📄 ドキュメントコミュニティ

table-extractor

PDFから高精度に表を抽出し、結合されたセルや複数行、ヘッダーを含む複雑な構造にも対応、報告書などから表形式データをCSVやExcelに変換して取り出すSkill。

📜 元の英語説明(参考)

Extract tables from PDFs with high accuracy using camelot. Handles complex table structures including merged cells, multi-line rows, and spanning headers. Use when a user asks to extract a table from a PDF, pull tabular data from a document, convert PDF tables to CSV or Excel, or parse structured tables from reports.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o table-extractor.zip https://jpskill.com/download/15449.zip && unzip -o table-extractor.zip && rm table-extractor.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/15449.zip -OutFile "$d\table-extractor.zip"; Expand-Archive "$d\table-extractor.zip" -DestinationPath $d -Force; ri "$d\table-extractor.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して table-extractor.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → table-extractor フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

Table Extractor

概要

camelot-py を使用して、PDFドキュメントから高精度でテーブルを抽出します。結合されたセル、複数行の行、スパンニングヘッダー、罫線のないテーブルなど、複雑なテーブル構造を処理します。CSV、Excel、または JSON にエクスポートできるクリーンな DataFrame を出力します。

手順

ユーザーから PDF からテーブルを抽出するように依頼された場合は、次の手順に従ってください。

ステップ 1: 依存関係のインストールと検証

# camelot とその依存関係をインストールします
pip install "camelot-py[base]" ghostscript opencv-python-headless pandas

# ghostscript が利用可能であることを確認します (camelot で必要)
gs --version 2>/dev/null || echo "Install ghostscript: sudo apt install ghostscript"

ghostscript が利用できない場合は、pdfplumber にフォールバックします。

pip install pdfplumber pandas

ステップ 2: PDF を調べてテーブルの位置を特定する

import camelot

# 簡単なスキャン: ドキュメントにいくつのテーブルがありますか？
tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")
print(f"Found {len(tables)} tables using lattice detection")

# テーブルが見つからない場合は、ストリーム検出を試してください (罫線のないテーブルの場合)
if len(tables) == 0:
    tables = camelot.read_pdf("document.pdf", pages="all", flavor="stream")
    print(f"Found {len(tables)} tables using stream detection")

# 各テーブルの概要
for i, table in enumerate(tables):
    print(f"\nTable {i}: {table.shape[0]} rows x {table.shape[1]} cols (page {table.page})")
    print(f"Accuracy: {table.accuracy:.1f}%")
    print(table.df.head(3))

ステップ 3: 適切な抽出 flavor を選択する

Lattice flavor (目に見える罫線/グリッド線のあるテーブルの場合):

tables = camelot.read_pdf(
    "document.pdf",
    pages="1,2,3",        # 特定のページ
    flavor="lattice",
    line_scale=40,         # 線の検出感度を調整します
    process_background=True # 色付きの背景の線を検出します
)

Stream flavor (罫線のないテーブル、空白区切り):

tables = camelot.read_pdf(
    "document.pdf",
    pages="1",
    flavor="stream",
    edge_tol=50,          # エッジ検出の許容範囲
    row_tol=10,           # テキストを行にグループ化するための許容範囲
    columns=["72,200,350,500"]  # 自動検出が失敗した場合の手動列境界
)

ステップ 4: 抽出されたテーブルをクリーンアップして処理する

import pandas as pd

for i, table in enumerate(tables):
    df = table.df

    # 最初の行に列名が含まれている場合は、最初の行をヘッダーに昇格させます
    if df.iloc[0].str.match(r'^[A-Za-z]').all():
        df.columns = df.iloc[0]
        df = df[1:].reset_index(drop=True)

    # セル内の空白と改行をクリーンアップします
    df = df.apply(lambda col: col.str.strip().str.replace(r'\n', ' ', regex=True))

    # 完全に空の行を削除します
    df = df.dropna(how='all').replace('', pd.NA).dropna(how='all')

    # 数値列を変換します
    for col in df.columns:
        try:
            df[col] = pd.to_numeric(df[col].str.replace(',', '').str.replace('$', ''))
        except (ValueError, AttributeError):
            pass  # 文字列として保持します

    print(f"\nCleaned Table {i}:")
    print(df.head())

ステップ 5: 複雑なテーブル構造を処理する

結合されたセルとスパンニングヘッダー:

# 結合されたセルを前方に入力します (行ヘッダーで一般的)
df.iloc[:, 0] = df.iloc[:, 0].replace('', pd.NA).ffill()

# 複数レベルの列ヘッダーを処理します
if df.iloc[0:2].apply(lambda x: x.str.len().mean()).mean() < 20:
    # 最初の 2 行を組み合わせて複数レベルのヘッダーにします
    new_cols = df.iloc[0] + " - " + df.iloc[1]
    df.columns = new_cols.str.strip(" - ")
    df = df[2:].reset_index(drop=True)

複数ページにまたがるテーブル:

# すべてのページから抽出して連結します
all_tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")

# 連続するテーブルをグループ化します (同じ列数)
groups = {}
for t in all_tables:
    key = t.shape[1]
    groups.setdefault(key, []).append(t.df)

for col_count, dfs in groups.items():
    combined = pd.concat(dfs, ignore_index=True)
    # ページ区切りに表示される重複するヘッダー行を削除します
    combined = combined[~combined.duplicated(keep='first')]

ステップ 6: 結果をエクスポートする

# CSV (テーブルごとに 1 つのファイル)
for i, table in enumerate(tables):
    table.df.to_csv(f"table_{i+1}.csv", index=False)

# Excel (すべてのテーブルを個別のシートとして)
with pd.ExcelWriter("extracted_tables.xlsx") as writer:
    for i, table in enumerate(tables):
        table.df.to_excel(writer, sheet_name=f"Table_{i+1}", index=False)

# JSON
for i, table in enumerate(tables):
    table.df.to_json(f"table_{i+1}.json", orient="records", indent=2)

print(f"Exported {len(tables)} tables")

例

例 1: 年次報告書から財務テーブルを抽出する

ユーザーのリクエスト: 「この年次報告書の PDF からすべてのテーブルを抽出してください」

アクション:

lattice flavor で全ページをスキャンします (財務報告書には通常、罫線付きのテーブルがあります)
列ヘッダーで損益計算書、貸借対照表、キャッシュフロー計算書を識別します
数値をクリーンアップします (負の数の場合は $、カンマ、括弧を削除します)
各テーブルを個別の CSV にエクスポートし、1 つの Excel ワークブックに結合します

出力: 「42 ページにわたって 7 つのテーブルを抽出しました。Income_Statement、Balance_Sheet、Cash_Flow、Revenue_Breakdown、Expenses、Quarterly_Summary、KPI のシートを含む extracted_tables.xlsx にエクスポートしました。」

例 2: 研究論文から特定のテーブルを抽出する

ユーザーのリクエスト: 「この論文の 8 ページの結果テーブルを取得してください」

アクション:

特に 8 ページをターゲットにします: camelot.read_pdf("paper.pdf", pages="8")
ページに複数のテーブルがある場合は、概要を表示してユーザーに選択させます
抽出されたテーブルをクリーンアップし、複数行のセルを処理します
CSV としてエクスポートします

出力: 結果テーブルを含む単一の CSV ファイルと、コンソールに出力された最初の数行のプレビュー。

例 3: 複数の PDF をバッチ処理する

ユーザーのリクエスト: 「これら 20 件の月次レポートからそれぞれサマリーテーブルを抽出してください」

アクション:


import glob

results = []
f

(原文がここで切り詰められています)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Table Extractor

Overview

Extract tables from PDF documents with high accuracy using camelot-py. Handles complex table structures including merged cells, multi-line rows, spanning headers, and borderless tables. Outputs clean DataFrames that can be exported to CSV, Excel, or JSON.

Instructions

When a user asks you to extract tables from a PDF, follow this process:

Step 1: Install and verify dependencies

# Install camelot and its dependencies
pip install "camelot-py[base]" ghostscript opencv-python-headless pandas

# Verify ghostscript is available (required by camelot)
gs --version 2>/dev/null || echo "Install ghostscript: sudo apt install ghostscript"

If ghostscript is not available, fall back to pdfplumber:

pip install pdfplumber pandas

Step 2: Inspect the PDF to locate tables

import camelot

# Quick scan: how many tables are in the document?
tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")
print(f"Found {len(tables)} tables using lattice detection")

# If no tables found, try stream detection (for borderless tables)
if len(tables) == 0:
    tables = camelot.read_pdf("document.pdf", pages="all", flavor="stream")
    print(f"Found {len(tables)} tables using stream detection")

# Summary of each table
for i, table in enumerate(tables):
    print(f"\nTable {i}: {table.shape[0]} rows x {table.shape[1]} cols (page {table.page})")
    print(f"Accuracy: {table.accuracy:.1f}%")
    print(table.df.head(3))

Step 3: Choose the right extraction flavor

Lattice flavor (for tables with visible borders/gridlines):

tables = camelot.read_pdf(
    "document.pdf",
    pages="1,2,3",        # Specific pages
    flavor="lattice",
    line_scale=40,         # Adjust line detection sensitivity
    process_background=True # Detect lines on colored backgrounds
)

Stream flavor (for borderless tables, whitespace-separated):

tables = camelot.read_pdf(
    "document.pdf",
    pages="1",
    flavor="stream",
    edge_tol=50,          # Tolerance for edge detection
    row_tol=10,           # Tolerance for grouping text into rows
    columns=["72,200,350,500"]  # Manual column boundaries if auto-detect fails
)

Step 4: Clean and process extracted tables

import pandas as pd

for i, table in enumerate(tables):
    df = table.df

    # Promote first row to header if it contains column names
    if df.iloc[0].str.match(r'^[A-Za-z]').all():
        df.columns = df.iloc[0]
        df = df[1:].reset_index(drop=True)

    # Clean whitespace and newlines within cells
    df = df.apply(lambda col: col.str.strip().str.replace(r'\n', ' ', regex=True))

    # Remove completely empty rows
    df = df.dropna(how='all').replace('', pd.NA).dropna(how='all')

    # Convert numeric columns
    for col in df.columns:
        try:
            df[col] = pd.to_numeric(df[col].str.replace(',', '').str.replace('$', ''))
        except (ValueError, AttributeError):
            pass  # Keep as string

    print(f"\nCleaned Table {i}:")
    print(df.head())

Step 5: Handle complex table structures

Merged cells and spanning headers:

# Forward-fill merged cells (common in row headers)
df.iloc[:, 0] = df.iloc[:, 0].replace('', pd.NA).ffill()

# Handle multi-level column headers
if df.iloc[0:2].apply(lambda x: x.str.len().mean()).mean() < 20:
    # Combine first two rows as multi-level header
    new_cols = df.iloc[0] + " - " + df.iloc[1]
    df.columns = new_cols.str.strip(" - ")
    df = df[2:].reset_index(drop=True)

Tables spanning multiple pages:

# Extract from all pages and concatenate
all_tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")

# Group tables that are continuations (same column count)
groups = {}
for t in all_tables:
    key = t.shape[1]
    groups.setdefault(key, []).append(t.df)

for col_count, dfs in groups.items():
    combined = pd.concat(dfs, ignore_index=True)
    # Remove duplicate header rows that appear at page breaks
    combined = combined[~combined.duplicated(keep='first')]

Step 6: Export the results

# CSV (one file per table)
for i, table in enumerate(tables):
    table.df.to_csv(f"table_{i+1}.csv", index=False)

# Excel (all tables as separate sheets)
with pd.ExcelWriter("extracted_tables.xlsx") as writer:
    for i, table in enumerate(tables):
        table.df.to_excel(writer, sheet_name=f"Table_{i+1}", index=False)

# JSON
for i, table in enumerate(tables):
    table.df.to_json(f"table_{i+1}.json", orient="records", indent=2)

print(f"Exported {len(tables)} tables")

Examples

Example 1: Extract financial tables from an annual report

User request: "Extract all tables from this annual report PDF"

Actions:

Scan all pages with lattice flavor (financial reports typically have bordered tables)
Identify income statement, balance sheet, and cash flow tables by column headers
Clean numeric values (remove $, commas, parentheses for negatives)
Export each table to a separate CSV and combine into one Excel workbook

Output: "Extracted 7 tables across 42 pages. Exported to extracted_tables.xlsx with sheets: Income_Statement, Balance_Sheet, Cash_Flow, Revenue_Breakdown, Expenses, Quarterly_Summary, KPIs."

Example 2: Extract a specific table from a research paper

User request: "Get the results table from page 8 of this paper"

Actions:

Target page 8 specifically: camelot.read_pdf("paper.pdf", pages="8")
If multiple tables on the page, show summaries and let the user pick
Clean the extracted table and handle any multi-line cells
Export as CSV

Output: A single CSV file with the results table, plus a preview of the first few rows printed to the console.

Example 3: Batch process multiple PDFs

User request: "Extract the summary table from each of these 20 monthly reports"

Actions:

import glob

results = []
for pdf_path in sorted(glob.glob("reports/*.pdf")):
    tables = camelot.read_pdf(pdf_path, pages="1", flavor="lattice")
    if tables:
        df = tables[0].df  # First table on first page
        df["source_file"] = pdf_path
        results.append(df)

combined = pd.concat(results, ignore_index=True)
combined.to_csv("all_summaries.csv", index=False)

Output: A single CSV combining the summary table from all 20 reports with a source_file column for traceability.

Guidelines

Always try lattice flavor first (bordered tables). Fall back to stream for borderless tables.
Check the accuracy score on each table. Below 80% indicates extraction issues that need manual review.
For scanned PDFs, run OCR first (e.g., ocrmypdf) before table extraction.
When camelot struggles, try pdfplumber as an alternative: page.extract_table(table_settings={...}).
Clean numeric data aggressively: remove currency symbols, commas, and handle parenthesized negatives.
For tables with merged cells, use forward-fill on the appropriate columns.
When extracting from multiple pages, watch for repeated header rows at page breaks.
Always preview the extracted data before exporting to catch alignment or parsing issues.
Report extraction quality metrics (accuracy, row/column count) so the user can verify correctness.