📄 ドキュメントコミュニティ

pdf-analyzer

PDFファイルからテキストや表、メタデータなどを抽出し、内容の要約や構造化されたデータ形式への変換を行い、スキャンされたPDFや複雑な表組みにも対応して、PDF情報をビジネスに活用するSkill。

📜 元の英語説明(参考)

Extract text, tables, metadata, and structured data from PDF files. Use when a user asks to read a PDF, parse a PDF, extract data from a PDF, summarize a PDF document, pull tables from a PDF, or convert PDF content to structured formats like JSON or CSV. Handles single and multi-page documents, scanned PDFs, and PDFs with complex table layouts.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o pdf-analyzer.zip https://jpskill.com/download/15241.zip && unzip -o pdf-analyzer.zip && rm pdf-analyzer.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/15241.zip -OutFile "$d\pdf-analyzer.zip"; Expand-Archive "$d\pdf-analyzer.zip" -DestinationPath $d -Force; ri "$d\pdf-analyzer.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して pdf-analyzer.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → pdf-analyzer フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

PDF Analyzer

概要

PDFファイルからテキスト、テーブル、構造化データを抽出し、利用可能な形式に変換します。このスキルは、単一または複数ページのPDFに対して、テキスト抽出、テーブル検出、メタデータ読み取り、および出力フォーマット処理を行います。

手順

ユーザーからPDFファイルの分析、読み取り、解析、またはデータ抽出を依頼された場合は、以下の手順に従ってください。

ステップ 1: PDFと目的の特定

ファイルパスと、ユーザーが何を抽出したいかを特定します。

Full text: すべてのページの読み取り可能なテキスト
Tables: 構造化された表形式データ
Metadata: タイトル、著者、作成日、ページ数
Specific sections: 特定のページからの対象コンテンツ
Summary: ドキュメント内容の要約版

ステップ 2: 抽出方法の選択

以下のライブラリのいずれかを使用してPythonスクリプトを作成します（テーブルの場合は pdfplumber を、速度の場合は PyMuPDF を優先します）。

テキスト抽出の場合:

import pdfplumber

def extract_text(pdf_path):
    text_by_page = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text()
            if text:
                text_by_page.append({"page": i + 1, "text": text.strip()})
    return text_by_page

テーブル抽出の場合:

import pdfplumber
import csv

def extract_tables(pdf_path, output_csv=None):
    all_tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            for table in tables:
                headers = table[0]
                rows = table[1:]
                all_tables.append({
                    "page": i + 1,
                    "headers": headers,
                    "rows": rows
                })
    if output_csv and all_tables:
        with open(output_csv, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(all_tables[0]["headers"])
            for table in all_tables:
                writer.writerows(table["rows"])
    return all_tables

メタデータの場合:

import pdfplumber

def extract_metadata(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        return {
            "pages": len(pdf.pages),
            "metadata": pdf.metadata
        }

ステップ 3: スクリプトの実行と出力のフォーマット

スクリプトを実行し、ユーザーが必要とする形式（プレーンテキスト、JSON、CSV、markdown table、または summary）で結果を表示します。

ステップ 4: エラーを適切に処理

抽出に失敗した場合は、以下のフォールバックアプローチを試してください。

pdfplumber から PyMuPDF (fitz) に切り替えます
スキャンされたPDFの場合は、pytesseract を使用した OCR を提案します
暗号化されたPDFの場合は、パスワードが必要であることをユーザーに伝えます

例

例 1: 請求書PDFからすべてのテーブルを抽出する

ユーザーのリクエスト: "invoice.pdf から明細テーブルを抽出し、CSVとして保存してください"

実行されたアクション:

pdfplumber で invoice.pdf を開きます
各ページのテーブルを検出します
ヘッダーと行を抽出します
invoice_items.csv に書き込みます

出力:

1ページに5列のテーブルを1つ検出しました: Item, Quantity, Unit Price, Tax, Total
12行を抽出しました。invoice_items.csv に保存しました。

プレビュー:
| Item            | Quantity | Unit Price | Tax   | Total   |
|-----------------|----------|------------|-------|---------|
| Widget A        | 100      | $2.50      | $25.00| $275.00 |
| Widget B        | 50       | $4.00      | $20.00| $220.00 |

例 2: 研究論文を要約する

ユーザーのリクエスト: "paper.pdf から主要な調査結果を要約してください"

実行されたアクション:

すべてのページから全文を抽出します
abstract、introduction、results、および conclusion のセクションを特定します
構造化された要約を作成します

出力:

ドキュメント: paper.pdf (14 ページ)

タイトル: Effects of Sleep Duration on Cognitive Performance
著者: Smith et al., 2024

主な調査結果:
- 睡眠時間が6時間未満の参加者は、反応時間が23%遅くなった
- 最適な認知パフォーマンスは7〜8時間の睡眠で観察された
- 9時間を超える睡眠では有意な利点は観察されなかった

方法: ランダム化比較試験、n=240、12週間
結論: 7〜8時間の睡眠は成人の認知パフォーマンスを最適化する

ガイドライン

抽出を試みる前に、PDFが読み取り可能かどうかを常に確認してください。一部のPDFは画像のみであり、OCRが必要です。
大規模なPDF（100ページ以上）の場合は、バッチで処理し、進捗状況を表示します。
テーブルを抽出するときは、行全体で列数が一貫していることを確認してください。結合されたセルは、位置ずれを引き起こすことがよくあります。
ユーザーが相互参照できるように、出力に元のページ番号を保持します。
PDFにテキストとスキャンされたページの両方がある場合は、可能な場合はテキストを抽出し、スキャンされたページにはOCRのフラグを立てます。
テーブルヘッダーを勝手に想定しないでください。ユーザーが特に指定しない限り、常に最初の行を使用してください。
複数列レイアウト（学術論文）の場合は、列を左から右に横断するのではなく、読み取り順にテキストを抽出します。

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

PDF Analyzer

Overview

Extract text, tables, and structured data from PDF files and convert them into usable formats. This skill handles text extraction, table detection, metadata reading, and output formatting for single or multi-page PDFs.

Instructions

When a user asks you to analyze, read, parse, or extract data from a PDF file, follow these steps:

Step 1: Identify the PDF and goal

Determine the file path and what the user wants extracted:

Full text: All readable text from every page
Tables: Structured tabular data
Metadata: Title, author, creation date, page count
Specific sections: Targeted content from certain pages
Summary: A condensed version of the document contents

Step 2: Choose the extraction method

Write a Python script using one of these libraries (prefer pdfplumber for tables, PyMuPDF for speed):

For text extraction:

import pdfplumber

def extract_text(pdf_path):
    text_by_page = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text()
            if text:
                text_by_page.append({"page": i + 1, "text": text.strip()})
    return text_by_page

For table extraction:

import pdfplumber
import csv

def extract_tables(pdf_path, output_csv=None):
    all_tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            for table in tables:
                headers = table[0]
                rows = table[1:]
                all_tables.append({
                    "page": i + 1,
                    "headers": headers,
                    "rows": rows
                })
    if output_csv and all_tables:
        with open(output_csv, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(all_tables[0]["headers"])
            for table in all_tables:
                writer.writerows(table["rows"])
    return all_tables

For metadata:

import pdfplumber

def extract_metadata(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        return {
            "pages": len(pdf.pages),
            "metadata": pdf.metadata
        }

Step 3: Run the script and format output

Execute the script, then present results in the format the user needs (plain text, JSON, CSV, markdown table, or summary).

Step 4: Handle errors gracefully

If extraction fails, try these fallback approaches:

Switch from pdfplumber to PyMuPDF (fitz)
For scanned PDFs, suggest OCR with pytesseract
For encrypted PDFs, inform the user a password is needed

Examples

Example 1: Extract all tables from an invoice PDF

User request: "Pull the line items table from invoice.pdf and save as CSV"

Actions taken:

Open invoice.pdf with pdfplumber
Detect tables on each page
Extract headers and rows
Write to invoice_items.csv

Output:

Found 1 table on page 1 with 5 columns: Item, Quantity, Unit Price, Tax, Total
Extracted 12 rows. Saved to invoice_items.csv.

Preview:
| Item            | Quantity | Unit Price | Tax   | Total   |
|-----------------|----------|------------|-------|---------|
| Widget A        | 100      | $2.50      | $25.00| $275.00 |
| Widget B        | 50       | $4.00      | $20.00| $220.00 |

Example 2: Summarize a research paper

User request: "Summarize the key findings from paper.pdf"

Actions taken:

Extract full text from all pages
Identify abstract, introduction, results, and conclusion sections
Produce a structured summary

Output:

Document: paper.pdf (14 pages)

Title: Effects of Sleep Duration on Cognitive Performance
Authors: Smith et al., 2024

Key Findings:
- Participants sleeping < 6 hours showed 23% slower reaction times
- Optimal cognitive performance observed at 7-8 hours of sleep
- No significant benefit observed beyond 9 hours

Methods: Randomized controlled trial, n=240, over 12 weeks
Conclusion: 7-8 hours of sleep optimizes cognitive performance in adults

Guidelines

Always check if the PDF is readable before attempting extraction. Some PDFs are image-only and require OCR.
For large PDFs (100+ pages), process in batches and show progress.
When extracting tables, validate that column counts are consistent across rows. Merged cells often cause misalignment.
Preserve the original page numbers in output so the user can cross-reference.
If a PDF has both text and scanned pages, extract text where available and flag scanned pages for OCR.
Never assume table headers. Always use the first row unless the user specifies otherwise.
For multi-column layouts (academic papers), extract text in reading order, not left-to-right across columns.