📄 ドキュメントコミュニティ

opendataloader-pdf

PDFからテキスト、表、画像、メタデータといった情報を高精度で抽出し、AIで扱いやすい構造化データに変換することで、請求書や契約書からのデータ抽出やRAG構築など、様々なドキュメント処理を効率化するSkill。

📜 元の英語説明(参考)

Parse PDFs into AI-ready structured data — extract text, tables, images, and metadata with high accuracy. Use when: processing PDF documents for RAG, extracting data from invoices/contracts, building document processing pipelines.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o opendataloader-pdf.zip https://jpskill.com/download/15208.zip && unzip -o opendataloader-pdf.zip && rm opendataloader-pdf.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/15208.zip -OutFile "$d\opendataloader-pdf.zip"; Expand-Archive "$d\opendataloader-pdf.zip" -DestinationPath $d -Force; ri "$d\opendataloader-pdf.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して opendataloader-pdf.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → opendataloader-pdf フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

OpenDataLoader PDF — AI対応ドキュメント解析

概要

PDFドキュメントを、AIでの利用に最適化された、クリーンで構造化されたデータに解析します。レイアウトを保持したテキスト、構造化されたJSON形式のテーブル、キャプション付きの画像、および豊富なメタデータを抽出します。RAGパイプライン、ドキュメント分析、およびデータ抽出ワークフローに最適です。

手順

ステップ 1: 解析戦略の選択

PDF Type	Best Approach	Tool
Text-native (デジタル)	直接テキスト抽出	pdfplumber, PyMuPDF
Scanned / image-based (スキャン/画像ベース)	OCRパイプライン	Tesseract, EasyOCR
Tables-heavy (テーブルが多い)	テーブル認識抽出	Camelot, pdfplumber
Complex layouts (複雑なレイアウト)	Vision LLM	Claude/GPT-4o vision

ステップ 2: Pythonパイプラインのセットアップ

pip install pdfplumber pymupdf camelot-py[cv] Pillow
# OCRの場合: pip install pytesseract easyocr

ステップ 3: レイアウトを考慮したテキスト抽出

import pdfplumber

def extract_text_structured(pdf_path):
    """ドキュメント構造を保持してテキストを抽出します。"""
    pages = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text(layout=True)
            words = page.extract_words(keep_blank_chars=True, extra_attrs=['fontname', 'size'])
            headers = [w for w in words if w['size'] > 14]
            pages.append({
                'page': i + 1, 'text': text,
                'headers': [h['text'] for h in headers],
                'word_count': len(words)
            })
    return pages

ステップ 4: 構造化データとしてのテーブル抽出

def extract_tables(pdf_path):
    """テーブルを辞書のリストとして抽出します。"""
    results = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            tables = page.extract_tables({"vertical_strategy": "text",
                "horizontal_strategy": "text", "snap_tolerance": 5})
            for j, table in enumerate(tables):
                if not table or len(table) < 2:
                    continue
                headers = [str(h).strip() for h in table[0]]
                rows = []
                for row in table[1:]:
                    row_dict = {}
                    for k, cell in enumerate(row):
                        key = headers[k] if k < len(headers) else f'col_{k}'
                        row_dict[key] = str(cell).strip() if cell else ''
                    rows.append(row_dict)
                results.append({'page': i+1, 'table_index': j, 'headers': headers,
                                'rows': rows, 'row_count': len(rows)})
    return results

ステップ 5: 画像とメタデータの抽出

import fitz  # PyMuPDF

def extract_images(pdf_path, output_dir='./images'):
    """PDFから埋め込み画像を抽出します。"""
    import os
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)
    images = []
    for page_num in range(len(doc)):
        page = doc[page_num]
        for img_idx, img in enumerate(page.get_images(full=True)):
            base_image = doc.extract_image(img[0])
            filename = f'page{page_num+1}_img{img_idx+1}.{base_image["ext"]}'
            filepath = os.path.join(output_dir, filename)
            with open(filepath, 'wb') as f:
                f.write(base_image['image'])
            images.append({'page': page_num+1, 'file': filepath,
                           'format': base_image['ext'],
                           'width': base_image.get('width'),
                           'height': base_image.get('height')})
    return images

def extract_metadata(pdf_path):
    """PDFメタデータを抽出します。"""
    doc = fitz.open(pdf_path)
    meta = doc.metadata
    return {'title': meta.get('title', ''), 'author': meta.get('author', ''),
            'pages': len(doc), 'encrypted': doc.is_encrypted}

ステップ 6: RAG対応のチャンクの構築

def chunk_for_rag(pages, chunk_size=500, overlap=50):
    """RAGのためにページをオーバーラップするチャンクに分割します。"""
    chunks = []
    for page in pages:
        text = page['text']
        if not text:
            continue
        words = text.split()
        for i in range(0, len(words), chunk_size - overlap):
            chunk_words = words[i:i + chunk_size]
            if len(chunk_words) < 20:
                continue
            chunks.append({'text': ' '.join(chunk_words), 'page': page['page'],
                           'chunk_index': len(chunks), 'word_count': len(chunk_words)})
    return chunks

ステップ 7: フルパイプライン — PDFからAI対応JSONへ

import json

def pdf_to_ai_ready(pdf_path, output_path=None):
    """完全なパイプライン: PDFから構造化されたAI対応データへ。"""
    result = {
        'source': pdf_path,
        'metadata': extract_metadata(pdf_path),
        'pages': extract_text_structured(pdf_path),
        'tables': extract_tables(pdf_path),
        'images': extract_images(pdf_path),
    }
    result['chunks'] = chunk_for_rag(result['pages'])
    result['stats'] = {
        'total_pages': len(result['pages']),
        'total_tables': len(result['tables']),
        'total_images': len(result['images']),
        'total_chunks': len(result['chunks']),
    }
    if output_path:
        with open(output_path, 'w') as f:
            json.dump(result, f, indent=2, default=str)
    return result

ステップ 8: OCRによるスキャンされたPDFの処理

import pytesseract
from PIL import Image

def ocr_pdf(pdf_path):
    """スキャンされたPDFページをOCR処理します。"""
    doc = fitz.open(pdf_path)
    pages = []
    for i in range(len(doc)):
        pix = doc[i].get_pixmap(dpi=300)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        text = pytesseract.image_to_string(img)
        pages.append({'page': i + 1, 'text': text, 'method': 'ocr'})
    return pages

例

例 1: 四半期財務報告書からのデータ抽出

財務チームは、48ページの四半期報告書PDFを処理して、分析パイプラインに供給します。

result = pdf

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

OpenDataLoader PDF — AI-Ready Document Parsing

Overview

Parse PDF documents into clean, structured data optimized for AI consumption. Extract text with layout preservation, tables as structured JSON, images with captions, and rich metadata. Ideal for RAG pipelines, document analysis, and data extraction workflows.

Instructions

Step 1: Choose Your Parsing Strategy

PDF Type	Best Approach	Tool
Text-native (digital)	Direct text extraction	pdfplumber, PyMuPDF
Scanned / image-based	OCR pipeline	Tesseract, EasyOCR
Tables-heavy	Table-aware extraction	Camelot, pdfplumber
Complex layouts	Vision LLM	Claude/GPT-4o vision

Step 2: Set Up the Python Pipeline

pip install pdfplumber pymupdf camelot-py[cv] Pillow
# For OCR: pip install pytesseract easyocr

Step 3: Extract Text with Layout Awareness

import pdfplumber

def extract_text_structured(pdf_path):
    """Extract text preserving document structure."""
    pages = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text(layout=True)
            words = page.extract_words(keep_blank_chars=True, extra_attrs=['fontname', 'size'])
            headers = [w for w in words if w['size'] > 14]
            pages.append({
                'page': i + 1, 'text': text,
                'headers': [h['text'] for h in headers],
                'word_count': len(words)
            })
    return pages

Step 4: Extract Tables as Structured Data

def extract_tables(pdf_path):
    """Extract tables as list of dicts."""
    results = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            tables = page.extract_tables({"vertical_strategy": "text",
                "horizontal_strategy": "text", "snap_tolerance": 5})
            for j, table in enumerate(tables):
                if not table or len(table) < 2:
                    continue
                headers = [str(h).strip() for h in table[0]]
                rows = []
                for row in table[1:]:
                    row_dict = {}
                    for k, cell in enumerate(row):
                        key = headers[k] if k < len(headers) else f'col_{k}'
                        row_dict[key] = str(cell).strip() if cell else ''
                    rows.append(row_dict)
                results.append({'page': i+1, 'table_index': j, 'headers': headers,
                                'rows': rows, 'row_count': len(rows)})
    return results

Step 5: Extract Images and Metadata

import fitz  # PyMuPDF

def extract_images(pdf_path, output_dir='./images'):
    """Extract embedded images from PDF."""
    import os
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)
    images = []
    for page_num in range(len(doc)):
        page = doc[page_num]
        for img_idx, img in enumerate(page.get_images(full=True)):
            base_image = doc.extract_image(img[0])
            filename = f'page{page_num+1}_img{img_idx+1}.{base_image["ext"]}'
            filepath = os.path.join(output_dir, filename)
            with open(filepath, 'wb') as f:
                f.write(base_image['image'])
            images.append({'page': page_num+1, 'file': filepath,
                           'format': base_image['ext'],
                           'width': base_image.get('width'),
                           'height': base_image.get('height')})
    return images

def extract_metadata(pdf_path):
    """Extract PDF metadata."""
    doc = fitz.open(pdf_path)
    meta = doc.metadata
    return {'title': meta.get('title', ''), 'author': meta.get('author', ''),
            'pages': len(doc), 'encrypted': doc.is_encrypted}

Step 6: Build RAG-Ready Chunks

def chunk_for_rag(pages, chunk_size=500, overlap=50):
    """Split pages into overlapping chunks for RAG."""
    chunks = []
    for page in pages:
        text = page['text']
        if not text:
            continue
        words = text.split()
        for i in range(0, len(words), chunk_size - overlap):
            chunk_words = words[i:i + chunk_size]
            if len(chunk_words) < 20:
                continue
            chunks.append({'text': ' '.join(chunk_words), 'page': page['page'],
                           'chunk_index': len(chunks), 'word_count': len(chunk_words)})
    return chunks

Step 7: Full Pipeline — PDF to AI-Ready JSON

import json

def pdf_to_ai_ready(pdf_path, output_path=None):
    """Complete pipeline: PDF to structured AI-ready data."""
    result = {
        'source': pdf_path,
        'metadata': extract_metadata(pdf_path),
        'pages': extract_text_structured(pdf_path),
        'tables': extract_tables(pdf_path),
        'images': extract_images(pdf_path),
    }
    result['chunks'] = chunk_for_rag(result['pages'])
    result['stats'] = {
        'total_pages': len(result['pages']),
        'total_tables': len(result['tables']),
        'total_images': len(result['images']),
        'total_chunks': len(result['chunks']),
    }
    if output_path:
        with open(output_path, 'w') as f:
            json.dump(result, f, indent=2, default=str)
    return result

Step 8: Handle Scanned PDFs with OCR

import pytesseract
from PIL import Image

def ocr_pdf(pdf_path):
    """OCR scanned PDF pages."""
    doc = fitz.open(pdf_path)
    pages = []
    for i in range(len(doc)):
        pix = doc[i].get_pixmap(dpi=300)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        text = pytesseract.image_to_string(img)
        pages.append({'page': i + 1, 'text': text, 'method': 'ocr'})
    return pages

Examples

Example 1: Extract Data from a Quarterly Financial Report

A finance team processes a 48-page quarterly report PDF to feed into their analysis pipeline:

result = pdf_to_ai_ready('Q4-2025-Annual-Report-Acme-Corp.pdf', 'acme_q4.json')
print(result['stats'])
# {'total_pages': 48, 'total_tables': 12, 'total_images': 7, 'total_chunks': 34}

# Extract the revenue table from page 8
revenue_table = [t for t in result['tables'] if t['page'] == 8][0]
print(revenue_table['headers'])
# ['Quarter', 'Revenue ($M)', 'Growth (%)', 'Operating Margin']
print(revenue_table['rows'][0])
# {'Quarter': 'Q4 2025', 'Revenue ($M)': '847.3', 'Growth (%)': '12.4', 'Operating Margin': '23.1%'}

# Feed chunks into RAG system
for chunk in result['chunks']:
    embed_and_store(chunk['text'], metadata={'page': chunk['page'], 'source': 'acme_q4'})

Example 2: Batch Process Legal Contracts for Clause Extraction

A legal team processes a directory of scanned contract PDFs to identify key clauses:

import os

contract_dir = './contracts/vendor-agreements/'
for filename in os.listdir(contract_dir):
    if not filename.endswith('.pdf'):
        continue
    pdf_path = os.path.join(contract_dir, filename)

    # Try text extraction first, fall back to OCR for scanned docs
    result = pdf_to_ai_ready(pdf_path)
    total_text = sum(len(p['text'] or '') for p in result['pages'])
    if total_text < 100:  # likely scanned
        result['pages'] = ocr_pdf(pdf_path)
        result['chunks'] = chunk_for_rag(result['pages'])

    print(f"{filename}: {result['stats']['total_pages']} pages, "
          f"{result['stats']['total_chunks']} chunks, "
          f"{result['stats']['total_tables']} tables")
    # Output: "vendor-agreement-globaltech-2025.pdf: 24 pages, 18 chunks, 3 tables"

    # Save structured output for downstream AI analysis
    pdf_to_ai_ready(pdf_path, pdf_path.replace('.pdf', '.json'))

Guidelines

Always check font encoding — some PDFs produce garbled text; try PyMuPDF if pdfplumber fails
Use Camelot for bordered tables — pdfplumber works better for borderless tables
Process large PDFs page-by-page — stream results to disk to avoid memory issues
Vision LLM fallback — for complex layouts, send page screenshots to Claude or GPT-4o as images
Validate extracted data — spot-check tables and text against the original PDF before using in production
Handle encrypted PDFs — check doc.is_encrypted and prompt for password before extraction

References

pdfplumber — detailed PDF text and table extraction
PyMuPDF — fast PDF processing with image extraction
Camelot — accurate table extraction from PDFs