📄 ドキュメントコミュニティ

pdf-ocr

スキャンされたPDFや画像から文字を認識し、テキストを抽出することで、書類のデジタル化や内容の読み取りを可能にし、低品質なスキャンや多言語にも対応するSkill。

📜 元の英語説明(参考)

Extract text from scanned PDFs using optical character recognition. Use when a user asks to OCR a PDF, read a scanned document, extract text from an image PDF, digitize a scanned file, convert a scanned PDF to text, or read text from a photograph of a document. Supports multiple languages and handles low-quality scans.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o pdf-ocr.zip https://jpskill.com/download/15243.zip && unzip -o pdf-ocr.zip && rm pdf-ocr.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/15243.zip -OutFile "$d\pdf-ocr.zip"; Expand-Archive "$d\pdf-ocr.zip" -DestinationPath $d -Force; ri "$d\pdf-ocr.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して pdf-ocr.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → pdf-ocr フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

PDF OCR

概要

光学文字認識（OCR）を使用して、スキャンされた、または画像ベースの PDF ドキュメントから可読なテキストを抽出します。このスキルは、PDF ページを画像に変換し、OCR を実行してテキストを検出し、クリーンで構造化されたテキストを出力します。複数ページのドキュメント、複数言語、および前処理による低品質スキャンに対応します。

手順

ユーザーがスキャンされた PDF の OCR 処理、または画像ベースの PDF からのテキスト抽出を要求した場合、次の手順に従います。

ステップ 1: OCR が実際に必要かどうかを確認する

まず、通常のテキスト抽出を試みます。PDF に既に選択可能なテキストが含まれている場合、OCR は不要です。

import pdfplumber

def check_text_content(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages[:3]:
            text = page.extract_text()
            if text and len(text.strip()) > 50:
                return True  # 抽出可能なテキストがあるため、OCR は不要
    return False  # 画像のみの PDF のため、OCR が必要

ステップ 2: 依存関係をインストールして検証する

必要なツールが利用可能であることを確認します。

# Tesseract OCR エンジンをインストール
# Ubuntu/Debian:
sudo apt-get install tesseract-ocr
# macOS:
brew install tesseract

# Python パッケージをインストール
pip install pytesseract pdf2image Pillow

# 追加の言語の場合:
sudo apt-get install tesseract-ocr-deu  # ドイツ語
sudo apt-get install tesseract-ocr-fra  # フランス語
sudo apt-get install tesseract-ocr-jpn  # 日本語

ステップ 3: PDF ページを画像に変換する

from pdf2image import convert_from_path

def pdf_to_images(pdf_path, dpi=300):
    images = convert_from_path(pdf_path, dpi=dpi)
    return images

標準的なドキュメントには 300 DPI を使用します。小さなテキストや低品質のスキャンには、400〜600 DPI に増やします。

ステップ 4: 精度向上のために画像を前処理する

前処理を適用して、OCR の品質を向上させます。

from PIL import Image, ImageFilter, ImageEnhance

def preprocess_image(image):
    # グレースケールに変換
    gray = image.convert('L')
    # コントラストを上げる
    enhancer = ImageEnhance.Contrast(gray)
    enhanced = enhancer.enhance(2.0)
    # シャープにする
    sharpened = enhanced.filter(ImageFilter.SHARPEN)
    # 二値化 (閾値処理)
    threshold = 150
    binary = sharpened.point(lambda x: 255 if x > threshold else 0)
    return binary

ステップ 5: 各ページで OCR を実行する

import pytesseract

def ocr_pages(images, lang='eng'):
    results = []
    for i, image in enumerate(images):
        processed = preprocess_image(image)
        text = pytesseract.image_to_string(processed, lang=lang)
        results.append({
            "page": i + 1,
            "text": text.strip(),
            "confidence": get_confidence(processed, lang)
        })
    return results

def get_confidence(image, lang='eng'):
    data = pytesseract.image_to_data(image, lang=lang, output_type=pytesseract.Output.DICT)
    confidences = [int(c) for c in data['conf'] if int(c) > 0]
    return sum(confidences) / len(confidences) if confidences else 0

ステップ 6: 結果を出力する

抽出されたテキストを結合してフォーマットします。テキストファイルとして保存するか、直接返します。

def save_results(results, output_path):
    with open(output_path, 'w', encoding='utf-8') as f:
        for page in results:
            f.write(f"--- Page {page['page']} (confidence: {page['confidence']:.0f}%) ---\n")
            f.write(page['text'] + "\n\n")
    return output_path

例

例 1: スキャンされた契約書の OCR 処理

ユーザーリクエスト: "このスキャンされた契約書 scan_contract.pdf からテキストを抽出してください"

実行されたアクション:

既存のテキストレイヤーを確認 - 見つからなかったため、OCR が必要
5 ページを 300 DPI で画像に変換
英語で前処理と OCR を実行

出力:

scan_contract.pdf (5 ページ) の OCR 処理が完了しました

ページごとの信頼度:
  1 ページ: 96% の信頼度
  2 ページ: 94% の信頼度
  3 ページ: 91% の信頼度
  4 ページ: 95% の信頼度
  5 ページ: 88% の信頼度 (低品質のスキャンが検出されました)

出力は以下に保存されました: scan_contract_text.txt (4,230 語が抽出されました)

注: 5 ページ目の画像品質が低くなっています。そのページの精度を確認してください。

例 2: 多言語ドキュメントの OCR 処理

ユーザーリクエスト: "このスキャンされたドキュメントを読んでください。ドイツ語で書かれています: rechnung.pdf"

実行されたアクション:

tesseract-ocr-deu 言語パックがインストールされていることを確認
ページを 300 DPI で画像に変換
lang='deu' で OCR を実行

出力:

ドイツ語言語モデルを使用して、rechnung.pdf (2 ページ) の OCR 処理が完了しました

  1 ページ: 93% の信頼度
  2 ページ: 95% の信頼度

812 語が抽出されました。出力は以下に保存されました: rechnung_text.txt

例 3: 複数のスキャンされたファイルのバッチ OCR 処理

ユーザーリクエスト: "./receipts/" フォルダ内のすべてのスキャンされた PDF の OCR 処理を行ってください

実行されたアクション:

./receipts/ 内のすべての PDF ファイルを検索 (12 個のファイルが見つかりました)
それぞれに既存のテキストレイヤーがあるか確認
必要な 10 個のファイルで OCR を実行

出力:

バッチ OCR 処理が完了しました: 12 個のファイルが処理されました

  既にテキストがありました: 2 個のファイル (スキップされました)
  OCR 処理が完了しました:    10 個のファイル
  平均信頼度: 92%

出力ファイルは ./receipts/ocr_output/ に保存されました
  receipt_001_text.txt (97% の信頼度)
  receipt_002_text.txt (94% の信頼度)
  ...
  receipt_010_text.txt (85% の信頼度 - レビューを推奨)

ガイドライン

OCR を実行する前に、常に既存のテキストコンテンツを確認してください。多くの PDF には既にテキストレイヤーがあります。
デフォルトの解像度として 300 DPI を使用します。小さなフォントや品質の悪いスキャンでは、解像度を上げてください。
ユーザーが手動でレビューする必要があるページを把握できるように、ページごとの信頼度スコアを報告します。
多言語ドキュメントの場合は、正しい Tesseract 言語コードを指定してください。複数の言語を組み合わせることができます: lang='eng+deu'。
OCR の前に画像を前処理します。グレースケール変換、コントラスト強調、および二値化により、精度が大幅に向上します。
回転または傾斜したスキャンについては、画像の回転検出を使用して、OCR の前に傾き補正を適用します。
大量の PDF は、メモリ使用量を管理するためにページごとに処理する必要があります。
一般的な Tesseract 言語コード: eng (英語)、deu (ドイツ語)、fra (フランス語)、spa (スペイン語)、jpn (日本語)、chi_sim (中国語簡体字)、kor (韓国語)。

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

PDF OCR

Overview

Extract readable text from scanned or image-based PDF documents using optical character recognition (OCR). This skill converts PDF pages to images, runs OCR to detect text, and outputs clean structured text. Handles multi-page documents, multiple languages, and low-quality scans with preprocessing.

Instructions

When a user asks to OCR a scanned PDF or extract text from an image-based PDF, follow these steps:

Step 1: Check if OCR is actually needed

First, attempt normal text extraction. If the PDF already contains selectable text, OCR is unnecessary:

import pdfplumber

def check_text_content(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages[:3]:
            text = page.extract_text()
            if text and len(text.strip()) > 50:
                return True  # Has extractable text, OCR not needed
    return False  # Image-only PDF, needs OCR

Step 2: Install and verify dependencies

Ensure the required tools are available:

# Install Tesseract OCR engine
# Ubuntu/Debian:
sudo apt-get install tesseract-ocr
# macOS:
brew install tesseract

# Install Python packages
pip install pytesseract pdf2image Pillow

# For additional languages:
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-fra  # French
sudo apt-get install tesseract-ocr-jpn  # Japanese

Step 3: Convert PDF pages to images

from pdf2image import convert_from_path

def pdf_to_images(pdf_path, dpi=300):
    images = convert_from_path(pdf_path, dpi=dpi)
    return images

Use 300 DPI for standard documents. Increase to 400-600 DPI for small text or low-quality scans.

Step 4: Preprocess images for better accuracy

Apply preprocessing to improve OCR quality:

from PIL import Image, ImageFilter, ImageEnhance

def preprocess_image(image):
    # Convert to grayscale
    gray = image.convert('L')
    # Increase contrast
    enhancer = ImageEnhance.Contrast(gray)
    enhanced = enhancer.enhance(2.0)
    # Sharpen
    sharpened = enhanced.filter(ImageFilter.SHARPEN)
    # Binarize (threshold)
    threshold = 150
    binary = sharpened.point(lambda x: 255 if x > threshold else 0)
    return binary

Step 5: Run OCR on each page

import pytesseract

def ocr_pages(images, lang='eng'):
    results = []
    for i, image in enumerate(images):
        processed = preprocess_image(image)
        text = pytesseract.image_to_string(processed, lang=lang)
        results.append({
            "page": i + 1,
            "text": text.strip(),
            "confidence": get_confidence(processed, lang)
        })
    return results

def get_confidence(image, lang='eng'):
    data = pytesseract.image_to_data(image, lang=lang, output_type=pytesseract.Output.DICT)
    confidences = [int(c) for c in data['conf'] if int(c) > 0]
    return sum(confidences) / len(confidences) if confidences else 0

Step 6: Output the results

Combine and format the extracted text. Save as a text file or return directly:

def save_results(results, output_path):
    with open(output_path, 'w', encoding='utf-8') as f:
        for page in results:
            f.write(f"--- Page {page['page']} (confidence: {page['confidence']:.0f}%) ---\n")
            f.write(page['text'] + "\n\n")
    return output_path

Examples

Example 1: OCR a scanned contract

User request: "Extract text from this scanned contract scan_contract.pdf"

Actions taken:

Check for existing text layer - none found, OCR needed
Convert 5 pages to images at 300 DPI
Preprocess and run OCR in English

Output:

OCR completed for scan_contract.pdf (5 pages)

Page-by-page confidence:
  Page 1: 96% confidence
  Page 2: 94% confidence
  Page 3: 91% confidence
  Page 4: 95% confidence
  Page 5: 88% confidence (lower quality scan detected)

Output saved to: scan_contract_text.txt (4,230 words extracted)

Note: Page 5 had lower image quality. Review that page for accuracy.

Example 2: OCR a multi-language document

User request: "Read this scanned document, it's in German: rechnung.pdf"

Actions taken:

Verify tesseract-ocr-deu language pack is installed
Convert pages to images at 300 DPI
Run OCR with lang='deu'

Output:

OCR completed for rechnung.pdf (2 pages) using German language model

  Page 1: 93% confidence
  Page 2: 95% confidence

Extracted 812 words. Output saved to: rechnung_text.txt

Example 3: Batch OCR multiple scanned files

User request: "OCR all the scanned PDFs in the ./receipts/ folder"

Actions taken:

Find all PDF files in ./receipts/ (found 12 files)
Check each for existing text layer
Run OCR on the 10 files that need it

Output:

Batch OCR complete: 12 files processed

  Already had text: 2 files (skipped)
  OCR completed:    10 files
  Average confidence: 92%

Output files saved to ./receipts/ocr_output/
  receipt_001_text.txt (97% confidence)
  receipt_002_text.txt (94% confidence)
  ...
  receipt_010_text.txt (85% confidence - review recommended)

Guidelines

Always check for existing text content before running OCR. Many PDFs already have a text layer.
Use 300 DPI as the default resolution. Increase for small fonts or poor quality scans.
Report confidence scores per page so users know which pages may need manual review.
For multi-language documents, specify the correct Tesseract language code. Multiple languages can be combined: lang='eng+deu'.
Preprocess images before OCR: grayscale conversion, contrast enhancement, and binarization significantly improve accuracy.
For rotated or skewed scans, apply deskewing before OCR using image rotation detection.
Large PDFs should be processed page by page to manage memory usage.
Common Tesseract language codes: eng (English), deu (German), fra (French), spa (Spanish), jpn (Japanese), chi_sim (Chinese Simplified), kor (Korean).