📄 ドキュメントコミュニティ

ai-provider-claude-vision

Claudeの画像認識能力を活用し、画像やPDFから情報を読み取り、構造化されたデータ抽出やコスト見積もりを行い、ビジネスにおける様々な視覚情報理解を支援するSkill。

📜 元の英語説明(参考)

Image understanding and document analysis with Claude's multimodal capabilities -- image input formats, PDF processing, multi-image patterns, structured extraction, and token cost estimation

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o ai-provider-claude-vision.zip https://jpskill.com/download/10215.zip && unzip -o ai-provider-claude-vision.zip && rm ai-provider-claude-vision.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/10215.zip -OutFile "$d\ai-provider-claude-vision.zip"; Expand-Archive "$d\ai-provider-claude-vision.zip" -DestinationPath $d -Force; ri "$d\ai-provider-claude-vision.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して ai-provider-claude-vision.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → ai-provider-claude-vision フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

Claude Vision パターン

クイックガイド: 画像には type: "image" のコンテンツブロック（base64、URL、または file_id）を使用し、PDF には type: "document" のコンテンツブロックを使用します。サポートされている画像形式: JPEG、PNG、GIF、WebP。コンテンツ配列内でテキストよりも前に画像があると、結果が向上します。トークンコストの計算式: tokens = (width * height) / 750。画像の長辺が 1568px を超えるか、約 1600 トークンを超えると、画像は自動的にリサイズされます。PDF は type: "document" と media_type: "application/pdf" を使用します。OCR ライブラリは不要です。Claude は画像や PDF から直接テキストを読み取ります。

<critical_requirements>

重要: この Skill を使用する前に

すべてのコードは、CLAUDE.md のプロジェクト規約に従う必要があります (kebab-case、名前付きエクスポート、インポート順、import type、名前付き定数)

(画像には必ず type: "image" を、PDF には type: "document" を使用してください。これらは異なるコンテンツブロックタイプです)

(コンテンツ配列では、必ず画像とドキュメントをテキストよりも前に配置してください。Claude は視覚コンテンツが最初にある方がパフォーマンスが向上します)

(すべてのリクエストで必ず max_tokens を指定してください。必須であり、デフォルト値はありません)

(必ず response.content ブロックを反復処理してください。レスポンスに単一のテキストブロックしかないと想定しないでください)

(max_tokens、トークン予算、ピクセル制限には、必ず名前付き定数を使用してください。マジックナンバーは使用しないでください)

</critical_requirements>

自動検出: Claude vision、画像分析、画像入力、base64 画像、URL 画像、type image、type document、media_type image/jpeg、media_type image/png、image/webp、image/gif、application/pdf、PDF 処理、ドキュメント抽出、マルチモーダル、マルチ画像、画像比較、チャート分析、スクリーンショット分析、画像理解、視覚コンテンツ、vision API

使用する場面:

分析、説明、またはデータ抽出のために画像を Claude に送信する場合
テキスト抽出、チャート分析、または要約のために PDF ドキュメントを処理する場合
単一のリクエストで複数の画像を比較する場合
スクリーンショット、領収書、チャート、またはフォームから構造化データを抽出する場合
Claude を使用してドキュメント処理パイプラインを構築する場合
画像を多用するワークロードのトークンコストを見積もる場合

カバーする主なパターン:

base64、URL、および Files API を介した画像入力
PDF ドキュメントの入力と処理
複数画像のリクエストと比較パターン
画像 + テキストプロンプトのベストプラクティス
トークンコストの見積もりと画像サイズ設定
視覚コンテンツからの構造化データ抽出
複数ターンの vision 会話
画像と PDF を使用したプロンプトのキャッシュ

使用しない場面:

画像やドキュメントを使用しない一般的な Claude API の使用。代わりに、一般的な Anthropic SDK パターンを使用してください
画像の生成または編集。Claude は理解のみであり、画像を作成または変更することはできません
画像内の特定の人物の識別。Claude は人物の名前を挙げることを拒否します (Anthropic ポリシー)
医療診断画像 (CT、MRI)。臨床診断用に設計されていません

例のインデックス

コア: 画像 & PDF 入力 -- Base64、URL、file_id、PDF 入力、複数画像、トークン見積もり
抽出 & プロンプト -- 構造化抽出、比較、プロンプトのベストプラクティス、キャッシュ
クイック API リファレンス -- コンテンツブロックタイプ、サポートされている形式、サイズ制限、トークン計算式

哲学

Claude の vision 機能は、画像とドキュメントをテキストと並ぶファーストクラスのコンテンツブロックとして扱います。個別の「vision API」はありません。画像またはドキュメントブロックを、テキストですでに使用しているのと同じ Messages API に追加します。

コア原則:

画像は添付ファイルではなくコンテンツブロック -- 画像と PDF は messages 配列内のコンテンツブロックであり、テキストとインターリーブされます。個別にアップロードしたり、URL のみで参照したりすることはありません。
画像優先の順序 -- コンテンツ配列では、テキストの前に画像を配置します。これは、documents first, query last がテキストプロンプトを改善するのと同じです。Claude は、質問の前に画像を見ると、視覚コンテンツをより適切に処理します。
OCR は不要 -- Claude は画像と PDF から直接テキストを読み取ります。OCR ライブラリでテキストを事前に抽出する必要はありません。PDF の場合、Claude は抽出されたテキストと各ページのレンダリングされた画像の両方を処理します。
トークンコストはピクセル数に応じてスケール -- 画像トークンは解像度に比例します: tokens = (width * height) / 750。送信前に画像を縮小すると、ほとんどのユースケースで意味のある詳細を失うことなくトークンを節約できます。
PDF は二重処理される -- 各 PDF ページは画像に変換され、テキストが抽出されます。Claude は両方を確認し、視覚的なレイアウトとテキストコンテンツにアクセスできます。

vision を使用する場面:

スクリーンショット、写真、チャート、図、またはインフォグラフィックを分析する場合
フォーム、領収書、請求書、またはテーブルからデータを抽出する場合
要約、抽出、または分析のために PDF ドキュメントを処理する場合
複数の画像を比較する場合 (ビフォー/アフター、A/B テスト、デザインレビュー)
テキストだけでは捉えられない視覚的なコンテキストを理解する場合

使用しない場面:

視覚的な要素がない純粋なテキストタスク。vision は不要なトークンコストを追加します
ピクセル単位の正確な空間精度を必要とするタスク。Claude の空間推論は近似的です
特定の人物の識別。Claude は個人名を挙げることを拒否します (Anthropic ポリシー)
専門的な医療画像分析 (CT、MRI、X 線) の代替

</philosophy>

コアパターン

パターン 1: Base64 画像入力

ローカルファイルを読み取り、base64 にエンコードし、type: "image" コンテンツブロックとして送信します。画像ブロックはテキストブロックの前に配置します。

// 画像ブロックを最初に、テキストプロンプトを次に、レスポンスコンテンツブロックを反復処理します
content: [
  {
    type: "image",
    source: { type: "base64", media_type: "image/png", data: imageData },
  },
  { type: "text", text: "この画像に何が見えるか説明してください。" },
];

良い理由: テキストの前に画像があると結果が向上する、明示的な media_type、構造化されたコンテンツブロック

// 悪い例: base64 をテキスト文字列として渡す -- Claude は生の base64 を解釈できません
content: "この画像には何が写っていますか？ " + imageData;

悪い理由: base64 をテキストとして渡している

(原文はここで切り詰められています)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Claude Vision Patterns

Quick Guide: Use type: "image" content blocks for images (base64, URL, or file_id) and type: "document" content blocks for PDFs. Supported image formats: JPEG, PNG, GIF, WebP. Images before text in the content array improves results. Token cost formula: tokens = (width * height) / 750. Images are auto-resized if the long edge exceeds 1568px or exceeds ~1600 tokens. PDFs use type: "document" with media_type: "application/pdf". No OCR library needed -- Claude reads text directly from images and PDFs.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST use type: "image" for images and type: "document" for PDFs -- they are different content block types)

(You MUST place images and documents BEFORE text in the content array -- Claude performs better with visual content first)

(You MUST always provide max_tokens in every request -- it is required and has no default)

(You MUST iterate over response.content blocks -- never assume a single text block in the response)

(You MUST use named constants for max_tokens, token budgets, and pixel limits -- no magic numbers)

</critical_requirements>

Auto-detection: Claude vision, image analysis, image input, base64 image, URL image, type image, type document, media_type image/jpeg, media_type image/png, image/webp, image/gif, application/pdf, PDF processing, document extraction, multimodal, multi-image, image comparison, chart analysis, screenshot analysis, image understanding, visual content, vision API

When to use:

Sending images to Claude for analysis, description, or data extraction
Processing PDF documents for text extraction, chart analysis, or summarization
Comparing multiple images in a single request
Extracting structured data from screenshots, receipts, charts, or forms
Building document processing pipelines with Claude
Estimating token costs for image-heavy workloads

Key patterns covered:

Image input via base64, URL, and Files API
PDF document input and processing
Multi-image requests and comparison patterns
Image + text prompting best practices
Token cost estimation and image sizing
Structured data extraction from visual content
Multi-turn vision conversations
Prompt caching with images and PDFs

When NOT to use:

General Claude API usage without images or documents -- use the general Anthropic SDK patterns instead
Image generation or editing -- Claude is understanding-only, it cannot create or modify images
Identifying specific people in images -- Claude refuses to name people (Anthropic policy)
Medical diagnostic imaging (CTs, MRIs) -- not designed for clinical diagnosis

Examples Index

Core: Image & PDF Input -- Base64, URL, file_id, PDF input, multi-image, token estimation
Extraction & Prompting -- Structured extraction, comparison, prompting best practices, caching
Quick API Reference -- Content block types, supported formats, size limits, token formula

Philosophy

Claude's vision capabilities treat images and documents as first-class content blocks alongside text. There is no separate "vision API" -- you add image or document blocks to the same Messages API you already use for text.

Core principles:

Images are content blocks, not attachments -- Images and PDFs are content blocks in the messages array, interleaved with text. They are not uploaded separately or referenced by URL-only.
Image-first ordering -- Place images before text in the content array. This mirrors how documents first, query last improves text prompts. Claude processes visual content better when it sees the image before the question.
No OCR needed -- Claude reads text directly from images and PDFs. You do not need to pre-extract text with an OCR library. For PDFs, Claude processes both the extracted text and a rendered image of each page.
Token costs scale with pixels -- Image tokens are proportional to resolution: tokens = (width * height) / 750. Downsizing images before sending saves tokens without losing meaningful detail for most use cases.
PDFs are dual-processed -- Each PDF page is converted to an image AND has its text extracted. Claude sees both, giving it access to visual layout and textual content.

When to use vision:

Analyzing screenshots, photos, charts, diagrams, or infographics
Extracting data from forms, receipts, invoices, or tables
Processing PDF documents for summarization, extraction, or analysis
Comparing multiple images (before/after, A/B testing, design review)
Understanding visual context that text alone cannot capture

When NOT to use:

Pure text tasks with no visual component -- vision adds unnecessary token cost
Tasks requiring pixel-perfect spatial precision -- Claude's spatial reasoning is approximate
Identifying specific people -- Claude refuses to name individuals (Anthropic policy)
Replacing professional medical imaging analysis (CTs, MRIs, X-rays)

</philosophy>

Core Patterns

Pattern 1: Base64 Image Input

Read a local file, encode to base64, send as type: "image" content block. Image block before text block.

// Image block first, text prompt second, iterate response content blocks
content: [
  {
    type: "image",
    source: { type: "base64", media_type: "image/png", data: imageData },
  },
  { type: "text", text: "Describe what you see in this image." },
];

Why good: Image before text improves results, explicit media_type, structured content blocks

// BAD: base64 as text string -- Claude cannot interpret raw base64
content: "What's in this image? " + imageData;

Why bad: Passing base64 as text string instead of image content block, Claude cannot interpret raw base64 text as an image

See: examples/core.md for full runnable examples with base64, URL, and Files API

Pattern 2: URL vs Base64 vs Files API

Three source types for images. Choose based on where your image lives.

// URL source -- simplest, smallest payload
source: { type: "url", url: "https://example.com/chart.png" }

// Base64 source -- local files
source: { type: "base64", media_type: "image/jpeg", data: base64String }

// Files API source (beta) -- upload once, reuse across requests
source: { type: "file", file_id: "file_abc123" }

When to use: URL for hosted images, base64 for local files, Files API for multi-turn or repeated use

See: examples/core.md for full examples of each source type

Pattern 3: PDF Document Input

PDFs use type: "document" -- different from type: "image". This is the most common mistake.

// Correct: type "document" for PDFs
{ type: "document", source: { type: "base64", media_type: "application/pdf", data: pdfData } }

// WRONG: type "image" for PDFs -- causes API errors
{ type: "image", source: { type: "base64", media_type: "application/pdf", data: pdfData } }

Why good: type: "document" enables dual processing (text extraction + page rendering)

Why bad: Using type: "image" for PDFs causes API errors. PDFs require type: "document".

See: examples/core.md for base64 and URL PDF examples, examples/extraction.md for PDF caching

Pattern 4: Multiple Images with Labels

Label images with text blocks so Claude can reference them clearly.

content: [
  { type: "text", text: "Image 1:" },
  {
    type: "image",
    source: { type: "base64", media_type: "image/jpeg", data: image1 },
  },
  { type: "text", text: "Image 2:" },
  {
    type: "image",
    source: { type: "base64", media_type: "image/jpeg", data: image2 },
  },
  {
    type: "text",
    text: "Compare these two images and describe the differences.",
  },
];

Why good: Labels let Claude reference specific images unambiguously

Why bad (without labels): Claude may confuse which image is which when no labels are provided

See: examples/core.md for full multi-image example

Pattern 5: Token Cost Estimation

Token formula: tokens = (width * height) / 750. Auto-resize triggers at 1568px long edge or ~1.15 megapixels.

const TOKENS_PER_PIXEL_DIVISOR = 750;
const MAX_LONG_EDGE_PX = 1568;
const MAX_MEGAPIXELS = 1.15;

function estimateImageTokens(width: number, height: number): number {
  let w = width,
    h = height;
  const longEdge = Math.max(w, h);
  const mp = (w * h) / 1_000_000;
  if (longEdge > MAX_LONG_EDGE_PX || mp > MAX_MEGAPIXELS) {
    const scale = Math.min(
      MAX_LONG_EDGE_PX / longEdge,
      Math.sqrt(MAX_MEGAPIXELS / mp),
    );
    w = Math.round(width * scale);
    h = Math.round(height * scale);
  }
  return Math.ceil((w * h) / TOKENS_PER_PIXEL_DIVISOR);
}
// 200x200: ~54 tokens | 1000x1000: ~1334 | 4000x3000: ~1590 (auto-resized)

Why good: Named constants, accounts for auto-resize, documents the formula

See: examples/core.md for full estimateImageTokens() utility and countTokens() usage, reference.md for the complete size/token/cost table

Pattern 6: Structured Data Extraction

Combine vision with messages.parse() and Zod schemas for typed extraction.

import { zodOutputFormat } from "@anthropic-ai/sdk/helpers/zod";
import { z } from "zod";

const ReceiptData = z.object({
  merchant: z.string(),
  date: z.string(),
  items: z.array(
    z.object({ name: z.string(), quantity: z.number(), price: z.number() }),
  ),
  total: z.number(),
  currency: z.string(),
});

const response = await client.messages.parse({
  model: "claude-sonnet-4-6",
  max_tokens: MAX_TOKENS,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/jpeg",
            data: receiptImage,
          },
        },
        {
          type: "text",
          text: "Extract all receipt information from this image.",
        },
      ],
    },
  ],
  output_config: { format: zodOutputFormat(ReceiptData) },
});

const receipt = response.parsed_output; // fully typed

Why good: Zod schema for type-safe extraction, messages.parse() for auto-validation, image before text

See: examples/extraction.md for receipt, chart, form, comparison, and multi-document extraction patterns

</patterns>

Performance Optimization

Image Sizing Strategy

Image resolution vs token cost:
200x200   -> ~54 tokens    ($0.00016/image at Sonnet 4.6 pricing)
1000x1000 -> ~1334 tokens  ($0.004/image)
1092x1092 -> ~1590 tokens  ($0.0048/image) -- max 1:1 without auto-resize
4000x3000 -> ~1590 tokens  (auto-resized to fit 1568px long edge)

Pre-resize images to no more than 1568px on the long edge and 1.15 megapixels to avoid auto-resize latency
Small images under 200px on any edge may degrade output quality
Images over 8000x8000px are rejected outright
20+ images in one request limits each image to 2000x2000px max

Cost Reduction Techniques

Resize before sending -- A 4000x3000 image is auto-resized to the same tokens as 1092x1092, but adds latency. Pre-resize to save time.
Use URL sources when images are already hosted -- avoids encoding overhead and reduces request payload size
Use the Files API for images used across multiple requests -- upload once, reference by file_id
Cache PDFs with cache_control: { type: "ephemeral" } when asking multiple questions about the same document
Use token counting (client.messages.countTokens()) before expensive requests to estimate costs

PDF Token Costs

Text extraction: ~1,500-3,000 tokens per page depending on density
Image rendering: Each page also incurs image token costs (same formula)
Total per page: text tokens + image tokens (dual processing)

</performance>

<decision_framework>

Decision Framework

Image Source Type

Where is your image?
+-- Local file        -> Base64 encode with readFileSync().toString("base64")
+-- Public URL        -> Use type: "url" source (simplest, smallest payload)
+-- Already uploaded  -> Use type: "file" source with file_id (Files API, beta)
+-- Multiple requests -> Upload once via Files API, reuse file_id

Image vs Document Block

What type of file?
+-- JPEG, PNG, GIF, WebP -> type: "image"
+-- PDF                  -> type: "document" with media_type: "application/pdf"
+-- Other formats        -> Convert to a supported format first

Token Budget for max_tokens

What kind of analysis?
+-- Brief description    -> 256-512 max_tokens
+-- Detailed analysis    -> 1024-2048 max_tokens
+-- Document summarization -> 2048-4096 max_tokens
+-- Structured extraction  -> 1024 max_tokens (JSON output is compact)

</decision_framework>

<red_flags>

RED FLAGS

High Priority Issues:

Using type: "image" for PDFs -- PDFs require type: "document" with media_type: "application/pdf"
Passing base64 data as a text string instead of an image content block -- Claude cannot interpret raw base64 text
Not providing max_tokens -- required on every request, no default
Images larger than 8000x8000px -- rejected by the API
API file size limit is 5MB per image (10MB on claude.ai)

Medium Priority Issues:

Placing text before images in the content array -- Claude performs better with images first
Not labeling multiple images -- Claude may confuse which image is which without "Image 1:", "Image 2:" labels
Sending full-resolution images when a smaller version would suffice -- wastes tokens and adds latency from auto-resizing
Using base64 for publicly available images -- URL source is simpler and reduces payload
Not using cache_control when asking multiple questions about the same PDF -- each request re-processes the full document

Common Mistakes:

Expecting Claude to generate or edit images -- it is understanding-only
Using vision for tasks requiring precise spatial reasoning (exact pixel coordinates, analog clock reading) -- Claude's spatial abilities are approximate
Relying on Claude to identify specific people -- it refuses to name individuals per Anthropic policy
Assuming exact object counts -- Claude gives approximate counts, especially for many small objects
Forgetting that PDF pages are dual-processed (text + image) -- token costs are higher than text-only

Gotchas & Edge Cases:

Images under 200px on any edge may produce lower quality analysis
When sending 20+ images in a single request, each image is limited to 2000x2000px max
API supports up to 600 images per request (100 for 200k context window models), but request size limits (32MB) are often reached first
Claude does not read image EXIF metadata -- orientation, camera info, GPS data are not accessible
PDFs with passwords or encryption are not supported -- only standard PDFs
The Files API for images and documents is currently in beta (betas: ["files-api-2025-04-14"])
Multi-turn vision conversations do not require re-sending the image -- it persists in conversation history
For PDFs, dense pages with complex tables or heavy graphics can fill the context window before reaching the 600-page limit

</red_flags>

<critical_reminders>

CRITICAL REMINDERS