jpskill.com
💬 コミュニケーション コミュニティ

document-workflows

Use this skill for building end-to-end document processing workflows and pipelines using LandingAI ADE. Trigger when users need to: (1) Process batches of documents in parallel or async, (2) Build classify-then-extract pipelines for mixed document types, (3) Prepare parsed documents for RAG systems with chunking and vector DB ingestion, (4) Load extraction results into databases like Snowflake or export to CSV/DataFrames, (5) Visualize extraction results: draw bounding box overlays on pages, crop chunk images, or highlight/annotate specific words or phrases found in documents, (6) Build Streamlit or web UIs for document processing, (7) Find and highlight specific terms within document sections using word-level grounding (e.g. highlight "L2S" in the Introduction, redact PII, annotate extracted values on the original page). This skill complements the document-extraction skill which covers ADE SDK basics. Use document-extraction to write code that executes parse/extract/split operations with more precision and less cost than adding the document image to the prompt and asking the LLM to find the relevant info. Use document-workflows when composing those operations into pipelines, or when you need visualization, annotation, or word-level grounding on parsed documents.

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o document-workflows.zip https://jpskill.com/download/22415.zip && unzip -o document-workflows.zip && rm document-workflows.zip
🪟 Windows (PowerShell)
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/22415.zip -OutFile "$d\document-workflows.zip"; Expand-Archive "$d\document-workflows.zip" -DestinationPath $d -Force; ri "$d\document-workflows.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)
  1. 1. 下の青いボタンを押して document-workflows.zip をダウンロード
  2. 2. ZIPファイルをダブルクリックで解凍 → document-workflows フォルダができる
  3. 3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
  4. 4. Claude Code を再起動

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

  1. 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
  2. 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
  3. 3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
    • · macOS / Linux: ~/.claude/skills/
    • · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →
最終更新
2026-05-18
取得日時
2026-05-18
同梱ファイル
7

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

ドキュメントワークフロー — ADE パイプラインパターン

概要

このスキルは、LandingAI ADE のプリミティブ(解析、抽出、分割)を本番環境に対応したドキュメント処理パイプラインに構成するための、再利用可能な構成要素を提供します。これは document-extraction スキルを補完するものです。

関心事 document-extraction document-workflows
スコープ ADE SDK API: parse, extract, split, grounding エンドツーエンドのパイプライン: バッチ、RAG、DB、分類・ルーティング
いつ使用するか 単一の ADE 操作を呼び出す必要がある場合 操作をワークフローに構成する必要がある場合
コード パラメータ付きの SDK メソッド呼び出し エラー処理、並列処理を含む完全な関数
依存関係 landingai-ade のみ + ワークフロー固有のライブラリ (pandas, chromadb など)

哲学: ドキュメントタイプではなく、ワークフローパターン(バッチ、RAG、DB挿入)で整理します。同じパターンは、ドキュメントが請求書、公共料金の請求書、医療フォームのいずれであっても適用されます。


ステップ 0 (必須) — 事前フライトドキュメント探索 {#pre-flight}

このセッションで内部構造がまだ検査されていないドキュメントを扱う場合は、パイプラインコードを記述する前に必ずこれを実行してください

ルール: サンプルドキュメントでツール 2 (診断解析) を実行する前に、セクション検出、見出しマッチング、またはテキスト検索コードを記述してはいけません。見出しの形式はドキュメント固有であり、タスクの説明やドキュメントタイプだけでは推測できません。それを知る唯一の信頼できる方法は、実際の ADE 出力を見ることです。

よくある驚き: 論文の「Introduction」の見出しが 1. Introduction (プレーンテキスト、#なし)、## IntroductionINTRODUCTION (すべて大文字)、または本文を含むテキストチャンク内に埋め込まれて表示されることがあります。これを間違えると、サイレントな失敗 (一致するチャンクがゼロ) が発生し、デバッグのために完全な再解析が必要になります。

コードを記述する前に、代表的なサンプルドキュメント 1〜3 点でツール 1 (視覚的レンダリング) とツール 2 (診断解析) を実行してください。これは 1 分もかからず、事前フライトで回避できたはずのデバッグの繰り返しを防ぎます。

ツール 1 — 視覚的ページレンダリング

1〜2 ページを PNG としてレンダリングし、視覚的なコンテキストとして読み取ります。ADE クレジットは使用されませんが、各 PNG はコンテキストトークンを消費します。レイアウトが曖昧な場合やドキュメントの出所が不明な場合 (手書き?スキャン?フォーム?) に使用します。

.venv/bin/python - << 'EOF'
import pymupdf
from pathlib import Path
from PIL import Image

pdf = Path('path/to/sample.pdf')
out_dir = Path('/tmp/ade_preflight'); out_dir.mkdir(exist_ok=True)
doc = pymupdf.open(pdf)
for pg in range(min(2, len(doc))):   # first 2 pages only
    pix = doc[pg].get_pixmap(matrix=pymupdf.Matrix(1.5, 1.5))   # 108 DPI
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    out = out_dir / f"{pdf.stem}_page{pg + 1}.png"
    img.save(out)
    print(out)
doc.close()
EOF

次に、保存された PNG を読み取ります。すぐに以下の疑問に答えます。

  • 見出しは太字テキストですか (→ ADE はプレーンテキストの見出しを出力する可能性があり、# Heading ではない)
  • ドキュメントは手書きですか、それともスキャンされたものですか? → PyMuPDF ではなく Tesseract OCR が必要です
  • 単一列レイアウトですか、それとも二列レイアウトですか?
  • ノイズはありますか: ヘッダー、ページ番号、透かし、スタンプなど?

ツール 2 — ADE 診断解析

1 つのサンプルを解析し、マークダウン構造とチャンクインベントリを出力します。ADE クレジットを使用します — 1〜3 サンプルのみに限定し、全コーパスには使用しないでください。

.venv/bin/python - << 'EOF'
import os
from pathlib import Path
from collections import Counter
from dotenv import load_dotenv

# Load API key: prefer existing env var, then .env file lookup
load_dotenv()  # Load API key from .env. Add a path to the .env if needed.

from landingai_ade import LandingAIADE
client = LandingAIADE()
pr = client.parse(document=Path('path/to/sample.pdf'))

print("=== MARKDOWN (first 80 lines) ===")
for i, ln in enumerate(pr.markdown.splitlines()[:80], 1):
    print(f"{i:3}: {ln}")

print("\n=== CHUNKS ===")
for ch in pr.chunks:
    txt = (ch.markdown or '').replace('\n', ' ')[:70]
    b = ch.grounding.box
    print(f"p{ch.grounding.page} {ch.type:12} "
          f"l={b.left:.2f} t={b.top:.2f} r={b.right:.2f} b={b.bottom:.2f} | {txt}")

print(f"\nPages: {pr.metadata.page_count}  "
      f"Chunks: {len(pr.chunks)}  "
      f"Types: {dict(Counter(ch.type for ch in pr.chunks))}")
EOF

コストに関する注意: 最初の実行後、pr.model_dump() を使用して解析結果を JSON ファイルに保存してください。client.parse() を再度呼び出す代わりに、後の開発のためにそれをロードしてください。ドキュメントセットが変更された場合にのみ再解析してください。

何に注目すべきか

観察 意味合い
見出しが 1. Introduction (プレーンテキスト、#なし) ADE マークダウンは ATX ヘッダーを使用しない → 正規表現ではなく ADE 抽出を使用する
見出し形式がドキュメントによって異なる (あるドキュメントでは # INTRO、別のドキュメントでは 1. Intro) 正規表現は一部のドキュメントで機能しない → 堅牢性のために ADE 抽出を使用する
すべての ch.markdown<a id='...'></a> で始まる 文字列マッチングや表示の前にアンカーを削除する
二列: 同じページに l=0.07l=0.50 のチャンクがある テキストの順序は左列、次に右列。セクションは両方にまたがる場合がある
ページ区切りでチャンクテキストが単語の途中で切れている セクションは複数ページにまたがる。複数のページからチャンクを収集する
t<0.08 または t>0.90marginalia チャンク ヘッダー/ページ番号 → コンテンツ抽出から除外する
ページ画像にスキャンされた/手書きのコンテンツが見える PyMuPDF テキスト抽出は機能しない → Tesseract OCR を使用する

ツール 3 — クロップ後の視覚的検証 (バウンディングボックスワークフローには必須) {#post-crop-verification}

バウンディングボックスのクロップまたはオーバーレイ (図の抽出、チャンクのクロップ、テーブルセルの抽出、単語レベルのグラウンディング) を生成した後、少なくとも 1 つの出力 PNG を画像として読み戻し、何が見えるかを記述してください。その記述をユーザーの要求と比較してください。これにより、以下の問題が検出されます。

  • ページ間違いのバグ — ADE のページ番号は 0-インデックスです。オフバイワンエラーにより、クロップが完全に異なるコンテンツを持つ隣接ページに配置されてしまいます。
  • 領域間違いのバグ — 空白領域や不適切な領域をクロップしてしまう座標系の不一致。

(原文はここで切り詰められています)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Document Workflows — ADE Pipeline Patterns

Overview

This skill provides reusable building blocks for composing LandingAI ADE primitives (parse, extract, split) into production-ready document processing pipelines. It complements the document-extraction skill:

Concern document-extraction document-workflows
Scope ADE SDK API: parse, extract, split, grounding End-to-end pipelines: batch, RAG, DB, classify-route
When Need to call a single ADE operation Need to compose operations into a workflow
Code SDK method calls with parameters Complete functions with error handling, parallelism
Deps landingai-ade only + workflow-specific libs (pandas, chromadb, etc.)

Philosophy: Organize by workflow pattern (batch, RAG, DB insertion), not by document type. The same pattern applies whether documents are invoices, utility bills, or medical forms.


Step 0 (mandatory) — Pre-Flight Document Exploration {#pre-flight}

Run this before writing any pipeline code whenever working with documents whose internal structure has not already been inspected in this session.

Rule: never write section-detection, heading-matching, or text-search code without first running Tool 2 (diagnostic parse) on the sample document. Heading format is document-specific and cannot be inferred from the task description or document type alone — the only reliable way to know it is to look at the actual ADE output.

Common surprises: a paper's "Introduction" heading may appear as 1. Introduction (plain text, no #), ## Introduction, INTRODUCTION (all-caps), or embedded inside a text chunk with body copy. Getting this wrong means a silent failure (zero chunks matched) that requires a full re-parse to debug.

Run Tool 1 (visual render) and Tool 2 (diagnostic parse) on 1–3 representative sample documents before writing any code. This takes under a minute and prevents debugging iterations that a pre-flight would have avoided.

Tool 1 — Visual page render

Render 1–2 pages as PNG and read them as visual context. No ADE credits used, but each PNG consumes context tokens. Use when layout is ambiguous or document origin is unknown (handwriting? scan? form?).

.venv/bin/python - << 'EOF'
import pymupdf
from pathlib import Path
from PIL import Image

pdf = Path('path/to/sample.pdf')
out_dir = Path('/tmp/ade_preflight'); out_dir.mkdir(exist_ok=True)
doc = pymupdf.open(pdf)
for pg in range(min(2, len(doc))):   # first 2 pages only
    pix = doc[pg].get_pixmap(matrix=pymupdf.Matrix(1.5, 1.5))   # 108 DPI
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    out = out_dir / f"{pdf.stem}_page{pg + 1}.png"
    img.save(out)
    print(out)
doc.close()
EOF

Then read the saved PNGs. Immediately answers:

  • Are headings bold text (→ ADE may output plain-text heading, not # Heading)
  • Is the document handwritten or scanned? → Tesseract OCR needed, not PyMuPDF
  • Single-column or two-column layout?
  • Any noise: running headers, page numbers, watermarks, stamps?

Tool 2 — ADE diagnostic parse

Parses 1 sample and prints markdown structure + chunk inventory. Uses ADE credits — keep to 1–3 samples only, never the full corpus.

.venv/bin/python - << 'EOF'
import os
from pathlib import Path
from collections import Counter
from dotenv import load_dotenv

# Load API key: prefer existing env var, then .env file lookup
load_dotenv()  # Load API key from .env. Add a path to the .env if needed.

from landingai_ade import LandingAIADE
client = LandingAIADE()
pr = client.parse(document=Path('path/to/sample.pdf'))

print("=== MARKDOWN (first 80 lines) ===")
for i, ln in enumerate(pr.markdown.splitlines()[:80], 1):
    print(f"{i:3}: {ln}")

print("\n=== CHUNKS ===")
for ch in pr.chunks:
    txt = (ch.markdown or '').replace('\n', ' ')[:70]
    b = ch.grounding.box
    print(f"p{ch.grounding.page} {ch.type:12} "
          f"l={b.left:.2f} t={b.top:.2f} r={b.right:.2f} b={b.bottom:.2f} | {txt}")

print(f"\nPages: {pr.metadata.page_count}  "
      f"Chunks: {len(pr.chunks)}  "
      f"Types: {dict(Counter(ch.type for ch in pr.chunks))}")
EOF

Cost note: Save the parse result with pr.model_dump() to a JSON file after the first run. Load it for later development instead of calling client.parse() again. Only re-parse when the document set changes.

What to look for

Observation Implication
Heading is 1. Introduction (plain text, no #) ADE markdown won't use ATX header → use ADE extract, not regex
Heading format varies across docs (# INTRO in one, 1. Intro in another) Regex will break on some docs → use ADE extract for robustness
Every ch.markdown starts with <a id='...'></a> Strip anchor before string matching or display
Two-column: chunks on same page with l=0.07 vs l=0.50 Text order is left column then right; sections may span both
Chunk text cut mid-word at page break Section spans pages; collect chunks from multiple pages
marginalia chunks at t<0.08 or t>0.90 Running headers / page numbers → exclude from content extraction
Scanned / handwritten content visible in page image PyMuPDF text extraction won't work → use Tesseract OCR

Tool 3 — Post-Crop Visual Verification (mandatory for bounding-box workflows) {#post-crop-verification}

After producing any bounding-box crop or overlay (figure extraction, chunk cropping, table cell extraction, word-level grounding), read back at least one output PNG as an image and describe what you see. Compare your description against the user's request. This catches:

  • Wrong-page bugs — ADE page numbers are 0-indexed; an off-by-one error lands the crop on an adjacent page with completely different content
  • Wrong-region bugs — coordinate system mismatches that crop blank space or an unrelated section

Rule: never declare a crop workflow complete without visually reading at least one output PNG and confirming its content matches the user's request.

Verification steps

  1. Save the first crop as PNG (the workflow already does this)
  2. Read the PNG file as an image (use the read_file tool on the PNG path)
  3. Describe what you see: what content, table, figure, or text appears?
  4. Compare against the user's request:
    • User asked for "the Events table" → does the crop show an Events table?
    • User asked for "Figure 3" → does the crop show a chart/diagram?
    • User asked for "Introduction section" → does the crop show intro text?
  5. If the description doesn't match → investigate page indexing and bounding-box coordinates before continuing
  6. Only proceed with remaining crops after the first one is verified

Why LLM vision, not heuristics

A blank-check heuristic (e.g. "mean brightness > 250 → blank") catches only the most obvious failures. The agent's own vision capability can semantically verify: "this crop shows a bar chart" vs "the user asked for a data table." This catches wrong-page errors even when the crop contains valid content from the wrong section.


Quick Reference — Building Blocks

# Block Pattern Reference
0 Pre-flight (mandatory) Render pages + diagnostic parse before building Above
1 Parse + Save Single doc → JSON + markdown Below
2 Parse + Extract + Save Single doc → structured data Below
3 Batch (sync) ThreadPoolExecutor + tqdm batch-processing.md
4 Batch (async) AsyncLandingAIADE + aiolimiter batch-processing.md
5 Large files Parse Jobs API (async polling) batch-processing.md
6 Classify → Extract Enum classification + schema routing Below
7 Results → DataFrame Flatten nested extraction to tables database-integration.md
8 Results → CSV Summary + per-document export database-integration.md
9 Results → Snowflake 4 normalized tables + COPY upload database-integration.md
10 Chunks → RAG CSV 19-column chunk dataset rag-pipelines.md
11 Chunks → ChromaDB OpenAI embeddings + persistent store rag-pipelines.md
12 Chunks → FAISS LangChain Documents + FAISS index rag-pipelines.md
13 RAG query RetrievalQA chain with sources rag-pipelines.md
14 Chunk images Crop chunks from pages as PNGs visualization.md
15 Grounding overlay Color-coded bounding boxes on pages visualization.md
16 Word-level grounding OCR + fuzzy match highlighting visualization.md
17 Section extraction Named section from markdown (regex or ADE extract) Below
18 Embedding computation Local (FastEmbed) or API (OpenAI) with best practices rag-pipelines.md
19 Hierarchical chunking Group ADE chunks into semantic units for embedding rag-pipelines.md
20 Multi-granularity RAG Chunk vs hierarchical vs document-level strategy rag-pipelines.md
21 Table stitching Parse-only or parse+extract merge of multi-page tables table-stitching.md
Schema catalog Ready-to-use Pydantic models schema-catalog.md

Core Workflow: Parse + Extract + Save

The fundamental two-step ADE pattern. Every other workflow builds on this.

import io
from pathlib import Path
from typing import Any, Tuple, Type

from landingai_ade import LandingAIADE
from landingai_ade.lib import pydantic_to_json_schema


def parse_extract_save(
    doc_path: Path,
    client: LandingAIADE,
    schema_cls: Type[Any],
    output_dir: str = "./ade_results",
) -> Tuple[Any, Any]:
    """Parse a document, extract structured data, save both
    as JSON via save_to. Returns (parse_result, extract_result)."""
    # Step 1 — Parse (auto-saves {stem}_parse_output.json)
    parse_result = client.parse(
        document=doc_path, save_to=output_dir,
    )

    # Step 2 — Extract (auto-saves {stem}_extract_output.json)
    extract_result = client.extract(
        schema=pydantic_to_json_schema(schema_cls),
        markdown=io.BytesIO(
            parse_result.markdown.encode("utf-8")
        ),
        save_to=output_dir,
    )
    return parse_result, extract_result

save_to parameter: Available on parse(), extract(), and split(). Creates the folder if needed and writes {input_filename}_{method}_output.json. This is a client-side convenience — the full response is saved locally after the API call.

Parse-Only (no extraction)

def parse_and_save(
    doc_path: Path,
    client: LandingAIADE,
    output_dir: str = "./ade_results",
) -> Any:
    return client.parse(
        document=doc_path, save_to=output_dir,
    )

Schemas: See schema-catalog.md for ready-to-use Pydantic models (invoice, utility bill, bank statement, pay stub, food label, CME certificate, document classifier). See the document-extraction skill for schema design rules.


Classify-then-Extract

Process mixed document types by first classifying, then applying the appropriate schema. Two approaches:

Approach 1: Classification Extraction (any document mix)

from typing import Literal
from pydantic import BaseModel, Field


class DocType(BaseModel):
    type: Literal[
        "invoice", "bank_statement", "pay_stub",
        "utility_bill",
    ] = Field(description="The type of the document.")


# Map types to schemas (from schema-catalog.md)
SCHEMA_MAP: dict[str, type] = {
    "invoice": InvoiceSchema,
    "bank_statement": BankStatementSchema,
    "pay_stub": PayStubSchema,
    "utility_bill": UtilityBillSchema,
}


def classify_and_extract(
    doc_path: Path,
    client: LandingAIADE,
) -> dict:
    """Classify a document then extract with the matching
    schema."""
    pr = client.parse(document=doc_path)

    # Classify using first page
    cls = client.extract(
        schema=pydantic_to_json_schema(DocType),
        markdown=pr.markdown,
    )
    doc_type: str = cls.extraction["type"]

    # Extract with type-specific schema
    schema_cls = SCHEMA_MAP[doc_type]
    er = client.extract(
        schema=pydantic_to_json_schema(schema_cls),
        markdown=pr.markdown,
    )
    return {
        "type": doc_type,
        "extraction": er.extraction,
        "parse_result": pr,
        "extract_result": er,
    }

Approach 2: Split API (multi-document PDFs)

When a single PDF contains multiple document types (e.g., a packet with invoices + receipts), use the Split API first:

def split_classify_extract(
    pdf_path: Path,
    client: LandingAIADE,
    split_classes: list[dict],
) -> list[dict]:
    """Split a multi-doc PDF, classify each split, extract."""
    pr = client.parse(document=pdf_path, split="page")

    # Split into sub-documents
    split_result = client.split(
        markdown=pr.markdown,
        split_class=split_classes,
    )

    results = []
    for split_doc in split_result.splits:
        # Classify
        cls = client.extract(
            schema=pydantic_to_json_schema(DocType),
            markdown=split_doc.markdowns[0],
        )
        doc_type = cls.extraction["type"]

        # Extract
        schema_cls = SCHEMA_MAP[doc_type]
        er = client.extract(
            schema=pydantic_to_json_schema(schema_cls),
            markdown=split_doc.markdowns[0],
        )
        results.append({
            "type": doc_type,
            "extraction": er.extraction,
            "pages": split_doc.pages,
        })
    return results

Split API parameters: Use split_class (list of dicts with name, description, identifier keys). See the document-extraction skill for full Split API reference.

When to use Split vs Classification:

  • Split API: One PDF contains multiple separate documents
  • Classification extraction: Each file is one document, but types vary

Section Extraction

Extract a named section (e.g. "Introduction", "Abstract") from a parsed document's markdown. Two approaches — choose based on document diversity and whether the extra API cost is justified.

Decision: If the diagnostic parse (Tool 2) shows consistent ATX headers (## Introduction, ## 2. Methods) across all your documents, use Approach A. If you see any plain-text numbered headings (1. Introduction) or formatting variation across documents, skip Approach A entirely and go straight to Approach B.

Approach When to use
A — regex Uniform, well-structured docs (academic papers, reports). Free, fast.
B — ADE extract Mixed or unpredictable formatting (slides, scanned papers, varied templates). Costs an extra extract credit per document.

Approach A — Rule-based regex (free, fast, brittle)

ADE may emit headings as ATX markdown (## 2. Related Work) or plain-text (1. Introduction) even within the same document. Handle both patterns:

import re

def find_section(markdown: str, name: str) -> str | None:
    """Extract a named section from ADE markdown, handling both ATX
    headers (## Introduction) and plain-text numbered headings
    (1. Introduction) which ADE may emit inconsistently."""

    # Pattern 1: ATX header  (# Introduction, ## 1. Introduction …)
    m = re.search(
        r"^(#{1,6})\s+(?:\d+\.?\s+)?" + re.escape(name) + r"\b.*$",
        markdown, re.IGNORECASE | re.MULTILINE,
    )
    if m:
        level = len(m.group(1))
        end = re.search(r"^#{1," + str(level) + r"}\s",
                        markdown[m.end():], re.MULTILINE)
        end_pos = m.end() + (end.start() if end else len(markdown[m.end():]))
        return markdown[m.start():m.end() + end_pos].strip()

    # Pattern 2: plain-text numbered heading  (1. Introduction)
    m2 = re.search(r"^(?:\d+\.?\s+)?" + re.escape(name) + r"\s*$",
                   markdown, re.IGNORECASE | re.MULTILINE)
    if m2:
        end2 = re.search(
            r"^#{1,6}\s|^(?:\d+\.?\s+)[A-Z][a-zA-Z ]{3,}\s*$",
            markdown[m2.end():], re.MULTILINE,
        )
        end_pos = m2.end() + (end2.start() if end2 else len(markdown[m2.end():]))
        return markdown[m2.start():end_pos].strip()
    return None

Approach B — ADE extract (robust, handles document diversity)

Use ADE's own extraction to semantically locate sections — no regex needed. The LLM understands section meaning even when formatting is inconsistent:

from pydantic import BaseModel, Field
from landingai_ade import LandingAIADE
from landingai_ade.lib import pydantic_to_json_schema
from pathlib import Path


class PaperSections(BaseModel):
    abstract: str = Field(
        description="The abstract section, plain text only, "
                    "no markdown formatting or anchor tags."
    )
    introduction: str = Field(
        description="The introduction section, plain text only, "
                    "no markdown formatting or anchor tags."
    )


client = LandingAIADE()
pr = client.parse(document=Path("paper.pdf"))
er = client.extract(
    schema=pydantic_to_json_schema(PaperSections),
    markdown=pr.markdown,
)
intro_text = er.extraction["introduction"]

Cost note: Each extract() call consumes additional credits on top of parse(). For high-volume pipelines with uniform document types, Approach A avoids this cost. For diverse or unpredictable documents the accuracy improvement justifies the extra credit.


Multi-Page Table Stitching {#table-stitching}

When a table spans multiple pages, ADE may emit it as separate table chunks per page — and may emit some pages as plain text instead of table chunks. This inconsistency can occur on any page, not just the last one.

Three approaches handle this, with different cost/accuracy/fragility trade-offs:

Approach ADE Calls Handles non-table chunks Fragility
A — Parse + Extract 2 ✓ LLM reads full markdown Low — no custom parsing
B — HTML table parsing 1 ✓ with fallback regex High — requires uniform row structure
C — pandas read_html 1 ✗ misses non-table chunks Medium

Decision guide:

  • Use Approach A when accuracy is paramount and cost is secondary
  • Use Approach B when rows are highly uniform, document structure is predictable, and cost savings justify the fragility of regex-based parsing
  • Use Approach C for quick prototyping or when missing some rows is acceptable

Pre-flight additions for table stitching

Before choosing an approach, run the diagnostic parse (Tool 2) and check:

What to check How Why
Chunk types per page Count type == "table" vs "text" per page Any page may have inconsistent types
Column count consistency Compare column counts across table chunks Inconsistent counts may indicate different tables
Header row presence Check first row of each table chunk Needed for detection and row filtering
Non-target tables Look for summary/metadata tables with same column count Must distinguish target from others
Row uniformity Compare row structure across pages Low uniformity makes Approach B fragile

Domain-specific semantic checks

After stitching, add validation checks that leverage domain knowledge:

  • Financial: running balances, column totals = sum of rows
  • Inventory: quantity conservation across rows
  • Time-series: chronological ordering, no sequence gaps
  • Scientific: consistent units, monotonic IDs

These checks serve as both validation (confirming correctness) and disambiguation (resolving structural ambiguity in parsed output).

Full code for all three approaches with reusable patterns: see table-stitching.md.


Batch Processing

Two patterns depending on scale. Both include per-document error handling.

Quick: ThreadPoolExecutor (sync)

from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm


def batch_process(
    files: list[Path],
    schema_cls: type,
    max_workers: int = 4,
) -> list[tuple[Path, Any, Any]]:
    client = LandingAIADE()
    results: list[tuple[Path, Any, Any]] = []
    with ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = {
            pool.submit(
                parse_extract_save, fp, client, schema_cls
            ): fp
            for fp in files
        }
        for fut in tqdm(
            as_completed(futures), total=len(futures)
        ):
            fp = futures[fut]
            try:
                results.append((fp, *fut.result()))
            except Exception as e:
                print(f"FAILED {fp.name}: {e}")
    return results

Scalable: AsyncLandingAIADE (async)

import asyncio
from aiolimiter import AsyncLimiter
from landingai_ade import AsyncLandingAIADE


async def batch_parse_async(
    files: list[Path],
    rate_limit: int = 30,
) -> list[dict]:
    client = AsyncLandingAIADE()
    limiter = AsyncLimiter(rate_limit, 60)

    async def _process(fp: Path) -> dict | None:
        try:
            async with limiter:
                return {
                    "path": fp,
                    "result": await client.parse(document=fp),
                }
        except Exception as e:
            print(f"FAILED {fp.name}: {e}")
            return None

    raw = await asyncio.gather(*[_process(fp) for fp in files])
    return [r for r in raw if r]

Full code with output directory organization, CSV export, and chunk image saving: see batch-processing.md.


Results to DataFrames and CSV

Flatten nested ADE extraction results into 4 normalized tables:

import uuid
from datetime import datetime, timezone


def rows_from_doc(
    file_path: str,
    parse_result: Any,
    extract_result: Any,
    run_id: str = "",
) -> tuple[dict, list[dict], list[dict], dict]:
    """Returns (main_row, line_rows, chunk_rows, md_record).

    - main_row: flattened top-level fields (nested__field)
    - line_rows: one per list item (line items, transactions)
    - chunk_rows: one per parsed chunk with bounding boxes
    - md_record: full markdown for traceability
    """
    doc_uuid = str(uuid.uuid4())
    f = extract_result.extraction

    # Flatten top-level fields
    main_row = {"doc_uuid": doc_uuid, "document_name": Path(file_path).name}
    for k, v in f.items():
        if isinstance(v, dict):
            for sk, sv in v.items():
                main_row[f"{k}__{sk}"] = sv
        elif not isinstance(v, list):
            main_row[k] = v

    # Extract list fields as line rows
    line_rows = [
        {"doc_uuid": doc_uuid, "list_field": k, "line_index": i, **item}
        for k, v in f.items() if isinstance(v, list)
        for i, item in enumerate(v) if isinstance(item, dict)
    ]

    # Chunk rows from parse result
    chunk_rows = [
        {
            "doc_uuid": doc_uuid,
            "chunk_id": getattr(ch, "id", None),
            "chunk_type": getattr(ch, "type", None),
            "page": ch.grounding.page if hasattr(ch, "grounding") else None,
        }
        for ch in (parse_result.chunks or [])
    ]

    md_record = {
        "doc_uuid": doc_uuid,
        "markdown": parse_result.markdown,
    }
    return main_row, line_rows, chunk_rows, md_record

Full code with Snowflake upload, UUID traceability, and bounding box columns: see database-integration.md.


RAG Preparation

Quick path from parsed documents to a queryable RAG system. Two embedding options: local (free, offline) or API (higher quality).

Option A — Local embeddings with FastEmbed (free)

import re
from fastembed import TextEmbedding


def ade_to_embeddings_local(
    parse_results: list[dict],
    model: str = "BAAI/bge-small-en-v1.5",
) -> list[dict]:
    """Embed ADE chunks locally. Returns list of dicts with
    text, vector, and grounding metadata."""
    embedder = TextEmbedding(model_name=model)
    items: list[dict] = []
    for pr in parse_results:
        for ch in (pr["parse_result"].chunks or []):
            text = re.sub(
                r"<a id='[^']*'>\s*</a>", "", ch.markdown,
            ).strip()
            if not text:
                continue
            items.append({
                "text": text,
                "source": pr["name"],
                "page": ch.grounding.page,
                "box": {
                    "l": ch.grounding.box.left,
                    "t": ch.grounding.box.top,
                    "r": ch.grounding.box.right,
                    "b": ch.grounding.box.bottom,
                },
            })
    vecs = list(embedder.embed([i["text"] for i in items]))
    for item, vec in zip(items, vecs):
        item["vector"] = vec.tolist()
    return items

Option B — API embeddings with OpenAI

from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings


def ade_to_rag(
    parse_results: list[dict],
    embedding_model: str = "text-embedding-3-small",
) -> FAISS:
    """Convert ADE parse results to a FAISS vector store.

    Args:
        parse_results: list of {"name": str, "parse_result": ParseResponse}
    """
    docs = [
        Document(
            page_content=ch.markdown,
            metadata={
                "source": item["name"],
                "chunk_type": getattr(ch, "type", ""),
                "page": ch.grounding.page if hasattr(ch, "grounding") else -1,
            },
        )
        for item in parse_results
        for ch in (item["parse_result"].chunks or [])
        if ch.markdown.strip()
    ]
    return FAISS.from_documents(
        docs, OpenAIEmbeddings(model=embedding_model)
    )

Full code with embedding best practices, hierarchical chunking, multi-granularity strategies, ChromaDB, LangChain RetrievalQA, and CSV export: see rag-pipelines.md.

Advanced RAG patterns in rag-pipelines.md:

  • Embedding computation (blocks 18–19) — choosing between local (FastEmbed, free) and API (OpenAI, higher quality) embeddings, including batch sizing and rate limiting
  • Hierarchical chunking (block 20) — embed at multiple granularities (chunk, section, document) for hybrid retrieval
  • Multi-granularity RAG (block 21) — combine chunk-level precision with document-level context, routing queries to the right embedding level based on scope

Visualization

Quick snippet for bounding box overlays on parsed pages:

from PIL import Image, ImageDraw
import pymupdf

CHUNK_COLORS = {
    "text": (40, 167, 69),
    "table": (0, 123, 255),
    "figure": (255, 0, 255),
    "marginalia": (111, 66, 193),
}

def annotate_page(
    img: Image.Image, chunks: list, page: int,
) -> Image.Image:
    annotated = img.copy()
    draw = ImageDraw.Draw(annotated)
    w, h = img.size
    for ch in chunks:
        if not hasattr(ch, "grounding") or ch.grounding.page != page:
            continue
        box = ch.grounding.box
        color = CHUNK_COLORS.get(getattr(ch, "type", ""), (200, 200, 200))
        draw.rectangle(
            [int(box.left * w), int(box.top * h),
             int(box.right * w), int(box.bottom * h)],
            outline=color, width=3,
        )
    return annotated

Full code with chunk image cropping, extraction-only overlays, and word-level OCR grounding: see visualization.md.


Streamlit UI Pattern

Quick Streamlit app for interactive document processing:

import streamlit as st
from pathlib import Path
from landingai_ade import LandingAIADE
from landingai_ade.lib import pydantic_to_json_schema

st.title("Document Processor")

uploaded = st.file_uploader(
    "Upload document", type=["pdf", "png", "jpg"]
)
if uploaded:
    # Save temp file
    tmp = Path(f"/tmp/{uploaded.name}")
    tmp.write_bytes(uploaded.read())

    client = LandingAIADE()

    with st.spinner("Parsing..."):
        pr = client.parse(document=tmp)

    st.subheader("Markdown Preview")
    st.markdown(pr.markdown[:2000])

    st.subheader("Chunks")
    for ch in pr.chunks:
        with st.expander(
            f"{ch.type} (page {ch.grounding.page})"
        ):
            st.text(ch.markdown[:500])

<!-- Requires: pip install landingai-ade streamlit -->

Full Streamlit app with batch upload, extraction display, and visualization tabs: adapt from the patterns in batch-processing.md and visualization.md.


Dependency Guide

Workflow Install
Core (parse + extract) pip install landingai-ade
Batch sync pip install landingai-ade tqdm
Batch async pip install landingai-ade aiolimiter
DataFrames / CSV pip install landingai-ade pandas
Snowflake pip install landingai-ade pandas snowflake-connector-python[pandas]
RAG (local embeddings) pip install landingai-ade fastembed
RAG (ChromaDB) pip install landingai-ade chromadb openai
RAG (FAISS + LangChain) pip install landingai-ade langchain langchain-openai langchain-community faiss-cpu
Visualization pip install landingai-ade Pillow pymupdf
Word-level grounding pip install landingai-ade Pillow pymupdf pytesseract fuzzywuzzy + tesseract binary
Streamlit UI pip install landingai-ade streamlit
Schema conversion from landingai_ade.lib import pydantic_to_json_schema (included in landingai-ade)

Reference Files

Read these for full implementations when building a specific workflow:

  • schema-catalog.md — Ready-to-use Pydantic schemas for invoice, utility bill, bank statement, pay stub, food label, CME certificate, and document classification
  • batch-processing.md — ThreadPoolExecutor, AsyncLandingAIADE, and Parse Jobs API patterns with full error handling
  • rag-pipelines.md — Chunks to CSV, ChromaDB ingestion, FAISS + LangChain, and RAG query chains
  • database-integration.md — DataFrame normalization, Snowflake upload, and CSV export patterns
  • visualization.md — Chunk image cropping, bounding box overlays, and word-level OCR grounding
  • table-stitching.md — Parse+Extract (robust), HTML parsing (fragile), and pandas approaches for merging multi-page tables into a single output

同梱ファイル

※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。