🛠️ 開発・MCP コミュニティ

ref-hallucination-arena

論文の引用情報が正しいか、Crossrefなどのデータベースと照合して検証し、誤りがないか、分野ごとの精度はどうかなどを評価することで、AIモデルが参考文献を捏造するリスクを測るSkill。

📜 元の英語説明(参考)

Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (ReAct + web search) mode. Use when the user asks to evaluate, benchmark, or compare models on academic reference hallucination, literature recommendation quality, or citation accuracy.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o ref-hallucination-arena.zip https://jpskill.com/download/10342.zip && unzip -o ref-hallucination-arena.zip && rm ref-hallucination-arena.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/10342.zip -OutFile "$d\ref-hallucination-arena.zip"; Expand-Archive "$d\ref-hallucination-arena.zip" -DestinationPath $d -Force; ri "$d\ref-hallucination-arena.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して ref-hallucination-arena.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → ref-hallucination-arena フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

Reference Hallucination Arena Skill

OpenJudge の RefArenaPipeline を使用して、LLM が実際の学術参考文献をどれだけ正確に推奨できるかを評価します。

クエリのロード — JSON/JSONL データセットから
応答の収集 — ターゲットモデルからの BibTeX 形式の参考文献
参考文献の抽出 — モデルの出力から BibTeX エントリを解析
参考文献の検証 — Crossref / PubMed / arXiv / DBLP との相互チェック
スコアリングとランキング — 検証率、フィールドごとの精度、分野の内訳を計算
レポートの生成 — Markdown レポート + 可視化チャート

前提条件

# OpenJudge のインストール
pip install py-openjudge

# ref_hallucination_arena の追加依存関係 (チャート生成)
pip install matplotlib

実行前にユーザーから収集する情報

情報	必須ですか？	注
Config YAML パス	はい	エンドポイント、データセット、検証設定を定義します
データセットパス	はい	クエリを含む JSON/JSONL ファイル (config で設定可能)
API キー	はい	環境変数: `OPENAI_API_KEY`、`DASHSCOPE_API_KEY` など
CrossRef メールアドレス	いいえ	検証のための API レート制限を改善します
PubMed API キー	いいえ	PubMed のレート制限を改善します
出力ディレクトリ	いいえ	デフォルト: `./evaluation_results/ref_hallucination_arena`
レポート言語	いいえ	`"en"` (デフォルト) または `"zh"`
Tavily API キー	いいえ	ツール拡張モードを使用する場合のみ必須

クイックスタート

CLI

# config ファイルで評価を実行
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# チェックポイントから再開 (デフォルトの動作)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# 新規に開始、チェックポイントを無視
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save

# 出力ディレクトリを上書き
python -m cookbooks.ref_hallucination_arena --config config.yaml \
  --output_dir ./my_results --save

Python API

import asyncio
from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline

async def main():
    pipeline = RefArenaPipeline.from_config("config.yaml")
    result = await pipeline.evaluate()

    for rank, (model, score) in enumerate(result.rankings, 1):
        print(f"{rank}. {model}: {score:.1%}")

asyncio.run(main())

CLI オプション

フラグ	デフォルト	説明
`--config`	—	YAML 設定ファイルへのパス (必須)
`--output_dir`	config の値	出力ディレクトリを上書き
`--save`	`False`	結果をファイルに保存
`--fresh`	`False`	新規に開始、チェックポイントを無視

最小限の config ファイル

task:
  description: "LLM の参考文献推薦能力を評価する"

dataset:
  path: "./data/queries.json"

target_endpoints:
  model_a:
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4"
    system_prompt: "あなたは学術文献推薦のエキスパートです。{num_refs} 件の実際の論文を BibTeX 形式で推薦してください。実際に存在すると確信できる論文のみを推薦してください。"

  model_b:
    base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    api_key: "${DASHSCOPE_API_KEY}"
    model: "qwen3-max"
    system_prompt: "あなたは学術文献推薦のエキスパートです。{num_refs} 件の実際の論文を BibTeX 形式で推薦してください。実際に存在すると確信できる論文のみを推薦してください。"

config の完全なリファレンス

task

フィールド	必須	説明
`description`	はい	評価タスクの説明
`scenario`	いいえ	利用シナリオ

dataset

フィールド	デフォルト	説明
`path`	—	JSON/JSONL データセットファイルへのパス (必須)
`shuffle`	`false`	評価前にクエリをシャッフル
`max_queries`	`null`	使用するクエリの最大数 (`null` = すべて)

target_endpoints.\<name>

フィールド	デフォルト	説明
`base_url`	—	API ベース URL (必須)
`api_key`	—	API キー、`${ENV_VAR}` をサポート (必須)
`model`	—	モデル名 (必須)
`system_prompt`	組み込み	システムプロンプト; `{num_refs}` プレースホルダーを使用
`max_concurrency`	`5`	このエンドポイントの最大同時リクエスト数
`extra_params`	—	追加の API リクエストパラメータ (例: `temperature`)
`tool_config.enabled`	`false`	Tavily ウェブ検索による ReAct エージェントを有効にする
`tool_config.tavily_api_key`	環境変数	Tavily API キー
`tool_config.max_iterations`	`10`	ReAct の最大イテレーション数 (1–30)
`tool_config.search_depth`	`"advanced"`	`"basic"` または `"advanced"`

verification

フィールド	デフォルト	説明
`crossref_mailto`	—	Crossref polite pool のためのメールアドレス
`pubmed_api_key`	—	PubMed API キー
`max_workers`	`10`	同時検証スレッド数 (1–50)
`timeout`	`30`	リクエストごとのタイムアウト (秒単位)
`verified_threshold`	`0.7`	VERIFIED と見なすための最小複合スコア

evaluation

フィールド	デフォルト	説明
`timeout`	`120`	モデル API リクエストのタイムアウト (秒単位)
`retry_times`	`3`	リトライ試行回数

output

フィールド	デフォルト	説明
`output_dir`	`./evaluation_results/ref_hallucination_arena`	出力ディレクトリ
`save_queries`	`true`	ロードされたクエリを保存
`save_responses`	`true`	モデルの応答を保存
`save_details`	`true`	検証の詳細を保存

report

フィールド	デフォルト	説明
`enabled`	`true`	レポート生成を有効にする
`language`	`"zh"`	レポート言語: `"zh"` または `"en"`
`include_examples`	`3`	セクションごとの例 (1–10)
`chart.enabled`	`true`	チャートを生成
`chart.orientation`	`"vertical"`	`"horizontal"` または `"vertical"`
`chart.show_values`	`true`	バーに値を表示
`chart.highlight_best`	`true`	最適なモデルを強調表示

データセット形式

JSON/JSONL d の各クエリ

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Reference Hallucination Arena Skill

Evaluate how accurately LLMs recommend real academic references using the OpenJudge RefArenaPipeline:

Load queries — from JSON/JSONL dataset
Collect responses — BibTeX-formatted references from target models
Extract references — parse BibTeX entries from model output
Verify references — cross-check against Crossref / PubMed / arXiv / DBLP
Score & rank — compute verification rate, per-field accuracy, discipline breakdown
Generate report — Markdown report + visualization charts

Prerequisites

# Install OpenJudge
pip install py-openjudge

# Extra dependency for ref_hallucination_arena (chart generation)
pip install matplotlib

Gather from user before running

Info	Required?	Notes
Config YAML path	Yes	Defines endpoints, dataset, verification settings
Dataset path	Yes	JSON/JSONL file with queries (can be set in config)
API keys	Yes	Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc.
CrossRef email	No	Improves API rate limits for verification
PubMed API key	No	Improves PubMed rate limits
Output directory	No	Default: `./evaluation_results/ref_hallucination_arena`
Report language	No	`"en"` (default) or `"zh"`
Tavily API key	No	Required only if using tool-augmented mode

Quick start

CLI

# Run evaluation with config file
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Resume from checkpoint (default behavior)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Start fresh, ignore checkpoint
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save

# Override output directory
python -m cookbooks.ref_hallucination_arena --config config.yaml \
  --output_dir ./my_results --save

Python API

import asyncio
from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline

async def main():
    pipeline = RefArenaPipeline.from_config("config.yaml")
    result = await pipeline.evaluate()

    for rank, (model, score) in enumerate(result.rankings, 1):
        print(f"{rank}. {model}: {score:.1%}")

asyncio.run(main())

CLI options

Flag	Default	Description
`--config`	—	Path to YAML configuration file (required)
`--output_dir`	config value	Override output directory
`--save`	`False`	Save results to file
`--fresh`	`False`	Start fresh, ignore checkpoint

Minimal config file

task:
  description: "Evaluate LLM reference recommendation capabilities"

dataset:
  path: "./data/queries.json"

target_endpoints:
  model_a:
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."

  model_b:
    base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    api_key: "${DASHSCOPE_API_KEY}"
    model: "qwen3-max"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."

Full config reference

task

Field	Required	Description
`description`	Yes	Evaluation task description
`scenario`	No	Usage scenario

dataset

Field	Default	Description
`path`	—	Path to JSON/JSONL dataset file (required)
`shuffle`	`false`	Shuffle queries before evaluation
`max_queries`	`null`	Max queries to use (`null` = all)

target_endpoints.\<name>

Field	Default	Description
`base_url`	—	API base URL (required)
`api_key`	—	API key, supports `${ENV_VAR}` (required)
`model`	—	Model name (required)
`system_prompt`	built-in	System prompt; use `{num_refs}` placeholder
`max_concurrency`	`5`	Max concurrent requests for this endpoint
`extra_params`	—	Extra API request params (e.g. `temperature`)
`tool_config.enabled`	`false`	Enable ReAct agent with Tavily web search
`tool_config.tavily_api_key`	env var	Tavily API key
`tool_config.max_iterations`	`10`	Max ReAct iterations (1–30)
`tool_config.search_depth`	`"advanced"`	`"basic"` or `"advanced"`

verification

Field	Default	Description
`crossref_mailto`	—	Email for Crossref polite pool
`pubmed_api_key`	—	PubMed API key
`max_workers`	`10`	Concurrent verification threads (1–50)
`timeout`	`30`	Per-request timeout in seconds
`verified_threshold`	`0.7`	Min composite score to count as VERIFIED

evaluation

Field	Default	Description
`timeout`	`120`	Model API request timeout in seconds
`retry_times`	`3`	Number of retry attempts

output

Field	Default	Description
`output_dir`	`./evaluation_results/ref_hallucination_arena`	Output directory
`save_queries`	`true`	Save loaded queries
`save_responses`	`true`	Save model responses
`save_details`	`true`	Save verification details

report

Field	Default	Description
`enabled`	`true`	Enable report generation
`language`	`"zh"`	Report language: `"zh"` or `"en"`
`include_examples`	`3`	Examples per section (1–10)
`chart.enabled`	`true`	Generate charts
`chart.orientation`	`"vertical"`	`"horizontal"` or `"vertical"`
`chart.show_values`	`true`	Show values on bars
`chart.highlight_best`	`true`	Highlight best model

Dataset format

Each query in the JSON/JSONL dataset:

{
  "query": "Please recommend papers on Transformer architectures for NLP.",
  "discipline": "computer_science",
  "num_refs": 5,
  "language": "en",
  "year_constraint": {"min_year": 2020}
}

Field	Required	Description
`query`	Yes	Prompt for reference recommendation
`discipline`	No	`computer_science`, `biomedical`, `physics`, `chemistry`, `social_science`, `interdisciplinary`, `other`
`num_refs`	No	Expected number of references (default: 5)
`language`	No	`"zh"` or `"en"` (default: `"zh"`)
`year_constraint`	No	`{"exact": 2023}`, `{"min_year": 2020}`, `{"max_year": 2015}`, or `{"min_year": 2020, "max_year": 2024}`

Official dataset: OpenJudge/ref-hallucination-arena

Interpreting results

Overall accuracy (verification rate):

> 75% — Excellent: model rarely hallucinates references
60–75% — Good: most references are real, some fabrication
40–60% — Fair: significant hallucination, use with caution
< 40% — Poor: model frequently fabricates references

Per-field accuracy:

title_accuracy — % of titles matching real papers
author_accuracy — % of correct author lists
year_accuracy — % of correct publication years
doi_accuracy — % of valid DOIs

Verification status:

VERIFIED — title + author + year all exactly match a real paper
SUSPECT — partial match (e.g. title matches but authors differ)
NOT_FOUND — no match in any database
ERROR — API timeout or network failure

Ranking order: overall accuracy → year compliance rate → avg confidence → completeness

Output files

evaluation_results/ref_hallucination_arena/
├── evaluation_report.md          # Detailed Markdown report
├── evaluation_results.json       # Rankings, per-field accuracy, scores
├── verification_chart.png        # Per-field accuracy bar chart
├── discipline_chart.png          # Per-discipline accuracy chart
├── queries.json                  # Loaded evaluation queries
├── responses.json                # Raw model responses
├── extracted_refs.json           # Extracted BibTeX references
├── verification_results.json     # Per-reference verification details
└── checkpoint.json               # Pipeline checkpoint for resume

API key by model

Model prefix	Environment variable
`gpt-`, `o1-`, `o3-*`	`OPENAI_API_KEY`
`claude-*`	`ANTHROPIC_API_KEY`
`qwen-`, `dashscope/`	`DASHSCOPE_API_KEY`
`deepseek-*`	`DEEPSEEK_API_KEY`
Custom endpoint	set `api_key` + `base_url` in config

Additional resources

Full config examples: cookbooks/ref_hallucination_arena/examples/
Documentation: docs/validating_graders/ref_hallucination_arena.md
Official dataset: HuggingFace
Leaderboard: openjudge.me/leaderboard