auto-arena
Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.
下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o auto-arena.zip https://jpskill.com/download/10337.zip && unzip -o auto-arena.zip && rm auto-arena.zip
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/10337.zip -OutFile "$d\auto-arena.zip"; Expand-Archive "$d\auto-arena.zip" -DestinationPath $d -Force; ri "$d\auto-arena.zip"
完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。
💾 手動でダウンロードしたい(コマンドが難しい人向け)
- 1. 下の青いボタンを押して
auto-arena.zipをダウンロード - 2. ZIPファイルをダブルクリックで解凍 →
auto-arenaフォルダができる - 3. そのフォルダを
C:\Users\あなたの名前\.claude\skills\(Win)または~/.claude/skills/(Mac)へ移動 - 4. Claude Code を再起動
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-18
- 取得日時
- 2026-05-18
- 同梱ファイル
- 1
📖 Claude が読む原文 SKILL.md(中身を展開)
この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。
Auto Arena Skill
End-to-end automated model comparison using the OpenJudge AutoArenaPipeline:
- Generate queries — LLM creates diverse test queries from task description
- Collect responses — query all target endpoints concurrently
- Generate rubrics — LLM produces evaluation criteria from task + sample queries
- Pairwise evaluation — judge model compares every model pair (with position-bias swap)
- Analyze & rank — compute win rates, win matrix, and rankings
- Report & charts — Markdown report + win-rate bar chart + optional matrix heatmap
Prerequisites
# Install OpenJudge
pip install py-openjudge
# Extra dependency for auto_arena (chart generation)
pip install matplotlib
Gather from user before running
| Info | Required? | Notes |
|---|---|---|
| Task description | Yes | What the models/agents should do (set in config YAML) |
| Target endpoints | Yes | At least 2 OpenAI-compatible endpoints to compare |
| Judge endpoint | Yes | Strong model for pairwise evaluation (e.g. gpt-4, qwen-max) |
| API keys | Yes | Env vars: OPENAI_API_KEY, DASHSCOPE_API_KEY, etc. |
| Number of queries | No | Default: 20 |
| Seed queries | No | Example queries to guide generation style |
| System prompts | No | Per-endpoint system prompts |
| Output directory | No | Default: ./evaluation_results |
| Report language | No | "zh" (default) or "en" |
Quick start
CLI
# Run evaluation
python -m cookbooks.auto_arena --config config.yaml --save
# Use pre-generated queries
python -m cookbooks.auto_arena --config config.yaml \
--queries_file queries.json --save
# Start fresh, ignore checkpoint
python -m cookbooks.auto_arena --config config.yaml --fresh --save
# Re-run only pairwise evaluation with new judge model
# (keeps queries, responses, and rubrics)
python -m cookbooks.auto_arena --config config.yaml --rerun-judge --save
Python API
import asyncio
from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline
async def main():
pipeline = AutoArenaPipeline.from_config("config.yaml")
result = await pipeline.evaluate()
print(f"Best model: {result.best_pipeline}")
for rank, (model, win_rate) in enumerate(result.rankings, 1):
print(f"{rank}. {model}: {win_rate:.1%}")
asyncio.run(main())
Minimal Python API (no config file)
import asyncio
from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline
from cookbooks.auto_arena.schema import OpenAIEndpoint
async def main():
pipeline = AutoArenaPipeline(
task_description="Customer service chatbot for e-commerce",
target_endpoints={
"gpt4": OpenAIEndpoint(
base_url="https://api.openai.com/v1",
api_key="sk-...",
model="gpt-4",
),
"qwen": OpenAIEndpoint(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-...",
model="qwen-max",
),
},
judge_endpoint=OpenAIEndpoint(
base_url="https://api.openai.com/v1",
api_key="sk-...",
model="gpt-4",
),
num_queries=20,
)
result = await pipeline.evaluate()
print(f"Best: {result.best_pipeline}")
asyncio.run(main())
CLI options
| Flag | Default | Description |
|---|---|---|
--config |
— | Path to YAML configuration file (required) |
--output_dir |
config value | Override output directory |
--queries_file |
— | Path to pre-generated queries JSON (skip generation) |
--save |
False |
Save results to file |
--fresh |
False |
Start fresh, ignore checkpoint |
--rerun-judge |
False |
Re-run pairwise evaluation only (keep queries/responses/rubrics) |
Minimal config file
task:
description: "Academic GPT assistant for research and writing tasks"
target_endpoints:
model_v1:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-4"
model_v2:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-3.5-turbo"
judge_endpoint:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-4"
Full config reference
task
| Field | Required | Description |
|---|---|---|
description |
Yes | Clear description of the task models will be tested on |
scenario |
No | Usage scenario for additional context |
target_endpoints.\<name>
| Field | Default | Description |
|---|---|---|
base_url |
— | API base URL (required) |
api_key |
— | API key, supports ${ENV_VAR} (required) |
model |
— | Model name (required) |
system_prompt |
— | System prompt for this endpoint |
extra_params |
— | Extra API params (e.g. temperature, max_tokens) |
judge_endpoint
Same fields as target_endpoints.<name>. Use a strong model (e.g. gpt-4, qwen-max) with low temperature (~0.1) for consistent judgments.
query_generation
| Field | Default | Description |
|---|---|---|
num_queries |
20 |
Total number of queries to generate |
seed_queries |
— | Example queries to guide generation |
categories |
— | Query categories with weights for stratified generation |
endpoint |
judge endpoint | Custom endpoint for query generation |
queries_per_call |
10 |
Queries generated per API call (1–50) |
num_parallel_batches |
3 |
Parallel generation batches |
temperature |
0.9 |
Sampling temperature (0.0–2.0) |
top_p |
0.95 |
Top-p sampling (0.0–1.0) |
max_similarity |
0.85 |
Dedup similarity threshold (0.0–1.0) |
enable_evolution |
false |
Enable Evol-Instruct complexity evolution |
evolution_rounds |
1 |
Evolution rounds (0–3) |
complexity_levels |
["constraints", "reasoning", "edge_cases"] |
Evolution strategies |
evaluation
| Field | Default | Description |
|---|---|---|
max_concurrency |
10 |
Max concurrent API requests |
timeout |
60 |
Request timeout in seconds |
retry_times |
3 |
Retry attempts for failed requests |
output
| Field | Default | Description |
|---|---|---|
output_dir |
./evaluation_results |
Output directory |
save_queries |
true |
Save generated queries |
save_responses |
true |
Save model responses |
save_details |
true |
Save detailed results |
report
| Field | Default | Description |
|---|---|---|
enabled |
false |
Enable Markdown report generation |
language |
"zh" |
Report language: "zh" or "en" |
include_examples |
3 |
Examples per section (1–10) |
chart.enabled |
true |
Generate win-rate chart |
chart.orientation |
"horizontal" |
"horizontal" or "vertical" |
chart.show_values |
true |
Show values on bars |
chart.highlight_best |
true |
Highlight best model |
chart.matrix_enabled |
false |
Generate win-rate matrix heatmap |
chart.format |
"png" |
Chart format: "png", "svg", or "pdf" |
Interpreting results
Win rate: percentage of pairwise comparisons a model wins. Each pair is evaluated in both orders (original + swapped) to eliminate position bias.
Rankings example:
1. gpt4_baseline [################----] 80.0%
2. qwen_candidate [############--------] 60.0%
3. llama_finetuned [##########----------] 50.0%
Win matrix: win_matrix[A][B] = how often model A beats model B across all queries.
Checkpoint & resume
The pipeline saves progress after each step. Interrupted runs resume automatically:
--fresh— ignore checkpoint, start from scratch--rerun-judge— re-run only the pairwise evaluation step (useful when switching judge models); keeps queries, responses, and rubrics intact- Adding new endpoints to config triggers incremental response collection; existing responses are preserved
Output files
evaluation_results/
├── evaluation_results.json # Rankings, win rates, win matrix
├── evaluation_report.md # Detailed Markdown report (if enabled)
├── win_rate_chart.png # Win-rate bar chart (if enabled)
├── win_rate_matrix.png # Matrix heatmap (if matrix_enabled)
├── queries.json # Generated test queries
├── responses.json # All model responses
├── rubrics.json # Generated evaluation rubrics
├── comparison_details.json # Pairwise comparison details
└── checkpoint.json # Pipeline checkpoint
API key by model
| Model prefix | Environment variable |
|---|---|
gpt-*, o1-*, o3-* |
OPENAI_API_KEY |
claude-* |
ANTHROPIC_API_KEY |
qwen-*, dashscope/* |
DASHSCOPE_API_KEY |
deepseek-* |
DEEPSEEK_API_KEY |
| Custom endpoint | set api_key + base_url in config |
Additional resources
- Full config examples: cookbooks/auto_arena/examples/
- Documentation: Auto Arena Guide