🛠️ 開発・MCP コミュニティ

baseline-selection-audit

機械学習やAI論文の実験で、比較対象となるベースラインが適切か、最新か、そして査読に耐えうるかを監査するSkill。

📜 元の英語説明(参考)

Audit whether an ML or AI paper's experimental baselines are necessary, fair, current, and reviewer-proof. Use this skill whenever the user is planning experiments, comparing methods, choosing baselines, worried about missing SOTA or unfair comparisons, preparing a reviewer-proof experiment section, or converting a literature review into must-have, should-have, optional, and not-comparable baselines.

🇯🇵 日本人クリエイター向け解説

一言でいうと

機械学習やAI論文の実験で、比較対象となるベースラインが適切か、最新か、そして査読に耐えうるかを監査するSkill。

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⬇ このSkillをダウンロード(.skill) 元のソースを見る ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-17
取得日時: 2026-05-17
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

[スキル名] baseline-selection-audit

ベースライン選択監査

主張、手法、実験計画の草案、または文献マップを、査読者にとって申し分のないベースラインセットと公平性台帳に変換します。

このスキルは次の場合に使用します。

実験が計画されており、適切なベースラインが不明確な場合
文献レビューで競合が見つかったが、比較に変換されていない場合
論文にSOTA、直接の競合、古典、アブレーションベースライン、または制御ベースラインが欠けている可能性がある場合
査読者が不公平なチューニング、スケール、データ、計算、プロトコル、またはメトリックの違いについて不満を言う可能性がある場合
反論または改訂で、どの追加のベースライン実験を実行する価値があるかを決定する必要がある場合
ユーザーが、比較できないとしてベースラインを除外する理由を正当化する必要がある場合

引用メタデータのチェックにはこのスキルを使用しないでください。BibTeXとLaTeXの正確性には citation-audit を使用してください。主な質問が比較の欠如ではなく参照の欠如である場合は citation-coverage-audit を使用してください。

このスキルは以下と組み合わせて使用します。

競合する論文マップが不完全な場合は、このスキルの前に literature-review-sprint
最も近いベースラインが手法設計を変更する場合は algorithm-design-planner
選択されたベースラインを具体的な実験マトリックスに変換するために、このスキルの後に experiment-design-planner
ベースラインの範囲、公平性のルール、および停止条件が明確になった場合にのみ run-experiment
ベースラインの結果が驚くべき、不安定、または提案された手法よりも優れている場合は result-diagnosis
ベースラインのリスクを論文の主張、図、セクションにリンクする必要がある場合は paper-evidence-board
ベースラインの決定、リスク、およびアクションをセッション間で保持する必要がある場合は research-project-memory

スキルディレクトリのレイアウト

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── baseline-taxonomy.md
    ├── fairness-ledger.md
    ├── memory-writeback.md
    ├── report-template.md
    └── reviewer-risk.md

段階的読み込み

常に references/baseline-taxonomy.md、references/fairness-ledger.md、および references/reviewer-risk.md を読んでください。
最終監査を作成する前に references/report-template.md を読んでください。
プロジェクトに memory/、コンポーネント .agent/ フォルダーがある場合、またはユーザーが永続的なプロジェクトメモリを要求する場合は references/memory-writeback.md を読んでください。
ベースラインセットが現在のSOTA、最近の並行作業、または会場の期待に依存する場合は、Web検索、OpenReview、会議録、arXiv、PMLR、ACL Anthology、CVF、DBLP、Semantic Scholar、またはユーザー提供の論文を通じて現在の情報源で検証してください。
現在の検証が利用できない場合は、ベースラインのステータスを暫定としてマークし、最終的な実験計画の前に必要な不足している検索を特定してください。

核となる原則

ベースラインは主張を擁護するために存在し、テーブルを飾るためではありません。
最も近い概念的競合、最も強力な経験的ベースライン、標準ベンチマークベースライン、アブレーションベースライン、および制御ベースラインを区別してください。
ベースラインは、引用目的、比較目的、またはその両方で欠落している可能性があります。どちらであるかを明記してください。
公平性は、データ、モデルサイズ、計算、チューニング、メトリック、プロトコル、コードの可用性、および報告をカバーする必要があります。
ユーザーに可能なすべてのベースラインを実行するように求めないでください。査読者への影響と決定価値によってランク付けしてください。
ベースラインを除外するには、擁護可能な理由が必要であり、多くの場合、引用または制限の記述が必要です。
手法を打ち負かす強力なベースラインは、単なる実験の失敗ではなく、プロジェクト情報です。
出力は experiment-design-planner に直接引き渡す必要があります。

ステップ1 - 主張と比較対象の回復

収集するもの：

論文の主張または実験の主張
提案された手法と、既知であれば最も近いベースライン
ターゲットタスク、データセット、ベンチマーク、メトリック、およびプロトコル
ターゲット会場またはコミュニティの期待
既存の結果、ドラフトテーブル、または計画された実験
利用可能な場合は文献レビューの出力
コードの可用性と計算予算
CLM-###、EVD-###、RSK-###、または ACT-### などのプロジェクトメモリID

主張を次のように書き換えてください。

[方法]が[タスク/プロトコル]の下で[比較セット]に対して[プロパティ]を改善することを示し、その結果が[交絡因子]によって説明されないようにする必要があります。

これが書けない場合は、research-idea-validator、algorithm-design-planner、または paper-evidence-board にルーティングしてください。

ステップ2 - 候補ベースラインプールの構築

使用するもの：

文献レビューの出力
引用された関連研究
ベンチマークのリーダーボードまたは公式ベースライン
ターゲット会場で最近採択された論文
コードリポジトリまたはモデルリリース
反論モードの場合は査読者のコメント

references/baseline-taxonomy.md を使用して各候補を分類してください。

プールには以下を含める必要があります。

直接の競合
最も強力な現在の手法
標準ベンチマークベースライン
古典的なベースライン
ユーザーの手法の以前のバージョンまたは最も近いアブレーション
手法なしまたは自明な制御ベースライン
適切な場合はオラクル、上限、または診断ベースライン
リソースが一致するベースライン
会場が期待するドメイン固有のベースライン

ステップ3 - ベースライン要件レベルの割り当て

各候補に、次のうちの1つだけを割り当ててください。

must-have：それなしでは論文を擁護するのが難しい
should-have：査読者の信頼を大幅に向上させるが、省略は擁護可能かもしれない
optional：有用なコンテキスト、採択への影響は低い
not-comparable：関連しているが、直接の比較としては不公平または無効
citation-only：議論/引用されるべきだが、実験は必要ない

すべての must-have ベースラインには、所有者、実験形式、公平性の制約、および不可能な場合のフォールバックが必要です。

すべての not-comparable ベースラインには理由が必要です。

異なるタスクまたはデータ
互換性のないメトリック
コードが利用できず、再現が高価すぎる
異なるリソース体制
追加の教師あり学習またはデータを使用している
忠実な再現に十分な公開詳細がない
異なる主張を評価している

ステップ4 - 公平性の監査

references/fairness-ledger.md を読んでください。

すべての must-have および should-have ベースラインについて、以下を確認してください。

同じデータ分割と前処理
同じトレーニングデータと追加データポリシー
比較可能なモデルサイズまたは明示的なスケール制御
比較可能な計算量または明示的な計算量ノルム

(原文がここで切り詰められています)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Baseline Selection Audit

Turn a claim, method, draft experiment plan, or literature map into a reviewer-proof baseline set and fairness ledger.

Use this skill when:

experiments are being planned and the right baselines are unclear
a literature review found competitors but they have not been converted into comparisons
a paper may be missing SOTA, direct competitors, classics, ablation baselines, or control baselines
a reviewer might complain about unfair tuning, scale, data, compute, protocol, or metric differences
a rebuttal or revision needs to decide which additional baseline experiment is worth running
the user needs to justify why a baseline is excluded as not comparable

Do not use this skill for citation metadata checks. Use citation-audit for BibTeX and LaTeX correctness. Use citation-coverage-audit when the primary question is missing references rather than missing comparisons.

Pair this skill with:

literature-review-sprint before this skill when the competing paper map is incomplete
algorithm-design-planner when the closest baseline changes the method design
experiment-design-planner after this skill to turn selected baselines into a concrete experiment matrix
run-experiment only after baseline scope, fairness rules, and stop conditions are clear
result-diagnosis when baseline results are surprising, unstable, or stronger than the proposed method
paper-evidence-board when baseline risks must be linked to paper claims, figures, and sections
research-project-memory when baseline decisions, risks, and actions should persist across sessions

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── baseline-taxonomy.md
    ├── fairness-ledger.md
    ├── memory-writeback.md
    ├── report-template.md
    └── reviewer-risk.md

Progressive Loading

Always read references/baseline-taxonomy.md, references/fairness-ledger.md, and references/reviewer-risk.md.
Read references/report-template.md before writing the final audit.
Read references/memory-writeback.md when the project has memory/, component .agent/ folders, or the user asks for persistent project memory.
If the baseline set depends on current SOTA, recent concurrent work, or venue expectations, verify with current sources through web search, OpenReview, proceedings, arXiv, PMLR, ACL Anthology, CVF, DBLP, Semantic Scholar, or user-provided papers.
If current verification is unavailable, mark baseline status as provisional and identify the missing search needed before final experiment planning.

Core Principles

Baselines exist to defend a claim, not to decorate a table.
Separate closest conceptual competitor, strongest empirical baseline, standard benchmark baseline, ablation baseline, and control baseline.
A baseline can be missing for citation purposes, comparison purposes, or both. Name which one.
Fairness must cover data, model size, compute, tuning, metric, protocol, code availability, and reporting.
Do not ask the user to run every possible baseline. Rank by reviewer impact and decision value.
Excluding a baseline requires a defensible reason and often a citation or limitation statement.
A strong baseline beating the method is project information, not merely an experiment failure.
The output must hand off directly to experiment-design-planner.

Step 1 - Recover Claim and Comparison Surface

Collect:

paper claim or experiment claim
proposed method and closest baseline, if known
target task, dataset, benchmark, metric, and protocol
target venue or community expectations
existing results, draft tables, or planned experiments
literature-review outputs, if available
code availability and compute budget
project memory IDs such as CLM-###, EVD-###, RSK-###, or ACT-###

Rewrite the claim into:

We need to show that [method] improves [property] over [comparison set] under [task/protocol], without the result being explained by [confound].

If this cannot be written, route to research-idea-validator, algorithm-design-planner, or paper-evidence-board.

Step 2 - Build Candidate Baseline Pool

Use:

literature review outputs
cited related work
benchmark leaderboards or official baselines
recent accepted papers at the target venue
code repositories or model releases
reviewer comments, if this is rebuttal mode

Classify each candidate using references/baseline-taxonomy.md.

The pool should include:

direct competitor
strongest current method
standard benchmark baseline
classic baseline
previous version or nearest ablation of the user's method
no-method or trivial control baseline
oracle, upper bound, or diagnostic baseline when appropriate
resource-matched baseline
domain-specific baseline expected by the venue

Step 3 - Assign Baseline Requirement Level

For each candidate, assign exactly one:

must-have: paper is hard to defend without it
should-have: materially improves reviewer confidence, but omission may be defensible
optional: useful context, low acceptance impact
not-comparable: related but unfair or invalid as a direct comparison
citation-only: should be discussed/cited but does not need an experiment

Every must-have baseline needs an owner, experiment form, fairness constraints, and fallback if impossible.

Every not-comparable baseline needs a reason:

different task or data
incompatible metric
unavailable code and reproduction too expensive
different resource regime
uses extra supervision or data
no public details sufficient for faithful reproduction
evaluates a different claim

Step 4 - Audit Fairness

Read references/fairness-ledger.md.

For each must-have and should-have baseline, check:

same data split and preprocessing
same training data and extra-data policy
comparable model size or explicit scale control
comparable compute or explicit compute-normalized metric
comparable tuning budget
comparable evaluation metric and decoding/sampling protocol
correct official code or faithful reimplementation
enough seeds, confidence intervals, or variance reporting
same reporting unit: tokens, examples, images, FLOPs, wall-clock, NFE, parameters, or memory

If fairness cannot be achieved, decide whether to:

change claim
add a matched subset comparison
run a smaller diagnostic comparison
mark baseline as citation-only with clear limitation
defer to rebuttal risk

Step 5 - Forecast Reviewer Attacks

Read references/reviewer-risk.md.

For each missing, weak, or unfair baseline, write the likely reviewer objection:

Reviewer could say: [attack].
Severity: fatal / major / medium / minor
Mitigation: run / cite / justify / narrow claim / move to appendix / accept risk

Prioritize by acceptance impact:

fatal novelty or comparison threat
required benchmark/SOTA omission
unfair tuning or compute
weak ablation baseline
unclear protocol
missing control

Step 6 - Produce Experiment Handoff

For experiment-design-planner, output:

selected baselines and requirement levels
exact comparison table rows
fairness ledger fields to log
metrics and protocol constraints
ablation/control baselines
stop conditions
expected reviewer question each baseline answers
fallback plan if a baseline is impossible

If compute is limited, propose a staged plan:

minimal reviewer-proof set
high-impact optional additions
appendix or deferred baselines

Step 7 - Write the Baseline Audit Report

Read references/report-template.md.

If saving to a project and no path is given, use:

docs/experiments/baseline_selection_audit_YYYY-MM-DD_<short-name>.md

If working inside a code repo or code worktree created by init-python-project / new-workspace, prefer:

docs/reports/baseline_selection_audit_YYYY-MM-DD_<short-name>.md

The report must include:

claim under audit
candidate baseline pool
requirement-level table
fairness ledger
reviewer attack forecast
selected experiment matrix handoff
baselines excluded and why
memory update section

Step 8 - Write Back to Project Memory

Read references/memory-writeback.md when memory exists.

Update the smallest useful set of entries:

memory/risk-board.md: missing, unfair, unavailable, or not-comparable baseline risks
memory/evidence-board.md: planned baseline comparisons and ablations
memory/action-board.md: implementation, run, citation, or justification actions
memory/claim-board.md: claims narrowed by baseline feasibility
memory/decision-log.md: durable decisions to include, exclude, or stage baselines
worktree .agent/worktree-status.md: baseline implementation purpose and exit condition
paper/.agent/: table/section implications when a draft exists

Use certainty labels:

verified for baselines checked against primary sources or official code
user-stated for constraints supplied by the user
inferred for reviewer risks and fairness judgments
unverified for candidates not yet checked

Final Sanity Check

Before finalizing:

every paper claim has at least one direct comparison or control
closest conceptual competitor and strongest empirical baseline are not conflated
must-have baselines are explicit
excluded baselines have defensible reasons
fairness constraints are concrete enough to run
reviewer attacks are written in reviewer language
the output can feed directly into experiment-design-planner
project memory is updated when present