🛠️ 開発・MCP コミュニティ 🔴 エンジニア向け 👤 エンジニア・AI開発者

🛠️ Clawpathy Autoresearch

clawpathy-autoresearch

与えられたタスクと評価基準に基づき、LLMがスキル定義を繰り返し修正し、下流のエージェントが優れたパフォーマンスを発揮するまで自動調整するSkill。

⚡ ⏱ 障害ポストモーテム 1日 → 1時間

📺 まず動画で見る(YouTube)

▶ 【衝撃】最強のAIエージェント「Claude Code」の最新機能・使い方・プログラミングをAIで効率化する超実践術を解説! ↗

※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。

📜 元の英語説明(参考)

Eval-driven skill tuning. Given a task and an LLM-judge rubric, iteratively rewrites a SKILL.md until a downstream executor agent performs well against the judge. Low-code: all evaluation is LLM-as-judge, not deterministic Python.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⬇ このSkillをダウンロード(.skill) 元のソースを見る ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-17
取得日時: 2026-05-17
同梱ファイル: 1

💬 こう話しかけるだけ — サンプルプロンプト

› Clawpathy Autoresearch を使って、最小構成のサンプルコードを示して
› Clawpathy Autoresearch の主な使い方と注意点を教えて
› Clawpathy Autoresearch を既存プロジェクトに組み込む方法を教えて

これをClaude Code に貼るだけで、このSkillが自動発動します。

📖 Claude が読む原文 SKILL.md(中身を展開)

この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。

clawpathy-autoresearch

Eval-driven skill development. The system iteratively rewrites a SKILL.md so a downstream executor agent performs better at a task class, as judged by an LLM against a paper/task-specific rubric.

Core idea

  propose (sonnet)  →  execute (sonnet, shell)  →  judge (opus, rubric)
       ↑                                                       │
       └──────── feedback: verdict + recommended edits ────────┘

Proposer rewrites SKILL.md based on the last judge verdict.
Executor runs the new SKILL.md end-to-end inside a workspace.
Judge scores methodology (primary) and outputs (secondary) against a per-task rubric. Lower is better; 0 = perfect.
Keep the new SKILL.md only if it strictly beats the best score; else revert. Stop on target_score or on early_stop_n consecutive regressions.

You are the orchestrator

You (the agent reading this) don't run the loop yourself. You dispatch subagents to build the workspace, then hand off to the Python loop.

Phase 1 — Scout

Dispatch a subagent with prompts/scout.md to research the paper/task. Report key findings to the user in a few lines.

Phase 2 — Scope (you + user)

Have a conversation. Ask ONE question at a time, multiple-choice where helpful. Agree on:

what to reproduce / what success looks like
which data sources are in-bounds
what methodology expectations belong in the rubric
iteration budget and target_score (if any)

Present a summary and get approval.

Phase 3 — Build

Dispatch a builder subagent with prompts/builder.md and the agreed scope. It writes:

task.json
rubric.md — the authoritative scoring rubric for the LLM judge
reference/ (optional; judge-only)
skill/SKILL.md — seed

Validate:

from skills.clawpathy_autoresearch import validate_workspace
print(validate_workspace(Path("WORKSPACE")))  # [] means valid

Phase 4 — Loop

python -m skills.clawpathy_autoresearch WORKSPACE_DIR
# or with custom models:
python -m skills.clawpathy_autoresearch WORKSPACE_DIR \
  --proposer-model sonnet --executor-model sonnet --judge-model opus

The loop streams progress to WORKSPACE/history.jsonl, snapshots every iteration's skill to WORKSPACE/snapshots/iter-NNN.md, and writes the executor's full transcript to WORKSPACE/executor_runs/iter-NNN.log.

Workspace layout

workspace/
  task.json                  # task metadata + loop knobs
  rubric.md                  # LLM-judge rubric (the heart of the system)
  reference/                 # optional ground truth, judge-only
  skill/SKILL.md             # iterated by the loop
  output/                    # executor outputs (cleared each iter)
  executor_runs/iter-NNN.log # transcripts (judge reads these)
  snapshots/iter-NNN.md      # per-iter SKILL.md snapshots
  history.jsonl              # one row per iter: score, kept, verdict

Key principles

LLM judge only. No deterministic Python scorers. All evaluation goes through judge.md + opus. This keeps the system low-code and lets the rubric carry paper-specific nuance without adding code.
Methodology is primary. The rubric weights "did the agent use sound methods?" above "did the numbers match?". Ground-truth match is a signal, not the objective — the goal is better SKILL.md files.
Never leak ground truth. reference/ is judge-only. The executor prompt says not to read it, and the judge penalises leakage.
No hardcoded answers in SKILL.md. The proposer prompt and the judge both enforce this. The executor must derive results by running methods.
Snapshots + strict-better revert. Score on the first iter becomes the floor. Later iters that tie or regress revert to the best.

Safety

All processing is local except scout web fetches for public resources.
ClawBio disclaimer: research/education tool, not a medical device.

Gotchas

Do not skip scoping. The rubric is paper-specific; a generic rubric tunes nothing. Get the user to agree on methodology expectations.
Do not write a Python scorer. Earlier versions of this project did. They rewarded API-fetching, not methodology. The judge is the scorer.
Do not hand-pick the "best" snapshot yourself. Trust the loop. If the judge is calibrated wrong, fix the rubric, not the history.