💼 Autoforge
AIエージェントの性能を数学的に最適化し、プロンプトやコード、プロジェクト全体を厳密な検証で改善するSkill。
📺 まず動画で見る(YouTube)
▶ 【自動化】AIガチ勢の最新活用術6選がこれ1本で丸分かり!【ClaudeCode・AIエージェント・AI経営・Skills・MCP】 ↗
※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。
📜 元の英語説明(参考)
AutoForge is a production-grade autonomous optimization framework for AI agents. It replaces subjective "reflection" with mathematically rigorous convergence loops — tracking every iteration in TSV, cross-validating with multiple models, and stopping only when pass rates confirm real improvement. Four specialized modes: prompt (skill & doc optimization via scenario simulation), code (sandboxed test execution with measurable criteria), audit (CLI verification against live tool behavior), and project (whole-repo cross-file consistency analysis). Battle-tested across 50+ iterations on production skills. Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".
🇯🇵 日本人クリエイター向け解説
AIエージェントの性能を数学的に最適化し、プロンプトやコード、プロジェクト全体を厳密な検証で改善するSkill。
※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-17
- 取得日時
- 2026-05-17
- 同梱ファイル
- 6
💬 こう話しかけるだけ — サンプルプロンプト
- › Autoforge で、私のビジネスを分析して改善案を3つ提案して
- › Autoforge を使って、来週の会議用の資料を作って
- › Autoforge で、現状の課題を整理してアクションプランに落として
これをClaude Code に貼るだけで、このSkillが自動発動します。
📖 Claude が読む原文 SKILL.md(中身を展開)
この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。
AutoForge — Autonomous Optimization Framework
Stop reflecting. Start converging. Every iteration is measured, logged, and validated — not vibed.
AutoForge replaces ad-hoc "improve this" prompts with a rigorous optimization loop: define evals, run iterations, track pass rates in TSV, report live to your channel, and stop only when math says you're done. Multi-model cross-validation prevents the "same model grades its own homework" blind spot.
Four modes. One convergence standard.
| Mode | What it does | Best for |
|---|---|---|
prompt |
Simulate 5 scenarios/iter, evaluate Yes/No | SKILL.md, prompts, doc templates |
code |
Sandboxed test execution, measure exit/stdout/stderr | Shell scripts, Python tools, pipelines |
audit |
Test CLI commands live, verify SKILL.md matches reality | CLI skill documentation |
project |
Scan whole repo, cross-file consistency analysis | README↔CLI drift, Dockerfile↔deps, CI gaps |
AutoForge — Top-Agent Architecture
Overview
Agent (you)
├── State: results.tsv, current target file state, iteration counter
├── Iteration 1: evaluate → improve → write TSV → report
├── Iteration 2: evaluate → improve → write TSV → report
├── ...
└── Finish: report.sh --final → configured channel
Sub-Agent = You
"Sub-Agent" is a conceptual role, not a separate process. You (the top-agent) execute each iteration yourself: simulate/execute → evaluate → write TSV → call report.sh. The templates below describe what you do PER ITERATION — not what you send to another agent.
For code mode, run tests using the exec tool.
Multi-Model Setup (recommended for Deep Audits)
For complex audits, you can split two roles across different models:
| Role | Model | Task |
|---|---|---|
| Optimizer | Opus / GPT-4.1 | Analyzes, finds issues, writes fixes |
| Validator | GPT-5 / Gemini (different model) | Checks against ground truth, provides pass rate |
Flow: Optimizer and Validator alternate. Optimizer iterations have status improved/retained/discard. Validator iterations confirm or refute the pass rate. Spawn validators as sub-agents with sessions_spawn and explicit model.
When to use Multi-Model: Deep Audits (>5 iterations expected), complex ground truth, or when a single model is blind to its own errors.
When Single-Model suffices: Simple CLI audits, prompt optimization, code with clear tests.
Configuration
AutoForge uses environment variables for reporting. All are optional — without them, output goes to stdout.
| Variable | Default | Description |
|---|---|---|
AF_CHANNEL |
telegram |
Messaging channel for reports |
AF_CHAT_ID |
(none) | Chat/group ID for report delivery |
AF_TOPIC_ID |
(none) | Thread/topic ID within the chat |
Hard Invariants
These rules apply always, regardless of mode:
- TSV is mandatory. Every iteration writes exactly one row to
results/[target]-results.tsv. - Reporting is mandatory. Call
report.shimmediately after every TSV row. - --dry-run never overwrites the target. Only TSV,
*-proposed.md, and reports are written. - Mode isolation is strict. Only execute steps for the assigned mode.
- Iteration 1 = Baseline. Evaluate the original version unchanged, status
baseline.
Modes — Read ONLY Your Mode!
You are assigned ONE mode. Ignore all sections for other modes.
| Mode | What happens | Output |
|---|---|---|
prompt |
Mentally simulate skill/prompt, evaluate against evals | Improved prompt text |
code |
Run tests in sandbox, measure results | Improved code |
audit |
Test CLI commands (read-only only!) + verify SKILL.md against reality | Improved SKILL.md |
project |
Scan whole repo, cross-file analysis, fix multiple files per iteration | Improved repository |
Your mode is in the task prompt. Everything else is irrelevant to you.
TSV Format (same for ALL modes)
Header (once at loop start):
printf '%s\t%s\t%s\t%s\t%s\n' "iteration" "prompt_version_summary" "pass_rate" "change_description" "status" > results/[target]-results.tsv
Row per iteration:
printf '%s\t%s\t%s\t%s\t%s\n' "1" "Baseline" "58%" "Original version" "baseline" >> results/[target]-results.tsv
Use
printfnotecho -e!echo -einterprets backslashes in field values.printf '%s'outputs strings literally.
5 columns, TAB-separated, EXACTLY this order:
| # | Column | Type | Rules |
|---|---|---|---|
| 1 | iteration |
Integer | 1, 2, 3, ... |
| 2 | prompt_version_summary |
String | Max 50 Unicode chars. No tabs, no newlines. |
| 3 | pass_rate |
String | Number + %: 58%, 92%, 100%. Always integer. |
| 4 | change_description |
String | Max 100 Unicode chars. No tabs, no newlines. |
| 5 | status |
Enum | Exactly one of: baseline · improved · retained · discard |
Escaping rules:
- Tabs in text fields → replace with spaces
- Newlines in text fields → replace with
| - Empty fields → use hyphen
-(never leave empty) $and backticks → useprintf '%s'or escape with\$(prevents unintended variable interpolation)- Unicode/Emoji allowed, count as 1 character (not bytes)
Status rules (based on pass-rate comparison):
baseline— Mandatory for Iteration 1. Evaluate original version only.improved— Pass rate higher than previous best → new version becomes current stateretained— Pass rate equal or marginally better → predecessor remainsdiscard— Pass rate lower → change discarded, revert to best state
Reporting (same for ALL modes)
After EVERY TSV row (including baseline):
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]"
After loop ends, additionally with --final:
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]" --final
The report script reads AF_CHANNEL, AF_CHAT_ID, and AF_TOPIC_ID from environment. Without them, it prints to stdout with ANSI colors.
Stop Conditions (for ALL modes)
Priority — first matching condition wins, top to bottom:
- 🛑 Minimum iterations — If specified in task (e.g. "min 5"), this count MUST be reached. No other condition can stop before.
- 🛑 Max 30 iterations — Hard safety net, stop immediately.
- ❌ 3×
discardin a row → structural problem, stop + analyze. - ✅ 3× 100% pass rate (after minimum) → confirmed perfect, done.
- ➡️ 5×
retainedin a row → converged, done.
Counting rules:
3× 100%= three iterations withpass_rate == 100%, not necessarily consecutive.5× retainedand3× discard= consecutive (in a row).baselinecounts toward no series.improvedinterruptsretainedanddiscardseries.
At 100% in early iterations: Keep going! Test harder edge cases. Only 3× 100% after the minimum confirms true perfection.
Recognizing Validator Noise
In multi-model setups, the Validator can produce false positives — fails that aren't real issues:
- Config path vs tool name confusion (e.g.
agents.list[]≠agents_listtool) - Inverted checks ("no X" → Validator looks for X as required)
- Normal English as forbidden reference (e.g. "runtime outcome" ≠
runtime: "acp") - Overcounting (thread commands counted as subagent commands)
Rule: If after all real fixes >3 discards come in a row and the fail justifications don't hold up under scrutiny → declare convergence, don't validate endlessly.
Execution Modes
| Flag | Behavior |
|---|---|
--dry-run (default) |
Only TSV + proposed files. Target file/repo remains unchanged. |
--live |
Target file/repo is overwritten. Auto-backup → results/backups/ |
--resume |
Read existing TSV, continue from last iteration. On invalid format: abort. |
mode: prompt
Only read if your task contains
mode: prompt!
Per Iteration: What you do
- Read current prompt/skill
- Mentally simulate 5 different realistic scenarios
- Evaluate each scenario against all evals (Yes=1, No=0)
- Pass rate = (Sum Yes) / (Eval count × 5 scenarios) × 100
- Compare with best previous pass rate → determine status
- On
improved: propose minimal, surgical improvement - Write TSV row + call report.sh
- Check stop conditions
At the End
Best version → results/[target]-proposed.md + report.sh --final
mode: code
Only read if your task contains
mode: code!
Per Iteration: What you do
- Create sandbox:
SCRATCH=$(mktemp -d) && cd $SCRATCH - Write current code to sandbox
- Execute test command (with
timeout 60s) - Measure: exit_code, stdout, stderr, runtime
- Evaluate against evals → calculate pass rate
- On
improved: minimal code improvement + verify again - Write TSV row + call report.sh
- Check stop conditions
Code Eval Types
| Eval Type | Description | Example |
|---|---|---|
exit_code |
Process exit code | exit_code == 0 |
output_contains |
stdout contains string | "SUCCESS" in stdout |
output_matches |
stdout matches regex | r"Total: \d+" |
test_pass |
Test framework green | pytest exit 0 |
runtime |
Runtime limit | < 5000ms |
no_stderr |
No error output | stderr == "" |
file_exists |
Output file created | result.json exists |
json_valid |
Output is valid JSON | json.loads(stdout) |
At the End
Best code → results/[target]-proposed.[ext] + report.sh --final
mode: audit
Only read if your task contains
mode: audit!
⚠️ DO NOT write your own code. Only test CLI commands of the target tool (--help + read-only).
Two Variants
Simple Audit (CLI skill, clear commands):
- 2 iterations: Baseline → Proposed Fix
- For tools with clear
--helpoutput and simple command structure
Deep Audit (complex docs, many checks):
- Iterative loop like prompt/code, same stop conditions
- For extensive documentation with many checkpoints (e.g. config keys, tool policy, parameter lists)
- Recommended: Multi-Model setup (Opus Optimizer + external Validator)
Simple Audit Flow
- Write TSV header
- Iteration 1 (Baseline): Test every documented command → pass rate → TSV + report
- Iteration 2 (Proposed Fix): Write improved SKILL.md → expected pass rate → TSV + report
- Improved SKILL.md →
results/[target]-proposed.md - Detail results →
results/[target]-audit-details.md(NOT in TSV!) - report.sh
--final
Deep Audit Flow
- Write TSV header
- Iteration 1 (Baseline): Extract ground truth from source, define all checks, evaluate baseline
- Iterations 2+: Optimizer fixes issues → Validator checks → TSV + report per iteration
- Loop runs until stop conditions trigger (3× 100%, 5× retained, 3× discard)
- Final version →
results/[target]-proposed.mdorresults/[target]-v1.md - report.sh
--final
Fixed Evals (audit)
- Completeness — Does SKILL.md cover ≥80% of real commands/config?
- Correctness — Are ≥90% of documented commands/params syntactically correct?
- No stale references — Does everything documented actually exist?
- No missing core features — Are all important features covered?
- Workflow quality — Does quick-start actually work?
mode: project
Only read if your task contains
mode: project!
⚠️ This mode operates on an ENTIRE repository/directory, not a single file. Cross-file consistency is the core feature — this is NOT "audit on many files."
Three Phases
Project mode runs through three sequential phases. Phases 1 and 2 happen once (in Iteration 1 = Baseline). Phase 3 is the iterative fix loop.
Phase 1: Scan & Plan
-
Analyze the repo directory:
# Discover structure tree -L 3 --dirsfirst [target_dir] ls -la [target_dir] -
Identify relevant files and classify by priority:
Priority Files critical README, Dockerfile, CI workflows (.github/workflows), package.json/requirements.txt, main entry points normal Tests, configs, scripts, .env.example, .gitignore low Docs, examples, LICENSE, CHANGELOG -
Build the File-Map — a mental inventory of what exists and what's missing.
-
Compose eval set: Merge user-provided evals with auto-detected evals (see Default Evals below).
Phase 2: Cross-File Analysis
Run consistency checks across files. Each check = one eval point:
| Check | What it verifies |
|---|---|
| README ↔ CLI | Documented commands/flags match actual --help output |
| Dockerfile ↔ deps | requirements.txt / package.json versions match what Dockerfile installs |
| CI ↔ project structure | Workflow references correct paths, scripts, test commands |
.env.example ↔ code |
Every env var in code has a corresponding entry in .env.example |
| Imports ↔ dependencies | Every import / require has a matching dependency declaration |
| Tests ↔ source | Test files exist for critical modules |
.gitignore ↔ artifacts |
Build outputs, secrets, and caches are excluded |
Result of Phase 2: A complete eval checklist with per-file and cross-file checks, each scored Yes/No.
Phase 3: Iterative Fix Loop
Same loop logic as prompt/code/audit — TSV, report.sh, stop conditions. Key differences:
- Multiple files can be changed per iteration
- Pass rate = aggregated over ALL evals (file-specific + cross-file)
- Fixes are minimal and surgical — don't refactor blindly, only fix what improves pass rate
change_descriptionincludes which files were touched:"Fix Dockerfile + CI workflow sync"
Per Iteration: What you do
- Evaluate current repo state against all evals (file-specific + cross-file)
- Calculate pass rate: (passing evals / total evals) × 100
- Compare with best previous pass rate → determine status
- On
improved: apply minimal, surgical fixes to the fewest files necessary - Verify the fix didn't break other evals (re-run affected checks)
- Write TSV row + call report.sh
- Check stop conditions
Dry-Run vs Live
| Flag | Behavior |
|---|---|
--dry-run (default) |
Fixed files → results/[target]-proposed/ directory (mirrors repo structure). Original repo untouched. |
--live |
Files overwritten in-place. Originals backed up → results/backups/ (preserving directory structure). |
Default Evals (auto-applied unless overridden)
These evals are automatically used when the user doesn't provide custom evals. The agent detects which are applicable based on what exists in the repo:
| # | Eval | Condition |
|---|---|---|
| 1 | README accurate? (describes actual features/commands) | README exists |
| 2 | Tests present and green? (pytest / npm test / go test) |
Test files or test config detected |
| 3 | CI configured and syntactically correct? | .github/workflows/ or .gitlab-ci.yml exists |
| 4 | No hardcoded secrets? (grep -rE "(password|api_key|token|secret)\s*=") |
Always |
| 5 | Dependencies complete? (requirements.txt ↔ imports, package.json ↔ requires) |
Dependency file exists |
| 6 | Dockerfile functional? (docker build succeeds or Dockerfile syntax valid) |
Dockerfile exists |
| 7 | .gitignore sensible? (no secrets, build artifacts excluded) |
.gitignore exists |
| 8 | License present? | Always |
Eval Scoring
Pass Rate = (Passing Evals / Total Applicable Evals) × 100
Evals that don't apply (e.g. "Dockerfile functional?" when no Dockerfile exists) are excluded from the total, not counted as passes.
At the End
--dry-run: All proposed changes →results/[target]-proposed/directory--live: Changes already applied, backups inresults/backups/- report.sh
--final - Optionally:
results/[target]-project-details.mdwith per-file findings (NOT in TSV!)
Directory Structure
autoforge/
├── SKILL.md ← This file
├── results/
│ ├── [target]-results.tsv ← TSV logs
│ ├── [target]-proposed.md ← Proposed improvement (prompt/audit)
│ ├── [target]-proposed/ ← Proposed repo changes (project mode)
│ │ ├── README.md
│ │ ├── Dockerfile
│ │ └── ...
│ ├── [target]-v1.md ← Deep audit final version
│ ├── [target]-audit-details.md ← Audit details (audit mode only)
│ ├── [target]-project-details.md ← Project details (project mode only)
│ └── backups/ ← Auto-backups (--live)
│ ├── [file].bak ← Single file backups (prompt/code/audit)
│ └── [target]-backup/ ← Full directory backup (project mode)
├── scripts/
│ ├── report.sh ← Channel reporting
│ └── visualize.py ← PNG chart (optional)
├── references/
│ ├── eval-examples.md ← Pre-built evals
│ └── ml-mode.md ← ML training guide
└── examples/
├── demo-results.tsv ← Demo data
└── example-config.json ← Example configuration
Examples (task descriptions, NOT CLI commands)
AutoForge is not a CLI tool — it's a skill prompt for the agent:
# Optimize a prompt
"Start autoforge mode: prompt for the coding-agent skill.
Evals: PTY correct? Workspace protected? Clearly structured?"
# Audit a CLI skill (simple)
"Start autoforge mode: audit for notebooklm-py."
# Deep audit with multi-model
"Start autoforge mode: audit (deep) for subagents docs.
Optimizer: Opus, Validator: GPT-5
Extract ground truth from source, validate iteratively."
# Optimize code
"Start autoforge mode: code for backup.sh.
File: ./backup.sh
Test: bash backup.sh personal --dry-run
Evals: exit_code==0, backup file created, < 10s runtime"
# Optimize a whole repository
"Start autoforge mode: project for ./my-app
Evals: Tests green? CI correct? No hardcoded secrets? README accurate?"
# Project mode with custom focus
"Start autoforge mode: project for /path/to/api-server
Focus: Docker + CI pipeline consistency
Evals: docker build succeeds, CI workflow references correct paths,
.env.example covers all env vars used in code"
# Project mode dry-run (default)
"Start autoforge mode: project for ./my-tool --dry-run
Use default evals. Show me what needs fixing."
Eval Examples → Mode Mapping
references/eval-examples.md provides ready-to-use Yes/No evals grouped by category. Here's how they map to AutoForge modes:
| eval-examples.md Category | AutoForge Mode | Notes |
|---|---|---|
| Briefing, Email, Calendar, Summary, Proposal | prompt |
Mental simulation with scenario evals |
| Python Script, Shell Script, API, Data Pipeline, Build | code |
Real execution with measurable criteria |
| CI/CD, Docker, Helm, Kubernetes, Terraform | code or project |
code for single files, project for cross-file |
| Code Review, API Documentation | audit |
Verify docs match reality |
| Project / Repository, Cross-File Consistency, Security Baseline | project |
Whole-repo scanning and cross-file checks |
Pick evals from the matching category and paste them into your task prompt as the eval set.
Tips
- Always start with
--dry-run prompt= think,code= execute,audit= test CLI,project= optimize repo- Simple Audit for clear CLI skills, Deep Audit for complex docs
- Project mode scans the whole repo — cross-file consistency is the killer feature
- Multi-Model for Deep Audits: different models cover different blind spots
- At >3 discards after all fixes: check for validator noise, declare convergence if justified
- TSV + report.sh are NOT optional — they are the user interface
- For ML training: see
references/ml-mode.md
同梱ファイル
※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。
- 📄 SKILL.md (21,207 bytes)
- 📎 README.md (16,512 bytes)
- 📎 references/eval-examples.md (9,042 bytes)
- 📎 references/ml-mode.md (3,004 bytes)
- 📎 scripts/report.sh (6,152 bytes)
- 📎 scripts/visualize.py (4,825 bytes)