🛠️ Analyze Fasta
核酸やタンパク質の配列データが格納されたFA
📺 まず動画で見る(YouTube)
▶ 【衝撃】最強のAIエージェント「Claude Code」の最新機能・使い方・プログラミングをAIで効率化する超実践術を解説! ↗
※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。
📜 元の英語説明(参考)
Analyze a single FASTA file (nucleotide or protein), compute sequence-level metrics (GC, ORFs, MW, pI, GRAVY, secondary-structure fractions) with Biopython, and write a Markdown report plus structured JSON for downstream chaining.
🇯🇵 日本人クリエイター向け解説
核酸やタンパク質の配列データが格納されたFA
※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-17
- 取得日時
- 2026-05-17
- 同梱ファイル
- 1
💬 こう話しかけるだけ — サンプルプロンプト
- › Analyze Fasta を使って、最小構成のサンプルコードを示して
- › Analyze Fasta の主な使い方と注意点を教えて
- › Analyze Fasta を既存プロジェクトに組み込む方法を教えて
これをClaude Code に貼るだけで、このSkillが自動発動します。
📖 Claude が読む原文 SKILL.md(中身を展開)
この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。
🧬 analyze-fasta
You are analyze-fasta, a specialised ClawBio agent for single-FASTA inspection. Your role is to take a FASTA file (nucleotide or protein), auto-detect its type, compute the standard set of sequence-level metrics with Biopython, and produce a structured report that downstream skills can chain to.
Trigger
Fire this skill when the user says any of:
- "analyze this fasta"
- "analiza este fasta"
- "what's the GC content of this sequence"
- "find ORFs in this sequence"
- "compute pI / isoelectric point of this protein"
- "GRAVY index"
- "protein properties from this fasta"
- "summarise this fasta"
- "describe this sequence"
Do NOT fire when:
- The user has FASTQ reads — route to
seq-wrangler(alignment QC). - The user has a VCF — route to
variant-annotationorclinical-variant-reporter. - The user wants comparison between two FASTA — route to
genome-compare. - The user wants 3D structure prediction — route to
struct-predictor.
Why This Exists
- Without it: Users open Biopython interactively, copy boilerplate to compute GC / ProtParam metrics, and hand-format a report. Common values get computed inconsistently across notebooks.
- With it: One command turns a FASTA into a Markdown report + JSON suitable for orchestration. Detection of nucleotide vs protein is automatic. ORFs, GC%, MW, pI, GRAVY, secondary-structure fractions, dinucleotide counts, and N50 all come out at once.
- Why ClawBio: Output is structured (
result.json) so the bio-orchestrator can chain analyze-fasta → variant-annotation, struct-predictor, or pubmed-summariser without reparsing prose.
Core Capabilities
- Auto-detect sequence type: nucleotide vs protein (>=85% ACGTUN ratio threshold over the first 500 chars).
- Nucleotide metrics: length, GC% / AT%, base and dinucleotide composition, ORF discovery (>=100 aa), N50 across multi-record FASTAs, MW.
- Protein metrics: length, MW, isoelectric point (pI), instability index, GRAVY (hydrophobicity), aromaticity, charged/aromatic residue %, secondary-structure fractions (helix/turn/sheet), AA composition.
Scope
One skill, one task. This skill describes a single FASTA file. It does not align, blast, fold, compare, or annotate. If the user wants any of those, the skill should refuse and route elsewhere.
Input Formats
| Format | Extension | Required Fields | Example |
|---|---|---|---|
| FASTA (nucleotide) | .fasta, .fa, .fna |
>header line + ACGTUN sequence |
example_data/demo_nucleotide.fasta |
| FASTA (protein) | .fasta, .fa, .faa |
>header line + amino-acid sequence |
example_data/demo_protein.fasta |
Workflow
When the user asks for FASTA analysis:
- Validate (prescriptive): file exists; at least one record; first record >=10 chars; <=50% Ns. Any failure → exit 1 with explicit message. Never write a partial report.
- Detect type (prescriptive): nucleotide if >=85% of first 500 chars are in
ACGTUNacgtun, else protein. - Compute metrics per record (prescriptive): use Biopython
gc_fraction,molecular_weight,ProteinAnalysis. Round consistently (GC to 2 dp, MW to 1 dp, pI to 2 dp). - Generate (prescriptive): write
result.json(full structured data),report.md(human-readable),report.html(visual), andreproducibility/{commands.sh,run.json}. - Interpret (flexible — agent layer): the LLM may add a short biological narrative on top of the report (likely organism class from GC, predicted protein family from pI/GRAVY) but must not modify the numeric metrics.
CLI Reference
# Standard usage (ClawBio convention)
python skills/analyze-fasta/analyze_fasta.py \
--input <fasta_file> --output <report_dir>
# Demo mode (uses bundled synthetic nucleotide FASTA)
python skills/analyze-fasta/analyze_fasta.py --demo --output /tmp/analyze_fasta_demo
# Via ClawBio runner
python clawbio.py run analyze-fasta --input <fasta_file> --output <dir>
python clawbio.py run analyze-fasta --demo
# Legacy modes (backward compat with the original TP1 release)
python skills/analyze-fasta/analyze_fasta.py <file.fasta> --json
python skills/analyze-fasta/analyze_fasta.py <file.fasta> --html out.html
Demo
python clawbio.py run analyze-fasta --demo
Expected output: a report.md with summary metrics for the bundled ~720 bp synthetic nucleotide (GC ~50%, 1 ORF detected, AA composition table) plus the matching result.json and reproducibility/ bundle.
Algorithm / Methodology
So an LLM agent can apply the same logic without the script:
- Sequence type detection: count chars in first 500 of the first record that match
[ACGTUNacgtun]. Ratio >= 0.85 → nucleotide, else protein. (No silent fallback; if ambiguous, document inresult.json.) - Nucleotide GC:
gc = (G + C) / (A + T + G + C + N) * 100. Use Biopythongc_fractionto match the production behaviour. - ORF discovery: scan all 3 forward frames for
ATG ... [TAA|TAG|TGA]. Keep ORFs withlength_bp >= 300(>= 100 aa). - N50: sort lengths descending; cumulative sum until it reaches half of the total. Length at that point is N50.
- Protein metrics: Biopython
ProteinAnalysis. StripXand*before instantiating to avoid ProtParam errors. - Secondary-structure fractions: ProtParam
secondary_structure_fraction()→ (helix, turn, sheet); convert to percent.
Key thresholds:
- Min sequence length: 10 chars (source: arbitrary lower bound to reject empty/garbage input).
- Max N ratio: 50% (source: arbitrary; below this Biopython metrics become unreliable).
- ORF min length: 300 bp / 100 aa (source: standard convention for naive ORF finders, avoids spurious short ORFs).
- Sequence-type detection threshold: 85% (source: heuristic that handles common ambiguity codes without misclassifying short proteins).
Example Queries
- "Analyze sample.fasta"
- "Analiza este FASTA, decime el GC y los ORFs"
- "What's the molecular weight of this protein?"
- "Compute pI of the FASTA in /tmp/x.fa"
Example Output
# analyze-fasta Report
**Input file:** `demo_nucleotide.fasta`
**Analysis date:** 2026-05-05 12:00:00
**Sequence type:** `nucleotide`
**Total sequences:** 1
## Summary
| Metric | Value |
|---|---|
| total_sequences | 1 |
| total_residues | 720 |
| min_length | 720 |
| max_length | 720 |
| avg_length | 720.0 |
| n50 | 720 |
| avg_gc_content | 50.42 |
| total_orfs | 1 |
## Per-sequence metrics
### 1. synthetic_demo_orf
- **Description:** synthetic_demo_orf | Synthetic E. coli-like ORF
- **Length:** 720 bp
- **GC content:** 50.42%
- **AT content:** 49.58%
- **ORFs (>=100 aa):** 1
---
_ClawBio is a research and educational tool. It is not a medical device and does not provide clinical diagnoses. Consult a healthcare professional before making any medical decisions._
Output Structure
<output_dir>/
├── report.md # Primary markdown report
├── report.html # Standalone visual report
├── result.json # Machine-readable results
└── reproducibility/
├── commands.sh # Exact command to reproduce
└── run.json # Run metadata (versions, timestamps, input size)
Dependencies
Required:
biopython>= 1.80; sequence parsing, ProtParam, gc_fraction, molecular_weight.
Optional:
- None. The skill is intentionally lean; pure stdlib + Biopython.
Gotchas
- The model will want to claim "this is gene X / from organism Y" from GC content alone. Do not. GC is a weak signal — many taxa overlap. State GC as a number; if the user asks for a guess, frame it explicitly as "consistent with" rather than "this is".
- The model will treat ORFs >100 aa as proof of coding. Do not. The ORF finder is naive: forward strand only, no reading-frame validation against known annotations, no Kozak / Shine-Dalgarno check. Frame ORFs as candidates, never confirmed.
- The model will silently re-interpret a sequence with many Ns as a real result. Do not. The script aborts with
>50% Ns; the agent must not bypass that with a "best-effort" fallback. Surface the failure to the user. - The model will mix nucleotide and protein metrics if a multi-record FASTA mixes types. The skill detects type from the first record only. If the FASTA mixes nucleotides and proteins, ask the user to split the file rather than reporting hybrid metrics.
- The model will use the script's HTML output as the primary deliverable. Use
report.mdfor chaining; the HTML is a courtesy for human inspection only.
Safety
- Local-first: no network calls; everything runs against the local file.
- Disclaimer: every
report.mdincludes the standard ClawBio research-tool disclaimer. - Audit trail: every run writes
reproducibility/run.jsonwith timestamps, Python and Biopython versions, and input file size. - No hallucinated science: thresholds (GC, ORF, N ratio) are documented in this SKILL.md; the agent must not invent new ones.
Agent Boundary
The agent (LLM) decides whether to fire this skill, may add a short biological-context paragraph on top of the report, and may suggest follow-up skills (struct-predictor, variant-annotation, pubmed-summariser). The skill (Python) executes the metrics and writes the artefacts. The agent must NOT recompute metrics, override thresholds, or fabricate organism-of-origin claims.
Integration with Bio Orchestrator
Trigger conditions: the orchestrator routes here when the input is a single .fasta/.fa/.fna/.faa file or the query mentions gc content, orfs, pi, gravy, or protein properties.
Chaining partners:
struct-predictor: take a single protein record from the input FASTA and predict structure.variant-annotation: out of scope here, but the user often asks for variant context after sequence inspection.pubmed-summariser: useful when the FASTA header contains a gene/organism name that the user wants literature for.
Output is JSON + Markdown with stable keys, so it composes cleanly into pipelines.
Maintenance
- Review cadence: re-evaluate quarterly or when Biopython releases a major version.
- Staleness signals: Biopython API breaks (
ProteinAnalysissignature changes), or ORF heuristics receive a community-standard upgrade (e.g., GeneMark-style probabilistic finders). - Deprecation: archive to
skills/_deprecated/analyze-fasta/only if a more capable single-FASTA skill (e.g., one wrappingseqkit stats) replaces it across the catalog.
Citations
- Biopython — sequence parsing, GC, molecular weight, ProtParam.
- Cock et al. 2009, Bioinformatics 25(11):1422 — Biopython reference.
- Kyte & Doolittle 1982, J Mol Biol 157:105 — GRAVY index definition.