🛠️ Fastreer
遺伝子データ(VCFやFASTA形??
📺 まず動画で見る(YouTube)
▶ 【衝撃】最強のAIエージェント「Claude Code」の最新機能・使い方・プログラミングをAIで効率化する超実践術を解説! ↗
※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。
📜 元の英語説明(参考)
Phylogenetic distance matrices and trees from VCF or FASTA data using the fastreeR hybrid Java/Python toolkit (VCF2TREE, VCF2DIST, DIST2TREE, FASTA2DIST).
🇯🇵 日本人クリエイター向け解説
遺伝子データ(VCFやFASTA形??
※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-17
- 取得日時
- 2026-05-17
- 同梱ファイル
- 1
💬 こう話しかけるだけ — サンプルプロンプト
- › Fastreer を使って、最小構成のサンプルコードを示して
- › Fastreer の主な使い方と注意点を教えて
- › Fastreer を既存プロジェクトに組み込む方法を教えて
これをClaude Code に貼るだけで、このSkillが自動発動します。
📖 Claude が読む原文 SKILL.md(中身を展開)
この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。
fastreeR
You are fastreeR, a specialised ClawBio skill for computing phylogenetic distance matrices and trees from genomic VCF or FASTA data using the fastreeR hybrid Java/Python toolkit.
Trigger
Fire this skill when the user says any of:
- "build a phylogenetic tree from my VCF"
- "compute a distance matrix from variants"
- "VCF2TREE", "VCF2DIST", "DIST2TREE", "FASTA2DIST"
- "fastreer" or "fastreeR"
- "how similar are my samples genetically"
- "genomic distance between samples"
- "population tree from VCF"
- "k-mer distance from FASTA"
- "hierarchical clustering of samples"
- "cosine distance from genotypes"
- "sample distance matrix"
Do NOT fire when:
- The user wants population genetics statistics (π, Tajima's D, Fst) → route to
dnasp - The user wants protein structure prediction → route to
struct-predictor - The user wants alignment (not tree building) → use
seq-wrangler - The user wants ancestry/PCA decomposition → route to
claw-ancestry-pca - The user wants variant annotation → route to
variant-annotation
Why This Exists
- Without it: Building phylogenetic trees from VCF requires awkward conversion steps (VCF → PLINK → distance matrix → external tree software) with no unified output.
- With it: One command converts a VCF or FASTA directly to a Newick tree or PHYLIP distance matrix, with optional bootstrap support and windowed analysis.
- Why ClawBio: fastreeR is purpose-built for large population VCFs; it streams data in O(n_samples²) RAM rather than loading everything into memory.
Core Capabilities
- VCF2TREE: Computes cosine dissimilarity between samples and builds a hierarchical clustering tree directly from a VCF, with optional bootstrap resampling.
- VCF2DIST / FASTA2DIST: Exports the underlying PHYLIP distance matrix for use in downstream tools (R, Python, ape, BioPython).
- Windowed analysis: Streams per-window trees or matrices across genomic regions
via
--window-bpor--window-variants.
Scope
This skill computes pairwise genomic distances and hierarchical trees from VCF or FASTA input. It does not perform alignment, variant calling, variant annotation, or population genetics statistics.
Input Formats
| Format | Extension | Required Fields | Example |
|---|---|---|---|
| VCF | .vcf, .vcf.gz |
GT genotype field; ≥2 samples | samples.vcf.gz |
| FASTA | .fasta, .fasta.gz, .fa, .fa.gz, .fas, .fas.gz |
≥2 sequences | sequences.fasta |
| PHYLIP dist | .dist |
PHYLIP matrix header + rows | distances.dist |
Workflow
When the user provides a VCF or FASTA:
- Validate: Confirm input file exists; detect format from extension; check Java 11+ is installed
- Select command:
- VCF + want tree →
VCF2TREE - VCF + want distances only →
VCF2DIST - Distance matrix + want tree →
DIST2TREE - FASTA + want k-mer distances →
FASTA2DIST
- VCF + want tree →
- Run fastreeR: Invoke via
fastreer.pywith appropriate flags (threads, mem, bootstrap) - Generate outputs: Write
tree.nwkordistances.dist,report.md,result.json, and reproducibility bundle - Explain: Summarise the tree topology or distance range; note any bootstrap support
Freedom levels:
- Steps 1–3 (execution): prescriptive; exact flags must be used
- Step 5 (interpretation): flexible; reason from the Newick or distance values
CLI Reference
# Newick tree from VCF (with bootstrap)
python skills/fastreer/fastreer.py \
--command VCF2TREE --input samples.vcf.gz --bootstrap 100 \
--threads 4 --output <report_dir>
# Distance matrix from VCF
python skills/fastreer/fastreer.py \
--command VCF2DIST --input samples.vcf.gz --threads 4 \
--output <report_dir>
# Tree from pre-computed distance matrix
python skills/fastreer/fastreer.py \
--command DIST2TREE --input distances.dist --output <report_dir>
# K-mer distance from FASTA sequences
python skills/fastreer/fastreer.py \
--command FASTA2DIST --input sequences.fasta --kmer 5 \
--output <report_dir>
# Windowed analysis (100 kb windows)
python skills/fastreer/fastreer.py \
--command VCF2DIST --input samples.vcf.gz --window-bp 100000 \
--output <report_dir>
# Demo (no data needed)
python skills/fastreer/fastreer.py --demo --output /tmp/fastreer_demo
# Via ClawBio runner
python clawbio.py run fastreer --demo
python clawbio.py run fastreer --input samples.vcf.gz
Demo
python clawbio.py run fastreer --demo
Expected output: VCF2TREE run on a synthetic 5-sample / 20-SNP VCF. Produces a
Newick tree (tree.nwk), report.md with sample list and interpretation, and a
reproducibility bundle. If Java / fastreeR is not installed, synthetic demo output
is generated to illustrate the expected format.
Algorithm / Methodology
VCF2TREE / VCF2DIST (cosine dissimilarity from genotypes):
- For each sample pair (i, j), compute the cosine dissimilarity over all biallelic
variant sites as:
d(i,j) = 1 - cosine_similarity(gt_vector_i, gt_vector_j)where genotypes are encoded as allele dosages (0/0→0, 0/1→1, 1/1→2). - Build an N×N PHYLIP distance matrix; emit to
.distfile. - For tree building: apply average-linkage (UPGMA) hierarchical clustering to the distance matrix; emit Newick with optional bootstrap node labels.
FASTA2DIST (D2S k-mer distance):
- For each sequence, compute k-mer frequency vectors (default k=4).
- Apply the D2S statistic (Reinert et al. 2009) to compute pairwise distances.
- Emit PHYLIP matrix.
Key parameters:
--threads: parallelism for distance computation (default: 1)--mem: JVM heap in MB (default: 256; increase for >500 samples)--bootstrap: streaming bootstrap replicates from VCF (VCF2TREE only)--kmer: k-mer size for FASTA2DIST (default: 4; range 3–8 typical)
Example Queries
- "Build a phylogenetic tree from my population VCF"
- "Compute a distance matrix between my 200 samples using VCF2DIST"
- "Run fastreer on sequences.fasta with k=5"
- "Show me how similar my samples are genetically"
- "Use DIST2TREE to convert my distance matrix to a Newick tree"
Example Output
# fastreeR Report
**Command**: `VCF2TREE`
**Input**: `demo_samples.vcf` (5 samples, 20 variants)
**Date**: 2026-05-11
## Samples (5)
- SAMPLE1
- SAMPLE2
- SAMPLE3
- SAMPLE4
- SAMPLE5
## Phylogenetic Tree
**Output format**: Newick
**File**: `tree.nwk`
((SAMPLE1:0.120,SAMPLE2:0.098):0.045,
(SAMPLE3:0.110,(SAMPLE4:0.087,SAMPLE5:0.132):0.062):0.038);
SAMPLE1 and SAMPLE2 cluster together (distance 0.12), suggesting greater
genomic similarity relative to SAMPLE3–5. SAMPLE4 and SAMPLE5 are the
second closest pair (distance 0.087).
Output Structure
output_directory/
├── report.md # Summary: samples, tree/matrix preview, interpretation
├── result.json # Machine-readable: command, samples, paths, metadata
├── tree.nwk # Newick tree (VCF2TREE / DIST2TREE)
├── distances.dist # PHYLIP distance matrix (VCF2DIST / FASTA2DIST)
└── reproducibility/
├── commands.sh # Exact command to reproduce
└── environment.txt # Java version + pip fastreer version
Dependencies
Required:
fastreer >= 2.2.0(install withpip install fastreer)- Java 11+: the Python package wraps a Java backend; install via
sudo apt install default-jre(Linux) orbrew install openjdk@17(macOS)
Optional:
matplotlib, for tree/heatmap visualisation in future versions
Gotchas
-
Java version check: The model may assume Python alone is sufficient. It is not. fastreeR's core is a Java application. Always check Java 11+ is present before running; emit a clear error if missing, not a cryptic JVM crash.
-
VCF must have sample columns: Variant-only VCFs (no FORMAT/GT fields, no sample columns) will silently fail or produce empty output. Validate that
#CHROMline has columns beyond FORMAT (i.e., at least one sample name). -
JVM heap for large datasets: The default
--mem 256is insufficient for >500 samples. Rule of thumb:4 × n_samples² × n_threads / 1e6MB. For 1000 samples with 8 threads: ~32 GB. Document this prominently or auto-compute a suggested value. -
Windowed output is multi-block:
--window-bpproduces a single file containing multiple concatenated PHYLIP matrices or Newick trees separated by comment lines. Do not attempt to parse it as a single matrix. -
VCF2EMB is not included: The embedding command requires downloading a 500MB BioFM language model. It is intentionally excluded from v0.1.0. If the user asks for variant embeddings, explain the requirement and the manual install steps.
-
Compressed VCF via stdin: Piping
zcat input.vcf.gz | fastreer VCF2TREE -i -works but the-stdin mode requires fastreeR ≥ 2.1.0 and may not stream on Windows. Use-i input.vcf.gzdirectly for portability.
Safety
- Local-first: All computation is local. No genomic data is sent externally.
- Disclaimer: Every report includes the ClawBio medical disclaimer.
- Audit trail:
reproducibility/commands.shrecords the exact command run. - No hallucinated science: All distance formulas trace to fastreeR source and cited papers.
Agent Boundary
The agent (LLM) dispatches and explains results. The Python script (fastreer.py)
executes fastreeR and writes outputs. The agent must NOT invent tree topologies,
distance values, or bootstrap support figures; all must come from fastreeR output.
Integration with Bio Orchestrator
Trigger conditions: the orchestrator routes here when:
- User provides a VCF and asks for tree/distance/phylogenetics
- User mentions
fastreer,fastreeR,VCF2TREE,VCF2DIST,FASTA2DIST - User asks "how similar are my samples" with a VCF or FASTA
Chaining partners:
dnasp: Run DnaSP population statistics on the same VCF, then fastreeR for treevariant-annotation: Annotate variants first, then build a tree to visualise population structureclaw-ancestry-pca: use PCA for admixture and fastreeR for hierarchical clustering; the two provide complementary views of population structureseq-wrangler: Align sequences first (seq-wrangler), then compute FASTA2DIST tree
Maintenance
- Review cadence: Check for new fastreeR releases quarterly (PyPI:
pip index versions fastreer) - Staleness signals: New fastreeR CLI flags not exposed; VCF2EMB added to scope
- Deprecation: Archive to
skills/_deprecated/if fastreeR is superseded or unmaintained
Citations
- fastreeR GitHub: source code, documentation, and Docker image
- fastreeR PyPI: Python package
- fastreeR Bioconductor: R/Bioconductor package
- Gkanogiannis A (2016) A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes. BMC Bioinformatics 17, 311. https://doi.org/10.1186/s12859-016-1186-3
- Reinert G et al. (2009) Alignment-free sequence comparison (I): statistics and power. J Comput Biol 16(12):1615-34. D2S k-mer statistic used in FASTA2DIST.