busco-phylogeny
Generate phylogenies from genome assemblies using BUSCO/compleasm-based single-copy orthologs with scheduler-aware workflow generation
下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o busco-phylogeny.zip https://jpskill.com/download/17649.zip && unzip -o busco-phylogeny.zip && rm busco-phylogeny.zip
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/17649.zip -OutFile "$d\busco-phylogeny.zip"; Expand-Archive "$d\busco-phylogeny.zip" -DestinationPath $d -Force; ri "$d\busco-phylogeny.zip"
完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。
💾 手動でダウンロードしたい(コマンドが難しい人向け)
- 1. 下の青いボタンを押して
busco-phylogeny.zipをダウンロード - 2. ZIPファイルをダブルクリックで解凍 →
busco-phylogenyフォルダができる - 3. そのフォルダを
C:\Users\あなたの名前\.claude\skills\(Win)または~/.claude/skills/(Mac)へ移動 - 4. Claude Code を再起動
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-18
- 取得日時
- 2026-05-18
- 同梱ファイル
- 15
📖 Skill本文(日本語訳)
※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。
BUSCOベースの系統ゲノミクスワークフロー生成ツール
このツールは、単一コピーのオーソログを用いてゲノムアセンブリから系統学的推論を行うための、包括的でスケジューラに対応したワークフローを生成するための系統ゲノミクスに関する専門知識を提供します。
目的
このツールは、以下の方法でユーザーがゲノムアセンブリから系統樹を生成するのを支援します。
- 混合入力(ローカルファイルとNCBIアクセッション)の処理
- スケジューラ固有のスクリプトの作成(SLURM、PBS、クラウド、ローカル)
- 生ゲノムから最終的な系統樹までの完全なワークフローのセットアップ
- 品質管理と推奨事項の提供
- 柔軟なソフトウェア管理のサポート(bioconda、Docker、カスタム)
利用可能なリソース
このツールは、以下のバンドルされたリソースへのアクセスを提供します。
スクリプト (scripts/)
query_ncbi_assemblies.py- 分類群名で利用可能なゲノムアセンブリをNCBIに問い合わせます(新規!)download_ncbi_genomes.py- BioProjectまたはアセンブリアクセッションを使用してNCBIからゲノムをダウンロードしますrename_genomes.py- 意味のあるサンプル名でゲノムファイルの名前を変更します(重要!)generate_qc_report.sh- compleasmの結果から品質管理レポートを生成しますextract_orthologs.sh- 単一コピーのオーソログを抽出して再編成しますrun_aliscore.sh- ランダムに類似した配列(RSS)を識別するためのAliscoreのラッパーrun_alicut.sh- アライメントからRSSの位置を削除するためのALICUTのラッパーrun_aliscore_alicut_batch.sh- すべてのアライメントをAliscore + ALICUTでバッチ処理しますconvert_fasconcat_to_partition.py- FASconCATの出力をIQ-TREEのパーティション形式に変換しますpredownloaded_aliscore_alicut/- 事前にテストされたAliscoreおよびALICUTのPerlスクリプト
テンプレート (templates/)
slurm/- SLURMジョブスケジューラテンプレートpbs/- PBS/Torqueジョブスケジューラテンプレートlocal/- ローカルマシンテンプレート(GNU parallelを使用)README.md- 完全なテンプレートドキュメント
リファレンス (references/)
REFERENCE.md- 以下の詳細な技術リファレンス:- サンプル命名のベストプラクティス
- BUSCOリネージデータセット(完全なリスト)
- リソースの推奨事項(メモリ、CPU、ウォールタイム)
- 詳細なステップバイステップの実装ガイド
- 品質管理ガイドライン
- Aliscore/ALICUTの詳細ガイド
- ツールの引用とダウンロードリンク
- ソフトウェアインストールガイド
- よくある問題とトラブルシューティング
ワークフローの概要
完全な系統ゲノミクスパイプラインは、次のシーケンスに従います。
入力準備 → オーソログの識別 → 品質管理 → オーソログの抽出 → アライメント → トリミング → 連結 → 系統学的推論
最初のユーザーへの質問
ユーザーが系統樹の生成を要求した場合、次の情報を体系的に収集します。
ステップ1:コンピューティング環境の検出
質問をする前に、ローカルのコンピューティング環境を検出してみてください。
# ジョブスケジューラを確認する
command -v sbatch >/dev/null 2>&1 # SLURM
command -v qsub >/dev/null 2>&1 # PBS/Torque
command -v parallel >/dev/null 2>&1 # GNU parallel
結果をユーザーに報告し、確認します。「このマシンで[X]を検出しました。スクリプトをここで実行しますか、それとも別のシステムで実行しますか?」
必要な情報
以下の質問をして、必須のワークフローパラメータを収集します。
-
コンピューティング環境
- これらのスクリプトはどこで実行されますか?(SLURMクラスタ、PBS/Torqueクラスタ、クラウドコンピューティング、ローカルマシン)
-
入力データ
- ローカルゲノムファイル、NCBIアクセッション、またはその両方ですか?
- NCBIの場合:すでにアセンブリアクセッション(GCA*/GCF)またはBioProjectアクセッション(PRJNA/PRJEB/PRJDA)をお持ちですか?
- ユーザーがアクセッションを持っていない場合:
query_ncbi_assemblies.pyを使用してアセンブリを見つける手伝いを申し出ます(下記の「ステップ0A:NCBIにアセンブリを問い合わせる」を参照) - ローカルファイルの場合:ファイルパスは何ですか?
-
分類学的範囲とデータセットの詳細
- 分類群は何ですか?(BUSCOリネージデータセットを決定します)
- 分析する分類群/ゲノムの数はいくつですか?
- おおよその系統学的範囲は何ですか?(種レベル、属レベル、科レベル、目レベルなど)
- 完全なリネージリストについては、
references/REFERENCE.mdを参照してください
-
環境管理
- 統合されたconda環境(デフォルト、推奨)を使用しますか、それともツールごとに個別の環境を使用しますか?
-
リソース制約
- 合計で使用するCPUコア/スレッド数はいくつですか?(ユーザーに指定するように依頼し、自動検出しないでください)
- ノード/マシンあたりの利用可能なメモリ(RAM)はどれくらいですか?
- ジョブの最大ウォールタイムはどれくらいですか?
- リソースの推奨事項については、
references/REFERENCE.mdを参照してください
-
並列化戦略
ユーザーに並列処理をどのように処理したいかを尋ねます。
-
ジョブスケジューラ(SLURM/PBS)の場合:
- 並列ステップにアレイジョブを使用しますか?(推奨:はい)
- どのステップを並列化しますか?(ステップ2、5、6、8Cを推奨)
-
ローカルマシンの場合:
- 並列ステップにGNU parallelを使用しますか?(
parallelのインストールが必要です) - 同時ジョブ数はいくつですか?
- 並列ステップにGNU parallelを使用しますか?(
-
すべてのシステムの場合:
- 最大スループットまたはシンプルさのどちらを最適化しますか?
-
-
スケジューラ固有の構成(SLURMまたはPBSを使用する場合)
- 計算時間課金のアカウント/ユーザー名
- ジョブを送信するパーティション/キュー
- メール通知が必要ですか?(アドレスとタイミング:START、END、FAIL、ALL)
- ジョブの依存関係はありますか?(推奨:線形ワークフローの場合ははい)
- 出力ログディレクトリはどこですか?(デフォルト:
logs/)
-
アライメントトリミングの好み
- Aliscore/ALICUT(従来型、徹底的)、trimAl(高速)、BMGE(エントロピーベース)、またはClipKit(最新)のどれを使用しますか?
-
置換モデルの選択(IQ-TREE系統学的推論の場合)
必要なコンテキスト:分類学的範囲、分類群の数、進化速度
アクション:IQ-TREEモデルのドキュメントを取得し、データセットの特性に基づいて適切なアミノ酸置換モデルを提案します。
置換モデル推奨システムを使用します(下記の「置換モデルの推奨」セクションを参照)。
-
教育目標
- あなたはバイオインフォマティクスを学んでおり、各ワークフローステップの包括的な説明が必要ですか?
- はいの場合:主要なワークフローを完了した後
(原文はここで切り詰められています)
📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開
BUSCO-based Phylogenomics Workflow Generator
This skill provides phylogenomics expertise for generating comprehensive, scheduler-aware workflows for phylogenetic inference from genome assemblies using single-copy orthologs.
Purpose
This skill helps users generate phylogenies from genome assemblies by:
- Handling mixed input (local files and NCBI accessions)
- Creating scheduler-specific scripts (SLURM, PBS, cloud, local)
- Setting up complete workflows from raw genomes to final trees
- Providing quality control and recommendations
- Supporting flexible software management (bioconda, Docker, custom)
Available Resources
The skill provides access to these bundled resources:
Scripts (scripts/)
query_ncbi_assemblies.py- Query NCBI for available genome assemblies by taxon name (new!)download_ncbi_genomes.py- Download genomes from NCBI using BioProjects or Assembly accessionsrename_genomes.py- Rename genome files with meaningful sample names (important!)generate_qc_report.sh- Generate quality control reports from compleasm resultsextract_orthologs.sh- Extract and reorganize single-copy orthologsrun_aliscore.sh- Wrapper for Aliscore to identify randomly similar sequences (RSS)run_alicut.sh- Wrapper for ALICUT to remove RSS positions from alignmentsrun_aliscore_alicut_batch.sh- Batch process all alignments through Aliscore + ALICUTconvert_fasconcat_to_partition.py- Convert FASconCAT output to IQ-TREE partition formatpredownloaded_aliscore_alicut/- Pre-tested Aliscore and ALICUT Perl scripts
Templates (templates/)
slurm/- SLURM job scheduler templatespbs/- PBS/Torque job scheduler templateslocal/- Local machine templates (with GNU parallel)README.md- Complete template documentation
References (references/)
REFERENCE.md- Detailed technical reference including:- Sample naming best practices
- BUSCO lineage datasets (complete list)
- Resource recommendations (memory, CPUs, walltime)
- Detailed step-by-step implementation guides
- Quality control guidelines
- Aliscore/ALICUT detailed guide
- Tool citations and download links
- Software installation guide
- Common issues and troubleshooting
Workflow Overview
The complete phylogenomics pipeline follows this sequence:
Input Preparation → Ortholog Identification → Quality Control → Ortholog Extraction → Alignment → Trimming → Concatenation → Phylogenetic Inference
Initial User Questions
When a user requests phylogeny generation, gather the following information systematically:
Step 1: Detect Computing Environment
Before asking questions, attempt to detect the local computing environment:
# Check for job schedulers
command -v sbatch >/dev/null 2>&1 # SLURM
command -v qsub >/dev/null 2>&1 # PBS/Torque
command -v parallel >/dev/null 2>&1 # GNU parallel
Report findings to the user, then confirm: "I detected [X] on this machine. Will you be running the scripts here or on a different system?"
Required Information
Ask these questions to gather essential workflow parameters:
-
Computing Environment
- Where will these scripts run? (SLURM cluster, PBS/Torque cluster, Cloud computing, Local machine)
-
Input Data
- Local genome files, NCBI accessions, or both?
- If NCBI: Do you already have Assembly accessions (GCA*/GCF) or BioProject accessions (PRJNA/PRJEB/PRJDA)?
- If user doesn't have accessions: Offer to help find assemblies using
query_ncbi_assemblies.py(see "STEP 0A: Query NCBI for Assemblies" below) - If local files: What are the file paths?
-
Taxonomic Scope & Dataset Details
- What taxonomic group? (determines BUSCO lineage dataset)
- How many taxa/genomes will be analyzed?
- What is the approximate phylogenetic breadth? (species-level, genus-level, family-level, order-level, etc.)
- See
references/REFERENCE.mdfor complete lineage list
-
Environment Management
- Use unified conda environment (default, recommended), or separate environments per tool?
-
Resource Constraints
- How many CPU cores/threads to use in total? (Ask user to specify, do not auto-detect)
- Available memory (RAM) per node/machine?
- Maximum walltime for jobs?
- See
references/REFERENCE.mdfor resource recommendations
-
Parallelization Strategy
Ask the user how they want to handle parallel processing:
-
For job schedulers (SLURM/PBS):
- Use array jobs for parallel steps? (Recommended: Yes)
- Which steps to parallelize? (Steps 2, 5, 6, 8C recommended)
-
For local machines:
- Use GNU parallel for parallel steps? (requires
parallelinstalled) - How many concurrent jobs?
- Use GNU parallel for parallel steps? (requires
-
For all systems:
- Optimize for maximum throughput or simplicity?
-
-
Scheduler-Specific Configuration (if using SLURM or PBS)
- Account/Username for compute time charges
- Partition/Queue to submit jobs to
- Email notifications? (address and when: START, END, FAIL, ALL)
- Job dependencies? (Recommended: Yes for linear workflow)
- Output log directory? (Default:
logs/)
-
Alignment Trimming Preference
- Aliscore/ALICUT (traditional, thorough), trimAl (fast), BMGE (entropy-based), or ClipKit (modern)?
-
Substitution Model Selection (for IQ-TREE phylogenetic inference)
Context needed: Taxonomic breadth, number of taxa, evolutionary rates
Action: Fetch IQ-TREE model documentation and suggest appropriate amino acid substitution models based on dataset characteristics.
Use the substitution model recommendation system (see "Substitution Model Recommendation" section below).
-
Educational Goals
- Are you learning bioinformatics and would you like comprehensive explanations of each workflow step?
- If yes: After completing each major workflow stage, offer to explain what the step accomplishes, why certain choices were made, and what best practices are being followed.
- Store this preference to use throughout the workflow.
Recommended Directory Structure
Organize analyses with dedicated folders for each pipeline step:
project_name/
├── logs/ # All log files
├── 00_genomes/ # Input genome assemblies
├── 01_busco_results/ # BUSCO/compleasm outputs
├── 02_qc/ # Quality control reports
├── 03_extracted_orthologs/ # Extracted single-copy orthologs
├── 04_alignments/ # Multiple sequence alignments
├── 05_trimmed/ # Trimmed alignments
├── 06_concatenation/ # Supermatrix and partition files
├── 07_partition_search/ # Partition model selection
├── 08_concatenated_tree/ # Concatenated ML tree
├── 09_gene_trees/ # Individual gene trees
├── 10_species_tree/ # ASTRAL species tree
└── scripts/ # All analysis scripts
Benefits: Easy debugging, clear workflow progression, reproducibility, prevents root directory clutter.
Template System
This skill uses a template-based system to reduce token usage and improve maintainability. Script templates are stored in the templates/ directory and organized by computing environment.
How to Use Templates
When generating scripts for users:
-
Read the appropriate template for their computing environment:
Read("templates/slurm/02_compleasm_first.job") -
Replace placeholders with user-specific values:
TOTAL_THREADS→ e.g.,64THREADS_PER_JOB→ e.g.,16NUM_GENOMES→ e.g.,20NUM_LOCI→ e.g.,2795LINEAGE→ e.g.,insecta_odb10MODEL_SET→ e.g.,LG,WAG,JTT,Q.pfam
-
Present the customized script to the user with setup instructions
Available Templates
Key templates by workflow step:
- Step 0 (setup): Environment setup script in
references/REFERENCE.md - Step 2 (compleasm):
02_compleasm_first,02_compleasm_parallel - Step 8A (partition search):
08a_partition_search - Step 8C (gene trees):
08c_gene_trees_array,08c_gene_trees_parallel,08c_gene_trees_serial
See templates/README.md for complete template documentation.
Substitution Model Recommendation
When asked about substitution model selection (Question 9), use this systematic approach:
Step 1: Fetch IQ-TREE Documentation
Use WebFetch to retrieve current model information:
WebFetch(url="https://iqtree.github.io/doc/Substitution-Models",
prompt="Extract all amino acid substitution models with descriptions and usage guidelines")
Step 2: Analyze Dataset Characteristics
Consider these factors from user responses:
- Taxonomic Scope: Species/genus (shallow) vs. family/order (moderate) vs. class/phylum+ (deep)
- Number of Taxa: <20 (small), 20-50 (medium), >50 (large)
- Evolutionary Rates: Fast-evolving, moderate, or slow-evolving
- Sequence Type: Nuclear proteins, mitochondrial, or chloroplast
Step 3: Recommend Models
Provide 3-5 appropriate models based on dataset characteristics. For detailed model recommendation matrices and taxonomically-targeted models, see references/REFERENCE.md section "Substitution Model Recommendation".
General recommendations:
- Nuclear proteins (most common): LG, WAG, JTT, Q.pfam
- Mitochondrial: mtREV, mtZOA, mtMAM, mtART, mtVer, mtInv
- Chloroplast: cpREV
- Taxonomically-targeted: Q.bird, Q.mammal, Q.insect, Q.plant, Q.yeast (when applicable)
Step 4: Present Recommendations
Format recommendations with justifications and explain how models will be used in IQ-TREE steps 8A and 8C.
Step 5: Store Model Set
Store the final comma-separated model list (e.g., "LG,WAG,JTT,Q.pfam") for use in Step 8 template placeholders.
Workflow Implementation
Once required information is gathered, guide the user through these steps. For each step, use templates where available and refer to references/REFERENCE.md for detailed implementation.
STEP 0: Environment Setup
ALWAYS start by generating a setup script for the user's environment.
Use the unified conda environment setup script from references/REFERENCE.md (Section: "Software Installation Guide"). This creates a single conda environment with all necessary tools:
- compleasm, MAFFT, trimming tools (trimAl, ClipKit, BMGE)
- IQ-TREE, ASTRAL, Perl with BioPerl, GNU parallel
- Downloads and installs Aliscore/ALICUT Perl scripts
Key points:
- Users choose between mamba (faster) or conda
- Users choose between predownloaded Aliscore/ALICUT scripts (tested) or latest from GitHub
- All subsequent steps use
conda activate phylo(the unified environment)
See references/REFERENCE.md for the complete setup script template.
STEP 0A: Query NCBI for Assemblies (Optional)
Use this step when: User wants to use NCBI data but doesn't have specific assembly accessions yet.
This optional preliminary step helps users discover available genome assemblies by taxon name before proceeding with the main workflow.
When to Offer This Step
Offer this step when:
- User wants to analyze genomes from NCBI
- User doesn't have specific Assembly or BioProject accessions
- User mentions a taxonomic group (e.g., "I want to build a phylogeny for beetles")
Workflow
-
Ask for focal taxon: Request the taxonomic group of interest
- Examples: "Coleoptera", "Drosophila", "Apis mellifera"
- Can be at any taxonomic level (order, family, genus, species)
-
Query NCBI using the script: Use
scripts/query_ncbi_assemblies.pyto search for assemblies# Basic query (returns 20 results by default) python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" # Query with more results python scripts/query_ncbi_assemblies.py --taxon "Drosophila" --max-results 50 # Query for RefSeq assemblies only (higher quality, GCF_* accessions) python scripts/query_ncbi_assemblies.py --taxon "Apis" --refseq-only # Save accessions to file for later download python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" --save assembly_accessions.txt -
Present results to user: The script displays:
- Assembly accession (GCA* or GCF*)
- Organism name
- Assembly level (Chromosome, Scaffold, Contig)
- Assembly name
-
Help user select assemblies: Ask user which assemblies they want to include
- Consider assembly level (Chromosome > Scaffold > Contig)
- Consider phylogenetic breadth (species coverage)
- Consider data quality (RefSeq > GenBank when available)
-
Collect selected accessions: Compile the list of chosen assembly accessions
-
Proceed to STEP 1: Use the selected accessions with
download_ncbi_genomes.py
Tips for Assembly Selection
- Assembly Level: Chromosome-level assemblies are most complete, followed by Scaffold, then Contig
- RefSeq vs GenBank: RefSeq (GCF*) assemblies undergo additional curation; GenBank (GCA*) are submitter-provided
- Taxonomic Sampling: For phylogenetics, aim for representative sampling across the taxonomic group
- Quality over Quantity: Better to have 20 high-quality assemblies than 100 poor-quality ones
STEP 1: Download NCBI Genomes (if applicable)
If user provided NCBI accessions, use scripts/download_ncbi_genomes.py:
For BioProjects:
python scripts/download_ncbi_genomes.py --bioprojects PRJNA12345 -o genomes.zip
unzip genomes.zip
For Assembly Accessions:
python scripts/download_ncbi_genomes.py --assemblies GCA_123456789.1 -o genomes.zip
unzip genomes.zip
IMPORTANT: After download, genomes must be renamed with meaningful sample names (format: [ACCESSION]_[SPECIES_NAME]). Sample names appear in final phylogenetic trees.
Generate a script that:
- Finds all downloaded FASTA files in ncbi_dataset directory structure
- Moves/renames files to main genomes directory with meaningful names
- Includes any local genome files
- Creates final genome_list.txt with ALL genomes (local + downloaded)
See references/REFERENCE.md section "Sample Naming Best Practices" for detailed guidelines.
STEP 2: Ortholog Identification with compleasm
Activate the unified environment and run compleasm on all genomes to identify single-copy orthologs.
Key considerations:
- First genome must run alone to download lineage database
- Remaining genomes can run in parallel
- Thread allocation: Miniprot scales well up to ~16-32 threads per genome
Threading guidelines: See references/REFERENCE.md for recommended thread allocation table.
Generate scripts using templates:
- SLURM: Read templates
02_compleasm_first.joband02_compleasm_parallel.job - PBS: Read templates
02_compleasm_first.joband02_compleasm_parallel.job - Local: Read templates
02_compleasm_first.shand02_compleasm_parallel.sh
Replace placeholders: TOTAL_THREADS, THREADS_PER_JOB, NUM_GENOMES, LINEAGE
For detailed implementation examples, see references/REFERENCE.md section "Ortholog Identification Implementation".
STEP 3: Quality Control
After compleasm completes, generate QC report using scripts/generate_qc_report.sh:
bash scripts/generate_qc_report.sh qc_report.csv
Provide interpretation:
- >95% complete: Excellent, retain
- 90-95% complete: Good, retain
- 85-90% complete: Acceptable, case-by-case
- 70-85% complete: Questionable, consider excluding
- <70% complete: Poor, recommend excluding
See references/REFERENCE.md section "Quality Control Guidelines" for detailed assessment criteria.
STEP 4: Ortholog Extraction
Use scripts/extract_orthologs.sh to extract single-copy orthologs:
bash scripts/extract_orthologs.sh LINEAGE_NAME
This generates per-locus unaligned FASTA files in single_copy_orthologs/unaligned_aa/.
STEP 5: Alignment with MAFFT
Activate the unified environment (conda activate phylo) which contains MAFFT.
Create locus list, then generate alignment scripts:
cd single_copy_orthologs/unaligned_aa
ls *.fas > locus_names.txt
num_loci=$(wc -l < locus_names.txt)
Generate scheduler-specific scripts:
- SLURM/PBS: Array job with one task per locus
- Local: Sequential processing or GNU parallel
For detailed script templates, see references/REFERENCE.md section "Alignment Implementation".
STEP 6: Alignment Trimming
Based on user's preference, provide appropriate trimming method. All tools are available in the unified conda environment.
Options:
- trimAl: Fast (
-automated1), recommended for large datasets - ClipKit: Modern, fast (default smart-gap mode)
- BMGE: Entropy-based (
-t AA) - Aliscore/ALICUT: Traditional, thorough (recommended for phylogenomics)
For Aliscore/ALICUT:
- Perl scripts were installed in STEP 0
- Use
scripts/run_aliscore_alicut_batch.shfor batch processing - Or use array jobs with
scripts/run_aliscore.shandscripts/run_alicut.sh - Always use
-Nflag for amino acid sequences
Generate scripts using scheduler-appropriate templates (array jobs for SLURM/PBS, parallel or serial for local).
For detailed implementation of each trimming method, see references/REFERENCE.md section "Alignment Trimming Implementation".
STEP 7: Concatenation and Partition Definition
Download FASconCAT-G (Perl script) and run concatenation:
conda activate phylo # Has Perl installed
wget https://raw.githubusercontent.com/PatrickKueck/FASconCAT-G/master/FASconCAT-G_v1.06.1.pl -O FASconCAT-G.pl
chmod +x FASconCAT-G.pl
cd trimmed_aa
perl ../FASconCAT-G.pl -s -i
Convert to IQ-TREE format using scripts/convert_fasconcat_to_partition.py:
python ../scripts/convert_fasconcat_to_partition.py FcC_info.xls partition_def.txt
Outputs: FcC_supermatrix.fas, FcC_info.xls, partition_def.txt
STEP 8: Phylogenetic Inference
IQ-TREE is already installed in the unified environment. Activate with conda activate phylo.
Part 8A: Partition Model Selection
Use the substitution models selected during initial setup (Question 9).
Generate script using templates:
- Read appropriate template:
templates/[slurm|pbs|local]/08a_partition_search.[job|sh] - Replace
MODEL_SETplaceholder with user's selected models (e.g., "LG,WAG,JTT,Q.pfam")
For detailed implementation, see references/REFERENCE.md section "Partition Model Selection Implementation".
Part 8B: Concatenated ML Tree
Run IQ-TREE using the best partition scheme from Part 8A:
iqtree -s FcC_supermatrix.fas -spp partition_search.best_scheme.nex \
-nt 18 -safe -pre concatenated_ML_tree -bb 1000 -bnni
Output: concatenated_ML_tree.treefile
Part 8C: Individual Gene Trees
Estimate gene trees for coalescent-based species tree inference.
Generate scripts using templates:
- SLURM/PBS: Read
08c_gene_trees_array.jobtemplate - Local: Read
08c_gene_trees_parallel.shor08c_gene_trees_serial.shtemplate - Replace
NUM_LOCIplaceholder
For detailed implementation, see references/REFERENCE.md section "Gene Trees Implementation".
Part 8D: ASTRAL Species Tree
ASTRAL is already installed in the unified conda environment.
conda activate phylo
# Concatenate all gene trees
cat trimmed_aa/*.treefile > all_gene_trees.tre
# Run ASTRAL
astral -i all_gene_trees.tre -o astral_species_tree.tre
Output: astral_species_tree.tre
STEP 9: Generate Methods Paragraph
ALWAYS generate a methods paragraph to help users write their publication methods section.
Create METHODS_PARAGRAPH.md file with:
- Customized text based on tools and parameters used
- Complete citations for all software
- Placeholders for user-specific values (genome count, loci count, thresholds)
- Instructions for adapting to journal requirements
For the complete methods paragraph template, see references/REFERENCE.md section "Methods Paragraph Template".
Pre-fill known values when possible:
- Number of genomes
- BUSCO lineage
- Trimming method used
- Substitution models tested
Final Outputs Summary
Provide users with a summary of outputs:
Phylogenetic Results:
concatenated_ML_tree.treefile- ML tree from concatenated supermatrixastral_species_tree.tre- Coalescent species tree*.treefile- Individual gene trees
Data and Quality Control:
4. qc_report.csv - Genome quality statistics
5. FcC_supermatrix.fas - Concatenated alignment
6. partition_search.best_scheme.nex - Selected partitioning scheme
Publication Materials:
7. METHODS_PARAGRAPH.md - Ready-to-use methods section with citations
Visualization tools: FigTree, iTOL, ggtree (R), ete3/toytree (Python)
Script Validation
ALWAYS perform validation checks after generating scripts but before presenting them to the user. This ensures script accuracy, consistency, and proper resource allocation.
Validation Workflow
For each generated script, perform these validation checks in order:
1. Program Option Verification
Purpose: Detect hallucinated or incorrect command-line options that may cause scripts to fail.
Procedure:
- Extract all command invocations from the generated script (e.g.,
compleasm run,iqtree -s,mafft --auto) - Compare against reference sources:
- First check: Compare against corresponding template in
templates/directory - Second check: Compare against examples in
references/REFERENCE.md - Third check: If options differ significantly or are uncertain, perform web search for official documentation
- First check: Compare against corresponding template in
- Common tools to validate:
compleasm run- Check-a,-o,-l,-toptionsiqtree- Verify-s,-p,-m,-bb,-alrt,-nt,-safeoptionsmafft- Check--auto,--thread,--reorderoptionsastral- Verify-i,-ooptions- Trimming tools (
trimal,clipkit,BMGE.jar) - Validate options
Action on issues:
- If incorrect options found: Inform user of the issue and ask if they want you to correct it
- If uncertain: Ask user to verify with tool documentation before proceeding
2. Pipeline Continuity Verification
Purpose: Ensure outputs from one step correctly feed into inputs of subsequent steps.
Procedure:
-
Map input/output relationships:
- Step 2 output (
01_busco_results/*_compleasm/) → Step 3 input (QC script) - Step 3 output (
single_copy_orthologs/) → Step 5 input (MAFFT) - Step 5 output (
04_alignments/*.fas) → Step 6 input (trimming) - Step 6 output (
05_trimmed/*.fas) → Step 7 input (FASconCAT-G) - Step 7 output (
FcC_supermatrix.fas, partition file) → Step 8A input (IQ-TREE) - Step 8C output (
*.treefile) → Step 8D input (ASTRAL)
- Step 2 output (
-
Check for consistency:
- File path references match across scripts
- Directory structure follows recommended layout
- Glob patterns correctly match expected files
- Required intermediate files are generated before being used
Action on issues:
- If path mismatches found: Inform user and ask if they want you to correct them
- If directory structure inconsistent: Suggest corrections aligned with recommended structure
3. Resource Compatibility Check
Purpose: Ensure allocated computational resources are appropriate for the task.
Procedure:
-
Verify resource allocations against recommendations in
references/REFERENCE.md:- Memory allocation: Check if memory per CPU (typically 6GB for compleasm, 2-4GB for others) is adequate
- Thread allocation: Verify thread counts are reasonable for the number of genomes/loci
- Walltime: Ensure walltime is sufficient based on dataset size guidelines
- Parallelization: Check that threads per job × concurrent jobs ≤ total threads
-
Common issues to check:
- Compleasm: First job needs full thread allocation (downloads database)
- IQ-TREE:
-ntshould match allocated CPUs - Gene trees: Ensure enough threads per tree × concurrent trees ≤ total available
- Memory: Concatenated tree inference may need 8-16GB per CPU for large datasets
-
Validate against user-specified constraints:
- Total CPUs specified by user
- Available memory per node
- Maximum walltime limits
- Scheduler-specific limits (if mentioned)
Action on issues:
- If resource allocation issues found: Inform user and suggest corrections with justification
- If uncertain about adequacy: Ask user about typical job performance in their environment
Validation Reporting
After completing all validation checks:
-
If all checks pass: Inform user briefly: "Scripts validated successfully - options, pipeline flow, and resources verified."
-
If issues found: Present a structured report:
**Validation Results** ⚠️ Issues found during validation: 1. [Issue category]: [Description] - Current: [What was generated] - Suggested: [Recommended fix] - Reason: [Why this is an issue] Would you like me to apply these corrections? -
Always ask before correcting: Never silently fix issues - always get user confirmation before applying changes.
-
Document corrections: If corrections are applied, explain what was changed and why.
Communication Guidelines
- Always start with STEP 0: Generate the unified environment setup script
- Always end with STEP 9: Generate the customized methods paragraph
- Always validate scripts: Perform validation checks before presenting scripts to users
- Use unified environment by default: All scripts should use
conda activate phylo - Always ask about CPU allocation: Never auto-detect cores, always ask user
- Recommend optimized workflows: For users with adequate resources, recommend optimized parallel approaches over simple serial approaches
- Be clear and pedagogical: Explain why each step is necessary
- Provide educational explanations when requested: If user answered yes to educational goals (question 10):
- After completing each major workflow stage, ask: "Would you like me to explain this step?"
- If yes, provide moderate-length explanation (1-2 paragraphs) covering:
- What the step accomplishes biologically and computationally
- Significant choices made and their rationale
- Best practices being followed in the workflow
- Examples of "major workflow stages": STEP 0 (setup), STEP 1 (download), STEP 2 (BUSCO), STEP 3 (QC), STEP 5 (alignment), STEP 6 (trimming), STEP 7 (concatenation), STEP 8 (phylogenetic inference)
- Provide complete, ready-to-run scripts: Users should copy-paste and run
- Adapt to user's environment: Always generate scheduler-specific scripts
- Reference supporting files: Direct users to
references/REFERENCE.mdfor details - Use helper scripts: Leverage provided scripts in
scripts/directory - Include error checking: Add file existence checks and informative error messages
- Be encouraging: Phylogenomics is complex; maintain supportive tone
Important Notes
Mandatory Steps
- STEP 0 is mandatory: Always generate the environment setup script first
- STEP 9 is mandatory: Always generate the methods paragraph file at the end
Template Usage (IMPORTANT!)
- Prefer templates over inline code: Use
templates/directory for major scripts - Template workflow:
- Read:
Read("templates/slurm/02_compleasm_first.job") - Replace placeholders:
TOTAL_THREADS,LINEAGE,NUM_GENOMES,MODEL_SET, etc. - Present customized script to user
- Read:
- Available templates: See
templates/README.mdfor complete list - Benefits: Reduces token usage, easier maintenance, consistent structure
Script Generation
- Always adapt scripts to user's scheduler (SLURM/PBS/local)
- Replace all placeholders before presenting scripts
- Never auto-detect CPU cores: Always ask user to specify
- Provide parallelization options: For each parallelizable step, offer array job, parallel, and serial options
- Scheduler-specific configuration: For SLURM/PBS, always ask about account, partition, email, etc.
Parallelization Strategy
- Ask about preferences: Let user choose between throughput optimization vs. simplicity
- Compleasm optimization: For ≥2 genomes and ≥16 cores, recommend two-phase approach
- Use threading guidelines: Refer to
references/REFERENCE.mdfor thread allocation recommendations - Parallelizable steps: Steps 2 (compleasm), 5 (MAFFT), 6 (trimming), 8C (gene trees)
Substitution Model Selection
- Always recommend models: Use the systematic model recommendation process
- Fetch current documentation: Use WebFetch to get IQ-TREE model information
- Replace MODEL_SET placeholder: In Step 8A templates with comma-separated list
- Taxonomically-targeted models: Suggest Q.bird, Q.mammal, Q.insect, Q.plant when applicable
Reference Material
- Direct users to references/REFERENCE.md for:
- Detailed implementation guides
- BUSCO lineage datasets (complete list)
- Resource recommendations (memory, CPUs, walltime tables)
- Sample naming best practices
- Quality control assessment criteria
- Aliscore/ALICUT detailed guide and parameters
- Tool citations with DOIs
- Software installation instructions
- Common issues and troubleshooting
Attribution
This skill was created by Bruno de Medeiros (Curator of Pollinating Insects, Field Museum) based on phylogenomics tutorials by Paul Frandsen (Brigham Young University).
Workflow Entry Point
When a user requests phylogeny generation:
- Gather required information using the "Initial User Questions" section
- Generate STEP 0 setup script from
references/REFERENCE.MD - If user needs help finding NCBI assemblies, perform STEP 0A using
query_ncbi_assemblies.py - Proceed step-by-step through workflow (STEPS 1-8), using templates and referring to
references/REFERENCE.mdfor detailed implementation - All workflow scripts should use the unified conda environment (
conda activate phylo) - Validate all generated scripts before presenting to user (see "Script Validation" section)
- Generate STEP 9 methods paragraph from template in
references/REFERENCE.md - Provide final outputs summary
同梱ファイル
※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。
- 📄 SKILL.md (31,026 bytes)
- 📎 README.md (4,093 bytes)
- 📎 references/REFERENCE.md (64,608 bytes)
- 📎 scripts/convert_fasconcat_to_partition.py (1,796 bytes)
- 📎 scripts/download_ncbi_genomes.py (3,935 bytes)
- 📎 scripts/extract_orthologs.sh (2,200 bytes)
- 📎 scripts/generate_qc_report.sh (2,104 bytes)
- 📎 scripts/predownloaded_aliscore_alicut/ALICUT_V2.31.pl (26,210 bytes)
- 📎 scripts/predownloaded_aliscore_alicut/Aliscore_module.pm (71,577 bytes)
- 📎 scripts/predownloaded_aliscore_alicut/Aliscore.02.2.pl (36,217 bytes)
- 📎 scripts/query_ncbi_assemblies.py (5,145 bytes)
- 📎 scripts/rename_genomes.py (6,952 bytes)
- 📎 scripts/run_alicut.sh (7,273 bytes)
- 📎 scripts/run_aliscore_alicut_batch.sh (8,099 bytes)
- 📎 scripts/run_aliscore.sh (7,341 bytes)