🛠️ 開発・MCP コミュニティ

busco-phylogeny

Generate phylogenies from genome assemblies using BUSCO/compleasm-based single-copy orthologs with scheduler-aware workflow generation

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o busco-phylogeny.zip https://jpskill.com/download/17649.zip && unzip -o busco-phylogeny.zip && rm busco-phylogeny.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/17649.zip -OutFile "$d\busco-phylogeny.zip"; Expand-Archive "$d\busco-phylogeny.zip" -DestinationPath $d -Force; ri "$d\busco-phylogeny.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して busco-phylogeny.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → busco-phylogeny フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 15

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

BUSCOベースの系統ゲノミクスワークフロー生成ツール

このツールは、単一コピーのオーソログを用いてゲノムアセンブリから系統学的推論を行うための、包括的でスケジューラに対応したワークフローを生成するための系統ゲノミクスに関する専門知識を提供します。

目的

このツールは、以下の方法でユーザーがゲノムアセンブリから系統樹を生成するのを支援します。

混合入力（ローカルファイルとNCBIアクセッション）の処理
スケジューラ固有のスクリプトの作成（SLURM、PBS、クラウド、ローカル）
生ゲノムから最終的な系統樹までの完全なワークフローのセットアップ
品質管理と推奨事項の提供
柔軟なソフトウェア管理のサポート（bioconda、Docker、カスタム）

利用可能なリソース

このツールは、以下のバンドルされたリソースへのアクセスを提供します。

スクリプト (`scripts/`)

query_ncbi_assemblies.py - 分類群名で利用可能なゲノムアセンブリをNCBIに問い合わせます（新規！）
download_ncbi_genomes.py - BioProjectまたはアセンブリアクセッションを使用してNCBIからゲノムをダウンロードします
rename_genomes.py - 意味のあるサンプル名でゲノムファイルの名前を変更します（重要！）
generate_qc_report.sh - compleasmの結果から品質管理レポートを生成します
extract_orthologs.sh - 単一コピーのオーソログを抽出して再編成します
run_aliscore.sh - ランダムに類似した配列（RSS）を識別するためのAliscoreのラッパー
run_alicut.sh - アライメントからRSSの位置を削除するためのALICUTのラッパー
run_aliscore_alicut_batch.sh - すべてのアライメントをAliscore + ALICUTでバッチ処理します
convert_fasconcat_to_partition.py - FASconCATの出力をIQ-TREEのパーティション形式に変換します
predownloaded_aliscore_alicut/ - 事前にテストされたAliscoreおよびALICUTのPerlスクリプト

テンプレート (`templates/`)

slurm/ - SLURMジョブスケジューラテンプレート
pbs/ - PBS/Torqueジョブスケジューラテンプレート
local/ - ローカルマシンテンプレート（GNU parallelを使用）
README.md - 完全なテンプレートドキュメント

リファレンス (`references/`)

REFERENCE.md - 以下の詳細な技術リファレンス：
- サンプル命名のベストプラクティス
- BUSCOリネージデータセット（完全なリスト）
- リソースの推奨事項（メモリ、CPU、ウォールタイム）
- 詳細なステップバイステップの実装ガイド
- 品質管理ガイドライン
- Aliscore/ALICUTの詳細ガイド
- ツールの引用とダウンロードリンク
- ソフトウェアインストールガイド
- よくある問題とトラブルシューティング

ワークフローの概要

完全な系統ゲノミクスパイプラインは、次のシーケンスに従います。

入力準備 → オーソログの識別 → 品質管理 → オーソログの抽出 → アライメント → トリミング → 連結 → 系統学的推論

最初のユーザーへの質問

ユーザーが系統樹の生成を要求した場合、次の情報を体系的に収集します。

ステップ1：コンピューティング環境の検出

質問をする前に、ローカルのコンピューティング環境を検出してみてください。

# ジョブスケジューラを確認する
command -v sbatch >/dev/null 2>&1  # SLURM
command -v qsub >/dev/null 2>&1    # PBS/Torque
command -v parallel >/dev/null 2>&1  # GNU parallel

結果をユーザーに報告し、確認します。「このマシンで[X]を検出しました。スクリプトをここで実行しますか、それとも別のシステムで実行しますか？」

必要な情報

以下の質問をして、必須のワークフローパラメータを収集します。

コンピューティング環境
- これらのスクリプトはどこで実行されますか？（SLURMクラスタ、PBS/Torqueクラスタ、クラウドコンピューティング、ローカルマシン）
入力データ
- ローカルゲノムファイル、NCBIアクセッション、またはその両方ですか？
- NCBIの場合：すでにアセンブリアクセッション（GCA*/GCF）またはBioProjectアクセッション（PRJNA/PRJEB/PRJDA）をお持ちですか？
- ユーザーがアクセッションを持っていない場合：query_ncbi_assemblies.pyを使用してアセンブリを見つける手伝いを申し出ます（下記の「ステップ0A：NCBIにアセンブリを問い合わせる」を参照）
- ローカルファイルの場合：ファイルパスは何ですか？
分類学的範囲とデータセットの詳細
- 分類群は何ですか？（BUSCOリネージデータセットを決定します）
- 分析する分類群/ゲノムの数はいくつですか？
- おおよその系統学的範囲は何ですか？（種レベル、属レベル、科レベル、目レベルなど）
- 完全なリネージリストについては、references/REFERENCE.mdを参照してください
環境管理
- 統合されたconda環境（デフォルト、推奨）を使用しますか、それともツールごとに個別の環境を使用しますか？
リソース制約
- 合計で使用するCPUコア/スレッド数はいくつですか？（ユーザーに指定するように依頼し、自動検出しないでください）
- ノード/マシンあたりの利用可能なメモリ（RAM）はどれくらいですか？
- ジョブの最大ウォールタイムはどれくらいですか？
- リソースの推奨事項については、references/REFERENCE.mdを参照してください
並列化戦略

ユーザーに並列処理をどのように処理したいかを尋ねます。
- ジョブスケジューラ（SLURM/PBS）の場合：
  - 並列ステップにアレイジョブを使用しますか？（推奨：はい）
  - どのステップを並列化しますか？（ステップ2、5、6、8Cを推奨）
- ローカルマシンの場合：
  - 並列ステップにGNU parallelを使用しますか？（parallelのインストールが必要です）
  - 同時ジョブ数はいくつですか？
- すべてのシステムの場合：
  - 最大スループットまたはシンプルさのどちらを最適化しますか？
スケジューラ固有の構成（SLURMまたはPBSを使用する場合）
- 計算時間課金のアカウント/ユーザー名
- ジョブを送信するパーティション/キュー
- メール通知が必要ですか？（アドレスとタイミング：START、END、FAIL、ALL）
- ジョブの依存関係はありますか？（推奨：線形ワークフローの場合ははい）
- 出力ログディレクトリはどこですか？（デフォルト：logs/）
アライメントトリミングの好み
- Aliscore/ALICUT（従来型、徹底的）、trimAl（高速）、BMGE（エントロピーベース）、またはClipKit（最新）のどれを使用しますか？
置換モデルの選択（IQ-TREE系統学的推論の場合）

必要なコンテキスト：分類学的範囲、分類群の数、進化速度

アクション：IQ-TREEモデルのドキュメントを取得し、データセットの特性に基づいて適切なアミノ酸置換モデルを提案します。

置換モデル推奨システムを使用します（下記の「置換モデルの推奨」セクションを参照）。
教育目標
- あなたはバイオインフォマティクスを学んでおり、各ワークフローステップの包括的な説明が必要ですか？
- はいの場合：主要なワークフローを完了した後

(原文はここで切り詰められています)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

BUSCO-based Phylogenomics Workflow Generator

This skill provides phylogenomics expertise for generating comprehensive, scheduler-aware workflows for phylogenetic inference from genome assemblies using single-copy orthologs.

Purpose

This skill helps users generate phylogenies from genome assemblies by:

Handling mixed input (local files and NCBI accessions)
Creating scheduler-specific scripts (SLURM, PBS, cloud, local)
Setting up complete workflows from raw genomes to final trees
Providing quality control and recommendations
Supporting flexible software management (bioconda, Docker, custom)

Available Resources

The skill provides access to these bundled resources:

Scripts (`scripts/`)

query_ncbi_assemblies.py - Query NCBI for available genome assemblies by taxon name (new!)
download_ncbi_genomes.py - Download genomes from NCBI using BioProjects or Assembly accessions
rename_genomes.py - Rename genome files with meaningful sample names (important!)
generate_qc_report.sh - Generate quality control reports from compleasm results
extract_orthologs.sh - Extract and reorganize single-copy orthologs
run_aliscore.sh - Wrapper for Aliscore to identify randomly similar sequences (RSS)
run_alicut.sh - Wrapper for ALICUT to remove RSS positions from alignments
run_aliscore_alicut_batch.sh - Batch process all alignments through Aliscore + ALICUT
convert_fasconcat_to_partition.py - Convert FASconCAT output to IQ-TREE partition format
predownloaded_aliscore_alicut/ - Pre-tested Aliscore and ALICUT Perl scripts

Templates (`templates/`)

slurm/ - SLURM job scheduler templates
pbs/ - PBS/Torque job scheduler templates
local/ - Local machine templates (with GNU parallel)
README.md - Complete template documentation

References (`references/`)

REFERENCE.md - Detailed technical reference including:
- Sample naming best practices
- BUSCO lineage datasets (complete list)
- Resource recommendations (memory, CPUs, walltime)
- Detailed step-by-step implementation guides
- Quality control guidelines
- Aliscore/ALICUT detailed guide
- Tool citations and download links
- Software installation guide
- Common issues and troubleshooting

Workflow Overview

The complete phylogenomics pipeline follows this sequence:

Input Preparation → Ortholog Identification → Quality Control → Ortholog Extraction → Alignment → Trimming → Concatenation → Phylogenetic Inference

Initial User Questions

When a user requests phylogeny generation, gather the following information systematically:

Step 1: Detect Computing Environment

Before asking questions, attempt to detect the local computing environment:

# Check for job schedulers
command -v sbatch >/dev/null 2>&1  # SLURM
command -v qsub >/dev/null 2>&1    # PBS/Torque
command -v parallel >/dev/null 2>&1  # GNU parallel

Report findings to the user, then confirm: "I detected [X] on this machine. Will you be running the scripts here or on a different system?"

Required Information

Ask these questions to gather essential workflow parameters:

Computing Environment
- Where will these scripts run? (SLURM cluster, PBS/Torque cluster, Cloud computing, Local machine)
Input Data
- Local genome files, NCBI accessions, or both?
- If NCBI: Do you already have Assembly accessions (GCA*/GCF) or BioProject accessions (PRJNA/PRJEB/PRJDA)?
- If user doesn't have accessions: Offer to help find assemblies using query_ncbi_assemblies.py (see "STEP 0A: Query NCBI for Assemblies" below)
- If local files: What are the file paths?
Taxonomic Scope & Dataset Details
- What taxonomic group? (determines BUSCO lineage dataset)
- How many taxa/genomes will be analyzed?
- What is the approximate phylogenetic breadth? (species-level, genus-level, family-level, order-level, etc.)
- See references/REFERENCE.md for complete lineage list
Environment Management
- Use unified conda environment (default, recommended), or separate environments per tool?
Resource Constraints
- How many CPU cores/threads to use in total? (Ask user to specify, do not auto-detect)
- Available memory (RAM) per node/machine?
- Maximum walltime for jobs?
- See references/REFERENCE.md for resource recommendations
Parallelization Strategy

Ask the user how they want to handle parallel processing:
- For job schedulers (SLURM/PBS):
  - Use array jobs for parallel steps? (Recommended: Yes)
  - Which steps to parallelize? (Steps 2, 5, 6, 8C recommended)
- For local machines:
  - Use GNU parallel for parallel steps? (requires parallel installed)
  - How many concurrent jobs?
- For all systems:
  - Optimize for maximum throughput or simplicity?
Scheduler-Specific Configuration (if using SLURM or PBS)
- Account/Username for compute time charges
- Partition/Queue to submit jobs to
- Email notifications? (address and when: START, END, FAIL, ALL)
- Job dependencies? (Recommended: Yes for linear workflow)
- Output log directory? (Default: logs/)
Alignment Trimming Preference
- Aliscore/ALICUT (traditional, thorough), trimAl (fast), BMGE (entropy-based), or ClipKit (modern)?
Substitution Model Selection (for IQ-TREE phylogenetic inference)

Context needed: Taxonomic breadth, number of taxa, evolutionary rates

Action: Fetch IQ-TREE model documentation and suggest appropriate amino acid substitution models based on dataset characteristics.

Use the substitution model recommendation system (see "Substitution Model Recommendation" section below).
Educational Goals
- Are you learning bioinformatics and would you like comprehensive explanations of each workflow step?
- If yes: After completing each major workflow stage, offer to explain what the step accomplishes, why certain choices were made, and what best practices are being followed.
- Store this preference to use throughout the workflow.

Recommended Directory Structure

Organize analyses with dedicated folders for each pipeline step:

project_name/
├── logs/                          # All log files
├── 00_genomes/                    # Input genome assemblies
├── 01_busco_results/              # BUSCO/compleasm outputs
├── 02_qc/                         # Quality control reports
├── 03_extracted_orthologs/        # Extracted single-copy orthologs
├── 04_alignments/                 # Multiple sequence alignments
├── 05_trimmed/                    # Trimmed alignments
├── 06_concatenation/              # Supermatrix and partition files
├── 07_partition_search/           # Partition model selection
├── 08_concatenated_tree/          # Concatenated ML tree
├── 09_gene_trees/                 # Individual gene trees
├── 10_species_tree/               # ASTRAL species tree
└── scripts/                       # All analysis scripts

Benefits: Easy debugging, clear workflow progression, reproducibility, prevents root directory clutter.

Template System

This skill uses a template-based system to reduce token usage and improve maintainability. Script templates are stored in the templates/ directory and organized by computing environment.

How to Use Templates

When generating scripts for users:

Read the appropriate template for their computing environment:
```
Read("templates/slurm/02_compleasm_first.job")
```
Replace placeholders with user-specific values:
- TOTAL_THREADS → e.g., 64
- THREADS_PER_JOB → e.g., 16
- NUM_GENOMES → e.g., 20
- NUM_LOCI → e.g., 2795
- LINEAGE → e.g., insecta_odb10
- MODEL_SET → e.g., LG,WAG,JTT,Q.pfam
Present the customized script to the user with setup instructions

Available Templates

Key templates by workflow step:

Step 0 (setup): Environment setup script in references/REFERENCE.md
Step 2 (compleasm): 02_compleasm_first, 02_compleasm_parallel
Step 8A (partition search): 08a_partition_search
Step 8C (gene trees): 08c_gene_trees_array, 08c_gene_trees_parallel, 08c_gene_trees_serial

See templates/README.md for complete template documentation.

Substitution Model Recommendation

When asked about substitution model selection (Question 9), use this systematic approach:

Step 1: Fetch IQ-TREE Documentation

Use WebFetch to retrieve current model information:

WebFetch(url="https://iqtree.github.io/doc/Substitution-Models",
         prompt="Extract all amino acid substitution models with descriptions and usage guidelines")

Step 2: Analyze Dataset Characteristics

Consider these factors from user responses:

Taxonomic Scope: Species/genus (shallow) vs. family/order (moderate) vs. class/phylum+ (deep)
Number of Taxa: <20 (small), 20-50 (medium), >50 (large)
Evolutionary Rates: Fast-evolving, moderate, or slow-evolving
Sequence Type: Nuclear proteins, mitochondrial, or chloroplast

Step 3: Recommend Models

Provide 3-5 appropriate models based on dataset characteristics. For detailed model recommendation matrices and taxonomically-targeted models, see references/REFERENCE.md section "Substitution Model Recommendation".

General recommendations:

Nuclear proteins (most common): LG, WAG, JTT, Q.pfam
Mitochondrial: mtREV, mtZOA, mtMAM, mtART, mtVer, mtInv
Chloroplast: cpREV
Taxonomically-targeted: Q.bird, Q.mammal, Q.insect, Q.plant, Q.yeast (when applicable)

Step 4: Present Recommendations

Format recommendations with justifications and explain how models will be used in IQ-TREE steps 8A and 8C.

Step 5: Store Model Set

Store the final comma-separated model list (e.g., "LG,WAG,JTT,Q.pfam") for use in Step 8 template placeholders.

Workflow Implementation

Once required information is gathered, guide the user through these steps. For each step, use templates where available and refer to references/REFERENCE.md for detailed implementation.

STEP 0: Environment Setup

ALWAYS start by generating a setup script for the user's environment.

Use the unified conda environment setup script from references/REFERENCE.md (Section: "Software Installation Guide"). This creates a single conda environment with all necessary tools:

compleasm, MAFFT, trimming tools (trimAl, ClipKit, BMGE)
IQ-TREE, ASTRAL, Perl with BioPerl, GNU parallel
Downloads and installs Aliscore/ALICUT Perl scripts

Key points:

Users choose between mamba (faster) or conda
Users choose between predownloaded Aliscore/ALICUT scripts (tested) or latest from GitHub
All subsequent steps use conda activate phylo (the unified environment)

See references/REFERENCE.md for the complete setup script template.

STEP 0A: Query NCBI for Assemblies (Optional)

Use this step when: User wants to use NCBI data but doesn't have specific assembly accessions yet.

This optional preliminary step helps users discover available genome assemblies by taxon name before proceeding with the main workflow.

When to Offer This Step

Offer this step when:

User wants to analyze genomes from NCBI
User doesn't have specific Assembly or BioProject accessions
User mentions a taxonomic group (e.g., "I want to build a phylogeny for beetles")

Workflow

Ask for focal taxon: Request the taxonomic group of interest
- Examples: "Coleoptera", "Drosophila", "Apis mellifera"
- Can be at any taxonomic level (order, family, genus, species)

Query NCBI using the script: Use scripts/query_ncbi_assemblies.py to search for assemblies

# Basic query (returns 20 results by default)
python scripts/query_ncbi_assemblies.py --taxon "Coleoptera"

# Query with more results
python scripts/query_ncbi_assemblies.py --taxon "Drosophila" --max-results 50

# Query for RefSeq assemblies only (higher quality, GCF_* accessions)
python scripts/query_ncbi_assemblies.py --taxon "Apis" --refseq-only

# Save accessions to file for later download
python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" --save assembly_accessions.txt

Present results to user: The script displays:
- Assembly accession (GCA* or GCF*)
- Organism name
- Assembly level (Chromosome, Scaffold, Contig)
- Assembly name
Help user select assemblies: Ask user which assemblies they want to include
- Consider assembly level (Chromosome > Scaffold > Contig)
- Consider phylogenetic breadth (species coverage)
- Consider data quality (RefSeq > GenBank when available)
Collect selected accessions: Compile the list of chosen assembly accessions
Proceed to STEP 1: Use the selected accessions with download_ncbi_genomes.py

Tips for Assembly Selection

Assembly Level: Chromosome-level assemblies are most complete, followed by Scaffold, then Contig
RefSeq vs GenBank: RefSeq (GCF*) assemblies undergo additional curation; GenBank (GCA*) are submitter-provided
Taxonomic Sampling: For phylogenetics, aim for representative sampling across the taxonomic group
Quality over Quantity: Better to have 20 high-quality assemblies than 100 poor-quality ones

STEP 1: Download NCBI Genomes (if applicable)

If user provided NCBI accessions, use scripts/download_ncbi_genomes.py:

For BioProjects:

python scripts/download_ncbi_genomes.py --bioprojects PRJNA12345 -o genomes.zip
unzip genomes.zip

For Assembly Accessions:

python scripts/download_ncbi_genomes.py --assemblies GCA_123456789.1 -o genomes.zip
unzip genomes.zip

IMPORTANT: After download, genomes must be renamed with meaningful sample names (format: [ACCESSION]_[SPECIES_NAME]). Sample names appear in final phylogenetic trees.

Generate a script that:

Finds all downloaded FASTA files in ncbi_dataset directory structure
Moves/renames files to main genomes directory with meaningful names
Includes any local genome files
Creates final genome_list.txt with ALL genomes (local + downloaded)

See references/REFERENCE.md section "Sample Naming Best Practices" for detailed guidelines.

STEP 2: Ortholog Identification with compleasm

Activate the unified environment and run compleasm on all genomes to identify single-copy orthologs.

Key considerations:

First genome must run alone to download lineage database
Remaining genomes can run in parallel
Thread allocation: Miniprot scales well up to ~16-32 threads per genome

Threading guidelines: See references/REFERENCE.md for recommended thread allocation table.

Generate scripts using templates:

SLURM: Read templates 02_compleasm_first.job and 02_compleasm_parallel.job
PBS: Read templates 02_compleasm_first.job and 02_compleasm_parallel.job
Local: Read templates 02_compleasm_first.sh and 02_compleasm_parallel.sh

Replace placeholders: TOTAL_THREADS, THREADS_PER_JOB, NUM_GENOMES, LINEAGE

For detailed implementation examples, see references/REFERENCE.md section "Ortholog Identification Implementation".

STEP 3: Quality Control

After compleasm completes, generate QC report using scripts/generate_qc_report.sh:

bash scripts/generate_qc_report.sh qc_report.csv

Provide interpretation:

>95% complete: Excellent, retain
90-95% complete: Good, retain
85-90% complete: Acceptable, case-by-case
70-85% complete: Questionable, consider excluding
<70% complete: Poor, recommend excluding

See references/REFERENCE.md section "Quality Control Guidelines" for detailed assessment criteria.

STEP 4: Ortholog Extraction

Use scripts/extract_orthologs.sh to extract single-copy orthologs:

bash scripts/extract_orthologs.sh LINEAGE_NAME

This generates per-locus unaligned FASTA files in single_copy_orthologs/unaligned_aa/.

STEP 5: Alignment with MAFFT

Activate the unified environment (conda activate phylo) which contains MAFFT.

Create locus list, then generate alignment scripts:

cd single_copy_orthologs/unaligned_aa
ls *.fas > locus_names.txt
num_loci=$(wc -l < locus_names.txt)

Generate scheduler-specific scripts:

SLURM/PBS: Array job with one task per locus
Local: Sequential processing or GNU parallel

For detailed script templates, see references/REFERENCE.md section "Alignment Implementation".

STEP 6: Alignment Trimming

Based on user's preference, provide appropriate trimming method. All tools are available in the unified conda environment.

Options:

trimAl: Fast (-automated1), recommended for large datasets
ClipKit: Modern, fast (default smart-gap mode)
BMGE: Entropy-based (-t AA)
Aliscore/ALICUT: Traditional, thorough (recommended for phylogenomics)

For Aliscore/ALICUT:

Perl scripts were installed in STEP 0
Use scripts/run_aliscore_alicut_batch.sh for batch processing
Or use array jobs with scripts/run_aliscore.sh and scripts/run_alicut.sh
Always use -N flag for amino acid sequences

Generate scripts using scheduler-appropriate templates (array jobs for SLURM/PBS, parallel or serial for local).

For detailed implementation of each trimming method, see references/REFERENCE.md section "Alignment Trimming Implementation".

STEP 7: Concatenation and Partition Definition

Download FASconCAT-G (Perl script) and run concatenation:

conda activate phylo  # Has Perl installed
wget https://raw.githubusercontent.com/PatrickKueck/FASconCAT-G/master/FASconCAT-G_v1.06.1.pl -O FASconCAT-G.pl
chmod +x FASconCAT-G.pl

cd trimmed_aa
perl ../FASconCAT-G.pl -s -i

Convert to IQ-TREE format using scripts/convert_fasconcat_to_partition.py:

python ../scripts/convert_fasconcat_to_partition.py FcC_info.xls partition_def.txt

Outputs: FcC_supermatrix.fas, FcC_info.xls, partition_def.txt

STEP 8: Phylogenetic Inference

IQ-TREE is already installed in the unified environment. Activate with conda activate phylo.

Part 8A: Partition Model Selection

Use the substitution models selected during initial setup (Question 9).

Generate script using templates:

Read appropriate template: templates/[slurm|pbs|local]/08a_partition_search.[job|sh]
Replace MODEL_SET placeholder with user's selected models (e.g., "LG,WAG,JTT,Q.pfam")

For detailed implementation, see references/REFERENCE.md section "Partition Model Selection Implementation".

Part 8B: Concatenated ML Tree

Run IQ-TREE using the best partition scheme from Part 8A:

iqtree -s FcC_supermatrix.fas -spp partition_search.best_scheme.nex \
  -nt 18 -safe -pre concatenated_ML_tree -bb 1000 -bnni

Output: concatenated_ML_tree.treefile

Part 8C: Individual Gene Trees

Estimate gene trees for coalescent-based species tree inference.

Generate scripts using templates:

SLURM/PBS: Read 08c_gene_trees_array.job template
Local: Read 08c_gene_trees_parallel.sh or 08c_gene_trees_serial.sh template
Replace NUM_LOCI placeholder

For detailed implementation, see references/REFERENCE.md section "Gene Trees Implementation".

Part 8D: ASTRAL Species Tree

ASTRAL is already installed in the unified conda environment.

conda activate phylo

# Concatenate all gene trees
cat trimmed_aa/*.treefile > all_gene_trees.tre

# Run ASTRAL
astral -i all_gene_trees.tre -o astral_species_tree.tre

Output: astral_species_tree.tre

STEP 9: Generate Methods Paragraph

ALWAYS generate a methods paragraph to help users write their publication methods section.

Create METHODS_PARAGRAPH.md file with:

Customized text based on tools and parameters used
Complete citations for all software
Placeholders for user-specific values (genome count, loci count, thresholds)
Instructions for adapting to journal requirements

For the complete methods paragraph template, see references/REFERENCE.md section "Methods Paragraph Template".

Pre-fill known values when possible:

Number of genomes
BUSCO lineage
Trimming method used
Substitution models tested

Final Outputs Summary

Provide users with a summary of outputs:

Phylogenetic Results:

concatenated_ML_tree.treefile - ML tree from concatenated supermatrix
astral_species_tree.tre - Coalescent species tree
*.treefile - Individual gene trees

Data and Quality Control: 4. qc_report.csv - Genome quality statistics 5. FcC_supermatrix.fas - Concatenated alignment 6. partition_search.best_scheme.nex - Selected partitioning scheme

Publication Materials: 7. METHODS_PARAGRAPH.md - Ready-to-use methods section with citations

Visualization tools: FigTree, iTOL, ggtree (R), ete3/toytree (Python)

Script Validation

ALWAYS perform validation checks after generating scripts but before presenting them to the user. This ensures script accuracy, consistency, and proper resource allocation.

Validation Workflow

For each generated script, perform these validation checks in order:

1. Program Option Verification

Purpose: Detect hallucinated or incorrect command-line options that may cause scripts to fail.

Procedure:

Extract all command invocations from the generated script (e.g., compleasm run, iqtree -s, mafft --auto)
Compare against reference sources:
- First check: Compare against corresponding template in templates/ directory
- Second check: Compare against examples in references/REFERENCE.md
- Third check: If options differ significantly or are uncertain, perform web search for official documentation
Common tools to validate:
- compleasm run - Check -a, -o, -l, -t options
- iqtree - Verify -s, -p, -m, -bb, -alrt, -nt, -safe options
- mafft - Check --auto, --thread, --reorder options
- astral - Verify -i, -o options
- Trimming tools (trimal, clipkit, BMGE.jar) - Validate options

Action on issues:

If incorrect options found: Inform user of the issue and ask if they want you to correct it
If uncertain: Ask user to verify with tool documentation before proceeding

2. Pipeline Continuity Verification

Purpose: Ensure outputs from one step correctly feed into inputs of subsequent steps.

Procedure:

Map input/output relationships:
- Step 2 output (01_busco_results/*_compleasm/) → Step 3 input (QC script)
- Step 3 output (single_copy_orthologs/) → Step 5 input (MAFFT)
- Step 5 output (04_alignments/*.fas) → Step 6 input (trimming)
- Step 6 output (05_trimmed/*.fas) → Step 7 input (FASconCAT-G)
- Step 7 output (FcC_supermatrix.fas, partition file) → Step 8A input (IQ-TREE)
- Step 8C output (*.treefile) → Step 8D input (ASTRAL)
Check for consistency:
- File path references match across scripts
- Directory structure follows recommended layout
- Glob patterns correctly match expected files
- Required intermediate files are generated before being used

Action on issues:

If path mismatches found: Inform user and ask if they want you to correct them
If directory structure inconsistent: Suggest corrections aligned with recommended structure

3. Resource Compatibility Check

Purpose: Ensure allocated computational resources are appropriate for the task.

Procedure:

Verify resource allocations against recommendations in references/REFERENCE.md:
- Memory allocation: Check if memory per CPU (typically 6GB for compleasm, 2-4GB for others) is adequate
- Thread allocation: Verify thread counts are reasonable for the number of genomes/loci
- Walltime: Ensure walltime is sufficient based on dataset size guidelines
- Parallelization: Check that threads per job × concurrent jobs ≤ total threads
Common issues to check:
- Compleasm: First job needs full thread allocation (downloads database)
- IQ-TREE: -nt should match allocated CPUs
- Gene trees: Ensure enough threads per tree × concurrent trees ≤ total available
- Memory: Concatenated tree inference may need 8-16GB per CPU for large datasets
Validate against user-specified constraints:
- Total CPUs specified by user
- Available memory per node
- Maximum walltime limits
- Scheduler-specific limits (if mentioned)

Action on issues:

If resource allocation issues found: Inform user and suggest corrections with justification
If uncertain about adequacy: Ask user about typical job performance in their environment

Validation Reporting

After completing all validation checks:

If all checks pass: Inform user briefly: "Scripts validated successfully - options, pipeline flow, and resources verified."

If issues found: Present a structured report:

**Validation Results**

⚠️ Issues found during validation:

1. [Issue category]: [Description]
   - Current: [What was generated]
   - Suggested: [Recommended fix]
   - Reason: [Why this is an issue]

Would you like me to apply these corrections?

Always ask before correcting: Never silently fix issues - always get user confirmation before applying changes.
Document corrections: If corrections are applied, explain what was changed and why.

Communication Guidelines

Always start with STEP 0: Generate the unified environment setup script
Always end with STEP 9: Generate the customized methods paragraph
Always validate scripts: Perform validation checks before presenting scripts to users
Use unified environment by default: All scripts should use conda activate phylo
Always ask about CPU allocation: Never auto-detect cores, always ask user
Recommend optimized workflows: For users with adequate resources, recommend optimized parallel approaches over simple serial approaches
Be clear and pedagogical: Explain why each step is necessary
Provide educational explanations when requested: If user answered yes to educational goals (question 10):
- After completing each major workflow stage, ask: "Would you like me to explain this step?"
- If yes, provide moderate-length explanation (1-2 paragraphs) covering:
  - What the step accomplishes biologically and computationally
  - Significant choices made and their rationale
  - Best practices being followed in the workflow
- Examples of "major workflow stages": STEP 0 (setup), STEP 1 (download), STEP 2 (BUSCO), STEP 3 (QC), STEP 5 (alignment), STEP 6 (trimming), STEP 7 (concatenation), STEP 8 (phylogenetic inference)
Provide complete, ready-to-run scripts: Users should copy-paste and run
Adapt to user's environment: Always generate scheduler-specific scripts
Reference supporting files: Direct users to references/REFERENCE.md for details
Use helper scripts: Leverage provided scripts in scripts/ directory
Include error checking: Add file existence checks and informative error messages
Be encouraging: Phylogenomics is complex; maintain supportive tone

Important Notes

Mandatory Steps

STEP 0 is mandatory: Always generate the environment setup script first
STEP 9 is mandatory: Always generate the methods paragraph file at the end

Template Usage (IMPORTANT!)

Prefer templates over inline code: Use templates/ directory for major scripts
Template workflow:
- Read: Read("templates/slurm/02_compleasm_first.job")
- Replace placeholders: TOTAL_THREADS, LINEAGE, NUM_GENOMES, MODEL_SET, etc.
- Present customized script to user
Available templates: See templates/README.md for complete list
Benefits: Reduces token usage, easier maintenance, consistent structure

Script Generation

Always adapt scripts to user's scheduler (SLURM/PBS/local)
Replace all placeholders before presenting scripts
Never auto-detect CPU cores: Always ask user to specify
Provide parallelization options: For each parallelizable step, offer array job, parallel, and serial options
Scheduler-specific configuration: For SLURM/PBS, always ask about account, partition, email, etc.

Parallelization Strategy

Ask about preferences: Let user choose between throughput optimization vs. simplicity
Compleasm optimization: For ≥2 genomes and ≥16 cores, recommend two-phase approach
Use threading guidelines: Refer to references/REFERENCE.md for thread allocation recommendations
Parallelizable steps: Steps 2 (compleasm), 5 (MAFFT), 6 (trimming), 8C (gene trees)

Substitution Model Selection

Always recommend models: Use the systematic model recommendation process
Fetch current documentation: Use WebFetch to get IQ-TREE model information
Replace MODEL_SET placeholder: In Step 8A templates with comma-separated list
Taxonomically-targeted models: Suggest Q.bird, Q.mammal, Q.insect, Q.plant when applicable

Reference Material

Direct users to references/REFERENCE.md for:
- Detailed implementation guides
- BUSCO lineage datasets (complete list)
- Resource recommendations (memory, CPUs, walltime tables)
- Sample naming best practices
- Quality control assessment criteria
- Aliscore/ALICUT detailed guide and parameters
- Tool citations with DOIs
- Software installation instructions
- Common issues and troubleshooting

Attribution

This skill was created by Bruno de Medeiros (Curator of Pollinating Insects, Field Museum) based on phylogenomics tutorials by Paul Frandsen (Brigham Young University).

Workflow Entry Point

When a user requests phylogeny generation:

Gather required information using the "Initial User Questions" section
Generate STEP 0 setup script from references/REFERENCE.MD
If user needs help finding NCBI assemblies, perform STEP 0A using query_ncbi_assemblies.py
Proceed step-by-step through workflow (STEPS 1-8), using templates and referring to references/REFERENCE.md for detailed implementation
All workflow scripts should use the unified conda environment (conda activate phylo)
Validate all generated scripts before presenting to user (see "Script Validation" section)
Generate STEP 9 methods paragraph from template in references/REFERENCE.md
Provide final outputs summary

同梱ファイル

※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。

📄 SKILL.md (31,026 bytes)
📎 README.md (4,093 bytes)
📎 references/REFERENCE.md (64,608 bytes)
📎 scripts/convert_fasconcat_to_partition.py (1,796 bytes)
📎 scripts/download_ncbi_genomes.py (3,935 bytes)
📎 scripts/extract_orthologs.sh (2,200 bytes)
📎 scripts/generate_qc_report.sh (2,104 bytes)
📎 scripts/predownloaded_aliscore_alicut/ALICUT_V2.31.pl (26,210 bytes)
📎 scripts/predownloaded_aliscore_alicut/Aliscore_module.pm (71,577 bytes)
📎 scripts/predownloaded_aliscore_alicut/Aliscore.02.2.pl (36,217 bytes)
📎 scripts/query_ncbi_assemblies.py (5,145 bytes)
📎 scripts/rename_genomes.py (6,952 bytes)
📎 scripts/run_alicut.sh (7,273 bytes)
📎 scripts/run_aliscore_alicut_batch.sh (8,099 bytes)
📎 scripts/run_aliscore.sh (7,341 bytes)

busco-phylogeny

🎯 このSkillでできること

📦 インストール方法 (3ステップ)

📖 Skill本文(日本語訳)

BUSCOベースの系統ゲノミクスワークフロー生成ツール

目的

利用可能なリソース

スクリプト (scripts/)

テンプレート (templates/)

リファレンス (references/)

ワークフローの概要

最初のユーザーへの質問

ステップ1：コンピューティング環境の検出

必要な情報

BUSCO-based Phylogenomics Workflow Generator

Purpose

Available Resources

Scripts (scripts/)

Templates (templates/)

References (references/)

Workflow Overview

Initial User Questions

Step 1: Detect Computing Environment

Required Information

Recommended Directory Structure

Template System

How to Use Templates

Available Templates

Substitution Model Recommendation

Step 1: Fetch IQ-TREE Documentation

Step 2: Analyze Dataset Characteristics

Step 3: Recommend Models

Step 4: Present Recommendations

Step 5: Store Model Set

Workflow Implementation

STEP 0: Environment Setup

STEP 0A: Query NCBI for Assemblies (Optional)

When to Offer This Step

Workflow

Tips for Assembly Selection

STEP 1: Download NCBI Genomes (if applicable)

STEP 2: Ortholog Identification with compleasm

STEP 3: Quality Control

STEP 4: Ortholog Extraction

STEP 5: Alignment with MAFFT

STEP 6: Alignment Trimming

STEP 7: Concatenation and Partition Definition

STEP 8: Phylogenetic Inference

Part 8A: Partition Model Selection

Part 8B: Concatenated ML Tree

Part 8C: Individual Gene Trees

Part 8D: ASTRAL Species Tree

STEP 9: Generate Methods Paragraph

Final Outputs Summary

Script Validation

Validation Workflow

1. Program Option Verification

2. Pipeline Continuity Verification

3. Resource Compatibility Check

Validation Reporting

Communication Guidelines

Important Notes

Mandatory Steps

Template Usage (IMPORTANT!)

Script Generation

Parallelization Strategy

Substitution Model Selection

Reference Material

Attribution

Workflow Entry Point

同梱ファイル

スクリプト (`scripts/`)

テンプレート (`templates/`)

リファレンス (`references/`)

Scripts (`scripts/`)

Templates (`templates/`)

References (`references/`)