🏭 milesでエンタープライズ級RL訓練(MoE FP8)
大規模MoEモデルのFP8/INT4 RL訓練本番フォーク miles のSkill。
📺 まず動画で見る(YouTube)
▶ 【衝撃】最強のAIエージェント「Claude Code」の最新機能・使い方・プログラミングをAIで効率化する超実践術を解説! ↗
※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。
📜 元の英語説明(参考)
Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.
🇯🇵 日本人クリエイター向け解説
大規模MoEモデルのFP8/INT4 RL訓練本番フォーク miles のSkill。
※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-17
- 取得日時
- 2026-05-17
- 同梱ファイル
- 3
💬 こう話しかけるだけ — サンプルプロンプト
- › milesでエンタープライズ級RL訓練(MoE FP8) を使って、最小構成のサンプルコードを示して
- › milesでエンタープライズ級RL訓練(MoE FP8) の主な使い方と注意点を教えて
- › milesでエンタープライズ級RL訓練(MoE FP8) を既存プロジェクトに組み込む方法を教えて
これをClaude Code に貼るだけで、このSkillが自動発動します。
📖 Claude が読む原文 SKILL.md(中身を展開)
この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。
miles: Enterprise-Grade RL for Large-Scale Model Training
miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.
When to Use miles
Choose miles when you need:
- Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
- FP8 or INT4 quantization-aware training
- Bit-wise identical train-inference alignment
- Speculative RL for maximum throughput
- Production stability with enterprise support
Consider alternatives when:
- You want the research-grade original → use slime
- You need flexible backend swapping → use verl
- You want PyTorch-native abstractions → use torchforge
Key Features
Low-Precision Training
- Unified FP8: End-to-end FP8 for both inference and training
- INT4 QAT: 1TB models on single-machine VRAM (H200)
- Rollout Routing Replay (R3): Bit-wise expert alignment for MoE
Performance Optimizations
- Speculative RL: 25%+ rollout speedup with online SFT draft models
- Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
- Partial Rollout: Recycle half-finished trajectories
Train-Inference Alignment
- TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
- Kernel-level optimization: FlashAttention-3, DeepGEMM integration
Installation
# Recommended: Docker
docker pull radixark/miles:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
-it radixark/miles:latest /bin/bash
# From source
git clone https://github.com/radixark/miles.git
cd miles
pip install -r requirements.txt
pip install -e .
Quick Start
miles inherits slime's configuration system. Basic training:
python train.py \
--advantage-estimator grpo \
--model-name qwen3-30b-a3b \
--hf-checkpoint /path/to/qwen3-30b-a3b-hf \
--rollout-batch-size 512 \
--n-samples-per-prompt 8
Workflow 1: Large MoE Training
Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.
Prerequisites Checklist
- [ ] H100/H200 GPUs with FP8 support
- [ ] MoE model (DeepSeek V3, Qwen3-MoE)
- [ ] Docker environment with miles
Step 1: Environment Setup
# FP8 block scaling (recommended for stability)
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
Step 2: Configure Training
python train.py \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--hf-checkpoint /path/to/deepseek-v3 \
--advantage-estimator grpo \
--tensor-model-parallel-size 8 \
--expert-model-parallel-size 4 \
--prompt-data /path/to/data.jsonl \
--num-rollout 3000
Verification Checklist
- [ ] Model loads without errors
- [ ] Routing decisions are consistent
- [ ] No NaN/Inf in loss values
Workflow 2: Speculative RL Training
Use this workflow for maximum rollout throughput with EAGLE speculative decoding.
How Speculative RL Works
- Small draft model generates candidate tokens
- Target model verifies in parallel
- Draft model updated via online SFT to track policy
Step 1: Enable Speculative Decoding
miles supports EAGLE speculative decoding via SGLang:
python train.py \
--actor-num-gpus-per-node 8 \
--hf-checkpoint /path/to/target-model \
--sglang-speculative-algorithm EAGLE \
--sglang-speculative-num-steps 3 \
--sglang-speculative-eagle-topk 1 \
--sglang-speculative-num-draft-tokens 4 \
--sglang-speculative-draft-model-path /path/to/draft-model \
--advantage-estimator grpo \
--prompt-data /path/to/data.jsonl
Step 2: Enable Online MTP Training (Optional)
For online SFT of draft model during training:
--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2
Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.
Expected Speedup
- Standard rollout: Baseline
- Speculative RL: 25-40% faster rollout
- With partial rollout: Additional 10-15% throughput
Configuration Reference
miles inherits all slime arguments. See slime API Reference for the complete list.
Cluster Resources (from slime)
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate
Megatron Parallelism (from slime)
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4 # MoE expert parallelism
Speculative Decoding (miles-specific)
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path
Online MTP Training (miles-specific)
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2
Key Features (Conceptual)
The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.
Unified FP8 Pipeline
End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.
Rollout Routing Replay (R3)
Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.
How R3 Works:
- During SGLang inference, expert routing decisions are recorded
- Routing decisions stored in
sample.rollout_routed_experts - During Megatron training, routing is replayed instead of recomputed
- Ensures identical expert selection between train and inference
INT4 Quantization-Aware Training
Enables single-machine deployment of 1TB+ models (e.g., on H200).
Memory Savings with INT4:
| Model Size | BF16 VRAM | INT4 VRAM | Reduction |
|---|---|---|---|
| 70B | 140GB | 45GB | 3.1x |
| 235B | 470GB | 150GB | 3.1x |
| 671B | 1.3TB | 420GB | 3.1x |
Train-Inference Alignment
miles achieves "exactly 0 KL divergence" between training and inference through:
- Flash Attention 3
- DeepGEMM
- Batch-invariant kernels from Thinking Machines Lab
torch.compileintegration
Sample Data Structure
miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:
@dataclass
class Sample:
prompt: str | list[dict]
tokens: list[int]
response: str
reward: float | dict
loss_mask: list[int]
status: Status
metadata: dict
rollout_log_probs: list[float]
rollout_routed_experts: list[list[int]] # MoE routing for R3
See slime API Reference for the complete Sample definition.
Common Issues and Solutions
Issue: FP8 Training Collapse
Symptoms: Loss explodes, NaN values
Solutions:
- Use block scaling:
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 - Reduce learning rate:
--lr 5e-7 - Ensure MoE routing is consistent between train/inference
Issue: Speculative Draft Drift
Symptoms: Low acceptance rate over time
Solutions:
- Enable online MTP training to keep draft model aligned
- Reduce speculative steps:
--sglang-speculative-num-steps 2 - Use CPU backup:
--sglang-enable-draft-weights-cpu-backup
Issue: Train-Inference Mismatch
Symptoms: Policy divergence, reward collapse
Solutions:
- Use TIS for off-policy correction:
--use-tis --tis-threshold 0.9 - Verify log probs match between SGLang and Megatron
- Enable R3 for MoE models
Supported Models
| Family | Models | MoE Support |
|---|---|---|
| DeepSeek | R1, V3, V3.2 | Full |
| Qwen | 2, 2.5, 3 (including MoE) | Full |
| Llama | 3, 3.1, 3.3, 4 | Dense only |
| Gemma | 2, 3, 3N | Dense only |
| GLM | 4.5, 4.6, 4.7 | Dense only |
| MiniMax | M2, M2.1 | Full |
Resources
- GitHub: https://github.com/radixark/miles
- Introduction Blog: https://lmsys.org/blog/2025-11-19-miles/
- Slime (upstream): https://github.com/THUDM/slime
- SGLang: https://github.com/sgl-project/sglang
同梱ファイル
※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。
- 📄 SKILL.md (8,900 bytes)
- 📎 references/api-reference.md (4,137 bytes)
- 📎 references/troubleshooting.md (5,814 bytes)