🌋 verl(Volcano Engine RL)
Volcano Engine の RLHF/GRPO/PPO LLM強化学習フレームワーク verl のSkill。
📺 まず動画で見る(YouTube)
▶ 【衝撃】最強のAIエージェント「Claude Code」の最新機能・使い方・プログラミングをAIで効率化する超実践術を解説! ↗
※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。
📜 元の英語説明(参考)
Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
🇯🇵 日本人クリエイター向け解説
Volcano Engine の RLHF/GRPO/PPO LLM強化学習フレームワーク verl のSkill。
※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-17
- 取得日時
- 2026-05-17
- 同梱ファイル
- 3
💬 こう話しかけるだけ — サンプルプロンプト
- › verl(Volcano Engine RL) を使って、最小構成のサンプルコードを示して
- › verl(Volcano Engine RL) の主な使い方と注意点を教えて
- › verl(Volcano Engine RL) を既存プロジェクトに組み込む方法を教えて
これをClaude Code に貼るだけで、このSkillが自動発動します。
📖 Claude が読む原文 SKILL.md(中身を展開)
この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。
verl: Volcano Engine Reinforcement Learning for LLMs
verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.
When to Use verl
Choose verl when you need:
- Production-ready RL training at scale (tested up to 671B parameters)
- Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
- Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
- Multi-turn rollout with tool calling for agentic workflows
- Vision-language model RL training
Consider alternatives when:
- You need Megatron-native training → use slime or miles
- You want PyTorch-native abstractions with Monarch → use torchforge
- You only need simple SFT/DPO → use TRL or Axolotl
Key Features
- Training backends: FSDP, FSDP2, Megatron-LM
- Rollout engines: vLLM, SGLang, HuggingFace Transformers
- Algorithms: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
- Models: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
- Advanced: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools
Installation
# Option 1: pip install
pip install verl[vllm] # or verl[sglang] for SGLang backend
# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest
# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]
Quick Start: GRPO Training
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=~/data/gsm8k/train.parquet \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.actor.use_kl_loss=True \
trainer.n_gpus_per_node=8
Core Architecture
verl uses a HybridFlow programming model separating control flow from computation:
┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray) │
│ - Orchestrates: rollout → reward → train → sync │
└─────────────────────┬───────────────────────────────────┘
│
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers │
│ ├── ActorRolloutRefWorker (policy + generation) │
│ ├── CriticWorker (value estimation, PPO only) │
│ └── RewardManager (model-based or rule-based rewards) │
└─────────────────────────────────────────────────────────┘
Workflow 1: Math Reasoning with GRPO
Use this workflow for training reasoning models on math tasks like GSM8K or MATH.
Prerequisites Checklist
- [ ] GPU cluster with 8+ GPUs (H100 recommended)
- [ ] Dataset in parquet format with
promptandreward_modelcolumns - [ ] Base model from HuggingFace Hub
Step 1: Prepare Dataset
import pandas as pd
data = [
{
"prompt": [{"role": "user", "content": "What is 15 + 27?"}],
"reward_model": {"ground_truth": "42"}
},
# ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")
Step 2: Define Reward Function
# reward_function.py
import re
def compute_reward(responses, ground_truths):
rewards = []
for response, gt in zip(responses, ground_truths):
# Extract answer from response
match = re.search(r'\\boxed{([^}]+)}', response)
if match and match.group(1).strip() == gt.strip():
rewards.append(1.0)
else:
rewards.append(0.0)
return rewards
Step 3: Create Training Config
# config/grpo_math.yaml
algorithm:
adv_estimator: grpo
gamma: 1.0
lam: 1.0
data:
train_files: /path/to/train.parquet
val_files: /path/to/val.parquet
train_batch_size: 256
max_prompt_length: 512
max_response_length: 2048
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
actor:
use_kl_loss: true
kl_loss_coef: 0.001
ppo_mini_batch_size: 64
rollout:
name: vllm
n: 8 # samples per prompt
temperature: 0.7
top_p: 0.95
trainer:
total_epochs: 3
n_gpus_per_node: 8
save_freq: 100
Step 4: Launch Training
python3 -m verl.trainer.main_ppo \
--config-path config \
--config-name grpo_math \
trainer.experiment_name=grpo_math_qwen7b
Step 5: Monitor and Validate
- [ ] Check WandB/TensorBoard for loss curves
- [ ] Verify reward is increasing over steps
- [ ] Run evaluation on held-out test set
Workflow 2: PPO with Critic Model
Use this workflow when you need value-based advantage estimation (GAE).
Key Differences from GRPO
- Requires separate critic model
- Uses Generalized Advantage Estimation (GAE)
- Better for tasks with dense rewards
Configuration
algorithm:
adv_estimator: gae # Use GAE instead of GRPO
gamma: 0.99
lam: 0.95
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct # Can be same or different from actor
ppo_mini_batch_size: 64
actor_rollout_ref:
actor:
use_kl_loss: true
kl_loss_coef: 0.02
clip_ratio: 0.2 # PPO clipping
Launch with Critic
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
critic.model.path=Qwen/Qwen2.5-7B-Instruct \
trainer.n_gpus_per_node=8
Workflow 3: Large-Scale Training with Megatron
Use this workflow for models >70B parameters or when you need expert parallelism.
Prerequisites
- [ ] Install Megatron-LM bridge:
pip install mbridge - [ ] Convert model to Megatron format
- [ ] Multi-node cluster with NVLink/InfiniBand
Configuration for 70B+ Models
actor_rollout_ref:
model:
path: /path/to/megatron/checkpoint
backend: megatron
actor:
strategy: megatron
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
rollout:
name: vllm
tensor_parallel_size: 8
Launch Multi-Node
# On head node
ray start --head --port=6379
# On worker nodes
ray start --address='head_ip:6379'
# Launch training
python3 -m verl.trainer.main_ppo \
trainer.nnodes=4 \
trainer.n_gpus_per_node=8
Configuration Reference
Algorithm Selection
| Algorithm | adv_estimator |
Use Case |
|---|---|---|
| GRPO | grpo |
Critic-free, math/reasoning |
| PPO/GAE | gae |
Dense rewards, value estimation |
| REINFORCE++ | reinforce_plus_plus |
Variance reduction |
| RLOO | rloo |
Leave-one-out baseline |
| ReMax | remax |
Maximum reward baseline |
| OPO | opo |
Optimal policy optimization |
Key Parameters
# Rollout parameters
actor_rollout_ref.rollout.n: 8 # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7 # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95 # Nucleus sampling
# Training parameters
actor_rollout_ref.actor.lr: 1e-6 # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2 # PPO clip range
# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1 # For adaptive KL control
Common Issues and Solutions
Issue: OOM During Rollout
Symptoms: CUDA out of memory during generation phase
Solutions:
# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true
# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true
Issue: Training Instability
Symptoms: Loss spikes, reward collapse
Solutions:
# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7
# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01
# Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0
Issue: Slow Weight Sync
Symptoms: Long pauses between rollout and training
Solutions:
# Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2
# Enable async weight transfer
trainer.async_weight_update=true
Issue: vLLM Version Mismatch
Symptoms: Import errors or generation failures
Solution: Use compatible versions:
pip install vllm>=0.8.5,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)
Advanced Topics
Multi-Turn Tool Calling
See references/multi-turn.md for agentic workflows with tool use.
Vision-Language Models
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-VL-7B-Instruct
rollout:
name: vllm
enable_vision: true
LoRA Training
actor_rollout_ref:
actor:
lora:
enabled: true
r: 16
alpha: 32
target_modules: ["q_proj", "v_proj"]
Resources
- Documentation: https://verl.readthedocs.io/
- Paper: https://arxiv.org/abs/2409.19256
- GitHub: https://github.com/volcengine/verl
- Recipes: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
- Community: Slack at verl-project
同梱ファイル
※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。
- 📄 SKILL.md (10,323 bytes)
- 📎 references/api-reference.md (6,941 bytes)
- 📎 references/troubleshooting.md (6,792 bytes)