qwen3-tts-mlx
Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.
下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o qwen3-tts-mlx.zip https://jpskill.com/download/10459.zip && unzip -o qwen3-tts-mlx.zip && rm qwen3-tts-mlx.zip
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/10459.zip -OutFile "$d\qwen3-tts-mlx.zip"; Expand-Archive "$d\qwen3-tts-mlx.zip" -DestinationPath $d -Force; ri "$d\qwen3-tts-mlx.zip"
完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。
💾 手動でダウンロードしたい(コマンドが難しい人向け)
- 1. 下の青いボタンを押して
qwen3-tts-mlx.zipをダウンロード - 2. ZIPファイルをダブルクリックで解凍 →
qwen3-tts-mlxフォルダができる - 3. そのフォルダを
C:\Users\あなたの名前\.claude\skills\(Win)または~/.claude/skills/(Mac)へ移動 - 4. Claude Code を再起動
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-18
- 取得日時
- 2026-05-18
- 同梱ファイル
- 1
📖 Claude が読む原文 SKILL.md(中身を展開)
この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。
Qwen3-TTS MLX
Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.
When to Use
- Generate speech fully offline on a Mac
- Produce narration, audiobooks, podcasts, or video voiceovers
- Create multilingual TTS with controllable style and emotion
- Clone any voice from a short audio sample
- Design custom voices from text descriptions
Quick Start
Install
pip install mlx-audio
brew install ffmpeg
Basic Usage
python scripts/run_tts.py custom-voice \
--text "Hello, welcome to local text to speech." \
--voice Ryan \
--output output.wav
With Style Control
python scripts/run_tts.py custom-voice \
--text "Breaking news: local AI model achieves human-level speech." \
--voice Uncle_Fu \
--instruct "news anchor tone, calm and authoritative" \
--output news.wav
Model Variants
| Variant | Model | Size | Memory | Use Case |
|---|---|---|---|---|
| CustomVoice | mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit |
~1GB | ~4GB | Built-in voices + style control (recommended) |
| VoiceDesign | mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit |
~2GB | ~5GB | Create voices from text descriptions |
| Base | mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit |
~1GB | ~4GB | Voice cloning from reference audio |
Supported Languages
| Language | Code | Notes |
|---|---|---|
| Auto-detect | auto |
Default, detects from text |
| Chinese | Chinese |
Mandarin |
| English | English |
|
| Japanese | Japanese |
|
| Korean | Korean |
|
| French | French |
|
| German | German |
|
| Spanish | Spanish |
|
| Portuguese | Portuguese |
|
| Italian | Italian |
|
| Russian | Russian |
Built-in Voices
| Voice | Language | Character |
|---|---|---|
| Vivian | Chinese | Female, bright, young |
| Serena | Chinese | Female, gentle, soft |
| Uncle_Fu | Chinese | Male, authoritative, news anchor |
| Dylan | Chinese | Male, Beijing dialect |
| Eric | Chinese | Male, Sichuan dialect |
| Ryan | English | Male, energetic |
| Aiden | English | Male, clear, neutral |
| Ono_Anna | Japanese | Female |
| Sohee | Korean | Female |
Voice Selection Guide:
| Scenario | Recommended Voice |
|---|---|
| Chinese news/narration | Uncle_Fu |
| Chinese casual/lively | Eric |
| Chinese female, professional | Vivian |
| Chinese female, storytelling | Serena |
| English energetic content | Ryan |
| English neutral/educational | Aiden |
| Japanese content | Ono_Anna |
| Korean content | Sohee |
Modes
1) CustomVoice
Use built-in voices with optional emotion/style control via --instruct.
python scripts/run_tts.py custom-voice \
--text "This is amazing news!" \
--voice Vivian \
--instruct "excited and happy" \
--output excited.wav
Style instruction examples:
"calm and warm"- Soft, friendly delivery"news anchor, authoritative"- Professional broadcast style"excited and energetic"- High energy, enthusiastic"sad and melancholic"- Emotional, somber tone"whispering, intimate"- Quiet, close-mic feel
2) VoiceDesign
Create a completely new voice by describing it in natural language.
python scripts/run_tts.py voice-design \
--text "Welcome to our podcast." \
--instruct "warm, mature male narrator with low pitch and gentle tone" \
--output podcast_intro.wav
Voice description examples:
"young cheerful female with high pitch""elderly wise male with deep resonant voice""professional female news anchor, clear articulation""friendly young male, casual and relaxed"
3) VoiceClone
Clone any voice from a reference audio sample (5-10 seconds recommended).
python scripts/run_tts.py voice-clone \
--text "This is my cloned voice speaking new content." \
--ref_audio reference.wav \
--ref_text "The exact transcript of the reference audio" \
--output cloned.wav
Tips for voice cloning:
- Use clean audio without background noise
- 5-10 seconds of speech works best
- Provide accurate transcript of the reference
- Reference and output language should match
CLI Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
--text |
Yes | - | Text to synthesize |
--voice |
No | Vivian | Built-in voice (CustomVoice only) |
--lang_code |
No | auto | Language code |
--instruct |
No | - | Style control or voice description |
--speed |
No | 1.0 | Speech speed multiplier |
--temperature |
No | 0.7 | Sampling temperature (higher = more variation) |
--model |
No | (per mode) | Override default model |
--output |
No | - | Output file path |
--out-dir |
No | ./outputs | Output directory when --output not set |
--ref_audio |
VoiceClone | - | Reference audio file |
--ref_text |
VoiceClone | - | Reference audio transcript |
Python API
Using generate_audio (recommended)
from mlx_audio.tts.generate import generate_audio
# CustomVoice with style control
generate_audio(
text="Hello from Qwen3-TTS!",
model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
voice="Ryan",
lang_code="english",
instruct="friendly and warm",
output_path=".",
file_prefix="hello",
audio_format="wav",
join_audio=True,
verbose=True,
)
Using Model directly
from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np
# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")
# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
text="Hello from Qwen3-TTS.",
speaker="Ryan",
language="english",
instruct="clear, steady delivery"
):
if hasattr(chunk, 'audio') and chunk.audio is not None:
audio_chunks.append(chunk.audio)
# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)
VoiceDesign
from mlx_audio.tts.generate import generate_audio
generate_audio(
text="Welcome to the show.",
model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
instruct="warm, friendly female narrator with medium pitch",
lang_code="english",
output_path=".",
file_prefix="voice_design",
join_audio=True,
)
VoiceClone
from mlx_audio.tts.generate import generate_audio
generate_audio(
text="New content in the cloned voice.",
model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
ref_audio="reference.wav",
ref_text="Transcript of the reference audio",
output_path=".",
file_prefix="cloned",
join_audio=True,
)
Batch Processing
Use scripts/batch_dubbing.py for processing multiple lines:
python scripts/batch_dubbing.py \
--input dubbing.json \
--out-dir outputs
See references/dubbing_format.md for the JSON format.
Performance
| Metric | Value |
|---|---|
| Sample rate | 24,000 Hz |
| Real-time factor | ~0.7x (faster than real-time) |
| Peak memory | ~4-6 GB |
| First run | Downloads model (~1-2GB) |
Troubleshooting
| Issue | Solution |
|---|---|
| Slow generation | Use 4-bit CustomVoice model |
| Unnatural pauses | Add punctuation, keep sentences short |
| Wrong language detected | Specify --lang_code explicitly |
| Voice cloning quality | Use cleaner reference audio, accurate transcript |
| Tokenizer warnings | Harmless, can be ignored |
| Out of memory | Close other apps, use 4-bit model |