jpskill.com
💬 コミュニケーション コミュニティ

qwen3-tts-mlx

Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o qwen3-tts-mlx.zip https://jpskill.com/download/10459.zip && unzip -o qwen3-tts-mlx.zip && rm qwen3-tts-mlx.zip
🪟 Windows (PowerShell)
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/10459.zip -OutFile "$d\qwen3-tts-mlx.zip"; Expand-Archive "$d\qwen3-tts-mlx.zip" -DestinationPath $d -Force; ri "$d\qwen3-tts-mlx.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)
  1. 1. 下の青いボタンを押して qwen3-tts-mlx.zip をダウンロード
  2. 2. ZIPファイルをダブルクリックで解凍 → qwen3-tts-mlx フォルダができる
  3. 3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
  4. 4. Claude Code を再起動

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

  1. 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
  2. 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
  3. 3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
    • · macOS / Linux: ~/.claude/skills/
    • · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →
最終更新
2026-05-18
取得日時
2026-05-18
同梱ファイル
1
📖 Claude が読む原文 SKILL.md(中身を展開)

この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。

Qwen3-TTS MLX

Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.

When to Use

  • Generate speech fully offline on a Mac
  • Produce narration, audiobooks, podcasts, or video voiceovers
  • Create multilingual TTS with controllable style and emotion
  • Clone any voice from a short audio sample
  • Design custom voices from text descriptions

Quick Start

Install

pip install mlx-audio
brew install ffmpeg

Basic Usage

python scripts/run_tts.py custom-voice \
  --text "Hello, welcome to local text to speech." \
  --voice Ryan \
  --output output.wav

With Style Control

python scripts/run_tts.py custom-voice \
  --text "Breaking news: local AI model achieves human-level speech." \
  --voice Uncle_Fu \
  --instruct "news anchor tone, calm and authoritative" \
  --output news.wav

Model Variants

Variant Model Size Memory Use Case
CustomVoice mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit ~1GB ~4GB Built-in voices + style control (recommended)
VoiceDesign mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit ~2GB ~5GB Create voices from text descriptions
Base mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit ~1GB ~4GB Voice cloning from reference audio

Supported Languages

Language Code Notes
Auto-detect auto Default, detects from text
Chinese Chinese Mandarin
English English
Japanese Japanese
Korean Korean
French French
German German
Spanish Spanish
Portuguese Portuguese
Italian Italian
Russian Russian

Built-in Voices

Voice Language Character
Vivian Chinese Female, bright, young
Serena Chinese Female, gentle, soft
Uncle_Fu Chinese Male, authoritative, news anchor
Dylan Chinese Male, Beijing dialect
Eric Chinese Male, Sichuan dialect
Ryan English Male, energetic
Aiden English Male, clear, neutral
Ono_Anna Japanese Female
Sohee Korean Female

Voice Selection Guide:

Scenario Recommended Voice
Chinese news/narration Uncle_Fu
Chinese casual/lively Eric
Chinese female, professional Vivian
Chinese female, storytelling Serena
English energetic content Ryan
English neutral/educational Aiden
Japanese content Ono_Anna
Korean content Sohee

Modes

1) CustomVoice

Use built-in voices with optional emotion/style control via --instruct.

python scripts/run_tts.py custom-voice \
  --text "This is amazing news!" \
  --voice Vivian \
  --instruct "excited and happy" \
  --output excited.wav

Style instruction examples:

  • "calm and warm" - Soft, friendly delivery
  • "news anchor, authoritative" - Professional broadcast style
  • "excited and energetic" - High energy, enthusiastic
  • "sad and melancholic" - Emotional, somber tone
  • "whispering, intimate" - Quiet, close-mic feel

2) VoiceDesign

Create a completely new voice by describing it in natural language.

python scripts/run_tts.py voice-design \
  --text "Welcome to our podcast." \
  --instruct "warm, mature male narrator with low pitch and gentle tone" \
  --output podcast_intro.wav

Voice description examples:

  • "young cheerful female with high pitch"
  • "elderly wise male with deep resonant voice"
  • "professional female news anchor, clear articulation"
  • "friendly young male, casual and relaxed"

3) VoiceClone

Clone any voice from a reference audio sample (5-10 seconds recommended).

python scripts/run_tts.py voice-clone \
  --text "This is my cloned voice speaking new content." \
  --ref_audio reference.wav \
  --ref_text "The exact transcript of the reference audio" \
  --output cloned.wav

Tips for voice cloning:

  • Use clean audio without background noise
  • 5-10 seconds of speech works best
  • Provide accurate transcript of the reference
  • Reference and output language should match

CLI Parameters

Parameter Required Default Description
--text Yes - Text to synthesize
--voice No Vivian Built-in voice (CustomVoice only)
--lang_code No auto Language code
--instruct No - Style control or voice description
--speed No 1.0 Speech speed multiplier
--temperature No 0.7 Sampling temperature (higher = more variation)
--model No (per mode) Override default model
--output No - Output file path
--out-dir No ./outputs Output directory when --output not set
--ref_audio VoiceClone - Reference audio file
--ref_text VoiceClone - Reference audio transcript

Python API

Using generate_audio (recommended)

from mlx_audio.tts.generate import generate_audio

# CustomVoice with style control
generate_audio(
    text="Hello from Qwen3-TTS!",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
    voice="Ryan",
    lang_code="english",
    instruct="friendly and warm",
    output_path=".",
    file_prefix="hello",
    audio_format="wav",
    join_audio=True,
    verbose=True,
)

Using Model directly

from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np

# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")

# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
    text="Hello from Qwen3-TTS.",
    speaker="Ryan",
    language="english",
    instruct="clear, steady delivery"
):
    if hasattr(chunk, 'audio') and chunk.audio is not None:
        audio_chunks.append(chunk.audio)

# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)

VoiceDesign

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="Welcome to the show.",
    model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
    instruct="warm, friendly female narrator with medium pitch",
    lang_code="english",
    output_path=".",
    file_prefix="voice_design",
    join_audio=True,
)

VoiceClone

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="New content in the cloned voice.",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio",
    output_path=".",
    file_prefix="cloned",
    join_audio=True,
)

Batch Processing

Use scripts/batch_dubbing.py for processing multiple lines:

python scripts/batch_dubbing.py \
  --input dubbing.json \
  --out-dir outputs

See references/dubbing_format.md for the JSON format.

Performance

Metric Value
Sample rate 24,000 Hz
Real-time factor ~0.7x (faster than real-time)
Peak memory ~4-6 GB
First run Downloads model (~1-2GB)

Troubleshooting

Issue Solution
Slow generation Use 4-bit CustomVoice model
Unnatural pauses Add punctuation, keep sentences short
Wrong language detected Specify --lang_code explicitly
Voice cloning quality Use cleaner reference audio, accurate transcript
Tokenizer warnings Harmless, can be ignored
Out of memory Close other apps, use 4-bit model