jpskill.com
📦 その他 コミュニティ 🟡 少し慣れが必要 👤 幅広いユーザー

📦 Heartmula

heartmula

歌詞とタグを入力するだけで、Sunoのように

⏱ 手作業のあれこれ 1日 → 1時間

📺 まず動画で見る(YouTube)

▶ 【Claude Code完全入門】誰でも使える/Skills活用法/経営者こそ使うべき ↗

※ jpskill.com 編集部が参考用に選んだ動画です。動画の内容と Skill の挙動は厳密には一致しないことがあります。

📜 元の英語説明(参考)

HeartMuLa: Suno-like song generation from lyrics + tags.

🇯🇵 日本人クリエイター向け解説

一言でいうと

歌詞とタグを入力するだけで、Sunoのように

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

  1. 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
  2. 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
  3. 3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
    • · macOS / Linux: ~/.claude/skills/
    • · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →
最終更新
2026-05-17
取得日時
2026-05-17
同梱ファイル
1

💬 こう話しかけるだけ — サンプルプロンプト

  • Heartmula の使い方を教えて
  • Heartmula で何ができるか具体例で見せて
  • Heartmula を初めて使う人向けにステップを案内して

これをClaude Code に貼るだけで、このSkillが自動発動します。

📖 Claude が読む原文 SKILL.md(中身を展開)

この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。

HeartMuLa - Open-Source Music Generation

Overview

HeartMuLa is a family of open-source music foundation models (Apache-2.0) that generates music conditioned on lyrics and tags, with multilingual support. Generates full songs from lyrics + tags. Comparable to Suno for open-source. Includes:

  • HeartMuLa - Music language model (3B/7B) for generation from lyrics + tags
  • HeartCodec - 12.5Hz music codec for high-fidelity audio reconstruction
  • HeartTranscriptor - Whisper-based lyrics transcription
  • HeartCLAP - Audio-text alignment model

When to Use

  • User wants to generate music/songs from text descriptions
  • User wants an open-source Suno alternative
  • User wants local/offline music generation
  • User asks about HeartMuLa, heartlib, or AI music generation

Hardware Requirements

  • Minimum: 8GB VRAM with --lazy_load true (loads/unloads models sequentially)
  • Recommended: 16GB+ VRAM for comfortable single-GPU usage
  • Multi-GPU: Use --mula_device cuda:0 --codec_device cuda:1 to split across GPUs
  • 3B model with lazy_load peaks at ~6.2GB VRAM

Installation Steps

1. Clone Repository

cd ~/  # or desired directory
git clone https://github.com/HeartMuLa/heartlib.git
cd heartlib

2. Create Virtual Environment (Python 3.10 required)

uv venv --python 3.10 .venv
. .venv/bin/activate
uv pip install -e .

3. Fix Dependency Compatibility Issues

IMPORTANT: As of Feb 2026, the pinned dependencies have conflicts with newer packages. Apply these fixes:

# Upgrade datasets (old version incompatible with current pyarrow)
uv pip install --upgrade datasets

# Upgrade transformers (needed for huggingface-hub 1.x compatibility)
uv pip install --upgrade transformers

4. Patch Source Code (Required for transformers 5.x)

Patch 1 - RoPE cache fix in src/heartlib/heartmula/modeling_heartmula.py:

In the setup_caches method of the HeartMuLa class, add RoPE reinitialization after the reset_caches try/except block and before the with device: block:

# Re-initialize RoPE caches that were skipped during meta-device loading
from torchtune.models.llama3_1._position_embeddings import Llama3ScaledRoPE
for module in self.modules():
    if isinstance(module, Llama3ScaledRoPE) and not module.is_cache_built:
        module.rope_init()
        module.to(device)

Why: from_pretrained creates model on meta device first; Llama3ScaledRoPE.rope_init() skips cache building on meta tensors, then never rebuilds after weights are loaded to real device.

Patch 2 - HeartCodec loading fix in src/heartlib/pipelines/music_generation.py:

Add ignore_mismatched_sizes=True to ALL HeartCodec.from_pretrained() calls (there are 2: the eager load in __init__ and the lazy load in the codec property).

Why: VQ codebook initted buffers have shape [1] in checkpoint vs [] in model. Same data, just scalar vs 0-d tensor. Safe to ignore.

5. Download Model Checkpoints

cd heartlib  # project root
hf download --local-dir './ckpt' 'HeartMuLa/HeartMuLaGen'
hf download --local-dir './ckpt/HeartMuLa-oss-3B' 'HeartMuLa/HeartMuLa-oss-3B-happy-new-year'
hf download --local-dir './ckpt/HeartCodec-oss' 'HeartMuLa/HeartCodec-oss-20260123'

All 3 can be downloaded in parallel. Total size is several GB.

GPU / CUDA

HeartMuLa uses CUDA by default (--mula_device cuda --codec_device cuda). No extra setup needed if the user has an NVIDIA GPU with PyTorch CUDA support installed.

  • The installed torch==2.4.1 includes CUDA 12.1 support out of the box
  • torchtune may report version 0.4.0+cpu — this is just package metadata, it still uses CUDA via PyTorch
  • To verify GPU is being used, look for "CUDA memory" lines in the output (e.g. "CUDA memory before unloading: 6.20 GB")
  • No GPU? You can run on CPU with --mula_device cpu --codec_device cpu, but expect generation to be extremely slow (potentially 30-60+ minutes for a single song vs ~4 minutes on GPU). CPU mode also requires significant RAM (~12GB+ free). If the user has no NVIDIA GPU, recommend using a cloud GPU service (Google Colab free tier with T4, Lambda Labs, etc.) or the online demo at https://heartmula.github.io/ instead.

Usage

Basic Generation

cd heartlib
. .venv/bin/activate
python ./examples/run_music_generation.py \
  --model_path=./ckpt \
  --version="3B" \
  --lyrics="./assets/lyrics.txt" \
  --tags="./assets/tags.txt" \
  --save_path="./assets/output.mp3" \
  --lazy_load true

Input Formatting

Tags (comma-separated, no spaces):

piano,happy,wedding,synthesizer,romantic

or

rock,energetic,guitar,drums,male-vocal

Lyrics (use bracketed structural tags):

[Intro]

[Verse]
Your lyrics here...

[Chorus]
Chorus lyrics...

[Bridge]
Bridge lyrics...

[Outro]

Key Parameters

Parameter Default Description
--max_audio_length_ms 240000 Max length in ms (240s = 4 min)
--topk 50 Top-k sampling
--temperature 1.0 Sampling temperature
--cfg_scale 1.5 Classifier-free guidance scale
--lazy_load false Load/unload models on demand (saves VRAM)
--mula_dtype bfloat16 Dtype for HeartMuLa (bf16 recommended)
--codec_dtype float32 Dtype for HeartCodec (fp32 recommended for quality)

Performance

  • RTF (Real-Time Factor) ≈ 1.0 — a 4-minute song takes ~4 minutes to generate
  • Output: MP3, 48kHz stereo, 128kbps

Pitfalls

  1. Do NOT use bf16 for HeartCodec — degrades audio quality. Use fp32 (default).
  2. Tags may be ignored — known issue (#90). Lyrics tend to dominate; experiment with tag ordering.
  3. Triton not available on macOS — Linux/CUDA only for GPU acceleration.
  4. RTX 5080 incompatibility reported in upstream issues.
  5. The dependency pin conflicts require the manual upgrades and patches described above.

Links