🛠️ 開発・MCP コミュニティ

ai-provider-openai-whisper

OpenAIの音声APIを使って、音声データをテキストに書き起こしたり、翻訳したりできるSkillで、発言者の識別や時間情報の付与、プロンプトによる制御など、ビジネス利用に役立つ機能も備えています。

📜 元の英語説明(参考)

Speech-to-text transcription and translation via OpenAI Audio API -- models, response formats, timestamps, prompting, streaming, chunking, and diarization

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o ai-provider-openai-whisper.zip https://jpskill.com/download/10218.zip && unzip -o ai-provider-openai-whisper.zip && rm ai-provider-openai-whisper.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/10218.zip -OutFile "$d\ai-provider-openai-whisper.zip"; Expand-Archive "$d\ai-provider-openai-whisper.zip" -DestinationPath $d -Force; ri "$d\ai-provider-openai-whisper.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して ai-provider-openai-whisper.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → ai-provider-openai-whisper フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

OpenAI Whisper のパターン

クイックガイド: 音声テキスト変換には client.audio.transcriptions.create() を、英語以外の音声を英語テキストに変換するには client.audio.translations.create() を使用します。最高の精度を得るには gpt-4o-transcribe、コスト効率を重視するなら gpt-4o-mini-transcribe、タイムスタンプ/SRT/VTT を使用するなら whisper-1、話者識別を行うなら gpt-4o-transcribe-diarize を選択してください。ファイルは 25 MB 未満である必要があります。大きなファイルはチャンクに分割してください。語彙やスタイルを誘導するには prompt を使用します。gpt-4o-transcribe モデルでは、stream: true を介してストリーミングが利用可能です。

<critical_requirements>

重要: この Skill を使用する前に

すべてのコードは、CLAUDE.md のプロジェクト規約に従う必要があります (kebab-case、名前付きエクスポート、インポート順序、import type、名前付き定数)

(ユースケースに適したモデルを必ず選択してください -- 精度には gpt-4o-transcribe、タイムスタンプ/SRT/VTT 出力には whisper-1、話者ラベルには gpt-4o-transcribe-diarize)

(25 MB を超えるオーディオファイルは、API に送信する前に必ずチャンクに分割してください -- API はこの制限を超えるファイルを拒否します)

(timestamp_granularities を使用する場合は、必ず response_format: "verbose_json" を渡してください -- タイムスタンプは whisper-1 でこの形式でのみ動作します)

(30 秒を超えるオーディオで gpt-4o-transcribe-diarize を使用する場合は、必ず chunking_strategy: "auto" を設定してください -- API はそれを必要とします)

</critical_requirements>

自動検出: Whisper, whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize, audio.transcriptions, audio.translations, transcription, speech-to-text, diarization, diarized_json, timestamp_granularities, verbose_json

使用する場面:

音声ファイル (会議、インタビュー、ポッドキャスト、ボイスメモ) をテキストに書き起こす
英語以外の音声を英語テキストに翻訳する
音声から SRT または VTT 形式の字幕を生成する
ビデオ編集のために、単語レベルまたはセグメントレベルのタイムスタンプを取得する
複数話者の音声で話者を識別する (ダイアライゼーション)
モデルが音声を処理するにつれて、ストリーミングによる文字起こし結果を段階的に取得する

カバーする主なパターン:

モデルの選択 (whisper-1 vs gpt-4o-transcribe vs gpt-4o-mini-transcribe vs gpt-4o-transcribe-diarize)
レスポンス形式 (json, text, srt, vtt, verbose_json, diarized_json)
タイムスタンプ (単語レベル、セグメントレベル) と字幕生成
語彙、頭字語、およびスタイルに関するプロンプト
コンテキストを維持したまま、大きなファイル (> 25 MB) をチャンクに分割する
stream: true を使用したストリーミング文字起こし
audio.translations.create() を介した英語への翻訳
話者参照による話者ダイアライゼーション

使用しない場面:

テキスト読み上げ (TTS) -- OpenAI TTS API (client.audio.speech.create()) を使用します
リアルタイムの双方向音声会話 -- OpenAI Realtime API を使用します
OpenAI 以外のプロバイダーによる文字起こし -- プロバイダーに依存しない音声 SDK を使用します

例のインデックス

コア: 文字起こし、翻訳、タイムスタンプ、チャンク分割、ストリーミング、ダイアライゼーション -- すべてのオーディオ API パターン

哲学

OpenAI Audio API は、さまざまなニーズに合わせて最適化された複数のモデルを通じて、音声テキスト変換と翻訳を提供します。 API はシンプルで、オーディオファイルを送信するとテキストが返されますが、適切なモデル、レスポンス形式、およびパラメータを選択することが、高品質の結果を得るために重要です。

コア原則:

モデルの選択が重要 -- gpt-4o-transcribe は、幻覚率が低く、最高の精度を実現します。 whisper-1 は、タイムスタンプ付きの SRT/VTT/verbose_json をサポートする唯一のモデルです。 gpt-4o-transcribe-diarize は話者識別を追加します。
ファイルサイズが主な制約 -- 25 MB の制限は、より長いオーディオをチャンクに分割する必要があることを意味します。コンテキストを維持するために、文の境界で分割します。
プロンプトは精度を向上させる -- prompt パラメータは、語彙、頭字語、および書式スタイルを誘導します。指示を与えるのではなく、モデルが照合するコンテキストを提供します。
レスポンス形式によって利用可能な機能が決まる -- タイムスタンプには whisper-1 で verbose_json が必要です。ダイアライゼーションには diarized_json が必要です。 SRT/VTT は whisper-1 でのみ利用可能です。

Audio API を使用する場面:

録音されたオーディオファイルの正確な文字起こしが必要な場合
オーディオから字幕 (SRT/VTT) が必要な場合
会話で誰が話しているかを識別する必要がある場合
英語以外の音声を英語テキストに翻訳する必要がある場合

使用しない場面:

リアルタイムの音声チャット -- 代わりに Realtime API を使用します
テキスト読み上げ -- client.audio.speech.create() を使用します
英語以外のターゲット言語での文字起こしが必要な場合 (翻訳は英語のみを出力します)

</philosophy>

コアパターン

パターン 1: 基本的な文字起こし

オーディオファイルを送信し、テキストを受け取ります。モデルは言語を自動的に検出します。

const transcription = await client.audio.transcriptions.create({
  model: "gpt-4o-transcribe",
  file: createReadStream(audioPath),
});

最高の精度を得るには gpt-4o-transcribe を使用します。プレーンテキストのみが必要な場合は、verbose_json を使用して whisper-1 を使用しないでください。オーバーヘッドが増加し、幻覚率が高くなります。完全な例については、core.md を参照してください。

パターン 2: モデルの選択

各モデルには、明確な機能とトレードオフがあります。

何が必要ですか？
+-- 最高の精度、プレーンテキスト -> gpt-4o-transcribe
+-- コスト効率、プレーンテキスト -> gpt-4o-mini-transcribe
+-- タイムスタンプ (単語/セグメント) -> whisper-1 (verbose_json)
+-- SRT または VTT 字幕 -> whisper-1 (srt/vtt 形式)
+-- 話者識別 -> gpt-4o-transcribe-diarize
+-- ストリーミング出力 -> gpt-4o-transcribe または gpt-4o-mini-transcribe

モデル機能マトリックス

機能	whisper-1	gpt-4o-transcribe	gpt-4o-mini-transcribe	gpt-4o-transcribe-diarize
レスポンス形式	json, text, srt, vtt, verbose_json	json, text	json, text

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

OpenAI Whisper Patterns

Quick Guide: Use client.audio.transcriptions.create() for speech-to-text and client.audio.translations.create() for non-English audio to English text. Choose gpt-4o-transcribe for highest accuracy, gpt-4o-mini-transcribe for cost-efficiency, whisper-1 for timestamps/SRT/VTT, or gpt-4o-transcribe-diarize for speaker identification. Files must be under 25 MB -- chunk larger files. Use prompt to guide vocabulary and style. Streaming is available via stream: true for progressive output on gpt-4o-transcribe models.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST choose the correct model for the use case -- gpt-4o-transcribe for accuracy, whisper-1 for timestamps/SRT/VTT output, gpt-4o-transcribe-diarize for speaker labels)

(You MUST chunk audio files larger than 25 MB before sending to the API -- the API rejects files exceeding this limit)

(You MUST pass response_format: "verbose_json" when using timestamp_granularities -- timestamps only work with this format on whisper-1)

(You MUST set chunking_strategy: "auto" when using gpt-4o-transcribe-diarize with audio longer than 30 seconds -- the API requires it)

</critical_requirements>

Auto-detection: Whisper, whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize, audio.transcriptions, audio.translations, transcription, speech-to-text, diarization, diarized_json, timestamp_granularities, verbose_json

When to use:

Transcribing audio files (meetings, interviews, podcasts, voice notes) to text
Translating non-English audio to English text
Generating subtitles in SRT or VTT format from audio
Getting word-level or segment-level timestamps for video editing
Identifying speakers in multi-speaker audio (diarization)
Streaming transcription results progressively as the model processes audio

Key patterns covered:

Model selection (whisper-1 vs gpt-4o-transcribe vs gpt-4o-mini-transcribe vs gpt-4o-transcribe-diarize)
Response formats (json, text, srt, vtt, verbose_json, diarized_json)
Timestamps (word-level, segment-level) and subtitle generation
Prompting for vocabulary, acronyms, and style
Chunking large files (> 25 MB) with context preservation
Streaming transcription with stream: true
Translation to English via audio.translations.create()
Speaker diarization with speaker references

When NOT to use:

Text-to-speech (TTS) -- use the OpenAI TTS API (client.audio.speech.create())
Real-time bidirectional voice conversations -- use the OpenAI Realtime API
Transcription with non-OpenAI providers -- use a provider-agnostic speech SDK

Examples Index

Core: Transcription, Translation, Timestamps, Chunking, Streaming, Diarization -- All audio API patterns

Philosophy

The OpenAI Audio API provides speech-to-text transcription and translation through multiple models optimized for different needs. The API is simple -- you send an audio file and get text back -- but choosing the right model, response format, and parameters is critical for quality results.

Core principles:

Model selection matters -- gpt-4o-transcribe produces the highest accuracy with lower hallucination rates. whisper-1 is the only model supporting SRT/VTT/verbose_json with timestamps. gpt-4o-transcribe-diarize adds speaker identification.
File size is the primary constraint -- 25 MB limit means you must chunk longer audio. Split at sentence boundaries to preserve context.
Prompting improves accuracy -- The prompt parameter guides vocabulary, acronyms, and formatting style. It does not give instructions -- it provides context the model matches against.
Response format determines available features -- Timestamps require verbose_json on whisper-1. Diarization requires diarized_json. SRT/VTT are only on whisper-1.

When to use the Audio API:

You need accurate transcription of recorded audio files
You need subtitles (SRT/VTT) from audio
You need to identify who is speaking in a conversation
You need to translate non-English speech to English text

When NOT to use:

Real-time voice chat -- use the Realtime API instead
Text-to-speech -- use client.audio.speech.create()
You need transcription in a non-English target language (translation only outputs English)

</philosophy>

Core Patterns

Pattern 1: Basic Transcription

Send an audio file and receive text back. The model auto-detects the language.

const transcription = await client.audio.transcriptions.create({
  model: "gpt-4o-transcribe",
  file: createReadStream(audioPath),
});

Use gpt-4o-transcribe for highest accuracy. Do not use whisper-1 with verbose_json when you only need plain text -- it adds overhead and has higher hallucination rates. See core.md for full examples.

Pattern 2: Model Selection

Each model has distinct capabilities and tradeoffs.

What do you need?
+-- Highest accuracy, plain text -> gpt-4o-transcribe
+-- Cost-efficient, plain text -> gpt-4o-mini-transcribe
+-- Timestamps (word/segment) -> whisper-1 (verbose_json)
+-- SRT or VTT subtitles -> whisper-1 (srt/vtt format)
+-- Speaker identification -> gpt-4o-transcribe-diarize
+-- Streaming output -> gpt-4o-transcribe or gpt-4o-mini-transcribe

Model Capabilities Matrix

Feature	whisper-1	gpt-4o-transcribe	gpt-4o-mini-transcribe	gpt-4o-transcribe-diarize
Response formats	json, text, srt, vtt, verbose_json	json, text	json, text	json, text, diarized_json
Timestamps	word + segment	No	No	No
Streaming	No	Yes	Yes	No
Prompt support	Yes (224 tokens)	Yes	Yes	No
Logprobs	No	Yes	Yes	No
Speaker labels	No	No	No	Yes
Language param	Yes	Yes	Yes	Yes

Pattern 3: Prompting for Vocabulary and Style

The prompt parameter provides context -- not instructions. It guides spelling of names, acronyms, and formatting style. Do not use it to give instructions like "please transcribe carefully" -- it matches style and vocabulary context.

const VOCABULARY_PROMPT = "Kubernetes, kubectl, etcd, NGINX, gRPC, PostgreSQL";

const transcription = await client.audio.transcriptions.create({
  model: "gpt-4o-transcribe",
  file: createReadStream(audioPath),
  prompt: VOCABULARY_PROMPT,
});

Use cases: Acronyms/proper nouns, preserving context across chunks (pass tail of previous transcript), maintaining filler words, writing style guidance. See core.md for detailed vocabulary examples.

Pattern 4: Chunking Large Files

Audio files exceeding 25 MB must be split before transcription. Split at sentence boundaries (e.g., via ffmpeg) to preserve context. Pass the tail of the previous transcript as prompt for continuity across chunks.

const MAX_FILE_SIZE_BYTES = 25 * 1024 * 1024; // 25 MB
// Split with ffmpeg: ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3
// Then transcribe sequentially, passing previous context via prompt

See core.md for the full chunking implementation with size validation and context preservation.

Pattern 5: Streaming Transcription

Stream partial transcription results as the model processes audio. Only gpt-4o-transcribe and gpt-4o-mini-transcribe support stream: true. Listen for transcript.text.delta events for progressive output and transcript.text.done for completion. Do NOT use stream: true with whisper-1 -- it is not supported.

const stream = await client.audio.transcriptions.create({
  model: "gpt-4o-transcribe",
  file: createReadStream(audioPath),
  stream: true,
});
for await (const event of stream) {
  if (event.type === "transcript.text.delta") process.stdout.write(event.delta);
}

See core.md for full streaming and logprob examples.

Pattern 6: Translation to English

Translate non-English audio to English text. Only whisper-1 is supported via audio.translations.create(). For same-language transcription, use audio.transcriptions.create() instead. Translation only outputs English -- there is no way to translate to other languages.

const translation = await client.audio.translations.create({
  model: "whisper-1",
  file: createReadStream(audioPath),
});

See core.md for full translation examples including vocabulary prompting.

Pattern 7: Speaker Diarization

Identify who is speaking in multi-speaker audio. Use gpt-4o-transcribe-diarize with response_format: "diarized_json" and chunking_strategy: "auto" (required for audio > 30s). Diarization does not support prompt, logprobs, or timestamp_granularities.

const transcription = await client.audio.transcriptions.create({
  model: "gpt-4o-transcribe-diarize",
  file: createReadStream(audioPath),
  response_format: "diarized_json",
  chunking_strategy: "auto",
});

Optionally supply known_speaker_names and known_speaker_references (2-10 second audio clips as data URLs) to map segments to known speakers (up to 4). See core.md for full diarization examples.

</patterns>

<decision_framework>

Decision Framework

Which Model to Choose

What do you need from the transcription?
+-- Just text (highest accuracy) -> gpt-4o-transcribe
+-- Just text (cost-sensitive) -> gpt-4o-mini-transcribe
+-- Word/segment timestamps -> whisper-1 (verbose_json)
+-- SRT or VTT subtitle files -> whisper-1 (srt or vtt)
+-- Speaker identification -> gpt-4o-transcribe-diarize
+-- Progressive/streaming output -> gpt-4o-transcribe (stream: true)

Which Response Format to Use

What output do you need?
+-- Plain text string -> "text"
+-- JSON with text field -> "json" (default)
+-- Subtitles for video -> "srt" or "vtt" (whisper-1 only)
+-- Timestamps (word/segment) -> "verbose_json" (whisper-1 only)
+-- Speaker-labeled segments -> "diarized_json" (gpt-4o-transcribe-diarize only)

Transcription vs Translation

Is the audio in English?
+-- YES -> Use audio.transcriptions.create()
+-- NO -> Do you want the output in the original language?
    +-- YES -> Use audio.transcriptions.create() (auto-detects language)
    +-- NO (want English) -> Use audio.translations.create() (whisper-1 only)

</decision_framework>

<red_flags>

RED FLAGS

High Priority Issues:

Using timestamp_granularities without response_format: "verbose_json" on whisper-1 (silently ignored)
Sending files larger than 25 MB (API returns error)
Using gpt-4o-transcribe-diarize without chunking_strategy on audio > 30 seconds (API returns error)
Using stream: true with whisper-1 (not supported)

Medium Priority Issues:

Using whisper-1 when gpt-4o-transcribe would produce higher accuracy (whisper-1 has higher hallucination rates)
Not passing language parameter when you know the language (auto-detection may be wrong for short or noisy audio)
Using audio.translations.create() when you want same-language transcription (translation always outputs English)
Splitting audio mid-sentence when chunking (loses context at boundaries)

Common Mistakes:

Treating the prompt parameter as an instruction ("please transcribe carefully") -- it is context for vocabulary and style matching
Using gpt-4o-transcribe when you need SRT/VTT output -- only whisper-1 supports those formats
Expecting gpt-4o-transcribe-diarize to support prompts or logprobs (it does not)
Using the translations endpoint for English audio (it only translates non-English to English)
Not providing previous chunk context when transcribing split files (reduces accuracy at boundaries)

Gotchas & Edge Cases:

The prompt parameter is limited to approximately 224 tokens on whisper-1. Longer prompts are truncated.
whisper-1 can hallucinate text for silent or near-silent audio segments. Use no_speech_prob from verbose_json to detect this.
gpt-4o-transcribe and gpt-4o-mini-transcribe only support json and text response formats -- not srt, vtt, or verbose_json.
The language parameter uses ISO 639-1 codes (e.g., "en", "fr", "ja"). Setting it improves accuracy for short audio.
Supported file formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Other formats must be converted first.
gpt-4o-transcribe-diarize labels speakers as "A", "B", "C" unless you provide known_speaker_names and known_speaker_references with short audio clips.
Translation endpoint only supports whisper-1 and only outputs English -- there is no way to translate to other languages via this API.
Streaming transcription emits transcript.text.delta events with a delta string property, plus a final transcript.text.done event.

</red_flags>

<critical_reminders>

CRITICAL REMINDERS

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST choose the correct model for the use case -- gpt-4o-transcribe for accuracy, whisper-1 for timestamps/SRT/VTT output, gpt-4o-transcribe-diarize for speaker labels)

(You MUST chunk audio files larger than 25 MB before sending to the API -- the API rejects files exceeding this limit)

(You MUST pass response_format: "verbose_json" when using timestamp_granularities -- timestamps only work with this format on whisper-1)

(You MUST set chunking_strategy: "auto" when using gpt-4o-transcribe-diarize with audio longer than 30 seconds -- the API requires it)

Failure to follow these rules will produce failed API calls or degraded transcription quality.

</critical_reminders>