🛠️ 開発・MCP コミュニティ

openai-realtime

OpenAIのリアルタイムAPIを活用し、音声によるAIアプリケーションを構築、リアルタイムでの音声会話や音声ストリーミング、音声アシスタント開発、OpenAIの音声機能を活用したシステム構築などを実現するSkill。

📜 元の英語説明(参考)

Build voice-enabled AI applications with the OpenAI Realtime API. Use when a user asks to implement real-time voice conversations, stream audio with WebSockets, build voice assistants, or integrate OpenAI audio capabilities.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o openai-realtime.zip https://jpskill.com/download/15202.zip && unzip -o openai-realtime.zip && rm openai-realtime.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/15202.zip -OutFile "$d\openai-realtime.zip"; Expand-Archive "$d\openai-realtime.zip" -DestinationPath $d -Force; ri "$d\openai-realtime.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して openai-realtime.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → openai-realtime フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

OpenAI Realtime API — 音声ネイティブなAI会話

概要

あなたは、音声ネイティブなAIアプリケーションを構築するための WebSocket ベースのインターフェースである OpenAI Realtime API の専門家です。あなたは、開発者が、オーディオ入力を直接処理し（個別の STT ステップは不要）、自然なイントネーションで音声応答を生成し、割り込みを処理し、関数呼び出しを使用する会話型音声エージェントを構築するのを支援します。これらすべてを、サブ秒のレイテンシで単一のストリーミング接続で行います。

指示

WebSocket 接続

// OpenAI Realtime API に接続します
import WebSocket from "ws";

const ws = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview", {
  headers: {
    "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
    "OpenAI-Beta": "realtime=v1",
  },
});

ws.on("open", () => {
  // セッションを構成します
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      modalities: ["text", "audio"],
      voice: "alloy",                      // alloy, echo, fable, onyx, nova, shimmer
      instructions: `あなたは Ava という親切な歯科医院の受付係です。
        温かく、プロフェッショナルで、簡潔にしてください。電話に適した短い文章を使用してください。
        医学的なアドバイスについて尋ねられた場合は、歯科医に転送すると言ってください。`,
      input_audio_format: "pcm16",         // 16-bit PCM, 24kHz
      output_audio_format: "pcm16",
      input_audio_transcription: {
        model: "whisper-1",                 // ロギングのために文字起こしも行います
      },
      turn_detection: {
        type: "server_vad",                 // サーバー側の音声アクティビティ検出
        threshold: 0.5,                     // 感度 (0-1)
        prefix_padding_ms: 300,             // 発話開始前の300msを含めます
        silence_duration_ms: 500,           // 500msの無音 = ターンの終了
      },
      tools: [                              // 関数呼び出しツール
        {
          type: "function",
          name: "check_availability",
          description: "利用可能な予約枠を確認します",
          parameters: {
            type: "object",
            properties: {
              date: { type: "string", description: "YYYY-MM-DD形式の日付" },
              procedure: { type: "string", enum: ["cleaning", "filling", "crown", "consultation"] },
            },
            required: ["date", "procedure"],
          },
        },
        {
          type: "function",
          name: "book_appointment",
          description: "患者の予約を入れます",
          parameters: {
            type: "object",
            properties: {
              patient_name: { type: "string" },
              phone: { type: "string" },
              date: { type: "string" },
              time: { type: "string" },
              procedure: { type: "string" },
            },
            required: ["patient_name", "date", "time", "procedure"],
          },
        },
      ],
    },
  }));
});

// OpenAI からのイベントを処理します
ws.on("message", (data) => {
  const event = JSON.parse(data.toString());

  switch (event.type) {
    case "response.audio.delta":
      // オーディオチャンクをスピーカー/WebRTC にストリーミングします
      const audioChunk = Buffer.from(event.delta, "base64");
      sendToSpeaker(audioChunk);
      break;

    case "response.audio_transcript.delta":
      // AI の応答のリアルタイム文字起こし
      process.stdout.write(event.delta);
      break;

    case "conversation.item.input_audio_transcription.completed":
      // ユーザーの発話が文字起こしされました
      console.log(`\nUser said: ${event.transcript}`);
      break;

    case "response.function_call_arguments.done":
      // AI が関数を呼び出したいと考えています
      handleFunctionCall(event.name, JSON.parse(event.arguments));
      break;

    case "input_audio_buffer.speech_started":
      // ユーザーが話し始めました — AI が話している場合は中断します
      console.log("[User interruption detected]");
      break;
  }
});

// マイクのオーディオを送信します
function sendAudio(pcmBuffer: Buffer) {
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: pcmBuffer.toString("base64"),
  }));
}

// 関数呼び出しを処理します
async function handleFunctionCall(name: string, args: any) {
  let result: string;

  if (name === "check_availability") {
    const slots = await checkClinicSlots(args.date, args.procedure);
    result = JSON.stringify(slots);
  } else if (name === "book_appointment") {
    const booking = await createAppointment(args);
    result = JSON.stringify(booking);
  }

  // 関数の結果を返送します — AI が応答を話します
  ws.send(JSON.stringify({
    type: "conversation.item.create",
    item: {
      type: "function_call_output",
      call_id: event.call_id,
      output: result,
    },
  }));

  // AI に関数の結果で応答するように指示します
  ws.send(JSON.stringify({ type: "response.create" }));
}

Python SDK

# OpenAI Python SDK を使用します
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def run_voice_agent():
    async with client.beta.realtime.connect(
        model="gpt-4o-realtime-preview"
    ) as connection:
        await connection.session.update(session={
            "modalities": ["text", "audio"],
            "voice": "nova",
            "instructions": "You are a helpful assistant.",
            "turn_detection": {"type": "server_vad"},
        })

        # マイクからオーディオを送信します
        await connection.input_audio_buffer.append(audio=base64_audio)

        # イベントを処理します
        async for event in connection:
            if event.type == "response.audio.delta":
                play_audio(event.delta)
            elif event.type == "response.done":
                print("AI finished speaking")

主要な概念

音声ネイティブ — モデルはオーディオを直接処理し、トーン、感情、強調を理解します（テキストの文字起こしだけではありません）
Server VAD — OpenAI のサーバーは、ユーザーがいつ話し始め、いつ話すのを止めたかを検出します。クライアント側の VAD は不要です
割り込み — いつ

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

OpenAI Realtime API — Voice-Native AI Conversations

Overview

You are an expert in the OpenAI Realtime API, the WebSocket-based interface for building voice-native AI applications. You help developers build conversational voice agents that process audio input directly (no separate STT step), generate spoken responses with natural intonation, handle interruptions, and use function calling — all in a single streaming connection with sub-second latency.

Instructions

WebSocket Connection

// Connect to OpenAI Realtime API
import WebSocket from "ws";

const ws = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview", {
  headers: {
    "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
    "OpenAI-Beta": "realtime=v1",
  },
});

ws.on("open", () => {
  // Configure the session
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      modalities: ["text", "audio"],
      voice: "alloy",                      // alloy, echo, fable, onyx, nova, shimmer
      instructions: `You are a helpful dental clinic receptionist named Ava.
        Be warm, professional, and concise. Use short sentences appropriate for phone calls.
        If asked about medical advice, say you'll transfer to the dentist.`,
      input_audio_format: "pcm16",         // 16-bit PCM, 24kHz
      output_audio_format: "pcm16",
      input_audio_transcription: {
        model: "whisper-1",                 // Also transcribe for logging
      },
      turn_detection: {
        type: "server_vad",                 // Server-side voice activity detection
        threshold: 0.5,                     // Sensitivity (0-1)
        prefix_padding_ms: 300,             // Include 300ms before speech start
        silence_duration_ms: 500,           // 500ms silence = end of turn
      },
      tools: [                              // Function calling tools
        {
          type: "function",
          name: "check_availability",
          description: "Check available appointment slots",
          parameters: {
            type: "object",
            properties: {
              date: { type: "string", description: "Date in YYYY-MM-DD format" },
              procedure: { type: "string", enum: ["cleaning", "filling", "crown", "consultation"] },
            },
            required: ["date", "procedure"],
          },
        },
        {
          type: "function",
          name: "book_appointment",
          description: "Book an appointment for a patient",
          parameters: {
            type: "object",
            properties: {
              patient_name: { type: "string" },
              phone: { type: "string" },
              date: { type: "string" },
              time: { type: "string" },
              procedure: { type: "string" },
            },
            required: ["patient_name", "date", "time", "procedure"],
          },
        },
      ],
    },
  }));
});

// Handle events from OpenAI
ws.on("message", (data) => {
  const event = JSON.parse(data.toString());

  switch (event.type) {
    case "response.audio.delta":
      // Stream audio chunks to speaker/WebRTC
      const audioChunk = Buffer.from(event.delta, "base64");
      sendToSpeaker(audioChunk);
      break;

    case "response.audio_transcript.delta":
      // Real-time transcript of AI's response
      process.stdout.write(event.delta);
      break;

    case "conversation.item.input_audio_transcription.completed":
      // User's speech transcribed
      console.log(`\nUser said: ${event.transcript}`);
      break;

    case "response.function_call_arguments.done":
      // AI wants to call a function
      handleFunctionCall(event.name, JSON.parse(event.arguments));
      break;

    case "input_audio_buffer.speech_started":
      // User started speaking — interrupt AI if it's talking
      console.log("[User interruption detected]");
      break;
  }
});

// Send microphone audio
function sendAudio(pcmBuffer: Buffer) {
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: pcmBuffer.toString("base64"),
  }));
}

// Handle function calls
async function handleFunctionCall(name: string, args: any) {
  let result: string;

  if (name === "check_availability") {
    const slots = await checkClinicSlots(args.date, args.procedure);
    result = JSON.stringify(slots);
  } else if (name === "book_appointment") {
    const booking = await createAppointment(args);
    result = JSON.stringify(booking);
  }

  // Send function result back — AI will speak the response
  ws.send(JSON.stringify({
    type: "conversation.item.create",
    item: {
      type: "function_call_output",
      call_id: event.call_id,
      output: result,
    },
  }));

  // Trigger AI to respond with the function result
  ws.send(JSON.stringify({ type: "response.create" }));
}

Python SDK

# Using OpenAI Python SDK
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def run_voice_agent():
    async with client.beta.realtime.connect(
        model="gpt-4o-realtime-preview"
    ) as connection:
        await connection.session.update(session={
            "modalities": ["text", "audio"],
            "voice": "nova",
            "instructions": "You are a helpful assistant.",
            "turn_detection": {"type": "server_vad"},
        })

        # Send audio from microphone
        await connection.input_audio_buffer.append(audio=base64_audio)

        # Process events
        async for event in connection:
            if event.type == "response.audio.delta":
                play_audio(event.delta)
            elif event.type == "response.done":
                print("AI finished speaking")

Key Concepts

Audio-native — The model processes audio directly, understanding tone, emotion, and emphasis (not just text transcription)
Server VAD — OpenAI's server detects when the user starts/stops speaking; no client-side VAD needed
Interruptions — When the user speaks while AI is talking, the response is automatically interrupted
Function calling — Same as Chat Completions function calling, but in real-time during voice conversation

Examples

Example 1: User asks to set up openai-realtime

User: "Help me set up openai-realtime for my project"

The agent should:

Check system requirements and prerequisites
Install or configure openai-realtime
Set up initial project structure
Verify the setup works correctly

Example 2: User asks to build a feature with openai-realtime

User: "Create a dashboard using openai-realtime"

The agent should:

Scaffold the component or configuration
Connect to the appropriate data source
Implement the requested feature
Test and validate the output

Guidelines

Server VAD for simplicity — Use server_vad turn detection; OpenAI handles speech detection, silence, and interruptions
PCM16 format — Use 16-bit PCM at 24kHz for both input and output; minimal encoding overhead
Short instructions — Keep system instructions concise; the model processes them with every turn
Function calls for actions — Use tools for bookings, lookups, and transfers; the model speaks the result naturally
Input transcription — Enable input_audio_transcription for logging and analytics; small additional cost
Silence threshold tuning — 500ms silence_duration for responsive agents; 1000ms for dictation (avoids mid-sentence cuts)
Voice selection — nova for friendly female, onyx for authoritative male, alloy for neutral; test with your use case
Cost awareness — Realtime API costs ~$0.06/min input + $0.24/min output audio; use for high-value interactions (sales, support), not bulk processing