💼 ビジネスコミュニティ

observability-setup

マイクロサービス全体の状況を把握できるよう、OpenTelemetryを使った計測基盤構築、Grafanaによる可視化、アラート設定などをまとめて行い、本番環境での問題解決を支援するSkill。

📜 元の英語説明(参考)

Set up end-to-end observability for microservices. Use when someone asks to "add tracing", "set up monitoring", "configure OpenTelemetry", "build Grafana dashboards", "distributed tracing", "structured logging", "metrics collection", or "debug production issues". Covers OpenTelemetry instrumentation, collector configuration, Grafana LGTM stack deployment, dashboard provisioning, and alert rules.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o observability-setup.zip https://jpskill.com/download/15186.zip && unzip -o observability-setup.zip && rm observability-setup.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/15186.zip -OutFile "$d\observability-setup.zip"; Expand-Archive "$d\observability-setup.zip" -DestinationPath $d -Force; ri "$d\observability-setup.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して observability-setup.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → observability-setup フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

Observability Setup

概要

このスキルは、AIエージェントがマイクロサービスアーキテクチャ全体で、オブザーバビリティの3つの柱であるトレース、メトリクス、構造化ログを実装するのに役立ちます。OpenTelemetryを計測の標準として、GrafanaのLGTMスタック（Loki、Grafana、Tempo、Mimir）をバックエンドとして使用しますが、パターンはOTel互換の任意のバックエンドに適用できます。

手順

OpenTelemetry Instrumentation (Node.js)

パッケージをインストールします:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http @opentelemetry/exporter-metrics-otlp-http

src/instrumentation.ts を作成します — このファイルは他のインポートの前にロードする必要があります:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  serviceName: process.env.OTEL_SERVICE_NAME || 'my-service',
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces' }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics' }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

スタートスクリプトに追加します: node --require ./src/instrumentation.ts src/index.ts

ビジネス上重要な操作のためにカスタムスパンを追加します:

import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');

async function processPayment(order) {
  return tracer.startActiveSpan('process-payment', async (span) => {
    span.setAttribute('order.id', order.id);
    span.setAttribute('payment.amount_cents', order.totalCents);
    try {
      const result = await paymentGateway.charge(order);
      span.setAttribute('payment.status', result.status);
      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

OpenTelemetry Instrumentation (Python)

インストール: pip install opentelemetry-distro opentelemetry-exporter-otlp
実行: opentelemetry-instrument --service_name notification-service python app.py

トレースコンテキストを含む構造化ログの場合:

import structlog
from opentelemetry import trace

def add_trace_context(logger, method_name, event_dict):
    span = trace.get_current_span()
    ctx = span.get_span_context()
    if ctx.is_valid:
        event_dict['trace_id'] = format(ctx.trace_id, '032x')
        event_dict['span_id'] = format(ctx.span_id, '016x')
    return event_dict

structlog.configure(processors=[add_trace_context, structlog.dev.ConsoleRenderer()])

トレース相関を使用した構造化ロギング

すべてのログ行には trace_id と span_id が含まれている必要があります。JSON形式を使用します:

{"level":"error","msg":"payment failed","trace_id":"abc123...","span_id":"def456...","service":"payment-service","error":"timeout","timestamp":"2025-01-15T10:30:00Z"}

Node.jsの場合、カスタムmixinで pino を使用します:

const pino = require('pino');
const { trace } = require('@opentelemetry/api');

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (span) {
      const ctx = span.spanContext();
      return { trace_id: ctx.traceId, span_id: ctx.spanId };
    }
    return {};
  }
});

OTel Collector の構成

receivers:
  otlp:
    protocols:
      grpc: { endpoint: "0.0.0.0:4317" }
      http: { endpoint: "0.0.0.0:4318" }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    limit_mib: 512

exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Grafana ダッシュボードのプロビジョニング

ダッシュボードを grafana/dashboards/ にJSONファイルとして作成します。主要なパネル:

サービス概要ダッシュボード:

リクエストレート: sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
P95 レイテンシ: histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le, service_name))
エラーレート: sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[5m])) / sum(rate(http_server_request_duration_seconds_count[5m]))

アラートルール:

エラーレート > 5 % (5分間) → P2
P99 レイテンシ > 2秒 (5分間) → P3
サービスヘルスチェックが1分間失敗 → P1

Grafana でのシグナル相関

Loki データソース: Tempo にリンクする正規表現 trace_id=(\w+) を持つ派生フィールドを追加します。
Tempo データソース: フィルター {service_name="$service"} | trace_id="$traceId" を使用して、Loki にリンクする「トレースからログへ」を有効にします。
メトリクス → ログ: Mimir の exemplars を使用して、メトリックデータポイントをトレースIDにリンクします。

例

例 1 — Express アプリのクイック OTel セットアップ

入力: "Add observability to my Express API."

出力: 上記の Node SDK 構成で src/instrumentation.ts を作成し、--require フラグをスタートスクリプトに追加し、.env で OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 および OTEL_SERVICE_NAME=api-gateway を設定します。

例

(原文がここで切り詰められています)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Observability Setup

Overview

This skill helps AI agents implement the three pillars of observability — traces, metrics, and structured logs — across microservice architectures. It uses OpenTelemetry as the instrumentation standard and Grafana's LGTM stack (Loki, Grafana, Tempo, Mimir) as the backend, though patterns apply to any OTel-compatible backend.

Instructions

OpenTelemetry Instrumentation (Node.js)

Install packages:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http @opentelemetry/exporter-metrics-otlp-http

Create src/instrumentation.ts — this file must be loaded BEFORE any other imports:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  serviceName: process.env.OTEL_SERVICE_NAME || 'my-service',
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces' }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics' }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Add to start script: node --require ./src/instrumentation.ts src/index.ts

Add custom spans for business-critical operations:

import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');

async function processPayment(order) {
  return tracer.startActiveSpan('process-payment', async (span) => {
    span.setAttribute('order.id', order.id);
    span.setAttribute('payment.amount_cents', order.totalCents);
    try {
      const result = await paymentGateway.charge(order);
      span.setAttribute('payment.status', result.status);
      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

OpenTelemetry Instrumentation (Python)

Install: pip install opentelemetry-distro opentelemetry-exporter-otlp
Run: opentelemetry-instrument --service_name notification-service python app.py

For structured logs with trace context:

import structlog
from opentelemetry import trace

def add_trace_context(logger, method_name, event_dict):
    span = trace.get_current_span()
    ctx = span.get_span_context()
    if ctx.is_valid:
        event_dict['trace_id'] = format(ctx.trace_id, '032x')
        event_dict['span_id'] = format(ctx.span_id, '016x')
    return event_dict

structlog.configure(processors=[add_trace_context, structlog.dev.ConsoleRenderer()])

Structured Logging with Trace Correlation

Every log line must include trace_id and span_id. Use JSON format:

{"level":"error","msg":"payment failed","trace_id":"abc123...","span_id":"def456...","service":"payment-service","error":"timeout","timestamp":"2025-01-15T10:30:00Z"}

For Node.js, use pino with a custom mixin:

const pino = require('pino');
const { trace } = require('@opentelemetry/api');

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (span) {
      const ctx = span.spanContext();
      return { trace_id: ctx.traceId, span_id: ctx.spanId };
    }
    return {};
  }
});

OTel Collector Configuration

receivers:
  otlp:
    protocols:
      grpc: { endpoint: "0.0.0.0:4317" }
      http: { endpoint: "0.0.0.0:4318" }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    limit_mib: 512

exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Grafana Dashboard Provisioning

Create dashboards as JSON files in grafana/dashboards/. Key panels:

Service Overview Dashboard:

Request rate: sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)
P95 latency: histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le, service_name))
Error rate: sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[5m])) / sum(rate(http_server_request_duration_seconds_count[5m]))

Alert Rules:

Error rate > 5 % for 5 min → P2
P99 latency > 2s for 5 min → P3
Service health check failing for 1 min → P1

Signal Correlation in Grafana

Loki data source: add derived field with regex trace_id=(\w+) linking to Tempo.
Tempo data source: enable "Trace to logs" linking to Loki with filter {service_name="$service"} | trace_id="$traceId".
Metrics → Logs: use exemplars in Mimir to link metric data points to trace IDs.

Examples

Example 1 — Quick OTel setup for Express app

Input: "Add observability to my Express API."

Output: Create src/instrumentation.ts with the Node SDK config above, add the --require flag to the start script, set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 and OTEL_SERVICE_NAME=api-gateway in .env.

Example 2 — Docker Compose for LGTM stack

Input: "Give me a docker-compose for the observability backend."

Output:

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    volumes: ["./otel-collector-config.yaml:/etc/otelcol/config.yaml"]
    ports: ["4317:4317", "4318:4318"]
  tempo:
    image: grafana/tempo:2.4.0
    command: ["-config.file=/etc/tempo.yaml"]
    volumes: ["./tempo.yaml:/etc/tempo.yaml"]
  loki:
    image: grafana/loki:3.0.0
    ports: ["3100:3100"]
  mimir:
    image: grafana/mimir:2.11.0
    command: ["-config.file=/etc/mimir.yaml"]
    volumes: ["./mimir.yaml:/etc/mimir.yaml"]
  grafana:
    image: grafana/grafana:10.4.0
    ports: ["3000:3000"]
    volumes: ["./grafana/provisioning:/etc/grafana/provisioning", "./grafana/dashboards:/var/lib/grafana/dashboards"]
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: Admin

Guidelines

Instrument before you optimize. You cannot improve what you cannot see.
Auto-instrumentation first, custom spans second. Auto-instrumentation covers 80 % of what you need. Add custom spans only for business-critical paths.
Always correlate signals. A trace without logs is incomplete. A metric spike without a trace is unactionable.
Set resource limits on the collector. Without memory_limiter, a traffic spike can OOM the collector and create a cascading failure.
Use service.name consistently. It is the primary grouping key across all three signal types. Mismatched names break correlation.