🛠️ 開発・MCP コミュニティ

monitoring-setup

サービスの稼働状況を把握するために、ヘルスチェック、性能測定、分散トレーシングなどを設定し、異常時のアラート通知や対応手順書への連携を可能にし、システム全体の安定稼働を支援するSkill。

📜 元の英語説明(参考)

Adds structured observability to services including health check endpoints (liveness, readiness, startup), metrics collection (latency, error rates, throughput), distributed tracing with correlation IDs, alert threshold configuration with escalation policies, and runbook links. Use when adding monitoring, setting up observability, creating health checks, configuring alerts, or when the user needs production readiness instrumentation.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o monitoring-setup.zip https://jpskill.com/download/9131.zip && unzip -o monitoring-setup.zip && rm monitoring-setup.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/9131.zip -OutFile "$d\monitoring-setup.zip"; Expand-Archive "$d\monitoring-setup.zip" -DestinationPath $d -Force; ri "$d\monitoring-setup.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して monitoring-setup.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → monitoring-setup フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

モニタリング設定

概要

健全性チェック、メトリクス、トレース、およびアラートをまとまりのあるシステムとして実装することで、あらゆるサービスに本番環境グレードの可観測性を追加します。このスキルは、モニタリングをアドホックなロギングではなく、構造化された出力として扱い、標準的な可観測性スタック（Prometheus、Grafana、OpenTelemetry、PagerDuty/OpsGenie）と統合できるファイルを生成します。

どのような時に使うか

サービスにモニタリングまたは可観測性を追加する場合
Kubernetesまたはロードバランサー用の健全性チェックエンドポイントを作成する場合
メトリクス（レイテンシ、エラー率、スループット）を実装する場合
サービス全体に分散トレーシングを設定する場合
アラートの閾値とエスカレーションポリシーを設定する場合
ユーザーが「本番環境への対応」、「SLO」、「SLI」、または「ランブック」について言及する場合
サービスをデプロイし、運用上の実装が必要な場合

使用すべきでない場合：

ユーザーが既存のモニタリング設定をデバッグする必要がある場合（systematic-debuggingを使用）
タスクが特定のベンダーダッシュボードの設定である場合（ベンダーのドキュメントに従ってください）
ユーザーがログ集約のみを必要とする場合（ロギングはこのスキルの焦点ではありません）

ワークフロー

1. 健全性チェックエンドポイントの追加

3つの異なる健全性チェックエンドポイントを作成します。それぞれは、Kubernetesのようなオーケストレーションシステムで異なる目的を果たします。

エンドポイント	パス	目的	チェック内容
Liveness	`GET /healthz`	「プロセスは生きているか？」	プロセスが実行中で、デッドロックしていないこと。最小限のチェックのみ。
Readiness	`GET /readyz`	「このインスタンスはトラフィックを処理できるか？」	データベースが接続されていること、キャッシュがウォームであること、依存関係に到達できること。
Startup	`GET /startupz`	「初期化は完了したか？」	マイグレーションが実行されたこと、設定がロードされたこと、初期データがシードされたこと。

重要な区別： Livenessは、外部依存関係を絶対にチェックすべきではありません。もしLivenessプローブがデータベースをチェックし、DBがダウンした場合、Kubernetesは正常なポッドを再起動し、停止を悪化させます。Liveness = 「このプロセスは根本的に壊れているか？」 Readiness = 「トラフィックをここにルーティングすべきか？」

レスポンス形式：

{
  "status": "ok",
  "checks": {
    "database": { "status": "ok", "latency_ms": 2 },
    "cache": { "status": "ok", "latency_ms": 1 },
    "external_api": { "status": "degraded", "latency_ms": 450 }
  },
  "version": "1.2.3",
  "uptime_seconds": 84321
}

正常な場合はHTTP 200、異常な場合は503を返します。個々のチェックのステータスを含めて、オペレーターがどの依存関係が失敗しているかを確認できるようにします。

2. メトリクス収集の実装

REDメソッドとUSEメソッドを使用して、包括的なカバレッジを確保します。

REDメソッド（リクエスト駆動型サービスの場合）：

メトリクス	測定対象	Prometheusタイプ	例
Rate	1秒あたりのリクエスト数	Counter	`http_requests_total{method, path, status}`
Errors	1秒あたりの失敗したリクエスト数	Counter	`http_errors_total{method, path, code}`
Duration	リクエストレイテンシの分布	Histogram	`http_request_duration_seconds{method, path}`

USEメソッド（リソース駆動型コンポーネントの場合）：

メトリクス	測定対象	Prometheusタイプ	例
Utilization	使用中のリソース容量の割合	Gauge	`db_pool_utilization_ratio`
Saturation	キューの深さ/バックプレッシャー	Gauge	`request_queue_length`
Errors	リソースレベルのエラー数	Counter	`db_connection_errors_total`

実装要件：

サービスの言語に対応したPrometheusクライアントライブラリを使用する
GET /metricsでPrometheus exposition formatでメトリクスを公開する
サービスに適したヒストグラムバケットを使用する： HTTPの場合は[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]秒
メトリクスにmethod、path（正規化されたもの）、およびstatus_codeでラベル付けする
すべてのリクエストを自動的に実装するメトリクスミドルウェアを追加する
プロセスメトリクス（メモリ、CPU、GC、該当する場合はイベントループの遅延）を含める

3. 分散トレーシングの追加

OpenTelemetry互換のトレーシングを、相関IDの伝播とともに実装します。

トレースコンテキストの伝播：

trace-id（128ビットの16進数）を持たない受信リクエストごとに一意のtrace-idを生成する
W3C Trace Contextヘッダー（traceparent、tracestate）を介して伝播する
下位互換性のためにX-Correlation-ID / X-Request-IDもサポートする
すべてのダウンストリームHTTP呼び出し、メッセージキューのパブリッシュ、および非同期ジョブにトレースコンテキストを渡す

スパンの作成：

受信リクエストごとにルートスパンを作成する
データベースクエリ、外部HTTP呼び出し、キャッシュ操作、メッセージキュー操作に対して子スパンを作成する
スパン属性を含める：http.method、http.url、http.status_code、db.system、db.statement
エラーメッセージとともに、失敗時にスパンステータスをERRORに設定する

構成出力 — トレース構成ファイルを生成します。

// tracing.js - OpenTelemetry configuration
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require(
  "@opentelemetry/auto-instrumentations-node",
);
const { OTLPTraceExporter } = require(
  "@opentelemetry/exporter-trace-otlp-http",
);

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ||
      "http://localhost:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: process.env.OTEL_SERVICE_NAME || "my-service",
});

sdk.start();

4. アラートの閾値の設定

SLO（サービスレベル目標）に基づいてアラートを定義します。

(原文はここで切り詰められています)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Monitoring Setup

Overview

Add production-grade observability to any service by instrumenting health checks, metrics, tracing, and alerts as a cohesive system. The skill treats monitoring as structured output — not ad-hoc logging — producing files that integrate with standard observability stacks (Prometheus, Grafana, OpenTelemetry, PagerDuty/OpsGenie).

When to use

When adding monitoring or observability to a service
When creating health check endpoints for Kubernetes or load balancers
When instrumenting metrics (latency, error rates, throughput)
When setting up distributed tracing across services
When configuring alert thresholds and escalation policies
When the user mentions "production readiness", "SLOs", "SLIs", or "runbooks"
When deploying a service and needing operational instrumentation

Do NOT use when:

The user needs to debug an existing monitoring setup (use systematic-debugging)
The task is configuring a specific vendor dashboard (just follow vendor docs)
The user needs log aggregation only (logging is not this skill's focus)

Workflow

1. Add health check endpoints

Create three distinct health check endpoints. Each serves a different purpose in orchestration systems like Kubernetes:

Endpoint	Path	Purpose	What to check
Liveness	`GET /healthz`	"Is the process alive?"	Process is running, not deadlocked. Minimal checks only.
Readiness	`GET /readyz`	"Can this instance serve traffic?"	Database connected, cache warm, dependencies reachable.
Startup	`GET /startupz`	"Has initialization completed?"	Migrations run, config loaded, initial data seeded.

Critical distinction: Liveness should NEVER check external dependencies. If your liveness probe checks the database and the DB goes down, Kubernetes will restart your healthy pods — making an outage worse. Liveness = "is this process fundamentally broken?" Readiness = "should traffic be routed here?"

Response format:

{
  "status": "ok",
  "checks": {
    "database": { "status": "ok", "latency_ms": 2 },
    "cache": { "status": "ok", "latency_ms": 1 },
    "external_api": { "status": "degraded", "latency_ms": 450 }
  },
  "version": "1.2.3",
  "uptime_seconds": 84321
}

Return HTTP 200 for healthy, 503 for unhealthy. Include individual check statuses so operators can see which dependency is failing.

2. Instrument metrics collection

Use the RED and USE methods to ensure comprehensive coverage:

RED method (for request-driven services):

Metric	What to measure	Prometheus type	Example
Rate	Requests per second	Counter	`http_requests_total{method, path, status}`
Errors	Failed requests per second	Counter	`http_errors_total{method, path, code}`
Duration	Request latency distribution	Histogram	`http_request_duration_seconds{method, path}`

USE method (for resource-driven components):

Metric	What to measure	Prometheus type	Example
Utilization	% of resource capacity in use	Gauge	`db_pool_utilization_ratio`
Saturation	Queue depth / backpressure	Gauge	`request_queue_length`
Errors	Resource-level error count	Counter	`db_connection_errors_total`

Implementation requirements:

Use Prometheus client library for the service's language
Expose metrics at GET /metrics in Prometheus exposition format
Use histogram buckets appropriate for the service: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds for HTTP
Label metrics with method, path (normalized), and status_code
Add a metrics middleware that instruments ALL requests automatically
Include process metrics (memory, CPU, GC, event loop lag where applicable)

3. Add distributed tracing

Implement OpenTelemetry-compatible tracing with correlation ID propagation:

Trace context propagation:

Generate a unique trace-id (128-bit hex) for each incoming request without one
Propagate via W3C Trace Context headers: traceparent, tracestate
Also support X-Correlation-ID / X-Request-ID for backward compatibility
Pass trace context to ALL downstream HTTP calls, message queue publishes, and async jobs

Span creation:

Create a root span for each incoming request
Create child spans for: database queries, external HTTP calls, cache operations, message queue operations
Include span attributes: http.method, http.url, http.status_code, db.system, db.statement
Set span status to ERROR on failures with error message

Configuration output — generate a trace config file:

// tracing.js - OpenTelemetry configuration
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require(
  "@opentelemetry/auto-instrumentations-node",
);
const { OTLPTraceExporter } = require(
  "@opentelemetry/exporter-trace-otlp-http",
);

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ||
      "http://localhost:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: process.env.OTEL_SERVICE_NAME || "my-service",
});

sdk.start();

4. Configure alert thresholds

Define alerts based on SLOs (Service Level Objectives), not arbitrary values. The process:

Define SLOs — e.g., "99.9% of requests complete in < 500ms"
Derive SLIs — the metric that measures the SLO (e.g., http_request_duration_seconds)
Set burn rate alerts — alert when you're consuming error budget too fast

Alert threshold guidelines:

SLO Target	Burn Rate 1h	Burn Rate 6h	Burn Rate 24h
99.9%	14.4x	6x	3x
99.5%	14.4x	6x	3x
99.0%	14.4x	6x	3x

Alert rule format (Prometheus alerting rules):

groups:
  - name: slo-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > (1 - 0.999) * 14.4
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Error rate burning through SLO budget at 14.4x"
          description: "Current error rate: {{ $value | humanizePercentage }}"
          runbook: "https://runbooks.example.com/high-error-rate"
          dashboard: "https://grafana.example.com/d/slo-overview"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 0.5
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "P99 latency exceeds 500ms SLO target"
          runbook: "https://runbooks.example.com/high-latency"

Escalation policy:

Severity	Response Time	Notification Channel	Escalation After
critical	5 minutes	PagerDuty page	15 min to lead
warning	30 minutes	Slack #alerts	2 hours to team
info	Next business	Slack #monitoring	None

Every alert MUST include a runbook annotation linking to resolution steps.

5. Create runbook templates

Generate a runbook for each alert with this structure:

# Runbook: [Alert Name]

## Alert Details

- **Severity:** critical/warning/info
- **SLO:** Which SLO this protects
- **Dashboard:** Link to relevant Grafana dashboard

## Symptoms

What the operator will observe when this fires.

## Diagnosis Steps

1. Check [specific metric/dashboard]
2. Look for [specific log pattern]
3. Verify [specific dependency]

## Resolution

### If caused by [root cause A]

1. Step-by-step fix

### If caused by [root cause B]

1. Step-by-step fix

## Escalation

- If not resolved in [time]: escalate to [team/person]
- If customer-facing: notify [channel]

6. Generate dashboard configuration

Produce a Grafana dashboard JSON or config covering:

Overview row: Request rate, error rate, latency P50/P95/P99
Health row: Health check status, uptime, version
Resources row: CPU, memory, DB pool utilization, queue depth
SLO row: Error budget remaining, burn rate, SLO compliance

Checklist

[ ] Liveness endpoint at /healthz — checks process only, NOT dependencies
[ ] Readiness endpoint at /readyz — checks all dependencies with individual status
[ ] Startup endpoint at /startupz — checks initialization completion
[ ] Health responses include status, individual checks, version, uptime
[ ] Metrics endpoint at /metrics in Prometheus exposition format
[ ] RED metrics: request rate, error rate, duration histogram
[ ] USE metrics: utilization, saturation, errors for resources
[ ] Metrics middleware instruments all requests automatically
[ ] Trace context propagation via W3C headers (traceparent)
[ ] Correlation ID generated for requests without trace context
[ ] Child spans for DB queries, HTTP calls, cache, message queues
[ ] Alert thresholds derived from SLOs, not arbitrary values
[ ] Every alert has severity, team label, runbook link, and dashboard link
[ ] Escalation policy defined per severity level
[ ] Runbook template generated for each alert
[ ] Dashboard config covers request metrics, health, resources, and SLOs

Example

Input: "Add monitoring to our Express.js order service"

Output files produced:

File	Contents
`health.js`	Liveness, readiness, startup route handlers
`metrics.js`	Prometheus client setup + metrics middleware
`tracing.js`	OpenTelemetry SDK configuration
`alert-rules.yml`	Prometheus alerting rules with SLO-based thresholds
`runbooks/`	One markdown file per alert
`dashboard.json`	Grafana dashboard configuration

Example health endpoint implementation:

// health.js
const express = require("express");
const router = express.Router();
const { Pool } = require("pg");

const startTime = Date.now();
let startupComplete = false;

// Liveness - process alive, no dependency checks
router.get("/healthz", (req, res) => {
  res.status(200).json({
    status: "ok",
    uptime_seconds: Math.floor((Date.now() - startTime) / 1000),
  });
});

// Readiness - can serve traffic
router.get("/readyz", async (req, res) => {
  const checks = {};
  let healthy = true;

  // Check database
  try {
    const start = Date.now();
    await pool.query("SELECT 1");
    checks.database = { status: "ok", latency_ms: Date.now() - start };
  } catch (err) {
    checks.database = { status: "error", error: err.message };
    healthy = false;
  }

  // Check Redis
  try {
    const start = Date.now();
    await redis.ping();
    checks.cache = { status: "ok", latency_ms: Date.now() - start };
  } catch (err) {
    checks.cache = { status: "error", error: err.message };
    healthy = false;
  }

  res.status(healthy ? 200 : 503).json({
    status: healthy ? "ok" : "unhealthy",
    checks,
    version: process.env.APP_VERSION || "unknown",
    uptime_seconds: Math.floor((Date.now() - startTime) / 1000),
  });
});

// Startup - initialization complete
router.get("/startupz", (req, res) => {
  res.status(startupComplete ? 200 : 503).json({
    status: startupComplete ? "ok" : "starting",
    uptime_seconds: Math.floor((Date.now() - startTime) / 1000),
  });
});

function markStartupComplete() {
  startupComplete = true;
}

module.exports = { router, markStartupComplete };

Common mistakes

Mistake	Fix
Liveness checks database/external deps	Liveness = process health only. Move dependency checks to readiness. DB down + liveness fail = cascading restarts.
Using `console.log` instead of metrics	Logs are for debugging, metrics are for monitoring. Use counters/histograms for anything you'd alert on.
Arbitrary alert thresholds ("error > 10")	Derive thresholds from SLOs and burn rates. "10 errors" means nothing without knowing request volume.
No correlation ID propagation	Generate trace ID on ingress, propagate to ALL downstream calls. Without this, distributed debugging is impossible.
Missing runbook links on alerts	Every alert must link to a runbook. An alert without a runbook is just noise that trains operators to ignore alerts.
Single health endpoint for everything	Separate liveness/readiness/startup. Kubernetes uses them differently; conflating them causes incorrect pod lifecycle decisions.
Metrics without labels	Always label with method, path, status. Aggregate metrics hide the signal — you need to slice by dimension.
No histogram buckets for latency	Use histograms, not averages. P99 latency matters more than mean. Configure buckets for your expected range.

Quick reference

Component	Output file	Format
Health checks	`health.{js,ts,py}`	Express/Fastify/Flask routes
Metrics	`metrics.{js,ts,py}`	Prometheus client + middleware
Tracing	`tracing.{js,ts,py}`	OpenTelemetry SDK config
Alert rules	`alert-rules.yml`	Prometheus alerting rules
Runbooks	`runbooks/*.md`	Markdown per alert
Dashboard	`dashboard.json`	Grafana dashboard JSON

Key principles

Liveness is sacred — Never put dependency checks in liveness probes. A liveness failure triggers a pod restart. If your DB is down and liveness checks the DB, Kubernetes restarts all pods, making recovery harder. Liveness answers only: "is this process fundamentally broken?"
SLOs drive alerts — Every alert threshold must trace back to a Service Level Objective. "Error rate > 1%" is meaningless without knowing the SLO. Use burn rate alerting: alert when you're consuming error budget faster than sustainable.
Metrics over logs — Anything you would alert on must be a metric, not a log line. Metrics are aggregatable, queryable, and cheap. Log-based alerting is fragile, expensive, and misses patterns that counters catch naturally.
Trace everything cross-service — Every request entering the system gets a trace ID. Every downstream call propagates it. Without end-to-end tracing, debugging distributed systems requires correlating timestamps across log streams — which doesn't scale.
Alerts without runbooks are noise — Every alert must link to a runbook with diagnosis steps and resolution procedures. Operators receiving alerts without context will either ignore them or waste time investigating from scratch. Runbooks encode institutional knowledge.