jpskill.com
🛠️ 開発・MCP コミュニティ

eks-observability

EKS環境の監視、ログ収集、分散トレーシング設定、ダッシュボード構築、問題解決、コスト最適化、SLO策定など、EKSの可観測性を向上させるために役立つSkill。

📜 元の英語説明(参考)

EKS observability with metrics, logging, and tracing. Use when setting up monitoring, configuring logging pipelines, implementing distributed tracing, building production dashboards, troubleshooting EKS issues, optimizing observability costs, or establishing SLOs.

🇯🇵 日本人クリエイター向け解説

一言でいうと

EKS環境の監視、ログ収集、分散トレーシング設定、ダッシュボード構築、問題解決、コスト最適化、SLO策定など、EKSの可観測性を向上させるために役立つSkill。

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o eks-observability.zip https://jpskill.com/download/9416.zip && unzip -o eks-observability.zip && rm eks-observability.zip
🪟 Windows (PowerShell)
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/9416.zip -OutFile "$d\eks-observability.zip"; Expand-Archive "$d\eks-observability.zip" -DestinationPath $d -Force; ri "$d\eks-observability.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)
  1. 1. 下の青いボタンを押して eks-observability.zip をダウンロード
  2. 2. ZIPファイルをダブルクリックで解凍 → eks-observability フォルダができる
  3. 3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
  4. 4. Claude Code を再起動

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

  1. 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
  2. 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
  3. 3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
    • · macOS / Linux: ~/.claude/skills/
    • · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →
最終更新
2026-05-18
取得日時
2026-05-18
同梱ファイル
1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

EKS 可観測性

概要

AWS ネイティブのマネージドサービスとオープンソースツールを使用した、Amazon EKS 向けの完全な可観測性ソリューションです。このスキルは、ADOT、Amazon Managed Prometheus、Fluent Bit、OpenTelemetry を含む 2025 年のベストプラクティスを用いた、3 つの柱のアプローチ (メトリクス、ログ、トレース) を実装します。

キーワード: EKS monitoring, CloudWatch Container Insights, Prometheus, Grafana, ADOT, Fluent Bit, X-Ray, OpenTelemetry, distributed tracing, log aggregation, metrics collection, observability stack

ステータス: 2025 年のベストプラクティスを用いた本番環境対応

このスキルを使用する場面

  • EKS クラスタのモニタリングを設定する
  • 一元化されたロギングパイプラインを実装する
  • 分散トレーシングを設定する
  • Grafana で本番環境ダッシュボードを構築する
  • アプリケーションのパフォーマンスをトラブルシューティングする
  • SLO とエラーバジェットを確立する
  • 可観測性のコストを最適化する
  • X-Ray SDK から OpenTelemetry に移行する
  • メトリクス、ログ、トレースを関連付ける
  • アラートとオンコールランブックを設定する

3 つの柱のアプローチ (2025 年の推奨)

1. メトリクス

CloudWatch Container Insights + Amazon Managed Prometheus (AMP)

  • デュアルモニタリングにより、完全な可視性を提供
  • AWS ネイティブの統合と迅速なセットアップのための CloudWatch
  • 高度なクエリとコミュニティダッシュボードのための Prometheus
  • 可視化のための Amazon Managed Grafana

2. ログ

Fluent Bit → CloudWatch Logs

  • 軽量ログフォワーダー (AWS は 2025 年 2 月に FluentD を非推奨にしました)
  • 自動収集のための DaemonSet デプロイメント
  • JSON パースによる構造化ロギング
  • 分析のための OpenSearch へのオプションの集約

3. トレース

ADOT → AWS X-Ray

  • OpenTelemetry 標準 (X-Ray SDK は 2026 年にメンテナンスモードに入ります)
  • ADOT Collector は OTLP を X-Ray 形式に変換
  • マイクロサービス全体の分散トレーシング
  • CloudWatch ServiceLens との統合

クイックスタートワークフロー

ステップ 1: CloudWatch Container Insights を有効にする

EKS アドオンを使用 (推奨):

# Create IAM policy for CloudWatch access
aws iam create-policy \
  --policy-name CloudWatchAgentServerPolicy \
  --policy-document file://cloudwatch-policy.json

# Create IRSA for CloudWatch
eksctl create iamserviceaccount \
  --name cloudwatch-agent \
  --namespace amazon-cloudwatch \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --approve \
  --override-existing-serviceaccounts

# Install Container Insights add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability \
  --service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/CloudWatchAgentRole

インストールを確認:

# Check add-on status
aws eks describe-addon \
  --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability

# Verify pods running
kubectl get pods -n amazon-cloudwatch

得られるもの:

  • ノードレベルのメトリクス (CPU、メモリ、ディスク、ネットワーク)
  • ポッドレベルのメトリクス (リソース使用量、再起動回数)
  • 名前空間レベルの集計
  • 自動 CloudWatch Logs 統合
  • 構築済みの CloudWatch ダッシュボード

ステップ 2: Amazon Managed Prometheus をデプロイする

AMP ワークスペースを作成:

# Create workspace
aws amp create-workspace \
  --alias my-cluster-metrics \
  --region us-west-2

# Get workspace ID
WORKSPACE_ID=$(aws amp list-workspaces \
  --alias my-cluster-metrics \
  --query 'workspaces[0].workspaceId' \
  --output text)

# Create IRSA for AMP ingestion
eksctl create iamserviceaccount \
  --name amp-ingest \
  --namespace prometheus \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
  --approve

kube-prometheus-stack をデプロイ:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install with AMP remote write
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace prometheus \
  --create-namespace \
  --set prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write \
  --set prometheus.prometheusSpec.remoteWrite[0].sigv4.region=us-west-2 \
  --set prometheus.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/AMPIngestRole"

得られるもの:

  • CRD ベースのモニタリングのための Prometheus Operator
  • ハードウェアメトリクスのための Node Exporter
  • クラスタ状態のための kube-state-metrics
  • アラートルーティングのための Alertmanager
  • 100 以上の構築済みの Grafana ダッシュボード

ステップ 3: ロギングのために Fluent Bit をデプロイする

Fluent Bit 用の IRSA を作成:

eksctl create iamserviceaccount \
  --name fluent-bit \
  --namespace logging \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --approve

Fluent Bit をデプロイ:

helm repo add fluent https://fluent.github.io/helm-charts

helm install fluent-bit fluent/fluent-bit \
  --namespace logging \
  --create-namespace \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/FluentBitRole" \
  --set cloudWatch.enabled=true \
  --set cloudWatch.region=us-west-2 \
  --set cloudWatch.logGroupName=/aws/eks/my-cluster/logs \
  --set cloudWatch.autoCreateGroup=true

得られるもの:

  • すべてのポッドからの自動ログ収集
  • 構造化された JSON ログのパース
  • CloudWatch Logs 統合
  • 複数行ログのサポート
  • Kubernetes メタデータエンリッチメント

ステップ 4: 分散トレーシングのために ADOT をデプロイする

ADOT Operator をインストール:

# Create IRSA for ADOT
eksctl create iamserviceaccount \
  --name adot-collector \
  --namespace adot \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --approve

# Install ADOT add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name adot \
  --service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/ADOTCollectorRole

ADOT Collector をデプロイ:

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

EKS Observability

Overview

Complete observability solution for Amazon EKS using AWS-native managed services and open-source tools. This skill implements the three-pillar approach (metrics, logs, traces) with 2025 best practices including ADOT, Amazon Managed Prometheus, Fluent Bit, and OpenTelemetry.

Keywords: EKS monitoring, CloudWatch Container Insights, Prometheus, Grafana, ADOT, Fluent Bit, X-Ray, OpenTelemetry, distributed tracing, log aggregation, metrics collection, observability stack

Status: Production-ready with 2025 best practices

When to Use This Skill

  • Setting up monitoring for EKS clusters
  • Implementing centralized logging pipelines
  • Configuring distributed tracing
  • Building production dashboards in Grafana
  • Troubleshooting application performance
  • Establishing SLOs and error budgets
  • Optimizing observability costs
  • Migrating from X-Ray SDKs to OpenTelemetry
  • Correlating metrics, logs, and traces
  • Setting up alerting and on-call runbooks

The Three-Pillar Approach (2025 Recommendation)

1. Metrics

CloudWatch Container Insights + Amazon Managed Prometheus (AMP)

  • Dual monitoring provides complete visibility
  • CloudWatch for AWS-native integration and quick setup
  • Prometheus for advanced queries and community dashboards
  • Amazon Managed Grafana for visualization

2. Logs

Fluent Bit → CloudWatch Logs

  • Lightweight log forwarder (AWS deprecated FluentD in Feb 2025)
  • DaemonSet deployment for automatic collection
  • Structured logging with JSON parsing
  • Optional aggregation to OpenSearch for analytics

3. Traces

ADOT → AWS X-Ray

  • OpenTelemetry standard (X-Ray SDKs entering maintenance mode 2026)
  • ADOT Collector converts OTLP to X-Ray format
  • Distributed tracing across microservices
  • Integration with CloudWatch ServiceLens

Quick Start Workflow

Step 1: Enable CloudWatch Container Insights

Using EKS Add-on (Recommended):

# Create IAM policy for CloudWatch access
aws iam create-policy \
  --policy-name CloudWatchAgentServerPolicy \
  --policy-document file://cloudwatch-policy.json

# Create IRSA for CloudWatch
eksctl create iamserviceaccount \
  --name cloudwatch-agent \
  --namespace amazon-cloudwatch \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --approve \
  --override-existing-serviceaccounts

# Install Container Insights add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability \
  --service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/CloudWatchAgentRole

Verify Installation:

# Check add-on status
aws eks describe-addon \
  --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability

# Verify pods running
kubectl get pods -n amazon-cloudwatch

What You Get:

  • Node-level metrics (CPU, memory, disk, network)
  • Pod-level metrics (resource usage, restart counts)
  • Namespace-level aggregations
  • Automatic CloudWatch Logs integration
  • Pre-built CloudWatch dashboards

Step 2: Deploy Amazon Managed Prometheus

Create AMP Workspace:

# Create workspace
aws amp create-workspace \
  --alias my-cluster-metrics \
  --region us-west-2

# Get workspace ID
WORKSPACE_ID=$(aws amp list-workspaces \
  --alias my-cluster-metrics \
  --query 'workspaces[0].workspaceId' \
  --output text)

# Create IRSA for AMP ingestion
eksctl create iamserviceaccount \
  --name amp-ingest \
  --namespace prometheus \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
  --approve

Deploy kube-prometheus-stack:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install with AMP remote write
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace prometheus \
  --create-namespace \
  --set prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write \
  --set prometheus.prometheusSpec.remoteWrite[0].sigv4.region=us-west-2 \
  --set prometheus.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/AMPIngestRole"

What You Get:

  • Prometheus Operator for CRD-based monitoring
  • Node Exporter for hardware metrics
  • kube-state-metrics for cluster state
  • Alertmanager for alert routing
  • 100+ pre-built Grafana dashboards

Step 3: Deploy Fluent Bit for Logging

Create IRSA for Fluent Bit:

eksctl create iamserviceaccount \
  --name fluent-bit \
  --namespace logging \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --approve

Deploy Fluent Bit:

helm repo add fluent https://fluent.github.io/helm-charts

helm install fluent-bit fluent/fluent-bit \
  --namespace logging \
  --create-namespace \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/FluentBitRole" \
  --set cloudWatch.enabled=true \
  --set cloudWatch.region=us-west-2 \
  --set cloudWatch.logGroupName=/aws/eks/my-cluster/logs \
  --set cloudWatch.autoCreateGroup=true

What You Get:

  • Automatic log collection from all pods
  • Structured JSON log parsing
  • CloudWatch Logs integration
  • Multi-line log support
  • Kubernetes metadata enrichment

Step 4: Deploy ADOT for Distributed Tracing

Install ADOT Operator:

# Create IRSA for ADOT
eksctl create iamserviceaccount \
  --name adot-collector \
  --namespace adot \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --approve

# Install ADOT add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name adot \
  --service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/ADOTCollectorRole

Deploy ADOT Collector:

# adot-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: adot-collector
  namespace: adot
spec:
  mode: deployment
  serviceAccount: adot-collector
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 30s
        send_batch_size: 50
      memory_limiter:
        check_interval: 1s
        limit_mib: 512

    exporters:
      awsxray:
        region: us-west-2
      awsemf:
        region: us-west-2
        namespace: EKS/Observability

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [awsxray]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [awsemf]
kubectl apply -f adot-collector.yaml

What You Get:

  • OTLP receiver for OpenTelemetry traces
  • Automatic X-Ray integration
  • Service map visualization
  • Trace sampling and filtering
  • CloudWatch ServiceLens integration

Step 5: Setup Amazon Managed Grafana

Create AMG Workspace:

# Create workspace (via AWS Console recommended)
# Or use AWS CLI:
aws grafana create-workspace \
  --workspace-name my-cluster-grafana \
  --account-access-type CURRENT_ACCOUNT \
  --authentication-providers AWS_SSO \
  --permission-type SERVICE_MANAGED

Add Data Sources:

  1. Navigate to AMG workspace URL
  2. Configuration → Data Sources → Add data source
  3. Add Amazon Managed Service for Prometheus
    • Region: us-west-2
    • Workspace: Select your AMP workspace
  4. Add CloudWatch
    • Default region: us-west-2
    • Namespaces: ContainerInsights, EKS/Observability
  5. Add AWS X-Ray
    • Default region: us-west-2

Import Dashboards:

# EKS Container Insights Dashboard
Dashboard ID: 16028

# Node Exporter Full Dashboard
Dashboard ID: 1860

# Kubernetes Cluster Monitoring
Dashboard ID: 15760

Production Deployment Checklist

Infrastructure

  • [ ] CloudWatch Container Insights enabled (EKS add-on)
  • [ ] Amazon Managed Prometheus workspace created
  • [ ] kube-prometheus-stack deployed with remote write
  • [ ] Fluent Bit DaemonSet running on all nodes
  • [ ] ADOT Collector deployed (deployment or daemonset)
  • [ ] Amazon Managed Grafana workspace created
  • [ ] All IRSA roles configured with least-privilege policies

Configuration

  • [ ] Prometheus scrape configs include all targets
  • [ ] Fluent Bit log groups created and structured
  • [ ] ADOT sampling configured (5-10% for high traffic)
  • [ ] Grafana data sources connected (AMP, CloudWatch, X-Ray)
  • [ ] Log retention policies set (7-90 days typical)
  • [ ] Metric retention configured (AMP default 150 days)

Dashboards

  • [ ] Cluster overview dashboard (nodes, pods, namespaces)
  • [ ] Application performance dashboard (latency, errors, throughput)
  • [ ] Resource utilization dashboard (CPU, memory, disk)
  • [ ] Cost monitoring dashboard (resource waste, right-sizing)
  • [ ] Network performance dashboard (CNO metrics)

Alerting

  • [ ] Critical alerts: Pod crash loops, node not ready
  • [ ] Performance alerts: High latency, error rate spikes
  • [ ] Resource alerts: CPU/memory pressure, disk full
  • [ ] Cost alerts: Budget thresholds, waste detection
  • [ ] SNS topics configured for notifications
  • [ ] PagerDuty/Opsgenie integration (optional)

Application Instrumentation

  • [ ] OpenTelemetry SDK integrated in applications
  • [ ] Trace context propagation configured
  • [ ] Custom metrics exported via OTLP
  • [ ] Structured logging with JSON format
  • [ ] Log correlation with trace IDs

Modern Observability Stack (2025)

┌─────────────────────────────────────────────────────────────┐
│                      EKS Cluster                            │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ Application  │  │ Application  │  │ Application  │     │
│  │ + OTel SDK   │  │ + OTel SDK   │  │ + OTel SDK   │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
│         │                  │                  │             │
│         └──────────────────┴──────────────────┘             │
│                            │                                │
│                   ┌────────▼────────┐                       │
│                   │ ADOT Collector  │                       │
│                   │ (OTel)          │                       │
│                   └────────┬────────┘                       │
│                            │                                │
│         ┌──────────────────┼──────────────────┐            │
│         │                  │                  │            │
│    ┌────▼─────┐      ┌────▼─────┐      ┌────▼─────┐      │
│    │Prometheus│      │Fluent Bit│      │Container │      │
│    │  (local) │      │DaemonSet │      │ Insights │      │
│    └────┬─────┘      └────┬─────┘      └────┬─────┘      │
└─────────┼──────────────────┼──────────────────┼────────────┘
          │                  │                  │
          │                  │                  │
    ┌─────▼─────┐      ┌────▼─────┐      ┌────▼─────┐
    │   AMP     │      │CloudWatch│      │ X-Ray    │
    │(Managed   │      │  Logs    │      │          │
    │Prometheus)│      └────┬─────┘      └────┬─────┘
    └─────┬─────┘           │                  │
          │                 │                  │
          └─────────────────┴──────────────────┘
                            │
                   ┌────────▼────────┐
                   │Amazon Managed   │
                   │    Grafana      │
                   └─────────────────┘

Detailed Documentation

For comprehensive guides on each observability component:

  • Metrics Collection: references/metrics.md

    • CloudWatch Container Insights setup
    • Amazon Managed Prometheus configuration
    • kube-prometheus-stack deployment
    • Custom metrics and ServiceMonitors
    • Cost optimization strategies
  • Centralized Logging: references/logging.md

    • Fluent Bit configuration and parsers
    • CloudWatch Logs integration
    • OpenSearch aggregation (optional)
    • Log retention and lifecycle policies
    • Troubleshooting log collection
  • Distributed Tracing: references/tracing.md

    • ADOT Collector deployment patterns
    • OpenTelemetry SDK instrumentation
    • X-Ray integration and migration
    • Trace sampling strategies
    • ServiceLens and trace analysis

Cost Optimization

Metrics

  • Sample high-cardinality metrics (5-10% of labels)
  • Use metric relabeling to drop unnecessary labels
  • Aggregate metrics before remote write to AMP
  • Set appropriate retention periods (30-90 days typical)

Logs

  • Implement log sampling for verbose applications
  • Use CloudWatch Logs Insights instead of exporting to S3
  • Set aggressive retention for debug logs (7 days)
  • Keep audit logs longer (90+ days)

Traces

  • Sample traces based on traffic (5-10% default)
  • Increase sampling for errors (100%)
  • Use tail-based sampling for important transactions
  • Clean up old X-Ray traces (default 30 days)

Typical Monthly Costs:

  • Small cluster (10 nodes): $50-150/month
  • Medium cluster (50 nodes): $200-500/month
  • Large cluster (200+ nodes): $1000-2000/month

Integration Patterns

Correlation Between Pillars

Metrics → Logs:

# Find pods with high error rates
rate(http_requests_total{status=~"5.."}[5m]) > 0.1
# Then search CloudWatch Logs for those pod names

Logs → Traces:

// Include trace_id in structured logs
{
  "timestamp": "2025-01-27T10:30:00Z",
  "level": "error",
  "message": "Database connection failed",
  "trace_id": "1-67a2f3b1-12456789abcdef012345678",
  "span_id": "abcdef0123456789"
}

Traces → Metrics:

  • Use trace data to identify slow endpoints
  • Create SLIs from trace latency percentiles
  • Alert on trace error rates

CloudWatch ServiceLens

Unified view combining:

  • X-Ray traces (request flow)
  • CloudWatch metrics (performance)
  • CloudWatch Logs (detailed context)
# Enable ServiceLens (automatic with Container Insights + X-Ray)
aws servicelens get-service-lens-metrics \
  --service-name my-app \
  --start-time 2025-01-27T00:00:00Z \
  --end-time 2025-01-27T23:59:59Z

Troubleshooting Quick Reference

Issue Cause Fix
No metrics in AMP Missing IRSA or remote write config Check Prometheus pod logs, verify IAM role
Logs not appearing Fluent Bit not running or wrong IAM kubectl logs -n logging fluent-bit-xxx
Traces not in X-Ray ADOT not deployed or app not instrumented Verify ADOT pods, check OTel SDK setup
High costs Too much data ingestion Enable sampling, reduce log verbosity
Missing pod metrics kube-state-metrics not running Check kube-prometheus-stack installation
Grafana can't connect Data source IAM permissions Add CloudWatch/AMP read policies to AMG role

Production Runbooks

Incident Response

  1. Check Grafana overview dashboard - Identify affected services
  2. Review X-Ray service map - Find bottleneck in request flow
  3. Query CloudWatch Logs Insights - Get detailed error messages
  4. Correlate with metrics spike - Understand timeline and scope
  5. Execute remediation - Scale, restart, or rollback

Performance Investigation

  1. Start with RED metrics (Rate, Errors, Duration)
  2. Check USE metrics (Utilization, Saturation, Errors) for infrastructure
  3. Analyze trace percentiles (p50, p95, p99)
  4. Review log patterns during slow periods
  5. Identify optimization opportunities

SLO Implementation

Define SLIs (Service Level Indicators):

# Availability SLI
- metric: probe_success
  target: 99.9%
  window: 30d

# Latency SLI
- metric: http_request_duration_seconds
  percentile: p99
  target: < 500ms
  window: 30d

# Error Rate SLI
- metric: http_requests_total{status=~"5.."}
  target: < 0.1%
  window: 30d

Calculate Error Budget:

Error Budget = 100% - SLO Target
Example: 99.9% SLO = 0.1% error budget
         = 43.2 minutes downtime/month

Burn Rate Alerts:

# Fast burn (5% budget in 1 hour)
(1 - slo:availability:ratio_rate_1h) > 0.05

# Slow burn (10% budget in 6 hours)
(1 - slo:availability:ratio_rate_6h) > 0.1

Best Practices Summary

  1. Use Dual Monitoring: CloudWatch Container Insights + Prometheus
  2. Standardize on OpenTelemetry: Future-proof instrumentation
  3. Enable IRSA for Everything: No node IAM roles
  4. Deploy ADOT Collector: Vendor-neutral observability
  5. Sample Intelligently: 5-10% traces, 100% errors
  6. Structure Your Logs: JSON format with trace correlation
  7. Set Retention Policies: Balance cost and compliance
  8. Build Actionable Dashboards: Focus on SLIs and anomalies
  9. Implement Progressive Alerting: Warn before critical
  10. Regularly Review Costs: Optimize based on actual usage

Stack: CloudWatch Container Insights, AMP, Fluent Bit, ADOT, AMG, X-Ray Standards: OpenTelemetry, IRSA, EKS Add-ons Last Updated: January 2025 (2025 Best Practices)