📦 その他コミュニティ

ecs-troubleshooting

ECS（Elastic Container Service）のタスク停止やネットワーク問題、IAM権限エラーなど、ECSに関する様々なトラブルシューティングやデバッグを行い、サービスやコンテナの状態を診断・解決するSkill。

📜 元の英語説明(参考)

ECS troubleshooting and debugging guide covering task failures, service issues, networking problems, and performance diagnostics. Use when diagnosing ECS issues, debugging task failures (STOPPED, PENDING), resolving networking problems, investigating IAM/permissions errors, troubleshooting container health checks, or analyzing ECS service health.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o ecs-troubleshooting.zip https://jpskill.com/download/9413.zip && unzip -o ecs-troubleshooting.zip && rm ecs-troubleshooting.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/9413.zip -OutFile "$d\ecs-troubleshooting.zip"; Expand-Archive "$d\ecs-troubleshooting.zip" -DestinationPath $d -Force; ri "$d\ecs-troubleshooting.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して ecs-troubleshooting.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → ecs-troubleshooting フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

ECSトラブルシューティングガイド

一般的なECSの問題を診断し、解決するための完全なガイドです。

クイック診断コマンド

# サービスの状態を確認
aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'

# 停止したタスク（失敗）をリスト表示
aws ecs list-tasks \
  --cluster production \
  --service-name my-service \
  --desired-status STOPPED

# 停止したタスクを記述
aws ecs describe-tasks \
  --cluster production \
  --tasks <task-arn> \
  --query 'tasks[0].{status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'

# 最近のログを表示
aws logs tail /ecs/my-app --since 1h --follow

# コンテナ内で実行（デバッグ）
aws ecs execute-command \
  --cluster production \
  --task <task-id> \
  --container my-app \
  --interactive \
  --command "/bin/sh"

タスクの失敗

タスクの状態: STOPPED

症状

タスクが開始直後に停止するか、開始に失敗します。

診断手順

import boto3

ecs = boto3.client('ecs')

def diagnose_stopped_task(cluster: str, task_arn: str):
    """タスクが停止した理由を診断"""

    response = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])
    task = response['tasks'][0]

    print(f"Task Status: {task['lastStatus']}")
    print(f"Stop Code: {task.get('stopCode', 'N/A')}")
    print(f"Stopped Reason: {task.get('stoppedReason', 'N/A')}")

    for container in task['containers']:
        print(f"\nContainer: {container['name']}")
        print(f"  Status: {container['lastStatus']}")
        print(f"  Exit Code: {container.get('exitCode', 'N/A')}")
        print(f"  Reason: {container.get('reason', 'N/A')}")

よくある原因と解決策

1. 必須コンテナが失敗

stoppedReason: "Essential container in task exited"

解決策: アプリケーションエラーについてコンテナログを確認してください。

aws logs tail /ecs/my-app --since 30m

2. タスクの開始に失敗

stoppedReason: "Task failed to start"

解決策: 実行ロールの権限を確認してください。

# 実行ロールがイメージをプルできることを確認
aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name ecr-access

3. CannotPullContainerError

reason: "CannotPullContainerError: Error response from daemon"

解決策:

実行ロールでECRの権限を確認してください。
イメージが存在することを確認してください: aws ecr describe-images --repository-name my-app
プライベートサブネットのVPCエンドポイントまたはNATゲートウェイを確認してください。

4. OutOfMemoryError

reason: "OutOfMemoryError: Container killed due to memory usage"
exitCode: 137

解決策: タスク定義でメモリを増やしてください。

memory = 2048  # 現在の値から増やしてください

5. Exit Code 1 (アプリケーションエラー)

exitCode: 1

解決策: エラーについてアプリケーションログを確認してください。

aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "ERROR"

タスクの状態: PENDING

症状

タスクがPENDING状態から抜け出せず、RUNNINGに移行しません。

診断手順

def diagnose_pending_tasks(cluster: str, service: str):
    """タスクがPENDING状態から抜け出せない理由を確認"""

    # PENDING状態のタスクをリスト表示
    pending = ecs.list_tasks(
        cluster=cluster,
        serviceName=service,
        desiredStatus='RUNNING'
    )

    for task_arn in pending['taskArns']:
        task = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])['tasks'][0]

        if task['lastStatus'] == 'PENDING':
            print(f"Task {task_arn.split('/')[-1]} is PENDING")

            # ENIの問題についてアタッチメントを確認
            for attachment in task.get('attachments', []):
                print(f"  Attachment: {attachment['type']} - {attachment['status']}")
                for detail in attachment.get('details', []):
                    print(f"    {detail['name']}: {detail['value']}")

よくある原因と解決策

1. 利用可能な容量がない

Service my-service was unable to place a task because no container instance met all of its requirements

Fargateの解決策:

キャパシティプロバイダーの制限を確認してください。
サブネットに利用可能なIPがあることを確認してください。
リージョン/AZにFargateの容量があるか確認してください。

2. ENIプロビジョニングの問題

Attachment status: PRECREATED

解決策:

セキュリティグループが必要なトラフィックを許可しているか確認してください。
サブネットに利用可能なIPがあることを確認してください。
EC2インスタンスのENI制限を確認してください。

3. イメージのプルに時間がかかりすぎている

Container image: pulling

解決策:

イメージサイズを確認してください（より小さいベースイメージを使用してください）。
ECRへのネットワーク接続を確認してください。
より高速なプルにはVPCエンドポイントを使用してください。

サービスの問題

サービスがタスクを開始しない

診断

# サービスイベントを確認
aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query 'services[0].events[:10]'

よくあるイベントと解決策

1. "service my-service is unable to place a task"

タスクの配置制約と容量を確認してください。

2. "service my-service has reached a steady state"

サービスは正常です - タスクは期待どおりに実行されています。

3. "service my-service was unable to place a task because no container instance met all requirements"

Fargateの場合: CPU/メモリ構成が有効な組み合わせであることを確認してください。

デプロイメントが停止している

症状

デプロイメントがCOMPLETED状態に到達しません。

診断

def check_deployment_status(cluster: str, service: str):
    """デプロイメントの進行状況を確認"""

    response = ecs.describe_services(cluster=cluster, services=[service])
    svc = response['services'][0]

    for deployment in svc['deployments']:
        print(f"\nDeployment: {deployment['id']}")
        print(f"  Status: {deployment['status']}")
        print(f"  Rollout State: {deployment['rolloutState']}")
        print(f"  Tasks: {deployment['runningCount']}/{deployment['desiredCount']}")

        if deployment['rolloutState'] == 'IN_PROGRESS':
            reason = deployment.get('rolloutStateReason', '')

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

ECS Troubleshooting Guide

Complete guide to diagnosing and resolving common ECS issues.

Quick Diagnostic Commands

# Check service status
aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'

# List stopped tasks (failures)
aws ecs list-tasks \
  --cluster production \
  --service-name my-service \
  --desired-status STOPPED

# Describe stopped task
aws ecs describe-tasks \
  --cluster production \
  --tasks <task-arn> \
  --query 'tasks[0].{status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'

# View recent logs
aws logs tail /ecs/my-app --since 1h --follow

# Execute into container (debug)
aws ecs execute-command \
  --cluster production \
  --task <task-id> \
  --container my-app \
  --interactive \
  --command "/bin/sh"

Task Failures

Task Status: STOPPED

Symptom

Tasks immediately stop after starting or fail to start.

Diagnostic Steps

import boto3

ecs = boto3.client('ecs')

def diagnose_stopped_task(cluster: str, task_arn: str):
    """Diagnose why a task stopped"""

    response = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])
    task = response['tasks'][0]

    print(f"Task Status: {task['lastStatus']}")
    print(f"Stop Code: {task.get('stopCode', 'N/A')}")
    print(f"Stopped Reason: {task.get('stoppedReason', 'N/A')}")

    for container in task['containers']:
        print(f"\nContainer: {container['name']}")
        print(f"  Status: {container['lastStatus']}")
        print(f"  Exit Code: {container.get('exitCode', 'N/A')}")
        print(f"  Reason: {container.get('reason', 'N/A')}")

Common Causes & Solutions

1. Essential container failed

stoppedReason: "Essential container in task exited"

Solution: Check container logs for application errors

aws logs tail /ecs/my-app --since 30m

2. Task failed to start

stoppedReason: "Task failed to start"

Solution: Check execution role permissions

# Verify execution role can pull image
aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name ecr-access

3. CannotPullContainerError

reason: "CannotPullContainerError: Error response from daemon"

Solutions:

Check ECR permissions in execution role
Verify image exists: aws ecr describe-images --repository-name my-app
Check VPC endpoints or NAT gateway for private subnets

4. OutOfMemoryError

reason: "OutOfMemoryError: Container killed due to memory usage"
exitCode: 137

Solution: Increase memory in task definition

memory = 2048  # Increase from current value

5. Exit Code 1 (Application Error)

exitCode: 1

Solution: Check application logs for errors

aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "ERROR"

Task Status: PENDING

Symptom

Tasks stuck in PENDING state, not transitioning to RUNNING.

Diagnostic Steps

def diagnose_pending_tasks(cluster: str, service: str):
    """Check why tasks are stuck in PENDING"""

    # List pending tasks
    pending = ecs.list_tasks(
        cluster=cluster,
        serviceName=service,
        desiredStatus='RUNNING'
    )

    for task_arn in pending['taskArns']:
        task = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])['tasks'][0]

        if task['lastStatus'] == 'PENDING':
            print(f"Task {task_arn.split('/')[-1]} is PENDING")

            # Check attachments for ENI issues
            for attachment in task.get('attachments', []):
                print(f"  Attachment: {attachment['type']} - {attachment['status']}")
                for detail in attachment.get('details', []):
                    print(f"    {detail['name']}: {detail['value']}")

Common Causes & Solutions

1. No available capacity

Service my-service was unable to place a task because no container instance met all of its requirements

Solutions for Fargate:

Check capacity provider limits
Verify subnet has available IPs
Check if region/AZ has Fargate capacity

2. ENI provisioning issues

Attachment status: PRECREATED

Solutions:

Check security group allows required traffic
Verify subnet has available IPs
Check ENI limits for EC2 instances

3. Image pull taking too long

Container image: pulling

Solutions:

Check image size (use smaller base images)
Verify network connectivity to ECR
Use VPC endpoints for faster pulls

Service Issues

Service Not Starting Tasks

Diagnostic

# Check service events
aws ecs describe-services \
  --cluster production \
  --services my-service \
  --query 'services[0].events[:10]'

Common Events & Solutions

1. "service my-service is unable to place a task"

Check task placement constraints and capacity.

2. "service my-service has reached a steady state"

Service is healthy - tasks are running as expected.

3. "service my-service was unable to place a task because no container instance met all requirements"

For Fargate: Check CPU/memory configurations are valid combinations.

Deployment Stuck

Symptom

Deployment never reaches COMPLETED state.

Diagnostic

def check_deployment_status(cluster: str, service: str):
    """Check deployment progress"""

    response = ecs.describe_services(cluster=cluster, services=[service])
    svc = response['services'][0]

    for deployment in svc['deployments']:
        print(f"\nDeployment: {deployment['id']}")
        print(f"  Status: {deployment['status']}")
        print(f"  Rollout State: {deployment['rolloutState']}")
        print(f"  Tasks: {deployment['runningCount']}/{deployment['desiredCount']}")

        if deployment['rolloutState'] == 'IN_PROGRESS':
            reason = deployment.get('rolloutStateReason', '')
            print(f"  Reason: {reason}")

Common Causes

1. Health check failures

rolloutStateReason: "ECS deployment circuit breaker: tasks failed to start"

Solutions:

Check target group health check settings
Increase healthCheckGracePeriodSeconds
Verify application responds on health check path

2. Insufficient capacity

rolloutStateReason: "Service my-service was unable to place a task"

Solutions:

Check subnet IP availability
Reduce maximumPercent to allow more headroom

Networking Issues

Tasks Cannot Connect to Internet

Symptoms

Cannot pull images
Cannot reach external APIs
Timeouts on external calls

Solutions

For private subnets:

# Option 1: NAT Gateway
resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public.id
}

# Option 2: VPC Endpoints (recommended)
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.ecr.api"
  vpc_endpoint_type = "Interface"
  subnet_ids        = aws_subnet.private[*].id
}

Tasks Cannot Connect to Each Other

Symptom

Service-to-service communication fails.

Diagnostic

# Check security group rules
aws ec2 describe-security-groups \
  --group-ids sg-12345 \
  --query 'SecurityGroups[0].IpPermissions'

Solutions

# Allow traffic between ECS tasks
resource "aws_security_group_rule" "ecs_to_ecs" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  security_group_id        = aws_security_group.ecs_tasks.id
  source_security_group_id = aws_security_group.ecs_tasks.id
}

Load Balancer Health Checks Failing

Symptom

Target group app-tg: 0 healthy, 3 unhealthy

Diagnostic

# Check target health
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn>

Common Causes & Solutions

1. Wrong health check path

health_check {
  path = "/health"  # Must match application endpoint
}

2. Container not listening on expected port

# Verify inside container
aws ecs execute-command --cluster production --task <task-id> \
  --container my-app --interactive --command "netstat -tlnp"

3. Security group blocking ALB

# Allow ALB to reach ECS tasks
resource "aws_security_group_rule" "alb_to_ecs" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  security_group_id        = aws_security_group.ecs_tasks.id
  source_security_group_id = aws_security_group.alb.id
}

IAM & Permissions Issues

CannotPullContainerError

Symptom

CannotPullContainerError: Error response from daemon: pull access denied

Solution: Task Execution Role

resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
  role       = aws_iam_role.ecs_task_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# For cross-account ECR
resource "aws_iam_role_policy" "cross_account_ecr" {
  role = aws_iam_role.ecs_task_execution.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ]
      Resource = "arn:aws:ecr:*:OTHER_ACCOUNT:repository/*"
    }]
  })
}

Secrets Access Denied

Symptom

ResourceInitializationError: unable to pull secrets

Solution

resource "aws_iam_role_policy" "secrets_access" {
  role = aws_iam_role.ecs_task_execution.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["secretsmanager:GetSecretValue"]
        Resource = "arn:aws:secretsmanager:*:*:secret:my-app/*"
      },
      {
        Effect = "Allow"
        Action = ["ssm:GetParameters"]
        Resource = "arn:aws:ssm:*:*:parameter/my-app/*"
      },
      {
        Effect = "Allow"
        Action = ["kms:Decrypt"]
        Resource = aws_kms_key.secrets.arn
      }
    ]
  })
}

Execute Command Not Working

Symptom

SessionManagerPlugin is not found

Execute command is disabled

Solutions

1. Enable execute command on service

resource "aws_ecs_service" "app" {
  enable_execute_command = true
}

2. Add SSM permissions to task role

resource "aws_iam_role_policy" "ssm_exec" {
  role = aws_iam_role.ecs_task.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ssmmessages:CreateControlChannel",
        "ssmmessages:CreateDataChannel",
        "ssmmessages:OpenControlChannel",
        "ssmmessages:OpenDataChannel"
      ]
      Resource = "*"
    }]
  })
}

Performance Issues

High CPU/Memory Usage

Diagnostic

import boto3

cloudwatch = boto3.client('cloudwatch')

def get_service_metrics(cluster: str, service: str):
    """Get CPU and memory metrics"""

    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/ECS',
        MetricName='CPUUtilization',
        Dimensions=[
            {'Name': 'ClusterName', 'Value': cluster},
            {'Name': 'ServiceName', 'Value': service}
        ],
        StartTime=datetime.utcnow() - timedelta(hours=1),
        EndTime=datetime.utcnow(),
        Period=300,
        Statistics=['Average', 'Maximum']
    )

    for point in sorted(response['Datapoints'], key=lambda x: x['Timestamp']):
        print(f"{point['Timestamp']}: Avg={point['Average']:.1f}%, Max={point['Maximum']:.1f}%")

Solutions

1. Right-size tasks

# Increase resources
cpu    = "1024"  # from 512
memory = "2048"  # from 1024

2. Enable auto-scaling

resource "aws_appautoscaling_policy" "cpu" {
  target_tracking_scaling_policy_configuration {
    target_value = 70.0
  }
}

Slow Task Startup

Causes & Solutions

1. Large container image

Use smaller base images (alpine, distroless)
Enable image caching with Fargate Platform 1.4.0

2. Slow application startup

Increase startPeriod in health check
Optimize application initialization

3. Slow secret/config loading

Use VPC endpoints for faster access
Cache configuration at startup

Log Analysis

CloudWatch Logs Queries

# Find errors in last hour
aws logs filter-events \
  --log-group-name /ecs/my-app \
  --start-time $(date -d '-1 hour' +%s000) \
  --filter-pattern "ERROR"

# Find OOM kills
aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "OutOfMemory"

# Find slow requests
aws logs filter-events \
  --log-group-name /ecs/my-app \
  --filter-pattern "[timestamp, level, duration>1000, ...]"

CloudWatch Insights

-- Top errors by count
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by @message
| sort errorCount desc
| limit 10

-- Average response time
fields @timestamp, responseTime
| stats avg(responseTime) as avgTime, max(responseTime) as maxTime by bin(5m)

Related Skills

boto3-ecs: SDK patterns
terraform-ecs: Infrastructure as Code
ecs-fargate: Fargate specifics
ecs-deployment: Deployment strategies

Quick Reference

Symptom	First Check	Common Cause
Task STOPPED	`stoppedReason`	Container crash, OOM
Task PENDING	Attachments	ENI/network issues
Deployment stuck	Health checks	ALB health check failing
Cannot pull image	Execution role	Missing ECR permissions
Cannot connect	Security groups	Wrong SG rules