ecs-troubleshooting
ECS(Elastic Container Service)のタスク停止やネットワーク問題、IAM権限エラーなど、ECSに関する様々なトラブルシューティングやデバッグを行い、サービスやコンテナの状態を診断・解決するSkill。
📜 元の英語説明(参考)
ECS troubleshooting and debugging guide covering task failures, service issues, networking problems, and performance diagnostics. Use when diagnosing ECS issues, debugging task failures (STOPPED, PENDING), resolving networking problems, investigating IAM/permissions errors, troubleshooting container health checks, or analyzing ECS service health.
🇯🇵 日本人クリエイター向け解説
ECS(Elastic Container Service)のタスク停止やネットワーク問題、IAM権限エラーなど、ECSに関する様々なトラブルシューティングやデバッグを行い、サービスやコンテナの状態を診断・解決するSkill。
※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。
下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o ecs-troubleshooting.zip https://jpskill.com/download/9413.zip && unzip -o ecs-troubleshooting.zip && rm ecs-troubleshooting.zip
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/9413.zip -OutFile "$d\ecs-troubleshooting.zip"; Expand-Archive "$d\ecs-troubleshooting.zip" -DestinationPath $d -Force; ri "$d\ecs-troubleshooting.zip"
完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。
💾 手動でダウンロードしたい(コマンドが難しい人向け)
- 1. 下の青いボタンを押して
ecs-troubleshooting.zipをダウンロード - 2. ZIPファイルをダブルクリックで解凍 →
ecs-troubleshootingフォルダができる - 3. そのフォルダを
C:\Users\あなたの名前\.claude\skills\(Win)または~/.claude/skills/(Mac)へ移動 - 4. Claude Code を再起動
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-18
- 取得日時
- 2026-05-18
- 同梱ファイル
- 1
📖 Skill本文(日本語訳)
※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。
ECSトラブルシューティングガイド
一般的なECSの問題を診断し、解決するための完全なガイドです。
クイック診断コマンド
# サービスの状態を確認
aws ecs describe-services \
--cluster production \
--services my-service \
--query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'
# 停止したタスク(失敗)をリスト表示
aws ecs list-tasks \
--cluster production \
--service-name my-service \
--desired-status STOPPED
# 停止したタスクを記述
aws ecs describe-tasks \
--cluster production \
--tasks <task-arn> \
--query 'tasks[0].{status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'
# 最近のログを表示
aws logs tail /ecs/my-app --since 1h --follow
# コンテナ内で実行(デバッグ)
aws ecs execute-command \
--cluster production \
--task <task-id> \
--container my-app \
--interactive \
--command "/bin/sh"
タスクの失敗
タスクの状態: STOPPED
症状
タスクが開始直後に停止するか、開始に失敗します。
診断手順
import boto3
ecs = boto3.client('ecs')
def diagnose_stopped_task(cluster: str, task_arn: str):
"""タスクが停止した理由を診断"""
response = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])
task = response['tasks'][0]
print(f"Task Status: {task['lastStatus']}")
print(f"Stop Code: {task.get('stopCode', 'N/A')}")
print(f"Stopped Reason: {task.get('stoppedReason', 'N/A')}")
for container in task['containers']:
print(f"\nContainer: {container['name']}")
print(f" Status: {container['lastStatus']}")
print(f" Exit Code: {container.get('exitCode', 'N/A')}")
print(f" Reason: {container.get('reason', 'N/A')}")
よくある原因と解決策
1. 必須コンテナが失敗
stoppedReason: "Essential container in task exited"
解決策: アプリケーションエラーについてコンテナログを確認してください。
aws logs tail /ecs/my-app --since 30m
2. タスクの開始に失敗
stoppedReason: "Task failed to start"
解決策: 実行ロールの権限を確認してください。
# 実行ロールがイメージをプルできることを確認
aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name ecr-access
3. CannotPullContainerError
reason: "CannotPullContainerError: Error response from daemon"
解決策:
- 実行ロールでECRの権限を確認してください。
- イメージが存在することを確認してください:
aws ecr describe-images --repository-name my-app - プライベートサブネットのVPCエンドポイントまたはNATゲートウェイを確認してください。
4. OutOfMemoryError
reason: "OutOfMemoryError: Container killed due to memory usage"
exitCode: 137
解決策: タスク定義でメモリを増やしてください。
memory = 2048 # 現在の値から増やしてください
5. Exit Code 1 (アプリケーションエラー)
exitCode: 1
解決策: エラーについてアプリケーションログを確認してください。
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "ERROR"
タスクの状態: PENDING
症状
タスクがPENDING状態から抜け出せず、RUNNINGに移行しません。
診断手順
def diagnose_pending_tasks(cluster: str, service: str):
"""タスクがPENDING状態から抜け出せない理由を確認"""
# PENDING状態のタスクをリスト表示
pending = ecs.list_tasks(
cluster=cluster,
serviceName=service,
desiredStatus='RUNNING'
)
for task_arn in pending['taskArns']:
task = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])['tasks'][0]
if task['lastStatus'] == 'PENDING':
print(f"Task {task_arn.split('/')[-1]} is PENDING")
# ENIの問題についてアタッチメントを確認
for attachment in task.get('attachments', []):
print(f" Attachment: {attachment['type']} - {attachment['status']}")
for detail in attachment.get('details', []):
print(f" {detail['name']}: {detail['value']}")
よくある原因と解決策
1. 利用可能な容量がない
Service my-service was unable to place a task because no container instance met all of its requirements
Fargateの解決策:
- キャパシティプロバイダーの制限を確認してください。
- サブネットに利用可能なIPがあることを確認してください。
- リージョン/AZにFargateの容量があるか確認してください。
2. ENIプロビジョニングの問題
Attachment status: PRECREATED
解決策:
- セキュリティグループが必要なトラフィックを許可しているか確認してください。
- サブネットに利用可能なIPがあることを確認してください。
- EC2インスタンスのENI制限を確認してください。
3. イメージのプルに時間がかかりすぎている
Container image: pulling
解決策:
- イメージサイズを確認してください(より小さいベースイメージを使用してください)。
- ECRへのネットワーク接続を確認してください。
- より高速なプルにはVPCエンドポイントを使用してください。
サービスの問題
サービスがタスクを開始しない
診断
# サービスイベントを確認
aws ecs describe-services \
--cluster production \
--services my-service \
--query 'services[0].events[:10]'
よくあるイベントと解決策
1. "service my-service is unable to place a task"
タスクの配置制約と容量を確認してください。
2. "service my-service has reached a steady state"
サービスは正常です - タスクは期待どおりに実行されています。
3. "service my-service was unable to place a task because no container instance met all requirements"
Fargateの場合: CPU/メモリ構成が有効な組み合わせであることを確認してください。
デプロイメントが停止している
症状
デプロイメントがCOMPLETED状態に到達しません。
診断
def check_deployment_status(cluster: str, service: str):
"""デプロイメントの進行状況を確認"""
response = ecs.describe_services(cluster=cluster, services=[service])
svc = response['services'][0]
for deployment in svc['deployments']:
print(f"\nDeployment: {deployment['id']}")
print(f" Status: {deployment['status']}")
print(f" Rollout State: {deployment['rolloutState']}")
print(f" Tasks: {deployment['runningCount']}/{deployment['desiredCount']}")
if deployment['rolloutState'] == 'IN_PROGRESS':
reason = deployment.get('rolloutStateReason', '') 📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開
ECS Troubleshooting Guide
Complete guide to diagnosing and resolving common ECS issues.
Quick Diagnostic Commands
# Check service status
aws ecs describe-services \
--cluster production \
--services my-service \
--query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'
# List stopped tasks (failures)
aws ecs list-tasks \
--cluster production \
--service-name my-service \
--desired-status STOPPED
# Describe stopped task
aws ecs describe-tasks \
--cluster production \
--tasks <task-arn> \
--query 'tasks[0].{status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'
# View recent logs
aws logs tail /ecs/my-app --since 1h --follow
# Execute into container (debug)
aws ecs execute-command \
--cluster production \
--task <task-id> \
--container my-app \
--interactive \
--command "/bin/sh"
Task Failures
Task Status: STOPPED
Symptom
Tasks immediately stop after starting or fail to start.
Diagnostic Steps
import boto3
ecs = boto3.client('ecs')
def diagnose_stopped_task(cluster: str, task_arn: str):
"""Diagnose why a task stopped"""
response = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])
task = response['tasks'][0]
print(f"Task Status: {task['lastStatus']}")
print(f"Stop Code: {task.get('stopCode', 'N/A')}")
print(f"Stopped Reason: {task.get('stoppedReason', 'N/A')}")
for container in task['containers']:
print(f"\nContainer: {container['name']}")
print(f" Status: {container['lastStatus']}")
print(f" Exit Code: {container.get('exitCode', 'N/A')}")
print(f" Reason: {container.get('reason', 'N/A')}")
Common Causes & Solutions
1. Essential container failed
stoppedReason: "Essential container in task exited"
Solution: Check container logs for application errors
aws logs tail /ecs/my-app --since 30m
2. Task failed to start
stoppedReason: "Task failed to start"
Solution: Check execution role permissions
# Verify execution role can pull image
aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name ecr-access
3. CannotPullContainerError
reason: "CannotPullContainerError: Error response from daemon"
Solutions:
- Check ECR permissions in execution role
- Verify image exists:
aws ecr describe-images --repository-name my-app - Check VPC endpoints or NAT gateway for private subnets
4. OutOfMemoryError
reason: "OutOfMemoryError: Container killed due to memory usage"
exitCode: 137
Solution: Increase memory in task definition
memory = 2048 # Increase from current value
5. Exit Code 1 (Application Error)
exitCode: 1
Solution: Check application logs for errors
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "ERROR"
Task Status: PENDING
Symptom
Tasks stuck in PENDING state, not transitioning to RUNNING.
Diagnostic Steps
def diagnose_pending_tasks(cluster: str, service: str):
"""Check why tasks are stuck in PENDING"""
# List pending tasks
pending = ecs.list_tasks(
cluster=cluster,
serviceName=service,
desiredStatus='RUNNING'
)
for task_arn in pending['taskArns']:
task = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])['tasks'][0]
if task['lastStatus'] == 'PENDING':
print(f"Task {task_arn.split('/')[-1]} is PENDING")
# Check attachments for ENI issues
for attachment in task.get('attachments', []):
print(f" Attachment: {attachment['type']} - {attachment['status']}")
for detail in attachment.get('details', []):
print(f" {detail['name']}: {detail['value']}")
Common Causes & Solutions
1. No available capacity
Service my-service was unable to place a task because no container instance met all of its requirements
Solutions for Fargate:
- Check capacity provider limits
- Verify subnet has available IPs
- Check if region/AZ has Fargate capacity
2. ENI provisioning issues
Attachment status: PRECREATED
Solutions:
- Check security group allows required traffic
- Verify subnet has available IPs
- Check ENI limits for EC2 instances
3. Image pull taking too long
Container image: pulling
Solutions:
- Check image size (use smaller base images)
- Verify network connectivity to ECR
- Use VPC endpoints for faster pulls
Service Issues
Service Not Starting Tasks
Diagnostic
# Check service events
aws ecs describe-services \
--cluster production \
--services my-service \
--query 'services[0].events[:10]'
Common Events & Solutions
1. "service my-service is unable to place a task"
Check task placement constraints and capacity.
2. "service my-service has reached a steady state"
Service is healthy - tasks are running as expected.
3. "service my-service was unable to place a task because no container instance met all requirements"
For Fargate: Check CPU/memory configurations are valid combinations.
Deployment Stuck
Symptom
Deployment never reaches COMPLETED state.
Diagnostic
def check_deployment_status(cluster: str, service: str):
"""Check deployment progress"""
response = ecs.describe_services(cluster=cluster, services=[service])
svc = response['services'][0]
for deployment in svc['deployments']:
print(f"\nDeployment: {deployment['id']}")
print(f" Status: {deployment['status']}")
print(f" Rollout State: {deployment['rolloutState']}")
print(f" Tasks: {deployment['runningCount']}/{deployment['desiredCount']}")
if deployment['rolloutState'] == 'IN_PROGRESS':
reason = deployment.get('rolloutStateReason', '')
print(f" Reason: {reason}")
Common Causes
1. Health check failures
rolloutStateReason: "ECS deployment circuit breaker: tasks failed to start"
Solutions:
- Check target group health check settings
- Increase
healthCheckGracePeriodSeconds - Verify application responds on health check path
2. Insufficient capacity
rolloutStateReason: "Service my-service was unable to place a task"
Solutions:
- Check subnet IP availability
- Reduce
maximumPercentto allow more headroom
Networking Issues
Tasks Cannot Connect to Internet
Symptoms
- Cannot pull images
- Cannot reach external APIs
- Timeouts on external calls
Solutions
For private subnets:
# Option 1: NAT Gateway
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public.id
}
# Option 2: VPC Endpoints (recommended)
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
}
Tasks Cannot Connect to Each Other
Symptom
Service-to-service communication fails.
Diagnostic
# Check security group rules
aws ec2 describe-security-groups \
--group-ids sg-12345 \
--query 'SecurityGroups[0].IpPermissions'
Solutions
# Allow traffic between ECS tasks
resource "aws_security_group_rule" "ecs_to_ecs" {
type = "ingress"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_group_id = aws_security_group.ecs_tasks.id
source_security_group_id = aws_security_group.ecs_tasks.id
}
Load Balancer Health Checks Failing
Symptom
Target group app-tg: 0 healthy, 3 unhealthy
Diagnostic
# Check target health
aws elbv2 describe-target-health \
--target-group-arn <target-group-arn>
Common Causes & Solutions
1. Wrong health check path
health_check {
path = "/health" # Must match application endpoint
}
2. Container not listening on expected port
# Verify inside container
aws ecs execute-command --cluster production --task <task-id> \
--container my-app --interactive --command "netstat -tlnp"
3. Security group blocking ALB
# Allow ALB to reach ECS tasks
resource "aws_security_group_rule" "alb_to_ecs" {
type = "ingress"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_group_id = aws_security_group.ecs_tasks.id
source_security_group_id = aws_security_group.alb.id
}
IAM & Permissions Issues
CannotPullContainerError
Symptom
CannotPullContainerError: Error response from daemon: pull access denied
Solution: Task Execution Role
resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
role = aws_iam_role.ecs_task_execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
# For cross-account ECR
resource "aws_iam_role_policy" "cross_account_ecr" {
role = aws_iam_role.ecs_task_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
]
Resource = "arn:aws:ecr:*:OTHER_ACCOUNT:repository/*"
}]
})
}
Secrets Access Denied
Symptom
ResourceInitializationError: unable to pull secrets
Solution
resource "aws_iam_role_policy" "secrets_access" {
role = aws_iam_role.ecs_task_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["secretsmanager:GetSecretValue"]
Resource = "arn:aws:secretsmanager:*:*:secret:my-app/*"
},
{
Effect = "Allow"
Action = ["ssm:GetParameters"]
Resource = "arn:aws:ssm:*:*:parameter/my-app/*"
},
{
Effect = "Allow"
Action = ["kms:Decrypt"]
Resource = aws_kms_key.secrets.arn
}
]
})
}
Execute Command Not Working
Symptom
SessionManagerPlugin is not found
or
Execute command is disabled
Solutions
1. Enable execute command on service
resource "aws_ecs_service" "app" {
enable_execute_command = true
}
2. Add SSM permissions to task role
resource "aws_iam_role_policy" "ssm_exec" {
role = aws_iam_role.ecs_task.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
]
Resource = "*"
}]
})
}
Performance Issues
High CPU/Memory Usage
Diagnostic
import boto3
cloudwatch = boto3.client('cloudwatch')
def get_service_metrics(cluster: str, service: str):
"""Get CPU and memory metrics"""
response = cloudwatch.get_metric_statistics(
Namespace='AWS/ECS',
MetricName='CPUUtilization',
Dimensions=[
{'Name': 'ClusterName', 'Value': cluster},
{'Name': 'ServiceName', 'Value': service}
],
StartTime=datetime.utcnow() - timedelta(hours=1),
EndTime=datetime.utcnow(),
Period=300,
Statistics=['Average', 'Maximum']
)
for point in sorted(response['Datapoints'], key=lambda x: x['Timestamp']):
print(f"{point['Timestamp']}: Avg={point['Average']:.1f}%, Max={point['Maximum']:.1f}%")
Solutions
1. Right-size tasks
# Increase resources
cpu = "1024" # from 512
memory = "2048" # from 1024
2. Enable auto-scaling
resource "aws_appautoscaling_policy" "cpu" {
target_tracking_scaling_policy_configuration {
target_value = 70.0
}
}
Slow Task Startup
Causes & Solutions
1. Large container image
- Use smaller base images (alpine, distroless)
- Enable image caching with Fargate Platform 1.4.0
2. Slow application startup
- Increase
startPeriodin health check - Optimize application initialization
3. Slow secret/config loading
- Use VPC endpoints for faster access
- Cache configuration at startup
Log Analysis
CloudWatch Logs Queries
# Find errors in last hour
aws logs filter-events \
--log-group-name /ecs/my-app \
--start-time $(date -d '-1 hour' +%s000) \
--filter-pattern "ERROR"
# Find OOM kills
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "OutOfMemory"
# Find slow requests
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "[timestamp, level, duration>1000, ...]"
CloudWatch Insights
-- Top errors by count
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by @message
| sort errorCount desc
| limit 10
-- Average response time
fields @timestamp, responseTime
| stats avg(responseTime) as avgTime, max(responseTime) as maxTime by bin(5m)
Related Skills
- boto3-ecs: SDK patterns
- terraform-ecs: Infrastructure as Code
- ecs-fargate: Fargate specifics
- ecs-deployment: Deployment strategies
Quick Reference
| Symptom | First Check | Common Cause |
|---|---|---|
| Task STOPPED | stoppedReason |
Container crash, OOM |
| Task PENDING | Attachments | ENI/network issues |
| Deployment stuck | Health checks | ALB health check failing |
| Cannot pull image | Execution role | Missing ECR permissions |
| Cannot connect | Security groups | Wrong SG rules |