💼 ビジネスコミュニティ

ingesting-into-data-lake

Import data into the AWS data lake from S3 files, local uploads, JDBC databases (Oracle, SQL Server, PostgreSQL, MySQL, RDS, Aurora), Amazon Redshift, Snowflake, BigQuery, DynamoDB, or existing Glue catalog tables (migration). Default target is S3 Tables; standard Iceberg on a general purpose bucket is supported where S3 Tables is not adopted. Handles one-time loads, recurring pipelines, migrations. Triggers on: import data, load data, ingest, sync database, migrate table, move data to AWS, set up pipeline, ETL, pull from Snowflake, query BigQuery into S3, export DynamoDB, CTAS, convert to Iceberg. Do NOT use for setting up or troubleshooting Glue connections (use connecting-to-data-source), creating empty tables (use creating-data-lake-table), running queries (use querying-data-lake), finding tables by fuzzy name (use finding-data-lake-assets), catalog audit (use exploring-data-catalog), or SaaS platforms like Salesforce, ServiceNow, SAP, MongoDB, Kafka.

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o ingesting-into-data-lake.zip https://jpskill.com/download/23347.zip && unzip -o ingesting-into-data-lake.zip && rm ingesting-into-data-lake.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/23347.zip -OutFile "$d\ingesting-into-data-lake.zip"; Expand-Archive "$d\ingesting-into-data-lake.zip" -DestinationPath $d -Force; ri "$d\ingesting-into-data-lake.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して ingesting-into-data-lake.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → ingesting-into-data-lake フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 26

📖 Claude が読む原文 SKILL.md(中身を展開)

この本文は AI(Claude)が読むための原文(英語または中国語)です。日本語訳は順次追加中。

Ingest into Data Lake

Move data from a source into a queryable table in the data lake. This skill assumes the source connection (if one is needed) already exists. For Glue connection setup or troubleshooting, delegate to connecting-to-data-source.

Philosophy

Default to S3 Tables unless the environment says otherwise. S3 Tables is the recommended target for new data lake work. If the user's catalog inventory shows they haven't adopted S3 Tables, recommend standard Iceberg on their existing general-purpose bucket instead of forcing them to change posture.

Common Tasks

You MUST execute commands using AWS MCP server tools when connected -- they provide validation, sandboxed execution, and audit logging. Fall back to AWS CLI only if MCP is unavailable. You MUST explain each step before executing.

Workflow

1. Verify Dependencies and Context

You MUST check whether AWS MCP tools or AWS CLI are available and inform the user if missing
You MUST confirm target AWS region and verify credentials with aws sts get-caller-identity
For SageMaker Unified Studio project roles, note that target tables and connections may be scoped to the project. See the caller ARN detection pattern in querying-data-lake.

2. Classify the Source

User says...	Source type	Reference
"upload my file", "local CSV", "move to S3"	Local file	local-upload.md
"load from S3", "import CSV/JSON/Parquet from s3://"	S3 files	s3-files.md
"import from Oracle/Postgres/MySQL/SQL Server/Redshift/RDS/Aurora"	JDBC	jdbc-ingest.md
"pull from Snowflake", "Snowflake table to S3"	Snowflake	snowflake-ingest.md
"import from BigQuery", "GCP analytics to S3"	BigQuery	bigquery-ingest.md
"export DynamoDB", "DynamoDB to data lake"	DynamoDB	dynamodb-ingest.md
"migrate Glue table", "convert Hive to Iceberg"	Catalog migration	catalog-migration.md

If the user names Salesforce, ServiceNow, SAP, MongoDB, Kafka, or another SaaS/streaming source, decline -- these are not supported in this release.

If the source table is referenced by a fuzzy or business name ("migrate our orders table", "pull from the sales warehouse"), delegate to finding-data-lake-assets to resolve before proceeding.

3. Confirm Connection Exists (if applicable)

For JDBC, Snowflake, and BigQuery sources, a Glue connection is required. Check:

aws glue get-connection --name <CONNECTION_NAME> --region <REGION>

If the connection does not exist, stop and delegate to connecting-to-data-source to create and test it. Do not proceed with ingest until the connection is verified.

Local files, S3 files, DynamoDB, and catalog migration do not need a Glue connection.

4. Clarify the Target

You MUST ask the user (or suggest based on catalog inventory) before creating or writing to any table:

Database/namespace: Does a specific target database exist? Or should one be created?
Table: Existing table (append/merge) or new table (delegate to creating-data-lake-table)?
Format: S3 Tables (default), standard Iceberg, or raw Parquet?

Inventory-aware defaults:

If you have already run exploring-data-catalog or can quickly check, use what exists:

Account has an s3tablescatalog federated catalog and active table buckets: recommend S3 Tables
Account has general-purpose buckets with Iceberg tables and no S3 Tables usage: recommend standard Iceberg on their existing bucket
Account uses Parquet/ORC on S3 without Iceberg metadata: ask whether to adopt Iceberg now (recommend yes) or continue with raw files

Do not force S3 Tables on customers who haven't adopted it. See iceberg-catalog-config-and-usage.md.

Delegations from this step:

Target table doesn't exist -> creating-data-lake-table
Target database named by fuzzy term -> finding-data-lake-assets
User doesn't know what exists -> exploring-data-catalog

5. Execute Source Workflow

Read the source-specific reference and follow its phases. Each is self-contained with job templates, gotchas, and troubleshooting:

Local / S3 / JDBC / Snowflake / BigQuery / DynamoDB / catalog migration -- one reference per source

Common Glue 5.1 or higher job configuration and PySpark templates are shared in glue-job-config.md and glue-job-scripts.md.

6. Validate

Run all three, do not skip:

Row count matches expected (source vs target)
Null check on critical columns
Spot-check 3-5 sample rows

See data-quality-validation.md.

7. Schedule (if recurring)

For recurring pipelines, create a Glue Trigger with a cron schedule. See testing-and-scheduling.md. Simple single-step pipelines use Glue Triggers; multi-step with branching uses MWAA.

Argument Routing

S3 path only: Infer one-time load, start Step 2 with S3 files
Connection name: Start Step 3 with the named connection
Table name: Start Step 4, ask whether this is source or target
--target flag: Pre-fill the target format in Step 4
No args: Walk through interactively

Gotchas

S3 Tables requires Glue 5.1 or higher and --datalake-formats iceberg job argument
All spark.sql.catalog.* config MUST go in --conf job arguments, never in spark.conf.set(). Glue 5.x throws AnalysisException: Cannot modify the value of a static config otherwise. See iceberg-catalog-config-and-usage.md for correct catalog configs.
The warehouse parameter is required in S3 Tables catalog config. Without it Spark fails with "Cannot derive default warehouse location".
Table and column names in S3 Tables MUST be all lowercase
overwritePartitions() only replaces partitions present in the DataFrame -- for full refresh with deletes, use createOrReplace()
Standard Iceberg targets MUST include a LOCATION clause; S3 Tables MUST NOT
DynamoDB does not need a Glue connection -- do not attempt to create one
Connection failures during ingest delegate back to connecting-to-data-source; do not debug network/credentials in this skill
For target tables in SageMaker Unified Studio projects, ensure the project role has write access to the target namespace before the Glue job runs

Troubleshooting

Error	Likely cause	Action
Access Denied on S3	Missing IAM permissions	Check Glue role has s3:GetObject, s3:PutObject
Access Denied on S3 Tables	Missing s3tables:* permissions	Add S3 Tables inline policy to Glue role
CTAS timeout	Dataset too large for Athena	Switch to Glue ETL or batch with WHERE filters
JDBC connection timeout/auth failure	Connection-level issue	Delegate to `connecting-to-data-source`
Throughput exceeded (DynamoDB)	Read percent too high	Lower `read.percent` or use native export

See error-handling.md for the full catalog.

References

Source-specific

local-upload.md -- Local files
s3-files.md -- S3 files (CSV, JSON, Parquet, Avro, ORC)
jdbc-ingest.md -- Oracle, SQL Server, PostgreSQL, MySQL, RDS, Aurora, Redshift
snowflake-ingest.md -- Snowflake
bigquery-ingest.md -- BigQuery
dynamodb-ingest.md -- DynamoDB (export and Glue direct read)
catalog-migration.md -- Existing Glue catalog tables (Hive, self-managed Iceberg)

Cross-cutting

iceberg-catalog-config-and-usage.md -- S3 Tables, standard Iceberg, raw files: catalog config, engine access patterns
glue-job-config.md -- Job sizing, monitoring, retry
glue-job-scripts.md -- PySpark templates (append, upsert, custom SQL, full refresh)
incremental-loading.md -- Watermark strategies
testing-and-scheduling.md -- Glue Triggers, MWAA
data-quality-validation.md -- Row counts, null checks, Glue Data Quality
schema-evolution.md -- ALTER TABLE ADD COLUMNS, nested JSON
type-transformations.md -- Type conflict resolution
format-specific-loading.md -- CSV/JSON/Parquet/Avro/ORC specifics
athena-loading.md -- Athena INSERT INTO as simple-load fallback
error-handling.md -- Ingest errors (connection errors delegate to connecting-to-data-source)
upload-options.md -- aws s3 cp vs sync, multipart

Migration-specific

ctas-patterns.md -- Athena CTAS syntax and partition transforms
glue-etl-migration.md -- Large-table migration via Glue 5.1 or higher PySpark
migration-validation.md -- Full validation checklist
migration-troubleshooting.md -- CTAS failures, visibility, partitions

JDBC-specific

jdbc-schema-discovery.md -- Crawler, direct inspection, custom SQL
jdbc-performance.md -- Parallel reads, partitioning

同梱ファイル

※ ZIPに含まれるファイル一覧。`SKILL.md` 本体に加え、参考資料・サンプル・スクリプトが入っている場合があります。

📄 SKILL.md (11,056 bytes)
📎 references/athena-loading.md (2,831 bytes)
📎 references/bigquery-ingest.md (3,794 bytes)
📎 references/catalog-migration.md (7,038 bytes)
📎 references/ctas-patterns.md (3,026 bytes)
📎 references/data-quality-validation.md (12,451 bytes)
📎 references/dynamodb-ingest.md (7,333 bytes)
📎 references/error-handling.md (12,626 bytes)
📎 references/format-specific-loading.md (13,153 bytes)
📎 references/glue-etl-migration.md (4,645 bytes)
📎 references/glue-job-config.md (9,625 bytes)
📎 references/glue-job-scripts.md (9,813 bytes)
📎 references/iceberg-catalog-config-and-usage.md (8,049 bytes)
📎 references/incremental-loading.md (13,381 bytes)
📎 references/jdbc-ingest.md (5,790 bytes)
📎 references/jdbc-performance.md (10,546 bytes)
📎 references/jdbc-schema-discovery.md (13,045 bytes)
📎 references/local-upload.md (4,625 bytes)
📎 references/migration-troubleshooting.md (2,640 bytes)
📎 references/migration-validation.md (2,772 bytes)
📎 references/s3-files.md (5,879 bytes)
📎 references/schema-evolution.md (9,497 bytes)
📎 references/snowflake-ingest.md (3,729 bytes)
📎 references/testing-and-scheduling.md (13,961 bytes)
📎 references/type-transformations.md (7,979 bytes)
📎 references/upload-options.md (1,053 bytes)