🛠️ 開発・MCP コミュニティ

data-validator

CSV、JSON、データベースのエクスポートデータに対し、欠損値、型不一致、重複、異常値、スキーマ違反などをチェックし、データ品質を検証することで、ETLパイプライン構築やデータ監査を効率化するSkill。

📜 元の英語説明(参考)

Validate data quality in CSV, JSON, and database exports by checking for missing values, type mismatches, duplicates, outliers, and schema violations. Use when building ETL pipelines, auditing data imports, checking data freshness, or ensuring data contracts between teams. Trigger words: data quality, validation, null values, duplicates, schema check, data contract, ETL, pipeline, data drift.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o data-validator.zip https://jpskill.com/download/14818.zip && unzip -o data-validator.zip && rm data-validator.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/14818.zip -OutFile "$d\data-validator.zip"; Expand-Archive "$d\data-validator.zip" -DestinationPath $d -Force; ri "$d\data-validator.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して data-validator.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → data-validator フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

データバリデーター

概要

データセットに対して包括的なデータ品質チェックを実行します。スキーマの検証、異常の検出、重複の発見、データコントラクトの適用を行います。これは、不良データが下流の分析やダッシュボードを静かに破損させる ETL パイプラインにとって不可欠です。

手順

1. まずデータセットをプロファイルする

検証の前に、データを理解します。

行数と列数
列ごとのデータ型 (string、integer、float、date、boolean)
列ごとの Null 率
ユニークな値の数とカーディナリティ
数値列の最小/最大/平均
時間列の日付範囲

データプロファイルサマリーとして提示します。

Dataset Profile: orders_export.csv
Rows: 142,847 | Columns: 12

| Column        | Type    | Nulls  | Unique  | Sample Values          |
|---------------|---------|--------|---------|------------------------|
| order_id      | string  | 0%     | 142,847 | ORD-20260217-001       |
| customer_id   | integer | 0.3%   | 28,491  | 10042, 10043           |
| amount         | float   | 0%     | 8,234   | 29.99, 149.00          |
| created_at    | date    | 0%     | 89,112  | 2026-02-17T14:23:01Z   |
| status        | string  | 0%     | 5       | completed, pending      |

2. 検証チェックを実行する

これらのチェックを体系的に適用します。

完全性 — 必須フィールドは入力されていますか？

Null/空の割合を閾値と比較して確認します (例: email は 1% 未満の Null である必要があります)
予期しない Null スパイクのある列にフラグを立てます

一意性 — ID は実際には一意ですか？

主キー列に重複がないか確認します
重複数と重複行のサンプルを報告します

型の整合性 — 値は予期される型と一致しますか？

パースできない日付、文字列として格納された数値
同じ列内の混合型

範囲の妥当性 — 値は予期される範囲内ですか？

負の金額、未来の日付、150歳を超える年齢
予期しない値を持つ Enum 列

参照整合性 — 外部キーは一致しますか？

customers テーブルに存在しない customer_id 値
孤立したレコード

鮮度 — データは最新ですか？

最新のレコードタイムスタンプと現在の時刻の比較
時系列データのギャップ

3. 結果を報告する

結果を検証レポートとして構成します。

## Data Validation Report
Dataset: orders_export.csv | Checked: 2026-02-17 14:30 UTC

### ❌ Failed Checks (3)
1. **Duplicate order_id** — 23 duplicate IDs found (0.016%)
   Sample: ORD-20260215-4421 appears 3 times
2. **Null customer_email** — 2.1% null (threshold: 1%)
   Spike on 2026-02-15 (bulk import batch)
3. **Future dates in created_at** — 7 rows have dates in 2027

### ⚠️ Warnings (2)
1. **Amount outliers** — 4 orders exceed $10,000 (review manually)
2. **Status enum drift** — New value "on_hold" not in expected set

### ✅ Passed Checks (8)
- Primary key uniqueness (order_id): PASS (99.98%)
- Amount non-negative: PASS
- Date range valid: PASS (excluding 7 future dates)
...

4. 修正を提案する

失敗したチェックごとに、具体的なアクションを推奨します。

重複を見つけて修正する SQL クエリ
Null 値をクリーンアップする変換
無効な行を削除するフィルター
継続的な監視を設定するためのアラート

例

例 1: CSV インポートの検証

入力: 「新しい CRM にインポートする前に、この顧客エクスポートを検証してください」

出力:

## Validation Report: customers_export.csv
Rows: 8,234 | Columns: 9

### ❌ Failed (2)
1. **Email format invalid** — 142 rows (1.7%) have malformed emails
   Pattern: missing @ symbol, spaces in domain
   Fix: SELECT * FROM data WHERE email NOT LIKE '%_@_%.__%'

2. **Phone duplicates** — 34 phone numbers shared by multiple customers
   Likely data entry errors or shared business lines
   Fix: Review and merge duplicate customer records

### ✅ Passed (6)
- customer_id unique: PASS
- Required fields (name, email): PASS (99.8%)
- Country codes valid ISO-3166: PASS
- Created dates in range: PASS
- No PII in notes field: PASS
- UTF-8 encoding clean: PASS

Recommendation: Fix the 142 invalid emails before import.
The phone duplicates are non-blocking but should be reviewed.

例 2: パイプラインデータコントラクトチェック

入力: 「トラッキングパイプラインからの分析イベントが、予期されるスキーマと一致するかどうかを確認してください」

出力:

## Schema Contract Validation: events_2026_02_17.json
Events: 1.2M | Schema version: v3.2

### ❌ Contract Violations (1)
1. **Missing required field: session_id**
   - 12,847 events (1.07%) missing session_id
   - All from source: mobile-ios-sdk v4.1.0
   - Started: 2026-02-16 09:00 UTC (correlates with SDK release)

### ⚠️ Warnings (1)
1. **New field detected: device_model**
   - Not in schema v3.2 — likely added in SDK update
   - 100% populated on mobile events, absent on web

### ✅ Contract Compliance: 98.93%

Action: Pin mobile-ios-sdk to v4.0.x or update schema to v3.3
with session_id as optional for mobile sources.

ガイドライン

検証する前に必ずプロファイルしてください。異常を検出するためのベースラインが必要です
ハードエラー (パイプラインをブロックする) と警告 (後で確認する) を区別します
すべての調査結果にサンプル行を含めます。抽象的な統計は行動に移しにくいものです
問題の説明だけでなく、SQL またはコードの修正を提案します
時系列データの場合は、最新のタイムスタンプだけでなく、ギャップと季節性を確認します
データドリフトを検出するために、検証結果を時間とともに追跡します
影響を受ける行/レコードについて具体的に記述します。「いくつかの重複」よりも「23 の重複」の方が優れています

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Data Validator

Overview

Perform comprehensive data quality checks on datasets — validate schemas, detect anomalies, find duplicates, and enforce data contracts. Essential for ETL pipelines where bad data silently corrupts downstream analytics and dashboards.

Instructions

1. Profile the dataset first

Before validating, understand the data:

Row count and column count
Data types per column (string, integer, float, date, boolean)
Null rates per column
Unique value counts and cardinality
Min/max/mean for numeric columns
Date ranges for temporal columns

Present as a data profile summary:

Dataset Profile: orders_export.csv
Rows: 142,847 | Columns: 12

| Column        | Type    | Nulls  | Unique  | Sample Values          |
|---------------|---------|--------|---------|------------------------|
| order_id      | string  | 0%     | 142,847 | ORD-20260217-001       |
| customer_id   | integer | 0.3%   | 28,491  | 10042, 10043           |
| amount         | float   | 0%     | 8,234   | 29.99, 149.00          |
| created_at    | date    | 0%     | 89,112  | 2026-02-17T14:23:01Z   |
| status        | string  | 0%     | 5       | completed, pending      |

2. Run validation checks

Apply these checks systematically:

Completeness — Are required fields populated?

Check null/empty rates against thresholds (e.g., email must be <1% null)
Flag columns with unexpected null spikes

Uniqueness — Are IDs actually unique?

Check primary key columns for duplicates
Report duplicate counts and sample duplicate rows

Type consistency — Do values match expected types?

Dates that don't parse, numbers stored as strings
Mixed types in the same column

Range validity — Are values within expected bounds?

Negative amounts, future dates, ages over 150
Enum columns with unexpected values

Referential integrity — Do foreign keys match?

customer_id values that don't exist in the customers table
Orphaned records

Freshness — Is the data up to date?

Most recent record timestamp vs current time
Gaps in time series data

3. Report findings

Structure results as a validation report:

## Data Validation Report
Dataset: orders_export.csv | Checked: 2026-02-17 14:30 UTC

### ❌ Failed Checks (3)
1. **Duplicate order_id** — 23 duplicate IDs found (0.016%)
   Sample: ORD-20260215-4421 appears 3 times
2. **Null customer_email** — 2.1% null (threshold: 1%)
   Spike on 2026-02-15 (bulk import batch)
3. **Future dates in created_at** — 7 rows have dates in 2027

### ⚠️ Warnings (2)
1. **Amount outliers** — 4 orders exceed $10,000 (review manually)
2. **Status enum drift** — New value "on_hold" not in expected set

### ✅ Passed Checks (8)
- Primary key uniqueness (order_id): PASS (99.98%)
- Amount non-negative: PASS
- Date range valid: PASS (excluding 7 future dates)
...

4. Suggest fixes

For each failed check, recommend a concrete action:

SQL query to find and fix duplicates
Transform to clean null values
Filter to remove invalid rows
Alert to set up for ongoing monitoring

Examples

Example 1: CSV Import Validation

Input: "Validate this customer export before importing to our new CRM"

Output:

## Validation Report: customers_export.csv
Rows: 8,234 | Columns: 9

### ❌ Failed (2)
1. **Email format invalid** — 142 rows (1.7%) have malformed emails
   Pattern: missing @ symbol, spaces in domain
   Fix: SELECT * FROM data WHERE email NOT LIKE '%_@_%.__%'

2. **Phone duplicates** — 34 phone numbers shared by multiple customers
   Likely data entry errors or shared business lines
   Fix: Review and merge duplicate customer records

### ✅ Passed (6)
- customer_id unique: PASS
- Required fields (name, email): PASS (99.8%)
- Country codes valid ISO-3166: PASS
- Created dates in range: PASS
- No PII in notes field: PASS
- UTF-8 encoding clean: PASS

Recommendation: Fix the 142 invalid emails before import.
The phone duplicates are non-blocking but should be reviewed.

Example 2: Pipeline Data Contract Check

Input: "Check if the analytics events from our tracking pipeline match the expected schema"

Output:

## Schema Contract Validation: events_2026_02_17.json
Events: 1.2M | Schema version: v3.2

### ❌ Contract Violations (1)
1. **Missing required field: session_id**
   - 12,847 events (1.07%) missing session_id
   - All from source: mobile-ios-sdk v4.1.0
   - Started: 2026-02-16 09:00 UTC (correlates with SDK release)

### ⚠️ Warnings (1)
1. **New field detected: device_model**
   - Not in schema v3.2 — likely added in SDK update
   - 100% populated on mobile events, absent on web

### ✅ Contract Compliance: 98.93%

Action: Pin mobile-ios-sdk to v4.0.x or update schema to v3.3
with session_id as optional for mobile sources.

Guidelines

Always profile before validating — you need baselines to detect anomalies
Distinguish between hard failures (blocks the pipeline) and warnings (review later)
Include sample rows for every finding — abstract stats are hard to act on
Suggest SQL or code fixes, not just descriptions of problems
For time series data, check for gaps and seasonality, not just latest timestamp
Track validation results over time to detect data drift
Be specific about which rows/records are affected — "23 duplicates" beats "some duplicates"