🛠️ 開発・MCP コミュニティ

multi-ai-verification

コード品質やセキュリティ、テストの信頼性などを、複数の段階で検証し、LLMやAgentを審判として活用しながら、総合的な品質を0〜100点で評価し、高品質な状態を保証するSkill。

📜 元の英語説明(参考)

Multi-layer quality assurance with 5-layer verification pyramid (Rules → Functional → Visual → Integration → Quality Scoring). Independent verification with LLM-as-judge and Agent-as-a-Judge patterns. Score 0-100 with ≥90 threshold. Use when verifying code quality, security scanning, preventing test gaming, comprehensive QA, or ensuring production readiness through multi-layer validation.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o multi-ai-verification.zip https://jpskill.com/download/9451.zip && unzip -o multi-ai-verification.zip && rm multi-ai-verification.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/9451.zip -OutFile "$d\multi-ai-verification.zip"; Expand-Archive "$d\multi-ai-verification.zip" -DestinationPath $d -Force; ri "$d\multi-ai-verification.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して multi-ai-verification.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → multi-ai-verification フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

Multi-AI Verification

概要

multi-ai-verification は、自動化されたルールから LLM-as-judge 評価まで、5層の検証ピラミッドを通じて包括的な品質保証を提供します。

目的: 本番環境で使用できる品質を保証する多層独立検証

パターン: タスクベース (5つの独立した検証操作、1層あたり1つ)

主なイノベーション: 5層ピラミッド (ベースで95%自動化 → 頂点で0%) と、バイアスとテストゲーミングを防ぐ独立検証

コア原則 (tri-AI 研究で検証済み):

多層防御 - 5つの層が異なる種類の問題を捕捉
独立検証 - 実装/テストからエージェントを分離
段階的な自動化 - 自動化できるものは自動化 (95% → 0%)
品質スコアリング - 客観的な 0-100 スコアリング、≥90 の閾値
実行可能なフィードバック - 100% のフィードバックは具体的で実行可能 (What/Where/Why/How/Priority)

品質ゲート: 本番環境での承認を得るには、5つの層すべてに合格する必要があります。

使用するタイミング

multi-ai-verification は、以下の場合に使用します。

コミット/デプロイ前の最終品質チェック
独立したコードレビュー (バイアス防止)
セキュリティ検証 (OWASP、脆弱性)
包括的な QA (すべての層)
テスト品質検証 (ゲーミング防止)
本番環境への準備状況の検証

前提条件

必須

検証するコード (実装完了)
利用可能なテスト (機能検証用)
定義された品質基準

推奨

multi-ai-testing - テストの生成/実行用
multi-ai-implementation - 修正の実装用

利用可能なツール

リンター (ESLint, Pylint)
型チェッカー (TypeScript, mypy)
カバレッジツール (c8, pytest-cov)
セキュリティスキャナー (Semgrep, Bandit)
テストフレームワーク (Jest, pytest)

5層の検証ピラミッド

         Layer 5: Quality Scoring
         (LLM-as-Judge, 0-20% automated)
              /\
             /  \
        Layer 4: Integration
        (E2E, System, 20-30% automated)
          /      \
         /        \
    Layer 3: Visual
    (UI, Screenshots, 30-50% automated)
      /          \
     /            \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
  /              \
 /                \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)

原則: 高価な LLM-as-judge 評価の前に、自動化された層で迅速に失敗させる (安価、高速)。

検証操作

操作 1: ルールベース検証 (Layer 1)

目的: コード構造、フォーマット、型の自動検証

自動化: 95% 自動化速度: 数秒 (高速フィードバック) 信頼性: 高い (決定的)

プロセス:

スキーマ検証 (該当する場合):

# JSON/YAML をスキーマに対して検証
ajv validate -s plan.schema.json -d plan.json
ajv validate -s task.schema.json -d tasks/*.json

Linting:

# JavaScript/TypeScript
npx eslint src/**/*.{ts,tsx,js,jsx}

# Python
pylint src/**/*.py

# 期待値: リンティングエラーゼロ

型チェック:

# TypeScript
npx tsc --noEmit

# Python
mypy src/

# 期待値: 型エラーゼロ

フォーマット検証:

# フォーマットのチェック
npx prettier --check src/**/*.{ts,tsx}

# または自動修正
npx prettier --write src/**/*.{ts,tsx}

セキュリティスキャン (SAST):

# 静的セキュリティ分析
npx semgrep --config=auto src/

# または Python の場合
bandit -r src/

# チェック項目:
# - ハードコードされたシークレット
# - SQL インジェクションのリスク
# - XSS 脆弱性
# - 安全でない依存関係

Layer 1 レポートの生成:

# Layer 1: ルールベース検証

## スキーマ検証
✅ plan.json は検証に合格しました
✅ すべてのタスクファイルは検証に合格しました

## Linting
✅ リンティングエラー 0 件
⚠️ 警告 3 件 (非ブロッキング)

## 型チェック
✅ 型エラー 0 件

## フォーマット
✅ すべてのファイルが正しくフォーマットされています

## セキュリティスキャン (SAST)
✅ 重大な脆弱性はありません
⚠️ 中程度: パスワードハッシュラウンドが弱い (bcrypt)

**Layer 1 ステータス**: ✅ PASS (重大な問題 0 件)
**対処すべき問題**: 中程度のセキュリティ問題 1 件

出力:

Lint レポート (エラー/警告)
型チェックの結果
スキーマ検証の結果
セキュリティスキャンの結果
Layer 1 ステータス (PASS/FAIL)

検証:

[ ] すべての自動チェックが実行された
[ ] 結果が文書化された
[ ] 重大な問題 = PASS の場合は 0
[ ] 警告に対する実行可能なフィードバック

見積もり時間: 15-30 分 (ほとんど自動化)

Gate 1: ✅ 重大な問題がない場合は PASS (警告は許容)

操作 2: 機能検証 (Layer 2)

目的: テスト実行とカバレッジによる機能の検証

自動化: 60-80% 自動化速度: 数分 (中程度のフィードバック) 信頼性: 高い (測定可能な結果)

プロセス:

完全なテストスイートの実行:

# カバレッジ付きですべてのテストを実行
npm test -- --coverage --verbose

# 結果のキャプチャ
# - テストの合否
# - カバレッジメトリクス
# - 実行時間

サンプルコードの検証 (ドキュメントから):

# SKILL.md からサンプルを抽出
# 各サンプルを自動的に実行
# 出力が期待値と一致することを確認

# 目標: ≥90% のサンプルが動作すること

カバレッジの確認:

# カバレッジレポート

**行カバレッジ**: 87% ✅ (ゲート: ≥80%)
**ブランチカバレッジ**: 82% ✅
**関数カバレッジ**: 92% ✅
**パスカバレッジ**: 74% ✅

**ゲートステータス**: PASS ✅ (すべて ≥80%)

**未カバーのコード**:
- src/admin/legacy.ts: 23% (低優先度)
- src/utils/deprecated.ts: 15% (非推奨、OK)

リグレッションテスト (アップデート用):

# 変更前/変更後を比較
git diff main...feature --stat

# すべてのテストを実行
npm test

# 検証: 新しい失敗がないこと (リグレッション防止)

パフォーマンス検証:


# パフォーマンス テストの実行
npm run test:performance

# 応答時間の確認
# 検証: W

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Multi-AI Verification

Overview

multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation.

Purpose: Multi-layer independent verification ensuring production-ready quality

Pattern: Task-based (5 independent verification operations, one per layer)

Key Innovation: 5-layer pyramid (95% automated at base → 0% at apex) with independent verification preventing bias and test gaming

Core Principles (validated by tri-AI research):

Multi-Layer Defense - 5 layers catch different types of issues
Independent Verification - Separate agent from implementation/testing
Progressive Automation - Automate what can be automated (95% → 0%)
Quality Scoring - Objective 0-100 scoring with ≥90 threshold
Actionable Feedback - 100% feedback is specific and actionable (What/Where/Why/How/Priority)

Quality Gates: All 5 layers must pass for production approval

When to Use

Use multi-ai-verification when:

Final quality check before commit/deployment
Independent code review (preventing bias)
Security verification (OWASP, vulnerabilities)
Comprehensive QA (all layers)
Test quality verification (prevent gaming)
Production readiness validation

Prerequisites

Required

Code to verify (implementation complete)
Tests available (for functional verification)
Quality standards defined

Tools Available

Linters (ESLint, Pylint)
Type checkers (TypeScript, mypy)
Coverage tools (c8, pytest-cov)
Security scanners (Semgrep, Bandit)
Test frameworks (Jest, pytest)

The 5-Layer Verification Pyramid

         Layer 5: Quality Scoring
         (LLM-as-Judge, 0-20% automated)
              /\
             /  \
        Layer 4: Integration
        (E2E, System, 20-30% automated)
          /      \
         /        \
    Layer 3: Visual
    (UI, Screenshots, 30-50% automated)
      /          \
     /            \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
  /              \
 /                \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)

Principle: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation

Verification Operations

Operation 1: Rules-Based Verification (Layer 1)

Purpose: Automated validation of code structure, formatting, types

Automation: 95% automated Speed: Seconds (fast feedback) Confidence: High (deterministic)

Process:

Schema Validation (if applicable):

# Validate JSON/YAML against schemas
ajv validate -s plan.schema.json -d plan.json
ajv validate -s task.schema.json -d tasks/*.json

Linting:

# JavaScript/TypeScript
npx eslint src/**/*.{ts,tsx,js,jsx}

# Python
pylint src/**/*.py

# Expected: Zero linting errors

Type Checking:

# TypeScript
npx tsc --noEmit

# Python
mypy src/

# Expected: Zero type errors

Format Validation:

# Check formatting
npx prettier --check src/**/*.{ts,tsx}

# Or auto-fix
npx prettier --write src/**/*.{ts,tsx}

Security Scanning (SAST):

# Static security analysis
npx semgrep --config=auto src/

# Or for Python
bandit -r src/

# Check for:
# - Hardcoded secrets
# - SQL injection risks
# - XSS vulnerabilities
# - Insecure dependencies

Generate Layer 1 Report:

# Layer 1: Rules-Based Verification

## Schema Validation
✅ plan.json validates
✅ All task files validate

## Linting
✅ 0 linting errors
⚠️ 3 warnings (non-blocking)

## Type Checking
✅ 0 type errors

## Formatting
✅ All files formatted correctly

## Security Scan (SAST)
✅ No critical vulnerabilities
⚠️ 1 medium: Weak password hashing rounds (bcrypt)

**Layer 1 Status**: ✅ PASS (0 critical issues)
**Issues to Address**: 1 medium security issue

Outputs:

Lint report (errors/warnings)
Type check results
Schema validation results
Security scan findings
Layer 1 status (PASS/FAIL)

Validation:

[ ] All automated checks run
[ ] Results documented
[ ] Critical issues = 0 for PASS
[ ] Actionable feedback for warnings

Time Estimate: 15-30 minutes (mostly automated)

Gate 1: ✅ PASS if no critical issues (warnings acceptable)

Operation 2: Functional Verification (Layer 2)

Purpose: Validate functionality through test execution and coverage

Automation: 60-80% automated Speed: Minutes (medium feedback) Confidence: High (measurable outcomes)

Process:

Execute Complete Test Suite:

# Run all tests with coverage
npm test -- --coverage --verbose

# Capture results
# - Tests passed/failed
# - Coverage metrics
# - Execution time

Validate Example Code (from documentation):

# Extract examples from SKILL.md
# Execute each example automatically
# Verify outputs match expected

# Target: ≥90% examples work

Check Coverage:

# Coverage Report

**Line Coverage**: 87% ✅ (gate: ≥80%)
**Branch Coverage**: 82% ✅
**Function Coverage**: 92% ✅
**Path Coverage**: 74% ✅

**Gate Status**: PASS ✅ (all ≥80%)

**Uncovered Code**:
- src/admin/legacy.ts: 23% (low priority)
- src/utils/deprecated.ts: 15% (deprecated, ok)

Regression Testing (for updates):

# Compare before/after
git diff main...feature --stat

# Run all tests
npm test

# Verify: No new failures (regression prevention)

Performance Validation:

# Run performance tests
npm run test:performance

# Check response times
# Verify: Within acceptable ranges

Generate Layer 2 Report:

# Layer 2: Functional Verification

## Test Execution
✅ 245/245 tests passing (100%)
⏱️ Execution time: 8.3 seconds

## Coverage
✅ Line: 87% (gate: ≥80%)
✅ Branch: 82%
✅ Function: 92%

## Example Validation
✅ 18/20 examples work (90%)
❌ 2 examples fail (outdated)

## Regression
✅ All existing tests still pass

## Performance
✅ All endpoints <200ms

**Layer 2 Status**: ✅ PASS
**Issues**: 2 outdated examples (update docs)

Outputs:

Test execution results
Coverage report
Example validation results
Regression check
Performance metrics
Layer 2 status

Validation:

[ ] All tests executed
[ ] Coverage meets gate (≥80%)
[ ] Examples validated (≥90%)
[ ] No regressions
[ ] Performance acceptable

Time Estimate: 30-60 minutes

Gate 2: ✅ PASS if tests pass + coverage ≥80%

Operation 3: Visual Verification (Layer 3)

Purpose: Validate UI appearance, layout, accessibility (for UI features)

Automation: 30-50% automated Speed: Minutes-Hours Confidence: Medium (subjective elements)

Process:

Screenshot Generation:

# Generate screenshots of UI
npx playwright test --screenshot=on

# Or manually:
# Open application
# Capture screenshots of key views

Visual Comparison (if previous version exists):

# Compare against baseline
npx playwright test --update-snapshots=missing

# Or use Percy/Chromatic for visual regression
npx percy snapshot screenshots/

Layout Validation:

# Visual Checklist

## Layout
- [ ] Components positioned correctly
- [ ] Spacing/margins match mockup
- [ ] Alignment proper
- [ ] No overlapping elements

## Styling
- [ ] Colors match design system
- [ ] Typography correct (fonts, sizes)
- [ ] Icons/images display properly

## Responsiveness
- [ ] Mobile view (320px-480px): ✅
- [ ] Tablet view (768px-1024px): ✅
- [ ] Desktop view (>1024px): ✅

Accessibility Testing:

# Automated accessibility scan
npx axe-core src/

# Check WCAG compliance
npx pa11y http://localhost:3000

# Manual checks:
# - Keyboard navigation
# - Screen reader compatibility
# - Color contrast ratios

Generate Layer 3 Report:

# Layer 3: Visual Verification

## Screenshot Comparison
✅ Login page matches mockup
✅ Dashboard layout correct
⚠️ Profile page: Avatar alignment off by 5px

## Responsiveness
✅ Mobile: All components visible
✅ Tablet: Layout adapts correctly
✅ Desktop: Full functionality

## Accessibility
✅ WCAG 2.1 AA compliance
✅ Keyboard navigation works
⚠️ 2 color contrast warnings (non-critical)

**Layer 3 Status**: ✅ PASS (minor issues acceptable)
**Issues**: Avatar alignment (cosmetic), contrast warnings

Outputs:

Screenshots of UI
Visual comparison results
Responsiveness validation
Accessibility report
Layer 3 status

Validation:

[ ] Screenshots captured
[ ] Visual comparison done (if applicable)
[ ] Layout validated
[ ] Responsiveness tested
[ ] Accessibility checked
[ ] No critical visual issues

Time Estimate: 30-90 minutes (skip if no UI)

Gate 3: ✅ PASS if no critical visual/a11y issues

Operation 4: Integration Verification (Layer 4)

Purpose: Validate system-level integration, data flow, API compatibility

Automation: 20-30% automated Speed: Hours (complex) Confidence: Medium-High

Process:

Component Integration Tests:

# Run integration test suite
npm test -- tests/integration/

# Verify components work together
# - Database ← → API
# - API ← → Frontend
# - Frontend ← → User

Data Flow Validation:

# Data Flow Verification

**Flow 1: User Registration**
Frontend form → API endpoint → Validation → Database → Email service
✅ Data flows correctly
✅ No data loss
✅ Transactions atomic

**Flow 2: Authentication**
Login request → API → Database lookup → Token generation → Response
✅ Token generated correctly
✅ Session stored
✅ Response includes token

API Integration Tests:

# Test all API endpoints
npm run test:api

# Verify:
# - All endpoints respond
# - Status codes correct
# - Response formats match spec
# - Error handling works

End-to-End Workflow Tests:

// Complete user journeys
test('Complete registration and login flow', async () => {
  // 1. Register new user
  const registerResponse = await api.post('/register', userData);
  expect(registerResponse.status).toBe(201);

  // 2. Confirm email
  const confirmResponse = await api.get(confirmLink);
  expect(confirmResponse.status).toBe(200);

  // 3. Login
  const loginResponse = await api.post('/login', credentials);
  expect(loginResponse.status).toBe(200);
  expect(loginResponse.data.token).toBeDefined();

  // 4. Access protected resource
  const profileResponse = await api.get('/profile', {
    headers: { Authorization: `Bearer ${loginResponse.data.token}` }
  });
  expect(profileResponse.status).toBe(200);
});

Dependency Compatibility:

# Check external dependencies work
npm audit

# Check for breaking changes
npm outdated

# Verify integration with services
# - Database connection
# - Redis/cache
# - External APIs

Generate Layer 4 Report:

# Layer 4: Integration Verification

## Component Integration
✅ 12/12 integration tests passing
✅ All components integrate correctly

## Data Flow
✅ All 5 data flows validated
✅ No data loss or corruption

## API Integration
✅ All 15 endpoints functional
✅ Response formats correct
✅ Error handling works

## E2E Workflows
✅ 8/8 user journeys complete successfully
✅ No workflow breaks

## Dependencies
✅ 0 critical vulnerabilities
⚠️ 2 moderate (non-blocking)

**Layer 4 Status**: ✅ PASS

Outputs:

Integration test results
Data flow validation
API compatibility report
E2E workflow results
Dependency audit
Layer 4 status

Validation:

[ ] Integration tests pass
[ ] Data flows validated
[ ] APIs integrate correctly
[ ] E2E workflows function
[ ] Dependencies secure

Time Estimate: 45-90 minutes

Gate 4: ✅ PASS if all integration tests pass, no critical dependencies

Operation 5: Quality Scoring (Layer 5)

Purpose: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns

Automation: 0-20% automated Speed: Hours (expensive) Confidence: Medium (requires judgment)

Process:

Spawn Independent Quality Assessor (Agent-as-a-Judge):

Key: Use different model family if possible (prevent self-preference bias)

const qualityAssessment = await task({
  description: "Assess code quality holistically",
  prompt: `Evaluate code quality in src/ and tests/.

  DO NOT read implementation conversation history.

  You have access to tools:
  - Read files
  - Execute tests
  - Run linters
  - Query database (if needed)

  Assess 5 dimensions (score each /20):

  1. CORRECTNESS (/20):
     - Logic correctness
     - Edge case handling
     - Error handling completeness
     - Security considerations

  2. FUNCTIONALITY (/20):
     - Meets all requirements
     - User workflows work
     - Performance acceptable
     - No regressions

  3. QUALITY (/20):
     - Code maintainability
     - Best practices followed
     - Anti-patterns avoided
     - Documentation complete

  4. INTEGRATION (/20):
     - Components integrate smoothly
     - API contracts correct
     - Data flow works
     - Backward compatible

  5. SECURITY (/20):
     - No vulnerabilities
     - Input validation
     - Authentication/authorization
     - Data protection

  TOTAL: /100 (sum of 5 dimensions)

  For each dimension, provide:
  - Score (/20)
  - Strengths (what's good)
  - Weaknesses (what needs improvement)
  - Evidence (file:line references)
  - Recommendations (specific, actionable)

  Write comprehensive report to: quality-assessment.md`
});

Multi-Agent Ensemble (for critical features):

3-5 Agent Voting Committee:

// Spawn 3 independent quality assessors
const [judge1, judge2, judge3] = await Promise.all([
  task({description: "Quality Judge 1", prompt: assessmentPrompt}),
  task({description: "Quality Judge 2", prompt: assessmentPrompt}),
  task({description: "Quality Judge 3", prompt: assessmentPrompt})
]);

// Aggregate scores
const scores = {
  correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]),
  functionality: median([...]),
  quality: median([...]),
  integration: median([...]),
  security: median([...])
};

const totalScore = sum(Object.values(scores)); // Total /100

// Check variance
const totalScores = [judge1.total, judge2.total, judge3.total];
const variance = max(totalScores) - min(totalScores);

if (variance > 15) {
  // High disagreement → spawn 2 more judges (total 5)
  // Use 5-agent ensemble for final score
}

// Final score: median of 3 or 5

Calibration Against Rubric:

# Scoring Calibration

## Correctness: 18/20 (Excellent)
**20**: Zero errors, all edge cases handled perfectly
**18**: Minor edge case missing, otherwise excellent ✅ (achieved)
**15**: 1-2 significant edge cases missing
**10**: Some logic errors present
**0**: Major functionality broken

**Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor)

## Functionality: 19/20 (Excellent)
[Similar rubric with evidence]

## Quality: 17/20 (Good)
[Similar rubric with evidence]

## Integration: 18/20 (Excellent)
[Similar rubric with evidence]

## Security: 16/20 (Good)
[Similar rubric with evidence]

**Total**: 88/100 ⚠️ (Below ≥90 gate)

Gap Analysis (if <90):

# Quality Gap Analysis

**Current Score**: 88/100
**Target**: ≥90/100
**Gap**: 2 points

## Critical Gaps (Blocking Approval)
None

## High Priority (Should Fix for ≥90)
1. **Security: Weak bcrypt rounds**
   - **What**: bcrypt using 10 rounds (outdated)
   - **Where**: src/auth/hash.ts:15
   - **Why**: Current standard is 12-14 rounds
   - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)`
   - **Priority**: High
   - **Impact**: +2 points → 90/100

## Medium Priority
1. **Quality: Missing JSDoc for 3 functions**
   - Impact: +1 point → 91/100

**Recommendation**: Fix high priority issue to reach ≥90 threshold
**Estimated Effort**: 15 minutes

Generate Comprehensive Quality Report:

# Layer 5: Quality Scoring Report

## Executive Summary
**Total Score**: 88/100 ⚠️ (Below ≥90 gate)
**Status**: NEEDS MINOR REVISION

## Dimension Scores
- Correctness: 18/20 ⭐⭐⭐⭐⭐
- Functionality: 19/20 ⭐⭐⭐⭐⭐
- Quality: 17/20 ⭐⭐⭐⭐
- Integration: 18/20 ⭐⭐⭐⭐⭐
- Security: 16/20 ⭐⭐⭐⭐

## Strengths
1. Comprehensive test coverage (87%)
2. All functionality working correctly
3. Clean integration with all components
4. Good error handling

## Weaknesses
1. Bcrypt rounds below current standard (security)
2. Missing documentation for helper functions (quality)
3. One timezone edge case not handled (correctness)

## Recommendations (Prioritized)

### Priority 1 (High - Needed for ≥90)
1. Increase bcrypt rounds: 10 → 12
   - File: src/auth/hash.ts:15
   - Effort: 5 min
   - Impact: +2 points

### Priority 2 (Medium - Nice to Have)
1. Add JSDoc to helper functions
   - Files: src/utils/validation.ts
   - Effort: 30 min
   - Impact: +1 point

2. Handle timezone DST edge case
   - File: src/auth/tokens.ts:78
   - Effort: 20 min
   - Impact: +1 point

**Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90

Outputs:

Quality score (0-100) with dimension breakdown
Calibrated against rubric
Gap analysis
Prioritized recommendations (Critical/High/Medium/Low)
Evidence-based feedback (file:line references)
Action plan to reach ≥90

Validation:

[ ] All 5 dimensions scored
[ ] Scores calibrated against rubric
[ ] Evidence provided for each score
[ ] Gap analysis if <90
[ ] Recommendations actionable
[ ] Ensemble used for critical features (optional)

Time Estimate: 60-120 minutes (ensemble adds 30-60 min)

Gate 5: ✅ PASS if total score ≥90/100

Quality Gates Summary

All 5 Gates Must Pass for production approval:

Gate 1: Rules Pass ✅
   ↓ (Linting, types, schema, security)

Gate 2: Tests Pass ✅
   ↓ (All tests, coverage ≥80%)

Gate 3: Visual OK ✅
   ↓ (UI validated, a11y checked)

Gate 4: Integration OK ✅
   ↓ (E2E works, APIs integrate)

Gate 5: Quality ≥90 ✅
   ↓ (LLM-as-judge score ≥90/100)

✅ PRODUCTION APPROVED

If Any Gate Fails:

Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass

Appendix A: Independence Protocol

How Verification Independence is Maintained

Verification Agent Spawning:

// After implementation and testing complete
const verification = await task({
  description: "Independent quality verification",
  prompt: `Verify code quality independently.

  DO NOT read prior conversation history.

  Review:
  - Code: src/**/*.ts
  - Tests: tests/**/*.test.ts
  - Specs: specs/requirements.md

  Verify against specifications ONLY (not implementation decisions).

  Use tools:
  - Read files to inspect code
  - Run tests to verify functionality
  - Execute linters for quality checks

  Score quality (0-100) with evidence.
  Write report to: independent-verification.md`
});

Bias Prevention Checklist:

[ ] Specifications written BEFORE implementation
[ ] Verification agent prompt has no implementation context
[ ] Agent evaluates against specs, not what code does
[ ] Fresh context (via Task tool)
[ ] Different model family used (if possible)

Validation of Independence:

## Independence Audit

**Expected Behavior**:
- ✅ Verifier finds 1-3 issues (healthy skepticism)
- ✅ Verifier references specifications
- ✅ Verifier uses tools to verify claims

**Warning Signs**:
- ⚠️ Verifier finds 0 issues (possible rubber stamp)
- ⚠️ Verifier doesn't use tools
- ⚠️ Verifier parrots implementation justifications

**If Warning**: Re-verify with stronger independence prompt

Appendix B: Operational Scoring Rubrics

Complete Rubrics for All 5 Dimensions

Correctness (/20)

20 (Perfect): Zero logic errors, all edge cases handled, security perfect 18 (Excellent): 1 minor edge case missing, otherwise flawless 15 (Good): 2-3 edge cases missing, no critical errors 12 (Acceptable): Some edge cases missing, 1 minor logic issue 10 (Needs Work): Multiple edge cases missing or 1 significant logic error 5 (Poor): Major logic errors present 0 (Broken): Critical functionality broken

Functionality (/20)

20: All requirements met, exceeds expectations 18: All requirements met, well implemented 15: All requirements met, basic implementation 12: 1 requirement partially missing 10: 2+ requirements partially missing 5: Several requirements not met 0: Core functionality missing

Quality (/20)

20: Exceptional code quality, best practices exemplified 18: High quality, follows best practices 15: Good quality, minor style issues 12: Acceptable quality, several style issues 10: Below standard, needs refactoring 5: Poor quality, significant issues 0: Unmaintainable code

Integration (/20)

20: Perfect integration, all touch points verified 18: Excellent integration, minor docs needed 15: Good integration, all major points work 12: Acceptable, 1-2 integration issues 10: Integration issues present 5: Multiple integration problems 0: Does not integrate

Security (/20)

20: Passes all security scans, OWASP compliant, hardened 18: Passes scans, 1 minor non-critical issue 15: Passes, 2-3 minor issues 12: 1 medium security issue 10: Multiple medium issues 5: 1 critical issue present 0: Multiple critical vulnerabilities

Appendix C: Technical Foundation

Verification Tools

Linting:

ESLint (JavaScript/TypeScript)
Pylint/Ruff (Python)

Type Checking:

TypeScript compiler (tsc)
mypy (Python)

Security (SAST):

Semgrep (multi-language)
Bandit (Python)
npm audit (JavaScript)

Visual Testing:

Playwright (screenshot, visual regression)
Percy/Chromatic (visual diff)
axe-core (accessibility)

Coverage:

c8/nyc (JavaScript)
pytest-cov (Python)

Cost Controls

Budget Caps:

LLM-as-judge: $50/month
Ensemble verification: $20/month
Total verification: $70/month

Optimization:

Cache quality scores for 24h (same code → same score)
Skip Layer 5 for changes <50 lines
Use ensemble (3-5 agents) only for critical features
Use cheaper models for pre-filtering (Haiku for Layer 1-2)

Quick Reference

The 5 Layers

Layer	Purpose	Automation	Time	Tools
1	Rules-based	95%	15-30m	Linters, types, SAST
2	Functional	60-80%	30-60m	Test execution, coverage
3	Visual	30-50%	30-90m	Screenshots, a11y
4	Integration	20-30%	45-90m	E2E, API tests
5	Quality Scoring	0-20%	60-120m	LLM-as-judge, ensemble

Total: 3-6 hours for complete 5-layer verification

Quality Thresholds

≥90: ✅ Excellent (production-ready)
80-89: ⚠️ Good (needs minor improvements)
70-79: ❌ Acceptable (needs work before production)
<70: ❌ Poor (significant rework required)

Gates

All 5 Must Pass:

Rules pass (no critical lint/type/security)
Tests pass + coverage ≥80%
Visual OK (no critical UI issues)
Integration OK (E2E works)
Quality ≥90/100

multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.

For rubrics, see Appendix B. For independence protocol, see Appendix A.

multi-ai-verification

🇯🇵 日本人クリエイター向け解説

🎯 このSkillでできること

📦 インストール方法 (3ステップ)

📖 Skill本文(日本語訳)

Multi-AI Verification

概要

使用するタイミング

前提条件

必須

推奨

利用可能なツール

5層の検証ピラミッド

検証操作

操作 1: ルールベース検証 (Layer 1)

操作 2: 機能検証 (Layer 2)

Multi-AI Verification

Overview

When to Use

Prerequisites

Required

Recommended

Tools Available

The 5-Layer Verification Pyramid

Verification Operations

Operation 1: Rules-Based Verification (Layer 1)

Operation 2: Functional Verification (Layer 2)

Operation 3: Visual Verification (Layer 3)

Operation 4: Integration Verification (Layer 4)

Operation 5: Quality Scoring (Layer 5)

Quality Gates Summary

Appendix A: Independence Protocol

How Verification Independence is Maintained

Appendix B: Operational Scoring Rubrics

Complete Rubrics for All 5 Dimensions

Correctness (/20)

Functionality (/20)

Quality (/20)

Integration (/20)

Security (/20)

Appendix C: Technical Foundation

Verification Tools

Cost Controls

Quick Reference

The 5 Layers

Quality Thresholds

Gates