multi-ai-verification
コード品質やセキュリティ、テストの信頼性などを、複数の段階で検証し、LLMやAgentを審判として活用しながら、総合的な品質を0〜100点で評価し、高品質な状態を保証するSkill。
📜 元の英語説明(参考)
Multi-layer quality assurance with 5-layer verification pyramid (Rules → Functional → Visual → Integration → Quality Scoring). Independent verification with LLM-as-judge and Agent-as-a-Judge patterns. Score 0-100 with ≥90 threshold. Use when verifying code quality, security scanning, preventing test gaming, comprehensive QA, or ensuring production readiness through multi-layer validation.
🇯🇵 日本人クリエイター向け解説
コード品質やセキュリティ、テストの信頼性などを、複数の段階で検証し、LLMやAgentを審判として活用しながら、総合的な品質を0〜100点で評価し、高品質な状態を保証するSkill。
※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。
下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。 ダウンロード → 解凍 → 配置まで全自動。
mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o multi-ai-verification.zip https://jpskill.com/download/9451.zip && unzip -o multi-ai-verification.zip && rm multi-ai-verification.zip
$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/9451.zip -OutFile "$d\multi-ai-verification.zip"; Expand-Archive "$d\multi-ai-verification.zip" -DestinationPath $d -Force; ri "$d\multi-ai-verification.zip"
完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。
💾 手動でダウンロードしたい(コマンドが難しい人向け)
- 1. 下の青いボタンを押して
multi-ai-verification.zipをダウンロード - 2. ZIPファイルをダブルクリックで解凍 →
multi-ai-verificationフォルダができる - 3. そのフォルダを
C:\Users\あなたの名前\.claude\skills\(Win)または~/.claude/skills/(Mac)へ移動 - 4. Claude Code を再起動
⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。
🎯 このSkillでできること
下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。
📦 インストール方法 (3ステップ)
- 1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
- 2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
- 3. 展開してできたフォルダを、ホームフォルダの
.claude/skills/に置く- · macOS / Linux:
~/.claude/skills/ - · Windows:
%USERPROFILE%\.claude\skills\
- · macOS / Linux:
Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。
詳しい使い方ガイドを見る →- 最終更新
- 2026-05-18
- 取得日時
- 2026-05-18
- 同梱ファイル
- 1
📖 Skill本文(日本語訳)
※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。
Multi-AI Verification
概要
multi-ai-verification は、自動化されたルールから LLM-as-judge 評価まで、5層の検証ピラミッドを通じて包括的な品質保証を提供します。
目的: 本番環境で使用できる品質を保証する多層独立検証
パターン: タスクベース (5つの独立した検証操作、1層あたり1つ)
主なイノベーション: 5層ピラミッド (ベースで95%自動化 → 頂点で0%) と、バイアスとテストゲーミングを防ぐ独立検証
コア原則 (tri-AI 研究で検証済み):
- 多層防御 - 5つの層が異なる種類の問題を捕捉
- 独立検証 - 実装/テストからエージェントを分離
- 段階的な自動化 - 自動化できるものは自動化 (95% → 0%)
- 品質スコアリング - 客観的な 0-100 スコアリング、≥90 の閾値
- 実行可能なフィードバック - 100% のフィードバックは具体的で実行可能 (What/Where/Why/How/Priority)
品質ゲート: 本番環境での承認を得るには、5つの層すべてに合格する必要があります。
使用するタイミング
multi-ai-verification は、以下の場合に使用します。
- コミット/デプロイ前の最終品質チェック
- 独立したコードレビュー (バイアス防止)
- セキュリティ検証 (OWASP、脆弱性)
- 包括的な QA (すべての層)
- テスト品質検証 (ゲーミング防止)
- 本番環境への準備状況の検証
前提条件
必須
- 検証するコード (実装完了)
- 利用可能なテスト (機能検証用)
- 定義された品質基準
推奨
- multi-ai-testing - テストの生成/実行用
- multi-ai-implementation - 修正の実装用
利用可能なツール
- リンター (ESLint, Pylint)
- 型チェッカー (TypeScript, mypy)
- カバレッジツール (c8, pytest-cov)
- セキュリティスキャナー (Semgrep, Bandit)
- テストフレームワーク (Jest, pytest)
5層の検証ピラミッド
Layer 5: Quality Scoring
(LLM-as-Judge, 0-20% automated)
/\
/ \
Layer 4: Integration
(E2E, System, 20-30% automated)
/ \
/ \
Layer 3: Visual
(UI, Screenshots, 30-50% automated)
/ \
/ \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
/ \
/ \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)
原則: 高価な LLM-as-judge 評価の前に、自動化された層で迅速に失敗させる (安価、高速)。
検証操作
操作 1: ルールベース検証 (Layer 1)
目的: コード構造、フォーマット、型の自動検証
自動化: 95% 自動化 速度: 数秒 (高速フィードバック) 信頼性: 高い (決定的)
プロセス:
-
スキーマ検証 (該当する場合):
# JSON/YAML をスキーマに対して検証 ajv validate -s plan.schema.json -d plan.json ajv validate -s task.schema.json -d tasks/*.json -
Linting:
# JavaScript/TypeScript npx eslint src/**/*.{ts,tsx,js,jsx} # Python pylint src/**/*.py # 期待値: リンティングエラーゼロ -
型チェック:
# TypeScript npx tsc --noEmit # Python mypy src/ # 期待値: 型エラーゼロ -
フォーマット検証:
# フォーマットのチェック npx prettier --check src/**/*.{ts,tsx} # または自動修正 npx prettier --write src/**/*.{ts,tsx} -
セキュリティスキャン (SAST):
# 静的セキュリティ分析 npx semgrep --config=auto src/ # または Python の場合 bandit -r src/ # チェック項目: # - ハードコードされたシークレット # - SQL インジェクションのリスク # - XSS 脆弱性 # - 安全でない依存関係 -
Layer 1 レポートの生成:
# Layer 1: ルールベース検証 ## スキーマ検証 ✅ plan.json は検証に合格しました ✅ すべてのタスクファイルは検証に合格しました ## Linting ✅ リンティングエラー 0 件 ⚠️ 警告 3 件 (非ブロッキング) ## 型チェック ✅ 型エラー 0 件 ## フォーマット ✅ すべてのファイルが正しくフォーマットされています ## セキュリティスキャン (SAST) ✅ 重大な脆弱性はありません ⚠️ 中程度: パスワードハッシュラウンドが弱い (bcrypt) **Layer 1 ステータス**: ✅ PASS (重大な問題 0 件) **対処すべき問題**: 中程度のセキュリティ問題 1 件
出力:
- Lint レポート (エラー/警告)
- 型チェックの結果
- スキーマ検証の結果
- セキュリティスキャンの結果
- Layer 1 ステータス (PASS/FAIL)
検証:
- [ ] すべての自動チェックが実行された
- [ ] 結果が文書化された
- [ ] 重大な問題 = PASS の場合は 0
- [ ] 警告に対する実行可能なフィードバック
見積もり時間: 15-30 分 (ほとんど自動化)
Gate 1: ✅ 重大な問題がない場合は PASS (警告は許容)
操作 2: 機能検証 (Layer 2)
目的: テスト実行とカバレッジによる機能の検証
自動化: 60-80% 自動化 速度: 数分 (中程度のフィードバック) 信頼性: 高い (測定可能な結果)
プロセス:
-
完全なテストスイートの実行:
# カバレッジ付きですべてのテストを実行 npm test -- --coverage --verbose # 結果のキャプチャ # - テストの合否 # - カバレッジメトリクス # - 実行時間 -
サンプルコードの検証 (ドキュメントから):
# SKILL.md からサンプルを抽出 # 各サンプルを自動的に実行 # 出力が期待値と一致することを確認 # 目標: ≥90% のサンプルが動作すること -
カバレッジの確認:
# カバレッジレポート **行カバレッジ**: 87% ✅ (ゲート: ≥80%) **ブランチカバレッジ**: 82% ✅ **関数カバレッジ**: 92% ✅ **パスカバレッジ**: 74% ✅ **ゲートステータス**: PASS ✅ (すべて ≥80%) **未カバーのコード**: - src/admin/legacy.ts: 23% (低優先度) - src/utils/deprecated.ts: 15% (非推奨、OK) -
リグレッションテスト (アップデート用):
# 変更前/変更後を比較 git diff main...feature --stat # すべてのテストを実行 npm test # 検証: 新しい失敗がないこと (リグレッション防止) -
パフォーマンス検証:
# パフォーマンス テストの実行 npm run test:performance # 応答時間の確認 # 検証: W
📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開
Multi-AI Verification
Overview
multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation.
Purpose: Multi-layer independent verification ensuring production-ready quality
Pattern: Task-based (5 independent verification operations, one per layer)
Key Innovation: 5-layer pyramid (95% automated at base → 0% at apex) with independent verification preventing bias and test gaming
Core Principles (validated by tri-AI research):
- Multi-Layer Defense - 5 layers catch different types of issues
- Independent Verification - Separate agent from implementation/testing
- Progressive Automation - Automate what can be automated (95% → 0%)
- Quality Scoring - Objective 0-100 scoring with ≥90 threshold
- Actionable Feedback - 100% feedback is specific and actionable (What/Where/Why/How/Priority)
Quality Gates: All 5 layers must pass for production approval
When to Use
Use multi-ai-verification when:
- Final quality check before commit/deployment
- Independent code review (preventing bias)
- Security verification (OWASP, vulnerabilities)
- Comprehensive QA (all layers)
- Test quality verification (prevent gaming)
- Production readiness validation
Prerequisites
Required
- Code to verify (implementation complete)
- Tests available (for functional verification)
- Quality standards defined
Recommended
- multi-ai-testing - For generating/running tests
- multi-ai-implementation - For implementing fixes
Tools Available
- Linters (ESLint, Pylint)
- Type checkers (TypeScript, mypy)
- Coverage tools (c8, pytest-cov)
- Security scanners (Semgrep, Bandit)
- Test frameworks (Jest, pytest)
The 5-Layer Verification Pyramid
Layer 5: Quality Scoring
(LLM-as-Judge, 0-20% automated)
/\
/ \
Layer 4: Integration
(E2E, System, 20-30% automated)
/ \
/ \
Layer 3: Visual
(UI, Screenshots, 30-50% automated)
/ \
/ \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
/ \
/ \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)
Principle: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation
Verification Operations
Operation 1: Rules-Based Verification (Layer 1)
Purpose: Automated validation of code structure, formatting, types
Automation: 95% automated Speed: Seconds (fast feedback) Confidence: High (deterministic)
Process:
-
Schema Validation (if applicable):
# Validate JSON/YAML against schemas ajv validate -s plan.schema.json -d plan.json ajv validate -s task.schema.json -d tasks/*.json -
Linting:
# JavaScript/TypeScript npx eslint src/**/*.{ts,tsx,js,jsx} # Python pylint src/**/*.py # Expected: Zero linting errors -
Type Checking:
# TypeScript npx tsc --noEmit # Python mypy src/ # Expected: Zero type errors -
Format Validation:
# Check formatting npx prettier --check src/**/*.{ts,tsx} # Or auto-fix npx prettier --write src/**/*.{ts,tsx} -
Security Scanning (SAST):
# Static security analysis npx semgrep --config=auto src/ # Or for Python bandit -r src/ # Check for: # - Hardcoded secrets # - SQL injection risks # - XSS vulnerabilities # - Insecure dependencies -
Generate Layer 1 Report:
# Layer 1: Rules-Based Verification ## Schema Validation ✅ plan.json validates ✅ All task files validate ## Linting ✅ 0 linting errors ⚠️ 3 warnings (non-blocking) ## Type Checking ✅ 0 type errors ## Formatting ✅ All files formatted correctly ## Security Scan (SAST) ✅ No critical vulnerabilities ⚠️ 1 medium: Weak password hashing rounds (bcrypt) **Layer 1 Status**: ✅ PASS (0 critical issues) **Issues to Address**: 1 medium security issue
Outputs:
- Lint report (errors/warnings)
- Type check results
- Schema validation results
- Security scan findings
- Layer 1 status (PASS/FAIL)
Validation:
- [ ] All automated checks run
- [ ] Results documented
- [ ] Critical issues = 0 for PASS
- [ ] Actionable feedback for warnings
Time Estimate: 15-30 minutes (mostly automated)
Gate 1: ✅ PASS if no critical issues (warnings acceptable)
Operation 2: Functional Verification (Layer 2)
Purpose: Validate functionality through test execution and coverage
Automation: 60-80% automated Speed: Minutes (medium feedback) Confidence: High (measurable outcomes)
Process:
-
Execute Complete Test Suite:
# Run all tests with coverage npm test -- --coverage --verbose # Capture results # - Tests passed/failed # - Coverage metrics # - Execution time -
Validate Example Code (from documentation):
# Extract examples from SKILL.md # Execute each example automatically # Verify outputs match expected # Target: ≥90% examples work -
Check Coverage:
# Coverage Report **Line Coverage**: 87% ✅ (gate: ≥80%) **Branch Coverage**: 82% ✅ **Function Coverage**: 92% ✅ **Path Coverage**: 74% ✅ **Gate Status**: PASS ✅ (all ≥80%) **Uncovered Code**: - src/admin/legacy.ts: 23% (low priority) - src/utils/deprecated.ts: 15% (deprecated, ok) -
Regression Testing (for updates):
# Compare before/after git diff main...feature --stat # Run all tests npm test # Verify: No new failures (regression prevention) -
Performance Validation:
# Run performance tests npm run test:performance # Check response times # Verify: Within acceptable ranges -
Generate Layer 2 Report:
# Layer 2: Functional Verification ## Test Execution ✅ 245/245 tests passing (100%) ⏱️ Execution time: 8.3 seconds ## Coverage ✅ Line: 87% (gate: ≥80%) ✅ Branch: 82% ✅ Function: 92% ## Example Validation ✅ 18/20 examples work (90%) ❌ 2 examples fail (outdated) ## Regression ✅ All existing tests still pass ## Performance ✅ All endpoints <200ms **Layer 2 Status**: ✅ PASS **Issues**: 2 outdated examples (update docs)
Outputs:
- Test execution results
- Coverage report
- Example validation results
- Regression check
- Performance metrics
- Layer 2 status
Validation:
- [ ] All tests executed
- [ ] Coverage meets gate (≥80%)
- [ ] Examples validated (≥90%)
- [ ] No regressions
- [ ] Performance acceptable
Time Estimate: 30-60 minutes
Gate 2: ✅ PASS if tests pass + coverage ≥80%
Operation 3: Visual Verification (Layer 3)
Purpose: Validate UI appearance, layout, accessibility (for UI features)
Automation: 30-50% automated Speed: Minutes-Hours Confidence: Medium (subjective elements)
Process:
-
Screenshot Generation:
# Generate screenshots of UI npx playwright test --screenshot=on # Or manually: # Open application # Capture screenshots of key views -
Visual Comparison (if previous version exists):
# Compare against baseline npx playwright test --update-snapshots=missing # Or use Percy/Chromatic for visual regression npx percy snapshot screenshots/ -
Layout Validation:
# Visual Checklist ## Layout - [ ] Components positioned correctly - [ ] Spacing/margins match mockup - [ ] Alignment proper - [ ] No overlapping elements ## Styling - [ ] Colors match design system - [ ] Typography correct (fonts, sizes) - [ ] Icons/images display properly ## Responsiveness - [ ] Mobile view (320px-480px): ✅ - [ ] Tablet view (768px-1024px): ✅ - [ ] Desktop view (>1024px): ✅ -
Accessibility Testing:
# Automated accessibility scan npx axe-core src/ # Check WCAG compliance npx pa11y http://localhost:3000 # Manual checks: # - Keyboard navigation # - Screen reader compatibility # - Color contrast ratios -
Generate Layer 3 Report:
# Layer 3: Visual Verification ## Screenshot Comparison ✅ Login page matches mockup ✅ Dashboard layout correct ⚠️ Profile page: Avatar alignment off by 5px ## Responsiveness ✅ Mobile: All components visible ✅ Tablet: Layout adapts correctly ✅ Desktop: Full functionality ## Accessibility ✅ WCAG 2.1 AA compliance ✅ Keyboard navigation works ⚠️ 2 color contrast warnings (non-critical) **Layer 3 Status**: ✅ PASS (minor issues acceptable) **Issues**: Avatar alignment (cosmetic), contrast warnings
Outputs:
- Screenshots of UI
- Visual comparison results
- Responsiveness validation
- Accessibility report
- Layer 3 status
Validation:
- [ ] Screenshots captured
- [ ] Visual comparison done (if applicable)
- [ ] Layout validated
- [ ] Responsiveness tested
- [ ] Accessibility checked
- [ ] No critical visual issues
Time Estimate: 30-90 minutes (skip if no UI)
Gate 3: ✅ PASS if no critical visual/a11y issues
Operation 4: Integration Verification (Layer 4)
Purpose: Validate system-level integration, data flow, API compatibility
Automation: 20-30% automated Speed: Hours (complex) Confidence: Medium-High
Process:
-
Component Integration Tests:
# Run integration test suite npm test -- tests/integration/ # Verify components work together # - Database ← → API # - API ← → Frontend # - Frontend ← → User -
Data Flow Validation:
# Data Flow Verification **Flow 1: User Registration** Frontend form → API endpoint → Validation → Database → Email service ✅ Data flows correctly ✅ No data loss ✅ Transactions atomic **Flow 2: Authentication** Login request → API → Database lookup → Token generation → Response ✅ Token generated correctly ✅ Session stored ✅ Response includes token -
API Integration Tests:
# Test all API endpoints npm run test:api # Verify: # - All endpoints respond # - Status codes correct # - Response formats match spec # - Error handling works -
End-to-End Workflow Tests:
// Complete user journeys test('Complete registration and login flow', async () => { // 1. Register new user const registerResponse = await api.post('/register', userData); expect(registerResponse.status).toBe(201); // 2. Confirm email const confirmResponse = await api.get(confirmLink); expect(confirmResponse.status).toBe(200); // 3. Login const loginResponse = await api.post('/login', credentials); expect(loginResponse.status).toBe(200); expect(loginResponse.data.token).toBeDefined(); // 4. Access protected resource const profileResponse = await api.get('/profile', { headers: { Authorization: `Bearer ${loginResponse.data.token}` } }); expect(profileResponse.status).toBe(200); }); -
Dependency Compatibility:
# Check external dependencies work npm audit # Check for breaking changes npm outdated # Verify integration with services # - Database connection # - Redis/cache # - External APIs -
Generate Layer 4 Report:
# Layer 4: Integration Verification ## Component Integration ✅ 12/12 integration tests passing ✅ All components integrate correctly ## Data Flow ✅ All 5 data flows validated ✅ No data loss or corruption ## API Integration ✅ All 15 endpoints functional ✅ Response formats correct ✅ Error handling works ## E2E Workflows ✅ 8/8 user journeys complete successfully ✅ No workflow breaks ## Dependencies ✅ 0 critical vulnerabilities ⚠️ 2 moderate (non-blocking) **Layer 4 Status**: ✅ PASS
Outputs:
- Integration test results
- Data flow validation
- API compatibility report
- E2E workflow results
- Dependency audit
- Layer 4 status
Validation:
- [ ] Integration tests pass
- [ ] Data flows validated
- [ ] APIs integrate correctly
- [ ] E2E workflows function
- [ ] Dependencies secure
Time Estimate: 45-90 minutes
Gate 4: ✅ PASS if all integration tests pass, no critical dependencies
Operation 5: Quality Scoring (Layer 5)
Purpose: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns
Automation: 0-20% automated Speed: Hours (expensive) Confidence: Medium (requires judgment)
Process:
-
Spawn Independent Quality Assessor (Agent-as-a-Judge):
Key: Use different model family if possible (prevent self-preference bias)
const qualityAssessment = await task({ description: "Assess code quality holistically", prompt: `Evaluate code quality in src/ and tests/. DO NOT read implementation conversation history. You have access to tools: - Read files - Execute tests - Run linters - Query database (if needed) Assess 5 dimensions (score each /20): 1. CORRECTNESS (/20): - Logic correctness - Edge case handling - Error handling completeness - Security considerations 2. FUNCTIONALITY (/20): - Meets all requirements - User workflows work - Performance acceptable - No regressions 3. QUALITY (/20): - Code maintainability - Best practices followed - Anti-patterns avoided - Documentation complete 4. INTEGRATION (/20): - Components integrate smoothly - API contracts correct - Data flow works - Backward compatible 5. SECURITY (/20): - No vulnerabilities - Input validation - Authentication/authorization - Data protection TOTAL: /100 (sum of 5 dimensions) For each dimension, provide: - Score (/20) - Strengths (what's good) - Weaknesses (what needs improvement) - Evidence (file:line references) - Recommendations (specific, actionable) Write comprehensive report to: quality-assessment.md` }); -
Multi-Agent Ensemble (for critical features):
3-5 Agent Voting Committee:
// Spawn 3 independent quality assessors const [judge1, judge2, judge3] = await Promise.all([ task({description: "Quality Judge 1", prompt: assessmentPrompt}), task({description: "Quality Judge 2", prompt: assessmentPrompt}), task({description: "Quality Judge 3", prompt: assessmentPrompt}) ]); // Aggregate scores const scores = { correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]), functionality: median([...]), quality: median([...]), integration: median([...]), security: median([...]) }; const totalScore = sum(Object.values(scores)); // Total /100 // Check variance const totalScores = [judge1.total, judge2.total, judge3.total]; const variance = max(totalScores) - min(totalScores); if (variance > 15) { // High disagreement → spawn 2 more judges (total 5) // Use 5-agent ensemble for final score } // Final score: median of 3 or 5 -
Calibration Against Rubric:
# Scoring Calibration ## Correctness: 18/20 (Excellent) **20**: Zero errors, all edge cases handled perfectly **18**: Minor edge case missing, otherwise excellent ✅ (achieved) **15**: 1-2 significant edge cases missing **10**: Some logic errors present **0**: Major functionality broken **Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor) ## Functionality: 19/20 (Excellent) [Similar rubric with evidence] ## Quality: 17/20 (Good) [Similar rubric with evidence] ## Integration: 18/20 (Excellent) [Similar rubric with evidence] ## Security: 16/20 (Good) [Similar rubric with evidence] **Total**: 88/100 ⚠️ (Below ≥90 gate) -
Gap Analysis (if <90):
# Quality Gap Analysis **Current Score**: 88/100 **Target**: ≥90/100 **Gap**: 2 points ## Critical Gaps (Blocking Approval) None ## High Priority (Should Fix for ≥90) 1. **Security: Weak bcrypt rounds** - **What**: bcrypt using 10 rounds (outdated) - **Where**: src/auth/hash.ts:15 - **Why**: Current standard is 12-14 rounds - **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)` - **Priority**: High - **Impact**: +2 points → 90/100 ## Medium Priority 1. **Quality: Missing JSDoc for 3 functions** - Impact: +1 point → 91/100 **Recommendation**: Fix high priority issue to reach ≥90 threshold **Estimated Effort**: 15 minutes -
Generate Comprehensive Quality Report:
# Layer 5: Quality Scoring Report ## Executive Summary **Total Score**: 88/100 ⚠️ (Below ≥90 gate) **Status**: NEEDS MINOR REVISION ## Dimension Scores - Correctness: 18/20 ⭐⭐⭐⭐⭐ - Functionality: 19/20 ⭐⭐⭐⭐⭐ - Quality: 17/20 ⭐⭐⭐⭐ - Integration: 18/20 ⭐⭐⭐⭐⭐ - Security: 16/20 ⭐⭐⭐⭐ ## Strengths 1. Comprehensive test coverage (87%) 2. All functionality working correctly 3. Clean integration with all components 4. Good error handling ## Weaknesses 1. Bcrypt rounds below current standard (security) 2. Missing documentation for helper functions (quality) 3. One timezone edge case not handled (correctness) ## Recommendations (Prioritized) ### Priority 1 (High - Needed for ≥90) 1. Increase bcrypt rounds: 10 → 12 - File: src/auth/hash.ts:15 - Effort: 5 min - Impact: +2 points ### Priority 2 (Medium - Nice to Have) 1. Add JSDoc to helper functions - Files: src/utils/validation.ts - Effort: 30 min - Impact: +1 point 2. Handle timezone DST edge case - File: src/auth/tokens.ts:78 - Effort: 20 min - Impact: +1 point **Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90
Outputs:
- Quality score (0-100) with dimension breakdown
- Calibrated against rubric
- Gap analysis
- Prioritized recommendations (Critical/High/Medium/Low)
- Evidence-based feedback (file:line references)
- Action plan to reach ≥90
Validation:
- [ ] All 5 dimensions scored
- [ ] Scores calibrated against rubric
- [ ] Evidence provided for each score
- [ ] Gap analysis if <90
- [ ] Recommendations actionable
- [ ] Ensemble used for critical features (optional)
Time Estimate: 60-120 minutes (ensemble adds 30-60 min)
Gate 5: ✅ PASS if total score ≥90/100
Quality Gates Summary
All 5 Gates Must Pass for production approval:
Gate 1: Rules Pass ✅
↓ (Linting, types, schema, security)
Gate 2: Tests Pass ✅
↓ (All tests, coverage ≥80%)
Gate 3: Visual OK ✅
↓ (UI validated, a11y checked)
Gate 4: Integration OK ✅
↓ (E2E works, APIs integrate)
Gate 5: Quality ≥90 ✅
↓ (LLM-as-judge score ≥90/100)
✅ PRODUCTION APPROVED
If Any Gate Fails:
Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass
Appendix A: Independence Protocol
How Verification Independence is Maintained
Verification Agent Spawning:
// After implementation and testing complete
const verification = await task({
description: "Independent quality verification",
prompt: `Verify code quality independently.
DO NOT read prior conversation history.
Review:
- Code: src/**/*.ts
- Tests: tests/**/*.test.ts
- Specs: specs/requirements.md
Verify against specifications ONLY (not implementation decisions).
Use tools:
- Read files to inspect code
- Run tests to verify functionality
- Execute linters for quality checks
Score quality (0-100) with evidence.
Write report to: independent-verification.md`
});
Bias Prevention Checklist:
- [ ] Specifications written BEFORE implementation
- [ ] Verification agent prompt has no implementation context
- [ ] Agent evaluates against specs, not what code does
- [ ] Fresh context (via Task tool)
- [ ] Different model family used (if possible)
Validation of Independence:
## Independence Audit
**Expected Behavior**:
- ✅ Verifier finds 1-3 issues (healthy skepticism)
- ✅ Verifier references specifications
- ✅ Verifier uses tools to verify claims
**Warning Signs**:
- ⚠️ Verifier finds 0 issues (possible rubber stamp)
- ⚠️ Verifier doesn't use tools
- ⚠️ Verifier parrots implementation justifications
**If Warning**: Re-verify with stronger independence prompt
Appendix B: Operational Scoring Rubrics
Complete Rubrics for All 5 Dimensions
Correctness (/20)
20 (Perfect): Zero logic errors, all edge cases handled, security perfect 18 (Excellent): 1 minor edge case missing, otherwise flawless 15 (Good): 2-3 edge cases missing, no critical errors 12 (Acceptable): Some edge cases missing, 1 minor logic issue 10 (Needs Work): Multiple edge cases missing or 1 significant logic error 5 (Poor): Major logic errors present 0 (Broken): Critical functionality broken
Functionality (/20)
20: All requirements met, exceeds expectations 18: All requirements met, well implemented 15: All requirements met, basic implementation 12: 1 requirement partially missing 10: 2+ requirements partially missing 5: Several requirements not met 0: Core functionality missing
Quality (/20)
20: Exceptional code quality, best practices exemplified 18: High quality, follows best practices 15: Good quality, minor style issues 12: Acceptable quality, several style issues 10: Below standard, needs refactoring 5: Poor quality, significant issues 0: Unmaintainable code
Integration (/20)
20: Perfect integration, all touch points verified 18: Excellent integration, minor docs needed 15: Good integration, all major points work 12: Acceptable, 1-2 integration issues 10: Integration issues present 5: Multiple integration problems 0: Does not integrate
Security (/20)
20: Passes all security scans, OWASP compliant, hardened 18: Passes scans, 1 minor non-critical issue 15: Passes, 2-3 minor issues 12: 1 medium security issue 10: Multiple medium issues 5: 1 critical issue present 0: Multiple critical vulnerabilities
Appendix C: Technical Foundation
Verification Tools
Linting:
- ESLint (JavaScript/TypeScript)
- Pylint/Ruff (Python)
Type Checking:
- TypeScript compiler (tsc)
- mypy (Python)
Security (SAST):
- Semgrep (multi-language)
- Bandit (Python)
- npm audit (JavaScript)
Visual Testing:
- Playwright (screenshot, visual regression)
- Percy/Chromatic (visual diff)
- axe-core (accessibility)
Coverage:
- c8/nyc (JavaScript)
- pytest-cov (Python)
Cost Controls
Budget Caps:
- LLM-as-judge: $50/month
- Ensemble verification: $20/month
- Total verification: $70/month
Optimization:
- Cache quality scores for 24h (same code → same score)
- Skip Layer 5 for changes <50 lines
- Use ensemble (3-5 agents) only for critical features
- Use cheaper models for pre-filtering (Haiku for Layer 1-2)
Quick Reference
The 5 Layers
| Layer | Purpose | Automation | Time | Tools |
|---|---|---|---|---|
| 1 | Rules-based | 95% | 15-30m | Linters, types, SAST |
| 2 | Functional | 60-80% | 30-60m | Test execution, coverage |
| 3 | Visual | 30-50% | 30-90m | Screenshots, a11y |
| 4 | Integration | 20-30% | 45-90m | E2E, API tests |
| 5 | Quality Scoring | 0-20% | 60-120m | LLM-as-judge, ensemble |
Total: 3-6 hours for complete 5-layer verification
Quality Thresholds
- ≥90: ✅ Excellent (production-ready)
- 80-89: ⚠️ Good (needs minor improvements)
- 70-79: ❌ Acceptable (needs work before production)
- <70: ❌ Poor (significant rework required)
Gates
All 5 Must Pass:
- Rules pass (no critical lint/type/security)
- Tests pass + coverage ≥80%
- Visual OK (no critical UI issues)
- Integration OK (E2E works)
- Quality ≥90/100
multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.
For rubrics, see Appendix B. For independence protocol, see Appendix A.