🛠️ 開発・MCP コミュニティ

deepeval

DeepEvalは、LLMアプリケーションの品質をテストするためのオープンソースフレームワークで、pytestのようなインターフェースでテストケース作成、独自の評価指標定義、CI/CDパイプラインへの統合を支援し、LLMの品質チェックを効率化するSkill。

📜 元の英語説明(参考)

Expert guidance for DeepEval, the open-source framework for unit testing LLM applications. Helps developers write test cases, define custom metrics, and integrate LLM quality checks into CI/CD pipelines using a pytest-like interface.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o deepeval.zip https://jpskill.com/download/14825.zip && unzip -o deepeval.zip && rm deepeval.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/14825.zip -OutFile "$d\deepeval.zip"; Expand-Archive "$d\deepeval.zip" -DestinationPath $d -Force; ri "$d\deepeval.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して deepeval.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → deepeval フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

DeepEval — LLM テスト & 評価フレームワーク

概要

DeepEval は、LLM アプリケーションのユニットテストのためのオープンソースフレームワークです。開発者がテストケースを作成し、カスタムメトリクスを定義し、pytest のようなインターフェースを使用して LLM 品質チェックを CI/CD パイプラインに統合するのに役立ちます。

手順

基本的なテストケース

組み込みのメトリクスを使用して LLM 出力のユニットテストを作成します。

# tests/test_chatbot.py — カスタマーサポートチャットボットのユニットテスト
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    ToxicityMetric,
)

def test_answer_is_relevant():
    """チャットボットが実際に質問された質問に答えているか検証します。"""
    test_case = LLMTestCase(
        input="How do I cancel my subscription?",
        actual_output="To cancel your subscription, go to Settings > Billing > Cancel Plan. Your access continues until the end of the billing period.",
        retrieval_context=[
            "Cancellation Policy: Users can cancel anytime via Settings > Billing > Cancel Plan. Access remains active until the current billing period ends.",
            "Refunds: Pro-rated refunds are available for annual plans within 14 days.",
        ],
    )

    metric = AnswerRelevancyMetric(
        threshold=0.7,      # 合格するための最小スコア (0.0 から 1.0)
        model="gpt-4o",     # 評価用の Judge モデル
    )
    assert_test(test_case, [metric])


def test_answer_is_faithful_to_context():
    """チャットボットが取得されたドキュメントを超えて幻覚を見ないようにします。"""
    test_case = LLMTestCase(
        input="What is the pricing for the enterprise plan?",
        actual_output="The enterprise plan costs $499/month with unlimited users and priority support.",
        retrieval_context=[
            "Enterprise Plan: $499/month. Includes unlimited users, priority support, SSO, and custom integrations.",
        ],
    )

    faithfulness = FaithfulnessMetric(threshold=0.8)
    hallucination = HallucinationMetric(threshold=0.5)  # Lower = less hallucination
    assert_test(test_case, [faithfulness, hallucination])


def test_response_is_not_toxic():
    """有害または不適切な応答に対する保護。"""
    test_case = LLMTestCase(
        input="Your product is terrible and I hate it",
        actual_output="I'm sorry to hear about your frustration. Let me help resolve your issue. Could you describe what went wrong?",
    )

    toxicity = ToxicityMetric(threshold=0.5)
    assert_test(test_case, [toxicity])

会話型テスト

コヒーレンスとコンテキスト保持のために、複数ターンの会話をテストします。

# tests/test_conversation.py — 複数ターンの会話の品質
from deepeval.test_case import ConversationalTestCase, LLMTestCase
from deepeval.metrics import ConversationRelevancyMetric, ConversationCompletenessMetric

def test_multi_turn_conversation():
    """アシスタントがターン間でコンテキストを維持しているか検証します。"""
    conversation = ConversationalTestCase(
        turns=[
            LLMTestCase(
                input="I want to upgrade to the Pro plan",
                actual_output="I'd be happy to help you upgrade to Pro! Your current plan is Basic. The Pro plan is $29/month with advanced analytics and priority support. Shall I proceed?",
            ),
            LLMTestCase(
                input="Yes, please proceed",
                actual_output="Done! Your account has been upgraded to Pro ($29/month). The new features are available immediately. Your next billing date is April 1st.",
            ),
            LLMTestCase(
                input="What features did I just get?",
                actual_output="With your new Pro plan, you now have access to: advanced analytics dashboards, priority email support (< 4hr response), custom integrations via API, and team collaboration tools for up to 10 members.",
                retrieval_context=[
                    "Pro Plan Features: Advanced analytics, priority support (4hr SLA), API access, team collaboration (10 seats).",
                ],
            ),
        ],
    )

    relevancy = ConversationRelevancyMetric(threshold=0.7)
    completeness = ConversationCompletenessMetric(threshold=0.7)
    assert_test(conversation, [relevancy, completeness])

カスタムメトリクス

ドメインに固有の評価基準を定義します。


# metrics/brand_voice.py — ブランドの一貫性のためのカスタムメトリクス
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class BrandVoiceMetric(BaseMetric):
    """応答が会社のブランドボイスガイドラインに一致するか評価します。

    出力がブランドの定義されたトーン、語彙、およびコミュニケーションスタイルに
    どれだけ従っているかをスコアリングします。
    """

    def __init__(self, brand_guidelines: str, threshold: float = 0.7):
        self.threshold = threshold
        self.brand_guidelines = brand_guidelines

    def measure(self, test_case: LLMTestCase) -> float:
        # LLM を使用してブランドボイスの遵守を判断します
        from deepeval.models import GPTModel
        judge = GPTModel(model="gpt-4o")

        prompt = f"""Evaluate how well this response follows the brand voice guidelines.

Brand Guidelines:
{self.brand_guidelines}

User Input: {test_case.input}
Response: {test_case.actual_output}

Score from 0.0 (completely off-brand) to 1.0 (perfectly on-brand).
Explain your reasoning, then provide the score on the last line as just a number."""

        result = judge.generate(prompt)
        # 最後の行からスコアを抽出します
        lines = result.strip().split('\n')
        self.score = float(lines[-1].strip())
        self.reason = '\n'.join(lines[:-1])
        self.success = self.score >= self.threshold
        return self.score

    def is_successful(self) -> bool:
        return self.success

    @property
    def __name__(self):
        return "Brand Voice"


# テストでの使用
brand_metric = BrandVoiceMetric(
    brand_guidelines="""


(原文がここで切り詰められています)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

DeepEval — LLM Testing & Evaluation Framework

Overview

DeepEval, the open-source framework for unit testing LLM applications. Helps developers write test cases, define custom metrics, and integrate LLM quality checks into CI/CD pipelines using a pytest-like interface.

Instructions

Basic Test Cases

Write unit tests for LLM outputs using built-in metrics:

# tests/test_chatbot.py — Unit tests for a customer support chatbot
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    ToxicityMetric,
)

def test_answer_is_relevant():
    """Verify the chatbot answers the actual question asked."""
    test_case = LLMTestCase(
        input="How do I cancel my subscription?",
        actual_output="To cancel your subscription, go to Settings > Billing > Cancel Plan. Your access continues until the end of the billing period.",
        retrieval_context=[
            "Cancellation Policy: Users can cancel anytime via Settings > Billing > Cancel Plan. Access remains active until the current billing period ends.",
            "Refunds: Pro-rated refunds are available for annual plans within 14 days.",
        ],
    )

    metric = AnswerRelevancyMetric(
        threshold=0.7,      # Minimum score to pass (0.0 to 1.0)
        model="gpt-4o",     # Judge model for evaluation
    )
    assert_test(test_case, [metric])


def test_answer_is_faithful_to_context():
    """Ensure the chatbot doesn't hallucinate beyond retrieved documents."""
    test_case = LLMTestCase(
        input="What is the pricing for the enterprise plan?",
        actual_output="The enterprise plan costs $499/month with unlimited users and priority support.",
        retrieval_context=[
            "Enterprise Plan: $499/month. Includes unlimited users, priority support, SSO, and custom integrations.",
        ],
    )

    faithfulness = FaithfulnessMetric(threshold=0.8)
    hallucination = HallucinationMetric(threshold=0.5)  # Lower = less hallucination
    assert_test(test_case, [faithfulness, hallucination])


def test_response_is_not_toxic():
    """Guard against toxic or inappropriate responses."""
    test_case = LLMTestCase(
        input="Your product is terrible and I hate it",
        actual_output="I'm sorry to hear about your frustration. Let me help resolve your issue. Could you describe what went wrong?",
    )

    toxicity = ToxicityMetric(threshold=0.5)
    assert_test(test_case, [toxicity])

Conversational Testing

Test multi-turn conversations for coherence and context retention:

# tests/test_conversation.py — Multi-turn conversation quality
from deepeval.test_case import ConversationalTestCase, LLMTestCase
from deepeval.metrics import ConversationRelevancyMetric, ConversationCompletenessMetric

def test_multi_turn_conversation():
    """Verify the assistant maintains context across turns."""
    conversation = ConversationalTestCase(
        turns=[
            LLMTestCase(
                input="I want to upgrade to the Pro plan",
                actual_output="I'd be happy to help you upgrade to Pro! Your current plan is Basic. The Pro plan is $29/month with advanced analytics and priority support. Shall I proceed?",
            ),
            LLMTestCase(
                input="Yes, please proceed",
                actual_output="Done! Your account has been upgraded to Pro ($29/month). The new features are available immediately. Your next billing date is April 1st.",
            ),
            LLMTestCase(
                input="What features did I just get?",
                actual_output="With your new Pro plan, you now have access to: advanced analytics dashboards, priority email support (< 4hr response), custom integrations via API, and team collaboration tools for up to 10 members.",
                retrieval_context=[
                    "Pro Plan Features: Advanced analytics, priority support (4hr SLA), API access, team collaboration (10 seats).",
                ],
            ),
        ],
    )

    relevancy = ConversationRelevancyMetric(threshold=0.7)
    completeness = ConversationCompletenessMetric(threshold=0.7)
    assert_test(conversation, [relevancy, completeness])

Custom Metrics

Define evaluation criteria specific to your domain:

# metrics/brand_voice.py — Custom metric for brand consistency
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class BrandVoiceMetric(BaseMetric):
    """Evaluate if responses match the company's brand voice guidelines.

    Scores how well the output follows the defined tone, vocabulary,
    and communication style of the brand.
    """

    def __init__(self, brand_guidelines: str, threshold: float = 0.7):
        self.threshold = threshold
        self.brand_guidelines = brand_guidelines

    def measure(self, test_case: LLMTestCase) -> float:
        # Use an LLM to judge brand voice adherence
        from deepeval.models import GPTModel
        judge = GPTModel(model="gpt-4o")

        prompt = f"""Evaluate how well this response follows the brand voice guidelines.

Brand Guidelines:
{self.brand_guidelines}

User Input: {test_case.input}
Response: {test_case.actual_output}

Score from 0.0 (completely off-brand) to 1.0 (perfectly on-brand).
Explain your reasoning, then provide the score on the last line as just a number."""

        result = judge.generate(prompt)
        # Extract score from last line
        lines = result.strip().split('\n')
        self.score = float(lines[-1].strip())
        self.reason = '\n'.join(lines[:-1])
        self.success = self.score >= self.threshold
        return self.score

    def is_successful(self) -> bool:
        return self.success

    @property
    def __name__(self):
        return "Brand Voice"


# Usage in tests
brand_metric = BrandVoiceMetric(
    brand_guidelines="""
    - Friendly but professional tone
    - Use 'we' not 'I'
    - Avoid jargon; explain technical terms
    - Maximum 3 sentences per paragraph
    - Always offer next steps
    """,
    threshold=0.75,
)

Bulk Evaluation with Datasets

Run evaluations at scale:

# eval/run_benchmark.py — Evaluate across a full test dataset
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.dataset import EvaluationDataset
import json

# Load test cases from a JSON file
with open("eval/test_cases.json") as f:
    raw_cases = json.load(f)

test_cases = [
    LLMTestCase(
        input=case["question"],
        actual_output=case["answer"],
        expected_output=case.get("expected_answer"),
        retrieval_context=case.get("contexts", []),
    )
    for case in raw_cases
]

dataset = EvaluationDataset(test_cases=test_cases)

# Run evaluation — results are displayed in a table and optionally
# pushed to the DeepEval dashboard (Confident AI)
results = evaluate(
    test_cases=dataset,
    metrics=[
        AnswerRelevancyMetric(threshold=0.7),
        FaithfulnessMetric(threshold=0.8),
    ],
    print_results=True,        # Show results table in terminal
)

# Access individual results programmatically
for result in results.test_results:
    if not result.success:
        print(f"FAILED: {result.input[:50]}...")
        for metric_result in result.metrics_data:
            if not metric_result.success:
                print(f"  {metric_result.name}: {metric_result.score:.2f} (reason: {metric_result.reason})")

Red Teaming and Safety

Test LLMs against adversarial inputs:

# tests/test_safety.py — Adversarial testing for LLM safety
from deepeval.metrics import (
    ToxicityMetric,
    BiasMetric,
)
from deepeval.red_teaming import RedTeamer

# Automated red teaming — generates adversarial prompts
red_teamer = RedTeamer(
    target_model="gpt-4o",
    attacks=[
        "prompt-injection",       # Attempts to override system prompt
        "jailbreak",              # Tries to bypass safety guardrails
        "pii-extraction",         # Attempts to extract personal data
        "harmful-content",        # Requests for dangerous information
    ],
    attack_count=20,              # Generate 20 attack attempts per category
)

results = red_teamer.scan()

# Check vulnerability scores
for vulnerability in results.vulnerabilities:
    print(f"{vulnerability.type}: {vulnerability.score:.2f} "
          f"({vulnerability.attacks_succeeded}/{vulnerability.attacks_total} succeeded)")

Installation & CLI

# Install DeepEval
pip install deepeval

# Run tests with pytest (deepeval is a pytest plugin)
deepeval test run tests/test_chatbot.py

# Run with verbose output showing per-metric scores
deepeval test run tests/ -v

# Login to Confident AI dashboard (optional, for tracking)
deepeval login

Examples

Example 1: Setting up an evaluation pipeline for a RAG application

User request:

I have a RAG chatbot that answers questions from our docs. Set up Deepeval to evaluate answer quality.

The agent creates an evaluation suite with appropriate metrics (faithfulness, relevance, answer correctness), configures test datasets from real user questions, runs baseline evaluations, and sets up CI integration so evaluations run on every prompt or retrieval change.

Example 2: Comparing model performance across prompts

User request:

We're testing GPT-4o vs Claude on our customer support prompts. Set up a comparison with Deepeval.

The agent creates a structured experiment with the existing prompt set, configures both model providers, defines scoring criteria specific to customer support (accuracy, tone, completeness), runs the comparison, and generates a summary report with statistical significance indicators.

Guidelines

Test the full pipeline — Don't just test the LLM; test retrieval + generation + post-processing together
Threshold tuning — Start with low thresholds (0.5), measure baseline, then raise gradually
CI/CD integration — Run deepeval test run in your CI pipeline; fail builds on quality regressions
Adversarial testing — Red team your LLM before production; focus on prompt injection and PII leaks
Version test sets — Track test cases in git; add new cases when you find production failures
Multiple metrics per test — Combine faithfulness + relevancy + toxicity for comprehensive coverage
Custom metrics for business — Standard metrics miss domain needs (brand voice, compliance, format)
Judge model selection — Use GPT-4o or Claude as judge; cheaper models produce unreliable evaluations