🛠️ 開発・MCP コミュニティ

ragas

Retrieval-Augmented Generation（RAG）という仕組みの品質を評価するRagasというフレームワークについて、検索精度や回答の正確性、応答の適切さなどを測り、RAGシステムを改善するための専門的なアドバイスを提供するSkill。

📜 元の英語説明(参考)

Expert guidance for Ragas, the framework for evaluating Retrieval-Augmented Generation pipelines. Helps developers measure and improve the quality of their RAG systems across retrieval accuracy, answer faithfulness, and response relevance.

🇯🇵 日本人クリエイター向け解説

一言でいうと

※ jpskill.com 編集部が日本のビジネス現場向けに補足した解説です。Skill本体の挙動とは独立した参考情報です。

⚡ おすすめ: コマンド1行でインストール(60秒)

下記のコマンドをコピーしてターミナル(Mac/Linux)または PowerShell(Windows)に貼り付けてください。ダウンロード → 解凍 → 配置まで全自動。

🍎 Mac / 🐧 Linux

mkdir -p ~/.claude/skills && cd ~/.claude/skills && curl -L -o ragas.zip https://jpskill.com/download/15313.zip && unzip -o ragas.zip && rm ragas.zip

🪟 Windows (PowerShell)

$d = "$env:USERPROFILE\.claude\skills"; ni -Force -ItemType Directory $d | Out-Null; iwr https://jpskill.com/download/15313.zip -OutFile "$d\ragas.zip"; Expand-Archive "$d\ragas.zip" -DestinationPath $d -Force; ri "$d\ragas.zip"

完了後、Claude Code を再起動 → 普通に「動画プロンプト作って」のように話しかけるだけで自動発動します。

💾 手動でダウンロードしたい(コマンドが難しい人向け)

1. 下の青いボタンを押して ragas.zip をダウンロード
2. ZIPファイルをダブルクリックで解凍 → ragas フォルダができる
3. そのフォルダを C:\Users\あなたの名前\.claude\skills\(Win)または ~/.claude/skills/(Mac)へ移動
4. Claude Code を再起動

⬇ .zip でダウンロード(推奨) ⬇ .skill 形式(上級者用) 元のソース ↗

⚠️ ダウンロード・利用は自己責任でお願いします。当サイトは内容・動作・安全性について責任を負いません。

🎯 このSkillでできること

下記の説明文を読むと、このSkillがあなたに何をしてくれるかが分かります。Claudeにこの分野の依頼をすると、自動で発動します。

📦 インストール方法 (3ステップ)

1. 上の「ダウンロード」ボタンを押して .skill ファイルを取得
2. ファイル名の拡張子を .skill から .zip に変えて展開(macは自動展開可)
3. 展開してできたフォルダを、ホームフォルダの .claude/skills/ に置く
- · macOS / Linux: ~/.claude/skills/
- · Windows: %USERPROFILE%\.claude\skills\

Claude Code を再起動すれば完了。「このSkillを使って…」と話しかけなくても、関連する依頼で自動的に呼び出されます。

詳しい使い方ガイドを見る →

最終更新: 2026-05-18
取得日時: 2026-05-18
同梱ファイル: 1

📖 Skill本文(日本語訳)

※ 原文(英語/中国語)を Gemini で日本語化したものです。Claude 自身は原文を読みます。誤訳がある場合は原文をご確認ください。

Ragas — RAG 評価フレームワーク

概要

Ragas は、Retrieval-Augmented Generation パイプラインを評価するためのフレームワークです。開発者が検索精度、回答の忠実性、応答の関連性において、RAG システムの品質を測定し、改善するのに役立ちます。

手順

基本的な評価

標準的なメトリクスで RAG パイプラインを評価します。

# evaluate_rag.py — RAG パイプラインで Ragas 評価を実行する
from ragas import evaluate
from ragas.metrics import (
    faithfulness,          # 回答は検索されたコンテキストに基づいているか？
    answer_relevancy,      # 回答は質問に対応しているか？
    context_precision,     # 検索されたドキュメントは関連性があり、適切にランク付けされているか？
    context_recall,        # 検索は必要な情報をすべて見つけたか？
)
from datasets import Dataset

# 評価データセットを準備 — 各行は正解を含む1つの質問
eval_data = {
    "question": [
        "年間サブスクリプションの払い戻しポリシーは何ですか？",
        "パスワードをリセットするにはどうすればよいですか？",
        "Slack で利用できる統合は何ですか？",
    ],
    "answer": [
        "年間サブスクリプションは、購入後 30 日以内であれば払い戻し可能です。",
        "ログインページの「パスワードを忘れた場合」をクリックし、メールのリンクに従ってください。",
        "チャネル通知とスラッシュコマンドを備えたネイティブの Slack 統合を提供しています。",
    ],
    "contexts": [
        # 各質問に対して検索されたドキュメント
        [
            "払い戻しポリシー：年間プランは、30 日以内であれば全額払い戻し可能です。月額プランは払い戻しできません。",
            "請求に関する FAQ：請求に関するお問い合わせは、support@example.com までご連絡ください。",
        ],
        [
            "パスワードのリセット：ログインページに移動し、「パスワードを忘れた場合」をクリックして、メールアドレスを入力してください。",
            "セキュリティ：すべてのパスワードリセットリンクは 24 時間後に期限切れになります。",
        ],
        [
            "統合：Slack、Teams、および Discord と接続します。Slack は通知と /commands をサポートしています。",
            "API：Webhook を使用して、任意のプラットフォームとのカスタム統合を行います。",
        ],
    ],
    "ground_truth": [
        "年間サブスクリプションは、購入後 30 日以内であれば全額払い戻し可能です。月額プランは払い戻しできません。",
        "ログインページに移動し、「パスワードを忘れた場合」をクリックして、メールアドレスを入力し、受信トレイに送信されたリセットリンクに従ってください。",
        "Slack 統合には、チャネル通知、スラッシュコマンドが含まれており、ネイティブ統合として利用できます。",
    ],
}

dataset = Dataset.from_dict(eval_data)

# すべてのメトリクスで評価を実行する
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

# 結果は metric_name → スコア (0.0 から 1.0) の辞書
print(results)
# {'faithfulness': 0.95, 'answer_relevancy': 0.88, 'context_precision': 0.92, 'context_recall': 0.85}

# 質問ごとの分析のために pandas DataFrame に変換する
df = results.to_pandas()
print(df[['question', 'faithfulness', 'answer_relevancy']].to_string())

合成データによるカスタムテストセット

独自のドキュメントから評価データセットを生成します。

# generate_testset.py — ドキュメントから合成 Q&A ペアを作成する
from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader

# ナレッジベースドキュメントをロードする
loader = DirectoryLoader("./docs/", glob="**/*.md")
documents = loader.load()

# LLM でジェネレーターを構成する
generator = TestsetGenerator.from_langchain(
    generator_llm=ChatOpenAI(model="gpt-4o"),
    critic_llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings(),
)

# さまざまな質問の複雑さでテストセットを生成する
# simple: 単純な事実に関する質問
# reasoning: 単一のドキュメント全体で推論を必要とする質問
# multi_context: 複数のドキュメントからの情報を必要とする質問
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50,                      # 50 個の Q&A ペアを生成する
    distributions={
        simple: 0.4,                   # 40% 単純な事実
        reasoning: 0.3,               # 30% 推論が必要
        multi_context: 0.3,           # 30% 複数ドキュメントの合成
    },
)

# 評価実行全体で再利用するためにエクスポートする
test_df = testset.to_pandas()
test_df.to_csv("eval_testset.csv", index=False)
print(f"Generated {len(test_df)} test questions")
print(f"Distribution: {test_df['evolution_type'].value_counts().to_dict()}")

特定のコンポーネントの評価

個々の RAG コンポーネントを分離して測定します。

# eval_retriever.py — 検索品質を個別に評価する
from ragas.metrics import (
    context_precision,     # 上位の結果は関連性があるか？ (ランキング品質)
    context_recall,        # 関連するドキュメントはすべて検索されたか？ (カバレッジ)
    context_entity_recall, # 正解からのキーエンティティはコンテキストにあるか？
)
from ragas import evaluate
from datasets import Dataset

def evaluate_retriever(retriever, test_questions, ground_truths):
    """ジェネレーターから独立してリトリーバーを評価します。

    Args:
        retriever: クエリを受け取り、ドキュメントを返す検索関数
        test_questions: テストクエリのリスト
        ground_truths: リコールの測定のために期待される回答のリスト
    """
    contexts = []
    for question in test_questions:
        # リトリーバーはドキュメント文字列のリストを返します
        docs = retriever.retrieve(question, top_k=5)
        contexts.append([doc.page_content for doc in docs])

    dataset = Dataset.from_dict({
        "question": test_questions,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })

    results = evaluate(
        dataset=dataset,
        metrics=[context_precision, context_recall, context_entity_recall],
    )

    return results


# eval_generator.py — 回答生成の品質を評価する
from ragas.metrics import (
    faithfulness,

(原文がここで切り詰められています)

📜 原文 SKILL.md(Claudeが読む英語/中国語)を展開

Ragas — RAG Evaluation Framework

Overview

Ragas, the framework for evaluating Retrieval-Augmented Generation pipelines. Helps developers measure and improve the quality of their RAG systems across retrieval accuracy, answer faithfulness, and response relevance.

Instructions

Basic Evaluation

Evaluate a RAG pipeline with standard metrics:

# evaluate_rag.py — Run Ragas evaluation on a RAG pipeline
from ragas import evaluate
from ragas.metrics import (
    faithfulness,          # Is the answer grounded in retrieved context?
    answer_relevancy,      # Does the answer address the question?
    context_precision,     # Are retrieved docs relevant and well-ranked?
    context_recall,        # Did retrieval find all necessary information?
)
from datasets import Dataset

# Prepare evaluation dataset — each row is one question with ground truth
eval_data = {
    "question": [
        "What is the refund policy for annual subscriptions?",
        "How do I reset my password?",
        "What integrations are available with Slack?",
    ],
    "answer": [
        "Annual subscriptions can be refunded within 30 days of purchase.",
        "Click 'Forgot Password' on the login page and follow the email link.",
        "We offer native Slack integration with channel notifications and slash commands.",
    ],
    "contexts": [
        # Retrieved documents for each question
        [
            "Refund Policy: Annual plans are eligible for a full refund within 30 days. Monthly plans are non-refundable.",
            "Billing FAQ: Contact support@example.com for billing inquiries.",
        ],
        [
            "Password Reset: Navigate to login page, click 'Forgot Password', enter your email.",
            "Security: All password reset links expire after 24 hours.",
        ],
        [
            "Integrations: Connect with Slack, Teams, and Discord. Slack supports notifications and /commands.",
            "API: Use webhooks for custom integrations with any platform.",
        ],
    ],
    "ground_truth": [
        "Annual subscriptions are eligible for a full refund within 30 days of purchase. Monthly plans cannot be refunded.",
        "Go to the login page, click 'Forgot Password', enter your email address, and follow the reset link sent to your inbox.",
        "Slack integration includes channel notifications, slash commands, and is available as a native integration.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation across all metrics
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

# Results are a dict of metric_name → score (0.0 to 1.0)
print(results)
# {'faithfulness': 0.95, 'answer_relevancy': 0.88, 'context_precision': 0.92, 'context_recall': 0.85}

# Convert to pandas DataFrame for per-question analysis
df = results.to_pandas()
print(df[['question', 'faithfulness', 'answer_relevancy']].to_string())

Custom Test Sets with Synthetic Data

Generate evaluation datasets from your own documents:

# generate_testset.py — Create synthetic Q&A pairs from documents
from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader

# Load your knowledge base documents
loader = DirectoryLoader("./docs/", glob="**/*.md")
documents = loader.load()

# Configure the generator with an LLM
generator = TestsetGenerator.from_langchain(
    generator_llm=ChatOpenAI(model="gpt-4o"),
    critic_llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings(),
)

# Generate test set with different question complexities
# simple: straightforward factual questions
# reasoning: questions requiring inference across a single document
# multi_context: questions needing information from multiple documents
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50,                      # Generate 50 Q&A pairs
    distributions={
        simple: 0.4,                   # 40% simple factual
        reasoning: 0.3,               # 30% reasoning required
        multi_context: 0.3,           # 30% multi-document synthesis
    },
)

# Export for reuse across evaluation runs
test_df = testset.to_pandas()
test_df.to_csv("eval_testset.csv", index=False)
print(f"Generated {len(test_df)} test questions")
print(f"Distribution: {test_df['evolution_type'].value_counts().to_dict()}")

Evaluating Specific Components

Isolate and measure individual RAG components:

# eval_retriever.py — Evaluate retrieval quality independently
from ragas.metrics import (
    context_precision,     # Are top results relevant? (ranking quality)
    context_recall,        # Are all relevant docs retrieved? (coverage)
    context_entity_recall, # Are key entities from ground truth in context?
)
from ragas import evaluate
from datasets import Dataset

def evaluate_retriever(retriever, test_questions, ground_truths):
    """Evaluate retriever independently from the generator.

    Args:
        retriever: Your retrieval function that takes a query and returns docs
        test_questions: List of test queries
        ground_truths: List of expected answers for recall measurement
    """
    contexts = []
    for question in test_questions:
        # Your retriever returns a list of document strings
        docs = retriever.retrieve(question, top_k=5)
        contexts.append([doc.page_content for doc in docs])

    dataset = Dataset.from_dict({
        "question": test_questions,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })

    results = evaluate(
        dataset=dataset,
        metrics=[context_precision, context_recall, context_entity_recall],
    )

    return results


# eval_generator.py — Evaluate answer generation quality
from ragas.metrics import (
    faithfulness,          # Hallucination detection
    answer_relevancy,      # Does it answer the question?
    answer_similarity,     # Semantic similarity to ground truth
    answer_correctness,    # Factual correctness vs ground truth
)

def evaluate_generator(rag_pipeline, test_questions, contexts, ground_truths):
    """Evaluate the generation component with fixed retrieval context.

    Args:
        rag_pipeline: Your generation function
        test_questions: List of test queries
        contexts: Pre-retrieved contexts (fixed to isolate generator)
        ground_truths: Expected correct answers
    """
    answers = []
    for question, ctx in zip(test_questions, contexts):
        answer = rag_pipeline.generate(question, context=ctx)
        answers.append(answer)

    dataset = Dataset.from_dict({
        "question": test_questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })

    return evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, answer_correctness],
    )

CI Integration

Run Ragas evaluations in your CI pipeline:

# tests/test_rag_quality.py — Pytest integration for continuous RAG evaluation
import pytest
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
import json

# Load the pre-generated test set (created by generate_testset.py)
QUALITY_THRESHOLDS = {
    "faithfulness": 0.85,          # Minimum acceptable faithfulness
    "answer_relevancy": 0.80,     # Minimum answer relevance
    "context_precision": 0.75,    # Minimum retrieval precision
}

@pytest.fixture(scope="session")
def eval_results(rag_pipeline, test_dataset):
    """Run evaluation once per test session and share results."""
    questions, ground_truths = test_dataset

    # Run the full RAG pipeline on test questions
    answers, contexts = [], []
    for q in questions:
        result = rag_pipeline.query(q)
        answers.append(result["answer"])
        contexts.append(result["sources"])

    dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })

    results = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision],
    )

    # Save results for reporting
    with open("rag_eval_results.json", "w") as f:
        json.dump(dict(results), f, indent=2)

    return results


def test_faithfulness_above_threshold(eval_results):
    """Ensure answers are grounded in retrieved context (no hallucinations)."""
    score = eval_results["faithfulness"]
    assert score >= QUALITY_THRESHOLDS["faithfulness"], (
        f"Faithfulness {score:.2f} below threshold {QUALITY_THRESHOLDS['faithfulness']}"
    )


def test_answer_relevancy_above_threshold(eval_results):
    """Ensure answers actually address the questions asked."""
    score = eval_results["answer_relevancy"]
    assert score >= QUALITY_THRESHOLDS["answer_relevancy"], (
        f"Answer relevancy {score:.2f} below threshold {QUALITY_THRESHOLDS['answer_relevancy']}"
    )


def test_no_regression(eval_results):
    """Compare against previous run to catch regressions."""
    try:
        with open("rag_eval_baseline.json") as f:
            baseline = json.load(f)
    except FileNotFoundError:
        pytest.skip("No baseline found — first run")

    for metric, score in eval_results.items():
        if metric in baseline:
            regression = baseline[metric] - score
            assert regression < 0.05, (   # Allow max 5% regression
                f"{metric} regressed by {regression:.2f} "
                f"(baseline: {baseline[metric]:.2f}, current: {score:.2f})"
            )

Custom Metrics

Define domain-specific evaluation criteria:

# custom_metrics.py — Create metrics specific to your use case
from ragas.metrics.base import MetricWithLLM
from dataclasses import dataclass, field

@dataclass
class ToneConsistency(MetricWithLLM):
    """Evaluate if the answer maintains the expected brand tone.

    Useful for customer-facing RAG applications where tone matters
    as much as factual accuracy.
    """
    name: str = "tone_consistency"
    expected_tone: str = "professional and empathetic"

    async def _ascore(self, row, callbacks=None):
        prompt = f"""Rate how well this answer maintains a {self.expected_tone} tone.

        Question: {row['question']}
        Answer: {row['answer']}

        Score from 0.0 (completely wrong tone) to 1.0 (perfect tone).
        Return ONLY the numeric score."""

        result = await self.llm.agenerate_text(prompt)
        try:
            return float(result.generations[0][0].text.strip())
        except ValueError:
            return 0.0


# Use custom metric alongside standard ones
tone_metric = ToneConsistency(expected_tone="friendly and technical")
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, tone_metric],
)

Installation

pip install ragas

# With LangChain integration for testset generation
pip install ragas[langchain]

# With all optional dependencies
pip install ragas[all]

Examples

Example 1: Setting up an evaluation pipeline for a RAG application

User request:

I have a RAG chatbot that answers questions from our docs. Set up Ragas to evaluate answer quality.

The agent creates an evaluation suite with appropriate metrics (faithfulness, relevance, answer correctness), configures test datasets from real user questions, runs baseline evaluations, and sets up CI integration so evaluations run on every prompt or retrieval change.

Example 2: Comparing model performance across prompts

User request:

We're testing GPT-4o vs Claude on our customer support prompts. Set up a comparison with Ragas.

The agent creates a structured experiment with the existing prompt set, configures both model providers, defines scoring criteria specific to customer support (accuracy, tone, completeness), runs the comparison, and generates a summary report with statistical significance indicators.

Guidelines

Evaluate before optimizing — Establish baseline scores before changing retrieval or generation parameters
Test set diversity — Include simple, reasoning, and multi-context questions; real user queries are best
Component isolation — Evaluate retriever and generator separately to identify which part needs improvement
Track over time — Store results in CI; catch regressions before they reach production
Custom metrics for domain — Standard metrics miss domain-specific quality requirements (tone, compliance, format)
Sufficient test size — Use 50+ questions for stable metrics; small sets produce noisy scores
Ground truth quality — Evaluation is only as good as your reference answers; invest in accurate ground truths
Multiple LLM judges — Cross-validate with different judge models to reduce evaluation bias