# SelfCheckGPT

SelfCheckGPT is a zero-resource black-box hallucination detection framework for generative large language models. It works by sampling multiple responses from the same LLM and measuring consistency between the original passage and the sampled passages - inconsistent facts across samples indicate potential hallucinations. The package implements five variants: BERTScore, MQAG (Multiple-choice Question Answering and Generation), N-gram, NLI (Natural Language Inference), and LLM Prompting.

The core principle is that factual information will appear consistently across multiple stochastic samples from an LLM, while hallucinated content will vary. This approach requires no external knowledge base or ground-truth data, making it applicable to any text generated by LLMs. SelfCheckGPT achieves state-of-the-art results on the WikiBio GPT-3 hallucination detection benchmark, with the LLM Prompt variant using GPT-3.5-turbo achieving the best performance.

## SelfCheckBERTScore - Hallucination Detection via Semantic Similarity

SelfCheckBERTScore measures hallucination by computing BERTScore between each sentence and sampled passages. For each sentence, it finds the best-matching sentence in each sample and returns `1.0 - bertscore`, so higher scores indicate potential hallucinations (less semantic similarity with samples).

```python
import torch
import spacy
from selfcheckgpt.modeling_selfcheck import SelfCheckBERTScore

# Initialize spacy for sentence tokenization
nlp = spacy.load("en_core_web_sm")

# Initialize SelfCheck-BERTScore
# rescale_with_baseline=True is recommended for meaningful score ranges
selfcheck_bertscore = SelfCheckBERTScore(
    default_model="en",  # uses roberta-large for English
    rescale_with_baseline=True
)

# Original LLM response to evaluate
passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."
sentences = [sent.text.strip() for sent in nlp(passage).sents]
# Output: ['Michael Alan Weiner (born March 31, 1942) is an American radio host.',
#          'He is the host of The Savage Nation.']

# Multiple stochastic samples from the same LLM (different generations)
sampled_passages = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.",
    "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.",
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."
]

# Get sentence-level hallucination scores
# Score range: [0.0, 1.0], higher = more likely hallucination
sent_scores = selfcheck_bertscore.predict(
    sentences=sentences,
    sampled_passages=sampled_passages
)
print(f"Sentence scores: {sent_scores}")
# Output: [0.0695562  0.45590915]
# First sentence is factual (low score), second shows inconsistency (higher score)
```

## SelfCheckNLI - Hallucination Detection via Natural Language Inference

SelfCheckNLI uses a DeBERTa-v3-large model fine-tuned on Multi-NLI to detect contradictions between sentences and sampled passages. For each sentence-sample pair, it computes the probability of contradiction, and averages across all samples. This is the recommended approach for best accuracy-speed tradeoff.

```python
import torch
from selfcheckgpt.modeling_selfcheck import SelfCheckNLI

# Initialize with GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_nli = SelfCheckNLI(device=device)
# Uses potsawee/deberta-v3-large-mnli by default

# Sentences to evaluate
sentences = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host.",
    "He is the host of The Savage Nation."
]

# Sampled passages for consistency check
sampled_passages = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.",
    "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.",
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."
]

# Predict hallucination scores
# Returns P(contradiction | sentence, sample) averaged over samples
# Score range: [0.0, 1.0], higher = more likely hallucination
sent_scores = selfcheck_nli.predict(
    sentences=sentences,
    sampled_passages=sampled_passages
)
print(f"NLI scores: {sent_scores}")
# Output: [0.334014 0.975106]
# Second sentence shows high contradiction probability
```

## SelfCheckMQAG - Hallucination Detection via Question Answering

SelfCheckMQAG generates multiple-choice questions from each sentence, then checks if the answers are consistent when using the original passage vs. sampled passages as context. Inconsistent answers indicate potential hallucinations. Three scoring methods are available: counting, bayes, and bayes_with_alpha.

```python
import torch
from selfcheckgpt.modeling_selfcheck import SelfCheckMQAG

# Initialize (loads multiple models: G1, G2, Answering, Answerability)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_mqag = SelfCheckMQAG(device=device)

# Text to evaluate
passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."
sentences = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host.",
    "He is the host of The Savage Nation."
]

# Sampled passages
sampled_passages = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.",
    "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.",
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."
]

# Predict with bayes_with_alpha scoring (recommended)
# Score range: [0.0, 1.0], higher = more likely hallucination
sent_scores = selfcheck_mqag.predict(
    sentences=sentences,
    passage=passage,
    sampled_passages=sampled_passages,
    num_questions_per_sent=5,  # questions generated per sentence
    scoring_method='bayes_with_alpha',  # options: 'counting', 'bayes', 'bayes_with_alpha'
    beta1=0.8,
    beta2=0.8
)
print(f"MQAG scores: {sent_scores}")
# Output: [0.30990949 0.42376232]

# Alternative: Simple counting method (requires AT parameter)
sent_scores_counting = selfcheck_mqag.predict(
    sentences=sentences,
    passage=passage,
    sampled_passages=sampled_passages,
    num_questions_per_sent=5,
    scoring_method='counting',
    AT=0.5  # answerability threshold
)
```

## SelfCheckNgram - Hallucination Detection via N-gram Models

SelfCheckNgram builds an n-gram language model from the passage and sampled passages, then evaluates each sentence's likelihood. Unlike other methods, scores are unbounded (higher = less likely according to the model = more likely hallucination). Returns both sentence-level and document-level scores.

```python
from selfcheckgpt.modeling_selfcheck import SelfCheckNgram

# Initialize n-gram model
# n=1 for unigram, n=2 for bigram, etc.
selfcheck_unigram = SelfCheckNgram(n=1, lowercase=True)
selfcheck_bigram = SelfCheckNgram(n=2, lowercase=True)

passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."
sentences = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host.",
    "He is the host of The Savage Nation."
]

sampled_passages = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.",
    "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.",
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."
]

# Get n-gram based scores
# Score range: [0.0, +inf), higher = more likely hallucination
ngram_scores = selfcheck_unigram.predict(
    sentences=sentences,
    passage=passage,
    sampled_passages=sampled_passages
)

print("Sentence-level scores:")
print(f"  avg_neg_logprob: {ngram_scores['sent_level']['avg_neg_logprob']}")
print(f"  max_neg_logprob: {ngram_scores['sent_level']['max_neg_logprob']}")

print("\nDocument-level scores:")
print(f"  avg_neg_logprob: {ngram_scores['doc_level']['avg_neg_logprob']}")
print(f"  avg_max_neg_logprob: {ngram_scores['doc_level']['avg_max_neg_logprob']}")

# Output:
# Sentence-level scores:
#   avg_neg_logprob: [3.184312, 3.279774]
#   max_neg_logprob: [3.476098, 4.574710]
# Document-level scores:
#   avg_neg_logprob: 3.218678904916201
#   avg_max_neg_logprob: 4.025404834169327
```

## SelfCheckLLMPrompt - Hallucination Detection via Open-Source LLM Prompting

SelfCheckLLMPrompt uses an open-source LLM (Llama2, Mistral, etc.) to assess if each sentence is supported by the sampled passages. The LLM is prompted to answer "Yes" or "No" for each sentence-sample pair, then scores are averaged. Supports custom prompt templates.

```python
import torch
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt

# Initialize with a HuggingFace model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_prompt = SelfCheckLLMPrompt(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    device=device
)
# Default prompt template:
# "Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: "

# Optionally set a custom prompt template
custom_template = """Given the following context:
{context}

Evaluate this sentence: {sentence}

Does the context support this sentence? Respond with only Yes or No.
Answer: """
selfcheck_prompt.set_prompt_template(custom_template)

sentences = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host.",
    "He is the host of The Savage Nation."
]

sampled_passages = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.",
    "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.",
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."
]

# Predict with progress bar
# Score mapping: Yes -> 0.0, No -> 1.0, N/A -> 0.5
sent_scores = selfcheck_prompt.predict(
    sentences=sentences,
    sampled_passages=sampled_passages,
    verbose=True  # show progress bar
)
print(f"LLM Prompt scores: {sent_scores}")
# Output: [0.33333333, 0.66666667]
```

## SelfCheckAPIPrompt - Hallucination Detection via API-Based LLM Prompting

SelfCheckAPIPrompt uses API-based LLMs (OpenAI GPT, Groq) for consistency checking. This variant achieves the best performance on benchmarks, especially with GPT-3.5-turbo. Requires appropriate API keys set as environment variables or passed directly.

```python
from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt

# Option 1: OpenAI (requires OPENAI_API_KEY environment variable)
selfcheck_openai = SelfCheckAPIPrompt(
    client_type="openai",
    model="gpt-3.5-turbo"
)

# Option 2: Groq (pass API key directly)
selfcheck_groq = SelfCheckAPIPrompt(
    client_type="groq",
    model="llama3-70b-8192",
    api_key="your-groq-api-key"
)

sentences = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host.",
    "He is the host of The Savage Nation."
]

sampled_passages = [
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.",
    "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.",
    "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."
]

# Custom prompt template
selfcheck_openai.set_prompt_template(
    "Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: "
)

# Predict hallucination scores
sent_scores = selfcheck_openai.predict(
    sentences=sentences,
    sampled_passages=sampled_passages,
    verbose=True
)
print(f"API Prompt scores: {sent_scores}")
# Score mapping: Yes -> 0.0, No -> 1.0, N/A -> 0.5
```

## MQAG - Multiple-Choice Question Answering and Generation

The MQAG class provides standalone multiple-choice question generation and answering capabilities. It can generate questions from text, answer them given context, and compute MQAG scores comparing candidate vs. reference texts. Useful for evaluating text summaries or comparing generated content against source documents.

```python
import torch
from selfcheckgpt.modeling_mqag import MQAG

# Initialize MQAG
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
mqag = MQAG(
    g1_model_type='race',  # options: 'race' or 'squad'
    device=device
)

# Example text for question generation
context = """The Apollo 11 mission was the first crewed mission to land on the Moon.
Commander Neil Armstrong and lunar module pilot Buzz Aldrin landed the Apollo Lunar Module
Eagle on July 20, 1969. Armstrong became the first person to step onto the lunar surface
six hours later, and Aldrin joined him 19 minutes after that."""

# Generate multiple-choice questions
questions = mqag.generate(
    context=context,
    do_sample=True,    # True for sampling, False for beam search
    num_questions=3    # number of questions to generate
)

for i, q in enumerate(questions):
    print(f"Q{i+1}: {q['question']}")
    print(f"Options: {q['options']}")
    print()
# Output example:
# Q1: Who was the commander of the Apollo 11 mission?
# Options: ['Neil Armstrong', 'Buzz Aldrin', 'Michael Collins', 'John Glenn']

# Answer questions given a context
answer_probs = mqag.answer(
    questions=questions,
    context=context
)
print(f"Answer probabilities shape: {answer_probs.shape}")
# Output: (num_questions, 4) - probability distribution over 4 options

# MQAG Score: Compare candidate text against reference
candidate = "Neil Armstrong was the first person to walk on the Moon during Apollo 11."
reference = context

distances = mqag.score(
    candidate=candidate,
    reference=reference,
    num_questions=5,
    verbose=True  # prints question-by-question analysis
)

print(f"\nMQAG Distances:")
print(f"  KL Divergence: {distances['kl_div']:.4f}")
print(f"  Counting: {distances['counting']:.4f}")
print(f"  Hellinger: {distances['hellinger']:.4f}")
print(f"  Total Variation: {distances['total_variation']:.4f}")
# Lower distances indicate better alignment between candidate and reference
```

## Loading the WikiBio GPT-3 Hallucination Dataset

The WikiBio GPT-3 hallucination dataset provides 238 annotated passages for benchmarking hallucination detection methods. Each passage includes GPT-3 generated text, human annotations at sentence level, and pre-generated stochastic samples for self-checking.

```python
# Option 1: Load via HuggingFace datasets
from datasets import load_dataset

dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")
example = dataset['evaluation'][0]

print("GPT-3 generated text:")
print(example['gpt3_text'])

print("\nSentences:")
for i, sent in enumerate(example['gpt3_sentences']):
    annotation = example['annotation'][i]
    print(f"  [{annotation}] {sent}")
    # annotation: 'accurate', 'minor_inaccurate', 'major_inaccurate'

print(f"\nNumber of sampled passages: {len(example['gpt3_text_samples'])}")

# Option 2: Load from JSON file (manual download)
import json

with open("dataset.json", "r") as f:
    dataset = json.loads(f.read())

# Dataset structure:
# - gpt3_text: GPT-3 generated passage
# - wiki_bio_text: Original Wikipedia passage (ground truth)
# - gpt3_sentences: gpt3_text split into sentences
# - annotation: human labels per sentence
# - wiki_bio_test_idx: ID from wikibio dataset
# - gpt3_text_samples: list of sampled passages for self-checking
```

## Complete Hallucination Detection Pipeline

This example demonstrates a complete pipeline for detecting hallucinations in LLM-generated text using multiple SelfCheckGPT variants, combining their scores for robust detection.

```python
import torch
import spacy
import numpy as np
from selfcheckgpt.modeling_selfcheck import (
    SelfCheckNLI, SelfCheckBERTScore, SelfCheckNgram
)

# Setup
torch.manual_seed(42)  # for reproducibility
nlp = spacy.load("en_core_web_sm")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize multiple checkers
selfcheck_nli = SelfCheckNLI(device=device)
selfcheck_bertscore = SelfCheckBERTScore(rescale_with_baseline=True)
selfcheck_ngram = SelfCheckNgram(n=1)

def detect_hallucinations(passage: str, sampled_passages: list, threshold: float = 0.5):
    """
    Comprehensive hallucination detection using multiple methods.

    Args:
        passage: Original LLM-generated text
        sampled_passages: List of stochastic samples from the same LLM
        threshold: Score threshold for flagging hallucinations

    Returns:
        Dictionary with sentence-level analysis
    """
    # Tokenize into sentences
    sentences = [sent.text.strip() for sent in nlp(passage).sents]

    # Get scores from multiple methods
    scores_nli = selfcheck_nli.predict(sentences, sampled_passages)
    scores_bert = selfcheck_bertscore.predict(sentences, sampled_passages)
    scores_ngram = selfcheck_ngram.predict(sentences, passage, sampled_passages)

    # Analyze each sentence
    results = []
    for i, sent in enumerate(sentences):
        nli_score = scores_nli[i]
        bert_score = scores_bert[i]
        ngram_score = scores_ngram['sent_level']['avg_neg_logprob'][i]

        # Normalize ngram score to [0,1] range (approximate)
        ngram_normalized = min(ngram_score / 5.0, 1.0)

        # Ensemble score (weighted average)
        ensemble = 0.5 * nli_score + 0.3 * bert_score + 0.2 * ngram_normalized

        results.append({
            'sentence': sent,
            'nli_score': float(nli_score),
            'bertscore': float(bert_score),
            'ngram_score': float(ngram_score),
            'ensemble_score': float(ensemble),
            'is_hallucination': ensemble > threshold
        })

    return {
        'sentences': results,
        'document_ngram_score': scores_ngram['doc_level']['avg_neg_logprob']
    }

# Example usage
passage = """Albert Einstein was born in 1879 in Germany.
He developed the theory of relativity.
Einstein won the Nobel Prize in Chemistry in 1921.
He later moved to the United States."""

samples = [
    "Albert Einstein was born in 1879 in Ulm, Germany. He is famous for the theory of relativity. Einstein won the Nobel Prize in Physics in 1921.",
    "Albert Einstein, born 1879, was a German physicist. His theory of relativity revolutionized physics. He received the Nobel Prize for his work on photoelectric effect.",
    "Einstein was born in Germany in 1879. He created the special and general theories of relativity. He won the 1921 Nobel Prize in Physics."
]

results = detect_hallucinations(passage, samples)

print("Hallucination Analysis:")
print("=" * 60)
for r in results['sentences']:
    status = "HALLUCINATION" if r['is_hallucination'] else "OK"
    print(f"\n[{status}] {r['sentence']}")
    print(f"  NLI: {r['nli_score']:.3f}, BERT: {r['bertscore']:.3f}, "
          f"Ngram: {r['ngram_score']:.3f}, Ensemble: {r['ensemble_score']:.3f}")

# Output shows "Nobel Prize in Chemistry" flagged as hallucination
# (should be Physics - inconsistent with samples)
```

## Summary

SelfCheckGPT provides a versatile toolkit for detecting hallucinations in LLM-generated text without requiring external knowledge bases. The primary use cases include: (1) quality assurance for AI-generated content by identifying potentially fabricated information, (2) building guardrails for production LLM applications that need factual accuracy, (3) research benchmarking for developing new hallucination detection methods, and (4) evaluating text summarization and question-answering systems through the MQAG framework.

Integration patterns typically involve generating multiple stochastic samples from the target LLM (3-5 samples recommended), then applying one or more SelfCheckGPT variants. For production deployments, SelfCheck-NLI offers the best accuracy-speed tradeoff for local inference, while SelfCheckAPIPrompt with GPT-3.5-turbo achieves the highest accuracy when API costs are acceptable. The ensemble approach combining multiple methods provides the most robust detection. All methods return sentence-level scores, enabling fine-grained identification of which specific claims may be hallucinated.