# SelfCheckGPT SelfCheckGPT is a zero-resource black-box hallucination detection framework for generative large language models. It works by sampling multiple responses from the same LLM and measuring consistency between the original passage and the sampled passages - inconsistent facts across samples indicate potential hallucinations. The package implements five variants: BERTScore, MQAG (Multiple-choice Question Answering and Generation), N-gram, NLI (Natural Language Inference), and LLM Prompting. The core principle is that factual information will appear consistently across multiple stochastic samples from an LLM, while hallucinated content will vary. This approach requires no external knowledge base or ground-truth data, making it applicable to any text generated by LLMs. SelfCheckGPT achieves state-of-the-art results on the WikiBio GPT-3 hallucination detection benchmark, with the LLM Prompt variant using GPT-3.5-turbo achieving the best performance. ## SelfCheckBERTScore - Hallucination Detection via Semantic Similarity SelfCheckBERTScore measures hallucination by computing BERTScore between each sentence and sampled passages. For each sentence, it finds the best-matching sentence in each sample and returns `1.0 - bertscore`, so higher scores indicate potential hallucinations (less semantic similarity with samples). ```python import torch import spacy from selfcheckgpt.modeling_selfcheck import SelfCheckBERTScore # Initialize spacy for sentence tokenization nlp = spacy.load("en_core_web_sm") # Initialize SelfCheck-BERTScore # rescale_with_baseline=True is recommended for meaningful score ranges selfcheck_bertscore = SelfCheckBERTScore( default_model="en", # uses roberta-large for English rescale_with_baseline=True ) # Original LLM response to evaluate passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation." sentences = [sent.text.strip() for sent in nlp(passage).sents] # Output: ['Michael Alan Weiner (born March 31, 1942) is an American radio host.', # 'He is the host of The Savage Nation.'] # Multiple stochastic samples from the same LLM (different generations) sampled_passages = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.", "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.", "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT." ] # Get sentence-level hallucination scores # Score range: [0.0, 1.0], higher = more likely hallucination sent_scores = selfcheck_bertscore.predict( sentences=sentences, sampled_passages=sampled_passages ) print(f"Sentence scores: {sent_scores}") # Output: [0.0695562 0.45590915] # First sentence is factual (low score), second shows inconsistency (higher score) ``` ## SelfCheckNLI - Hallucination Detection via Natural Language Inference SelfCheckNLI uses a DeBERTa-v3-large model fine-tuned on Multi-NLI to detect contradictions between sentences and sampled passages. For each sentence-sample pair, it computes the probability of contradiction, and averages across all samples. This is the recommended approach for best accuracy-speed tradeoff. ```python import torch from selfcheckgpt.modeling_selfcheck import SelfCheckNLI # Initialize with GPU if available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") selfcheck_nli = SelfCheckNLI(device=device) # Uses potsawee/deberta-v3-large-mnli by default # Sentences to evaluate sentences = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host.", "He is the host of The Savage Nation." ] # Sampled passages for consistency check sampled_passages = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.", "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.", "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT." ] # Predict hallucination scores # Returns P(contradiction | sentence, sample) averaged over samples # Score range: [0.0, 1.0], higher = more likely hallucination sent_scores = selfcheck_nli.predict( sentences=sentences, sampled_passages=sampled_passages ) print(f"NLI scores: {sent_scores}") # Output: [0.334014 0.975106] # Second sentence shows high contradiction probability ``` ## SelfCheckMQAG - Hallucination Detection via Question Answering SelfCheckMQAG generates multiple-choice questions from each sentence, then checks if the answers are consistent when using the original passage vs. sampled passages as context. Inconsistent answers indicate potential hallucinations. Three scoring methods are available: counting, bayes, and bayes_with_alpha. ```python import torch from selfcheckgpt.modeling_selfcheck import SelfCheckMQAG # Initialize (loads multiple models: G1, G2, Answering, Answerability) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") selfcheck_mqag = SelfCheckMQAG(device=device) # Text to evaluate passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation." sentences = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host.", "He is the host of The Savage Nation." ] # Sampled passages sampled_passages = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.", "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.", "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT." ] # Predict with bayes_with_alpha scoring (recommended) # Score range: [0.0, 1.0], higher = more likely hallucination sent_scores = selfcheck_mqag.predict( sentences=sentences, passage=passage, sampled_passages=sampled_passages, num_questions_per_sent=5, # questions generated per sentence scoring_method='bayes_with_alpha', # options: 'counting', 'bayes', 'bayes_with_alpha' beta1=0.8, beta2=0.8 ) print(f"MQAG scores: {sent_scores}") # Output: [0.30990949 0.42376232] # Alternative: Simple counting method (requires AT parameter) sent_scores_counting = selfcheck_mqag.predict( sentences=sentences, passage=passage, sampled_passages=sampled_passages, num_questions_per_sent=5, scoring_method='counting', AT=0.5 # answerability threshold ) ``` ## SelfCheckNgram - Hallucination Detection via N-gram Models SelfCheckNgram builds an n-gram language model from the passage and sampled passages, then evaluates each sentence's likelihood. Unlike other methods, scores are unbounded (higher = less likely according to the model = more likely hallucination). Returns both sentence-level and document-level scores. ```python from selfcheckgpt.modeling_selfcheck import SelfCheckNgram # Initialize n-gram model # n=1 for unigram, n=2 for bigram, etc. selfcheck_unigram = SelfCheckNgram(n=1, lowercase=True) selfcheck_bigram = SelfCheckNgram(n=2, lowercase=True) passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation." sentences = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host.", "He is the host of The Savage Nation." ] sampled_passages = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.", "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.", "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT." ] # Get n-gram based scores # Score range: [0.0, +inf), higher = more likely hallucination ngram_scores = selfcheck_unigram.predict( sentences=sentences, passage=passage, sampled_passages=sampled_passages ) print("Sentence-level scores:") print(f" avg_neg_logprob: {ngram_scores['sent_level']['avg_neg_logprob']}") print(f" max_neg_logprob: {ngram_scores['sent_level']['max_neg_logprob']}") print("\nDocument-level scores:") print(f" avg_neg_logprob: {ngram_scores['doc_level']['avg_neg_logprob']}") print(f" avg_max_neg_logprob: {ngram_scores['doc_level']['avg_max_neg_logprob']}") # Output: # Sentence-level scores: # avg_neg_logprob: [3.184312, 3.279774] # max_neg_logprob: [3.476098, 4.574710] # Document-level scores: # avg_neg_logprob: 3.218678904916201 # avg_max_neg_logprob: 4.025404834169327 ``` ## SelfCheckLLMPrompt - Hallucination Detection via Open-Source LLM Prompting SelfCheckLLMPrompt uses an open-source LLM (Llama2, Mistral, etc.) to assess if each sentence is supported by the sampled passages. The LLM is prompted to answer "Yes" or "No" for each sentence-sample pair, then scores are averaged. Supports custom prompt templates. ```python import torch from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt # Initialize with a HuggingFace model device = torch.device("cuda" if torch.cuda.is_available() else "cpu") selfcheck_prompt = SelfCheckLLMPrompt( model="mistralai/Mistral-7B-Instruct-v0.2", device=device ) # Default prompt template: # "Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: " # Optionally set a custom prompt template custom_template = """Given the following context: {context} Evaluate this sentence: {sentence} Does the context support this sentence? Respond with only Yes or No. Answer: """ selfcheck_prompt.set_prompt_template(custom_template) sentences = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host.", "He is the host of The Savage Nation." ] sampled_passages = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.", "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.", "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT." ] # Predict with progress bar # Score mapping: Yes -> 0.0, No -> 1.0, N/A -> 0.5 sent_scores = selfcheck_prompt.predict( sentences=sentences, sampled_passages=sampled_passages, verbose=True # show progress bar ) print(f"LLM Prompt scores: {sent_scores}") # Output: [0.33333333, 0.66666667] ``` ## SelfCheckAPIPrompt - Hallucination Detection via API-Based LLM Prompting SelfCheckAPIPrompt uses API-based LLMs (OpenAI GPT, Groq) for consistency checking. This variant achieves the best performance on benchmarks, especially with GPT-3.5-turbo. Requires appropriate API keys set as environment variables or passed directly. ```python from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt # Option 1: OpenAI (requires OPENAI_API_KEY environment variable) selfcheck_openai = SelfCheckAPIPrompt( client_type="openai", model="gpt-3.5-turbo" ) # Option 2: Groq (pass API key directly) selfcheck_groq = SelfCheckAPIPrompt( client_type="groq", model="llama3-70b-8192", api_key="your-groq-api-key" ) sentences = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host.", "He is the host of The Savage Nation." ] sampled_passages = [ "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country.", "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times.", "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT." ] # Custom prompt template selfcheck_openai.set_prompt_template( "Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: " ) # Predict hallucination scores sent_scores = selfcheck_openai.predict( sentences=sentences, sampled_passages=sampled_passages, verbose=True ) print(f"API Prompt scores: {sent_scores}") # Score mapping: Yes -> 0.0, No -> 1.0, N/A -> 0.5 ``` ## MQAG - Multiple-Choice Question Answering and Generation The MQAG class provides standalone multiple-choice question generation and answering capabilities. It can generate questions from text, answer them given context, and compute MQAG scores comparing candidate vs. reference texts. Useful for evaluating text summaries or comparing generated content against source documents. ```python import torch from selfcheckgpt.modeling_mqag import MQAG # Initialize MQAG device = torch.device("cuda" if torch.cuda.is_available() else "cpu") mqag = MQAG( g1_model_type='race', # options: 'race' or 'squad' device=device ) # Example text for question generation context = """The Apollo 11 mission was the first crewed mission to land on the Moon. Commander Neil Armstrong and lunar module pilot Buzz Aldrin landed the Apollo Lunar Module Eagle on July 20, 1969. Armstrong became the first person to step onto the lunar surface six hours later, and Aldrin joined him 19 minutes after that.""" # Generate multiple-choice questions questions = mqag.generate( context=context, do_sample=True, # True for sampling, False for beam search num_questions=3 # number of questions to generate ) for i, q in enumerate(questions): print(f"Q{i+1}: {q['question']}") print(f"Options: {q['options']}") print() # Output example: # Q1: Who was the commander of the Apollo 11 mission? # Options: ['Neil Armstrong', 'Buzz Aldrin', 'Michael Collins', 'John Glenn'] # Answer questions given a context answer_probs = mqag.answer( questions=questions, context=context ) print(f"Answer probabilities shape: {answer_probs.shape}") # Output: (num_questions, 4) - probability distribution over 4 options # MQAG Score: Compare candidate text against reference candidate = "Neil Armstrong was the first person to walk on the Moon during Apollo 11." reference = context distances = mqag.score( candidate=candidate, reference=reference, num_questions=5, verbose=True # prints question-by-question analysis ) print(f"\nMQAG Distances:") print(f" KL Divergence: {distances['kl_div']:.4f}") print(f" Counting: {distances['counting']:.4f}") print(f" Hellinger: {distances['hellinger']:.4f}") print(f" Total Variation: {distances['total_variation']:.4f}") # Lower distances indicate better alignment between candidate and reference ``` ## Loading the WikiBio GPT-3 Hallucination Dataset The WikiBio GPT-3 hallucination dataset provides 238 annotated passages for benchmarking hallucination detection methods. Each passage includes GPT-3 generated text, human annotations at sentence level, and pre-generated stochastic samples for self-checking. ```python # Option 1: Load via HuggingFace datasets from datasets import load_dataset dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination") example = dataset['evaluation'][0] print("GPT-3 generated text:") print(example['gpt3_text']) print("\nSentences:") for i, sent in enumerate(example['gpt3_sentences']): annotation = example['annotation'][i] print(f" [{annotation}] {sent}") # annotation: 'accurate', 'minor_inaccurate', 'major_inaccurate' print(f"\nNumber of sampled passages: {len(example['gpt3_text_samples'])}") # Option 2: Load from JSON file (manual download) import json with open("dataset.json", "r") as f: dataset = json.loads(f.read()) # Dataset structure: # - gpt3_text: GPT-3 generated passage # - wiki_bio_text: Original Wikipedia passage (ground truth) # - gpt3_sentences: gpt3_text split into sentences # - annotation: human labels per sentence # - wiki_bio_test_idx: ID from wikibio dataset # - gpt3_text_samples: list of sampled passages for self-checking ``` ## Complete Hallucination Detection Pipeline This example demonstrates a complete pipeline for detecting hallucinations in LLM-generated text using multiple SelfCheckGPT variants, combining their scores for robust detection. ```python import torch import spacy import numpy as np from selfcheckgpt.modeling_selfcheck import ( SelfCheckNLI, SelfCheckBERTScore, SelfCheckNgram ) # Setup torch.manual_seed(42) # for reproducibility nlp = spacy.load("en_core_web_sm") device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Initialize multiple checkers selfcheck_nli = SelfCheckNLI(device=device) selfcheck_bertscore = SelfCheckBERTScore(rescale_with_baseline=True) selfcheck_ngram = SelfCheckNgram(n=1) def detect_hallucinations(passage: str, sampled_passages: list, threshold: float = 0.5): """ Comprehensive hallucination detection using multiple methods. Args: passage: Original LLM-generated text sampled_passages: List of stochastic samples from the same LLM threshold: Score threshold for flagging hallucinations Returns: Dictionary with sentence-level analysis """ # Tokenize into sentences sentences = [sent.text.strip() for sent in nlp(passage).sents] # Get scores from multiple methods scores_nli = selfcheck_nli.predict(sentences, sampled_passages) scores_bert = selfcheck_bertscore.predict(sentences, sampled_passages) scores_ngram = selfcheck_ngram.predict(sentences, passage, sampled_passages) # Analyze each sentence results = [] for i, sent in enumerate(sentences): nli_score = scores_nli[i] bert_score = scores_bert[i] ngram_score = scores_ngram['sent_level']['avg_neg_logprob'][i] # Normalize ngram score to [0,1] range (approximate) ngram_normalized = min(ngram_score / 5.0, 1.0) # Ensemble score (weighted average) ensemble = 0.5 * nli_score + 0.3 * bert_score + 0.2 * ngram_normalized results.append({ 'sentence': sent, 'nli_score': float(nli_score), 'bertscore': float(bert_score), 'ngram_score': float(ngram_score), 'ensemble_score': float(ensemble), 'is_hallucination': ensemble > threshold }) return { 'sentences': results, 'document_ngram_score': scores_ngram['doc_level']['avg_neg_logprob'] } # Example usage passage = """Albert Einstein was born in 1879 in Germany. He developed the theory of relativity. Einstein won the Nobel Prize in Chemistry in 1921. He later moved to the United States.""" samples = [ "Albert Einstein was born in 1879 in Ulm, Germany. He is famous for the theory of relativity. Einstein won the Nobel Prize in Physics in 1921.", "Albert Einstein, born 1879, was a German physicist. His theory of relativity revolutionized physics. He received the Nobel Prize for his work on photoelectric effect.", "Einstein was born in Germany in 1879. He created the special and general theories of relativity. He won the 1921 Nobel Prize in Physics." ] results = detect_hallucinations(passage, samples) print("Hallucination Analysis:") print("=" * 60) for r in results['sentences']: status = "HALLUCINATION" if r['is_hallucination'] else "OK" print(f"\n[{status}] {r['sentence']}") print(f" NLI: {r['nli_score']:.3f}, BERT: {r['bertscore']:.3f}, " f"Ngram: {r['ngram_score']:.3f}, Ensemble: {r['ensemble_score']:.3f}") # Output shows "Nobel Prize in Chemistry" flagged as hallucination # (should be Physics - inconsistent with samples) ``` ## Summary SelfCheckGPT provides a versatile toolkit for detecting hallucinations in LLM-generated text without requiring external knowledge bases. The primary use cases include: (1) quality assurance for AI-generated content by identifying potentially fabricated information, (2) building guardrails for production LLM applications that need factual accuracy, (3) research benchmarking for developing new hallucination detection methods, and (4) evaluating text summarization and question-answering systems through the MQAG framework. Integration patterns typically involve generating multiple stochastic samples from the target LLM (3-5 samples recommended), then applying one or more SelfCheckGPT variants. For production deployments, SelfCheck-NLI offers the best accuracy-speed tradeoff for local inference, while SelfCheckAPIPrompt with GPT-3.5-turbo achieves the highest accuracy when API costs are acceptable. The ensemble approach combining multiple methods provides the most robust detection. All methods return sentence-level scores, enabling fine-grained identification of which specific claims may be hallucinated.