Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Add Docs
QMSum
https://github.com/yale-lily/qmsum
Admin
QMSum is a human-annotated benchmark dataset for query-based multi-domain meeting summarization,
...
Tokens:
21,392
Snippets:
380
Trust Score:
8.6
Update:
1 month ago
Context
Skills
Chat
Benchmark
75.6
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# QMSum QMSum is a human-annotated benchmark dataset for query-based multi-domain meeting summarization, introduced in the NAACL 2021 paper. The dataset consists of 1,808 query-summary pairs over 232 meetings spanning three domains: Academic (ICSI and AMI meeting corpora), Product (design meetings), and Committee (parliamentary committee meetings). Unlike traditional meeting summarization datasets, QMSum focuses on query-based summarization where users can ask both general questions (summarize the whole meeting) and specific questions (summarize a particular topic or speaker's contribution). The dataset addresses the challenge of long-form meeting summarization by providing topic segmentation and relevant text span annotations. Each meeting includes topic lists with text span mappings, general queries for whole-meeting summaries, and specific queries targeting particular discussion topics or speakers. This structure enables both extractive span localization and abstractive summary generation research, making QMSum valuable for developing hierarchical and query-focused summarization models. ## Data Format and Structure QMSum data is stored in JSON format with four main components: topic lists, general queries, specific queries, and meeting transcripts. Each meeting file contains complete dialogue transcripts with speaker attribution and structured query-answer pairs. ```json { "topic_list": [ { "topic": "Introduction of petitions and prioritization of governmental matters", "relevant_text_span": [["0","19"]] }, { "topic": "Financial assistance for vulnerable Canadians during the pandemic", "relevant_text_span": [["21","57"], ["113","119"], ["191","217"]] } ], "general_query_list": [ { "query": "Summarize the whole meeting.", "answer": "The meeting of the standing committee took place to discuss matters pertinent to the Coronavirus pandemic..." } ], "specific_query_list": [ { "query": "Summarize the discussion about introduction of petitions.", "answer": "The Chair brought the meeting to order, announcing that the purpose was to discuss COVID-19's impact...", "relevant_text_span": [["0","19"]] }, { "query": "What did Paul-Hus think about the introduction of petitions?", "answer": "Mr. Paul-Hus thought that the government should not take firearms away from law-abiding citizens...", "relevant_text_span": [["9","18"]] } ], "meeting_transcripts": [ { "speaker": "The Chair (Hon. Anthony Rota)", "content": "I call the meeting to order. Welcome to the third meeting of the House of Commons Special Committee..." }, { "speaker": "Mr. Garnett Genuis", "content": "Mr. Chair, I'm pleased to be presenting two petitions today..." } ] } ``` ## Loading QMSum Data Load the dataset using Python's built-in JSON library. The data is available in both JSON and JSONL formats across train/validation/test splits. ```python import json # Load data from JSONL format def load_qmsum_data(split='train', domain='ALL'): """ Load QMSum data for specified split and domain. Args: split: 'train', 'val', or 'test' domain: 'ALL', 'Academic', 'Product', or 'Committee' Returns: List of meeting dictionaries """ data_path = f'data/{domain}/jsonl/{split}.jsonl' data = [] with open(data_path) as f: for line in f: data.append(json.loads(line)) return data # Load training data train_data = load_qmsum_data('train', 'ALL') print(f'Total {len(train_data)} meetings in training set') # Output: Total 162 meetings in the train set. # Access meeting components meeting = train_data[0] print(f"Topics: {len(meeting['topic_list'])}") print(f"General queries: {len(meeting['general_query_list'])}") print(f"Specific queries: {len(meeting['specific_query_list'])}") print(f"Transcript turns: {len(meeting['meeting_transcripts'])}") ``` ## Text Preprocessing Clean meeting transcripts by removing speech recognition artifacts and noise markers. Tokenize text for model input using NLTK. ```python from nltk import word_tokenize def tokenize(sent): """Tokenize and lowercase a sentence.""" tokens = ' '.join(word_tokenize(sent.lower())) return tokens def clean_data(text): """Remove speech recognition noise markers from text.""" noise_markers = [ '{ vocalsound } ', '{ disfmarker } ', '{ pause } ', '{ nonvocalsound } ', '{ gap } ' ] for marker in noise_markers: text = text.replace(marker, '') # Normalize abbreviations abbreviations = { 'a_m_i_': 'ami', 'l_c_d_': 'lcd', 'p_m_s': 'pms', 't_v_': 'tv' } for abbr, expanded in abbreviations.items(): text = text.replace(abbr, expanded) return text # Example usage raw_text = "{ vocalsound } I think we should { disfmarker } use l_c_d_ screens" cleaned = clean_data(raw_text) tokenized = tokenize(cleaned) print(tokenized) # Output: i think we should use lcd screens ``` ## Processing Data for BART Model Convert QMSum data into source-target pairs suitable for sequence-to-sequence models like BART. The source combines the query with meeting content. ```python def process_for_bart(data, use_gold_spans=False): """ Process QMSum data for BART summarization model. Args: data: List of meeting dictionaries use_gold_spans: If True, use only relevant text spans for specific queries Returns: List of {'src': source_text, 'tgt': target_summary} dictionaries """ bart_data = [] for meeting in data: # Build full meeting content full_src = [] for turn in meeting['meeting_transcripts']: speaker = turn['speaker'].lower() content = tokenize(turn['content']) full_src.append(f"{speaker}: {content}") full_src_text = ' '.join(full_src) # Process general queries (use full meeting) for query_item in meeting['general_query_list']: query = tokenize(query_item['query']) target = tokenize(query_item['answer']) src = clean_data(f'<s> {query} </s> {full_src_text} </s>') bart_data.append({'src': src, 'tgt': target}) # Process specific queries for query_item in meeting['specific_query_list']: query = tokenize(query_item['query']) target = tokenize(query_item['answer']) if use_gold_spans: # Extract only relevant spans src_parts = [] for span in query_item['relevant_text_span']: start, end = int(span[0]), int(span[1]) for idx in range(start, end + 1): turn = meeting['meeting_transcripts'][idx] speaker = turn['speaker'].lower() content = tokenize(turn['content']) src_parts.append(f"{speaker}: {content}") src_text = ' '.join(src_parts) else: src_text = full_src_text src = clean_data(f'<s> {query} </s> {src_text} </s>') bart_data.append({'src': src, 'tgt': target}) return bart_data # Process training data train_data = load_qmsum_data('train', 'ALL') bart_train = process_for_bart(train_data, use_gold_spans=False) bart_train_gold = process_for_bart(train_data, use_gold_spans=True) print(f'Total {len(bart_train)} query-summary pairs') # Output: Total 1257 query-summary pairs in the train set # Save processed data with open('bart_train.jsonl', 'w') as f: for item in bart_train: print(json.dumps(item), file=f) ``` ## Extracting Relevant Text Spans Access and extract the gold-standard relevant text spans for specific queries. Spans indicate which transcript turns are relevant for answering a query. ```python def extract_gold_spans(meeting, query_idx): """ Extract gold-standard relevant text spans for a specific query. Args: meeting: Meeting dictionary query_idx: Index into specific_query_list Returns: List of transcript turn dictionaries """ query_item = meeting['specific_query_list'][query_idx] spans = query_item['relevant_text_span'] relevant_turns = [] for span in spans: start_idx, end_idx = int(span[0]), int(span[1]) for idx in range(start_idx, end_idx + 1): turn = meeting['meeting_transcripts'][idx] relevant_turns.append({ 'index': idx, 'speaker': turn['speaker'], 'content': turn['content'] }) return relevant_turns # Example: Extract spans for first specific query meeting = train_data[0] query = meeting['specific_query_list'][0] print(f"Query: {query['query']}") print(f"Answer: {query['answer'][:100]}...") print(f"Span indices: {query['relevant_text_span']}") relevant = extract_gold_spans(meeting, 0) print(f"\nRelevant turns ({len(relevant)} total):") for turn in relevant[:3]: print(f" [{turn['index']}] {turn['speaker']}: {turn['content'][:50]}...") ``` ## Working with Domain-Specific Data QMSum contains meetings from three distinct domains. Load and analyze data from specific domains for domain-focused experiments. ```python def load_domain_data(domain): """ Load all splits for a specific domain. Args: domain: 'Academic', 'Product', or 'Committee' Returns: Dictionary with train/val/test splits """ return { 'train': load_qmsum_data('train', domain), 'val': load_qmsum_data('val', domain), 'test': load_qmsum_data('test', domain) } def get_domain_statistics(domain_data): """Calculate statistics for domain data.""" stats = {} for split, meetings in domain_data.items(): n_meetings = len(meetings) n_general = sum(len(m['general_query_list']) for m in meetings) n_specific = sum(len(m['specific_query_list']) for m in meetings) avg_turns = sum(len(m['meeting_transcripts']) for m in meetings) / n_meetings stats[split] = { 'meetings': n_meetings, 'general_queries': n_general, 'specific_queries': n_specific, 'avg_transcript_turns': round(avg_turns, 1) } return stats # Load and analyze each domain for domain in ['Academic', 'Product', 'Committee']: data = load_domain_data(domain) stats = get_domain_statistics(data) print(f"\n{domain} Domain:") for split, s in stats.items(): print(f" {split}: {s['meetings']} meetings, " f"{s['general_queries'] + s['specific_queries']} queries") ``` ## Using Extracted Spans from Locator Model Load pre-extracted spans from the Locator model to use as input for summarization instead of full meeting transcripts. ```python def load_extracted_spans(split='train'): """ Load spans extracted by the Locator model. Args: split: 'train', 'val', or 'test' Returns: List of extracted span strings """ spans = [] with open(f'extracted_span/{split}.txt', 'r') as f: for line in f: spans.append(line.strip()) return spans def load_model_outputs(): """ Load HMNet model predictions and references. Returns: Tuple of (predictions, references) lists """ with open('model_output/preds.txt', 'r') as f: preds = [line.strip() for line in f] with open('model_output/refs.txt', 'r') as f: refs = [line.strip() for line in f] return preds, refs # Load extracted spans for training train_spans = load_extracted_spans('train') print(f"Loaded {len(train_spans)} extracted spans") # Load model outputs for evaluation preds, refs = load_model_outputs() print(f"Loaded {len(preds)} predictions and {len(refs)} references") # HMNet ROUGE scores: R-1/R-2/R-L = 36.51/11.41/31.60 ``` ## Evaluating Summarization with ROUGE Evaluate generated summaries against reference summaries using ROUGE metrics, the standard evaluation for QMSum. ```python from rouge_score import rouge_scorer def evaluate_summaries(predictions, references): """ Calculate ROUGE scores for generated summaries. Args: predictions: List of generated summary strings references: List of reference summary strings Returns: Dictionary with ROUGE-1, ROUGE-2, ROUGE-L F1 scores """ scorer = rouge_scorer.RougeScorer( ['rouge1', 'rouge2', 'rougeL'], use_stemmer=True ) scores = {'rouge1': [], 'rouge2': [], 'rougeL': []} for pred, ref in zip(predictions, references): score = scorer.score(ref, pred) for key in scores: scores[key].append(score[key].fmeasure) return { 'rouge1': round(sum(scores['rouge1']) / len(scores['rouge1']) * 100, 2), 'rouge2': round(sum(scores['rouge2']) / len(scores['rouge2']) * 100, 2), 'rougeL': round(sum(scores['rougeL']) / len(scores['rougeL']) * 100, 2) } # Example evaluation preds, refs = load_model_outputs() results = evaluate_summaries(preds, refs) print(f"ROUGE-1: {results['rouge1']}") print(f"ROUGE-2: {results['rouge2']}") print(f"ROUGE-L: {results['rougeL']}") # Expected HMNet output: 36.51/11.41/31.60 ``` ## Summary QMSum serves as a comprehensive benchmark for developing and evaluating query-based meeting summarization systems. The dataset's unique features include multi-domain coverage (academic, product design, and parliamentary meetings), hierarchical topic segmentation with text span annotations, and both general and specific query types. Researchers can use QMSum to train and evaluate various approaches including extractive span localization (Locator models), abstractive summarization (BART, PGNet), and hierarchical meeting understanding (HMNet). The dataset supports multiple research paradigms: full-meeting summarization using all transcript content, gold-span summarization using annotated relevant text spans, and pipeline approaches combining span extraction with abstractive generation. With 1,808 query-summary pairs and detailed annotations, QMSum enables rigorous evaluation of models' abilities to understand long-form dialogue, identify relevant information, and generate coherent query-focused summaries. The provided data processing utilities facilitate integration with popular sequence-to-sequence frameworks and support reproducible research in meeting summarization.