# QMSum
QMSum is a human-annotated benchmark dataset for query-based multi-domain meeting summarization, introduced in the NAACL 2021 paper. The dataset consists of 1,808 query-summary pairs over 232 meetings spanning three domains: Academic (ICSI and AMI meeting corpora), Product (design meetings), and Committee (parliamentary committee meetings). Unlike traditional meeting summarization datasets, QMSum focuses on query-based summarization where users can ask both general questions (summarize the whole meeting) and specific questions (summarize a particular topic or speaker's contribution).
The dataset addresses the challenge of long-form meeting summarization by providing topic segmentation and relevant text span annotations. Each meeting includes topic lists with text span mappings, general queries for whole-meeting summaries, and specific queries targeting particular discussion topics or speakers. This structure enables both extractive span localization and abstractive summary generation research, making QMSum valuable for developing hierarchical and query-focused summarization models.
## Data Format and Structure
QMSum data is stored in JSON format with four main components: topic lists, general queries, specific queries, and meeting transcripts. Each meeting file contains complete dialogue transcripts with speaker attribution and structured query-answer pairs.
```json
{
"topic_list": [
{
"topic": "Introduction of petitions and prioritization of governmental matters",
"relevant_text_span": [["0","19"]]
},
{
"topic": "Financial assistance for vulnerable Canadians during the pandemic",
"relevant_text_span": [["21","57"], ["113","119"], ["191","217"]]
}
],
"general_query_list": [
{
"query": "Summarize the whole meeting.",
"answer": "The meeting of the standing committee took place to discuss matters pertinent to the Coronavirus pandemic..."
}
],
"specific_query_list": [
{
"query": "Summarize the discussion about introduction of petitions.",
"answer": "The Chair brought the meeting to order, announcing that the purpose was to discuss COVID-19's impact...",
"relevant_text_span": [["0","19"]]
},
{
"query": "What did Paul-Hus think about the introduction of petitions?",
"answer": "Mr. Paul-Hus thought that the government should not take firearms away from law-abiding citizens...",
"relevant_text_span": [["9","18"]]
}
],
"meeting_transcripts": [
{
"speaker": "The Chair (Hon. Anthony Rota)",
"content": "I call the meeting to order. Welcome to the third meeting of the House of Commons Special Committee..."
},
{
"speaker": "Mr. Garnett Genuis",
"content": "Mr. Chair, I'm pleased to be presenting two petitions today..."
}
]
}
```
## Loading QMSum Data
Load the dataset using Python's built-in JSON library. The data is available in both JSON and JSONL formats across train/validation/test splits.
```python
import json
# Load data from JSONL format
def load_qmsum_data(split='train', domain='ALL'):
"""
Load QMSum data for specified split and domain.
Args:
split: 'train', 'val', or 'test'
domain: 'ALL', 'Academic', 'Product', or 'Committee'
Returns:
List of meeting dictionaries
"""
data_path = f'data/{domain}/jsonl/{split}.jsonl'
data = []
with open(data_path) as f:
for line in f:
data.append(json.loads(line))
return data
# Load training data
train_data = load_qmsum_data('train', 'ALL')
print(f'Total {len(train_data)} meetings in training set')
# Output: Total 162 meetings in the train set.
# Access meeting components
meeting = train_data[0]
print(f"Topics: {len(meeting['topic_list'])}")
print(f"General queries: {len(meeting['general_query_list'])}")
print(f"Specific queries: {len(meeting['specific_query_list'])}")
print(f"Transcript turns: {len(meeting['meeting_transcripts'])}")
```
## Text Preprocessing
Clean meeting transcripts by removing speech recognition artifacts and noise markers. Tokenize text for model input using NLTK.
```python
from nltk import word_tokenize
def tokenize(sent):
"""Tokenize and lowercase a sentence."""
tokens = ' '.join(word_tokenize(sent.lower()))
return tokens
def clean_data(text):
"""Remove speech recognition noise markers from text."""
noise_markers = [
'{ vocalsound } ', '{ disfmarker } ', '{ pause } ',
'{ nonvocalsound } ', '{ gap } '
]
for marker in noise_markers:
text = text.replace(marker, '')
# Normalize abbreviations
abbreviations = {
'a_m_i_': 'ami', 'l_c_d_': 'lcd',
'p_m_s': 'pms', 't_v_': 'tv'
}
for abbr, expanded in abbreviations.items():
text = text.replace(abbr, expanded)
return text
# Example usage
raw_text = "{ vocalsound } I think we should { disfmarker } use l_c_d_ screens"
cleaned = clean_data(raw_text)
tokenized = tokenize(cleaned)
print(tokenized)
# Output: i think we should use lcd screens
```
## Processing Data for BART Model
Convert QMSum data into source-target pairs suitable for sequence-to-sequence models like BART. The source combines the query with meeting content.
```python
def process_for_bart(data, use_gold_spans=False):
"""
Process QMSum data for BART summarization model.
Args:
data: List of meeting dictionaries
use_gold_spans: If True, use only relevant text spans for specific queries
Returns:
List of {'src': source_text, 'tgt': target_summary} dictionaries
"""
bart_data = []
for meeting in data:
# Build full meeting content
full_src = []
for turn in meeting['meeting_transcripts']:
speaker = turn['speaker'].lower()
content = tokenize(turn['content'])
full_src.append(f"{speaker}: {content}")
full_src_text = ' '.join(full_src)
# Process general queries (use full meeting)
for query_item in meeting['general_query_list']:
query = tokenize(query_item['query'])
target = tokenize(query_item['answer'])
src = clean_data(f' {query} {full_src_text} ')
bart_data.append({'src': src, 'tgt': target})
# Process specific queries
for query_item in meeting['specific_query_list']:
query = tokenize(query_item['query'])
target = tokenize(query_item['answer'])
if use_gold_spans:
# Extract only relevant spans
src_parts = []
for span in query_item['relevant_text_span']:
start, end = int(span[0]), int(span[1])
for idx in range(start, end + 1):
turn = meeting['meeting_transcripts'][idx]
speaker = turn['speaker'].lower()
content = tokenize(turn['content'])
src_parts.append(f"{speaker}: {content}")
src_text = ' '.join(src_parts)
else:
src_text = full_src_text
src = clean_data(f' {query} {src_text} ')
bart_data.append({'src': src, 'tgt': target})
return bart_data
# Process training data
train_data = load_qmsum_data('train', 'ALL')
bart_train = process_for_bart(train_data, use_gold_spans=False)
bart_train_gold = process_for_bart(train_data, use_gold_spans=True)
print(f'Total {len(bart_train)} query-summary pairs')
# Output: Total 1257 query-summary pairs in the train set
# Save processed data
with open('bart_train.jsonl', 'w') as f:
for item in bart_train:
print(json.dumps(item), file=f)
```
## Extracting Relevant Text Spans
Access and extract the gold-standard relevant text spans for specific queries. Spans indicate which transcript turns are relevant for answering a query.
```python
def extract_gold_spans(meeting, query_idx):
"""
Extract gold-standard relevant text spans for a specific query.
Args:
meeting: Meeting dictionary
query_idx: Index into specific_query_list
Returns:
List of transcript turn dictionaries
"""
query_item = meeting['specific_query_list'][query_idx]
spans = query_item['relevant_text_span']
relevant_turns = []
for span in spans:
start_idx, end_idx = int(span[0]), int(span[1])
for idx in range(start_idx, end_idx + 1):
turn = meeting['meeting_transcripts'][idx]
relevant_turns.append({
'index': idx,
'speaker': turn['speaker'],
'content': turn['content']
})
return relevant_turns
# Example: Extract spans for first specific query
meeting = train_data[0]
query = meeting['specific_query_list'][0]
print(f"Query: {query['query']}")
print(f"Answer: {query['answer'][:100]}...")
print(f"Span indices: {query['relevant_text_span']}")
relevant = extract_gold_spans(meeting, 0)
print(f"\nRelevant turns ({len(relevant)} total):")
for turn in relevant[:3]:
print(f" [{turn['index']}] {turn['speaker']}: {turn['content'][:50]}...")
```
## Working with Domain-Specific Data
QMSum contains meetings from three distinct domains. Load and analyze data from specific domains for domain-focused experiments.
```python
def load_domain_data(domain):
"""
Load all splits for a specific domain.
Args:
domain: 'Academic', 'Product', or 'Committee'
Returns:
Dictionary with train/val/test splits
"""
return {
'train': load_qmsum_data('train', domain),
'val': load_qmsum_data('val', domain),
'test': load_qmsum_data('test', domain)
}
def get_domain_statistics(domain_data):
"""Calculate statistics for domain data."""
stats = {}
for split, meetings in domain_data.items():
n_meetings = len(meetings)
n_general = sum(len(m['general_query_list']) for m in meetings)
n_specific = sum(len(m['specific_query_list']) for m in meetings)
avg_turns = sum(len(m['meeting_transcripts']) for m in meetings) / n_meetings
stats[split] = {
'meetings': n_meetings,
'general_queries': n_general,
'specific_queries': n_specific,
'avg_transcript_turns': round(avg_turns, 1)
}
return stats
# Load and analyze each domain
for domain in ['Academic', 'Product', 'Committee']:
data = load_domain_data(domain)
stats = get_domain_statistics(data)
print(f"\n{domain} Domain:")
for split, s in stats.items():
print(f" {split}: {s['meetings']} meetings, "
f"{s['general_queries'] + s['specific_queries']} queries")
```
## Using Extracted Spans from Locator Model
Load pre-extracted spans from the Locator model to use as input for summarization instead of full meeting transcripts.
```python
def load_extracted_spans(split='train'):
"""
Load spans extracted by the Locator model.
Args:
split: 'train', 'val', or 'test'
Returns:
List of extracted span strings
"""
spans = []
with open(f'extracted_span/{split}.txt', 'r') as f:
for line in f:
spans.append(line.strip())
return spans
def load_model_outputs():
"""
Load HMNet model predictions and references.
Returns:
Tuple of (predictions, references) lists
"""
with open('model_output/preds.txt', 'r') as f:
preds = [line.strip() for line in f]
with open('model_output/refs.txt', 'r') as f:
refs = [line.strip() for line in f]
return preds, refs
# Load extracted spans for training
train_spans = load_extracted_spans('train')
print(f"Loaded {len(train_spans)} extracted spans")
# Load model outputs for evaluation
preds, refs = load_model_outputs()
print(f"Loaded {len(preds)} predictions and {len(refs)} references")
# HMNet ROUGE scores: R-1/R-2/R-L = 36.51/11.41/31.60
```
## Evaluating Summarization with ROUGE
Evaluate generated summaries against reference summaries using ROUGE metrics, the standard evaluation for QMSum.
```python
from rouge_score import rouge_scorer
def evaluate_summaries(predictions, references):
"""
Calculate ROUGE scores for generated summaries.
Args:
predictions: List of generated summary strings
references: List of reference summary strings
Returns:
Dictionary with ROUGE-1, ROUGE-2, ROUGE-L F1 scores
"""
scorer = rouge_scorer.RougeScorer(
['rouge1', 'rouge2', 'rougeL'],
use_stemmer=True
)
scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
for pred, ref in zip(predictions, references):
score = scorer.score(ref, pred)
for key in scores:
scores[key].append(score[key].fmeasure)
return {
'rouge1': round(sum(scores['rouge1']) / len(scores['rouge1']) * 100, 2),
'rouge2': round(sum(scores['rouge2']) / len(scores['rouge2']) * 100, 2),
'rougeL': round(sum(scores['rougeL']) / len(scores['rougeL']) * 100, 2)
}
# Example evaluation
preds, refs = load_model_outputs()
results = evaluate_summaries(preds, refs)
print(f"ROUGE-1: {results['rouge1']}")
print(f"ROUGE-2: {results['rouge2']}")
print(f"ROUGE-L: {results['rougeL']}")
# Expected HMNet output: 36.51/11.41/31.60
```
## Summary
QMSum serves as a comprehensive benchmark for developing and evaluating query-based meeting summarization systems. The dataset's unique features include multi-domain coverage (academic, product design, and parliamentary meetings), hierarchical topic segmentation with text span annotations, and both general and specific query types. Researchers can use QMSum to train and evaluate various approaches including extractive span localization (Locator models), abstractive summarization (BART, PGNet), and hierarchical meeting understanding (HMNet).
The dataset supports multiple research paradigms: full-meeting summarization using all transcript content, gold-span summarization using annotated relevant text spans, and pipeline approaches combining span extraction with abstractive generation. With 1,808 query-summary pairs and detailed annotations, QMSum enables rigorous evaluation of models' abilities to understand long-form dialogue, identify relevant information, and generate coherent query-focused summaries. The provided data processing utilities facilitate integration with popular sequence-to-sequence frameworks and support reproducible research in meeting summarization.