Whisper (openai/whisper)

Whisper

https://github.com/openai/whisper
Admin
Whisper is a general-purpose speech recognition model trained on diverse audio, capable of...

Tokens:13,841
Snippets:148
Trust Score:9.1
Update:2 weeks ago
Show doc for...
Context Summary (auto-generated)
Raw
# Whisper

Whisper is OpenAI's general-purpose speech recognition model built on a Transformer sequence-to-sequence architecture. It is trained on 680,000 hours of multilingual and multitask supervised data collected from the web, enabling robust speech recognition across 99 languages. The model can perform multilingual speech recognition, speech translation (any language to English), spoken language identification, and voice activity detection through a unified token-based approach.

The library provides both a Python API and command-line interface for transcribing audio files. Whisper offers six model sizes (tiny, base, small, medium, large, turbo) with English-only variants available for smaller models, allowing users to balance accuracy against speed and resource requirements. The turbo model provides an optimized version of large-v3 with 8x faster inference while maintaining high accuracy for transcription tasks.

## Load Model

The `load_model` function downloads and initializes a Whisper ASR model. It automatically selects the appropriate device (CUDA if available, otherwise CPU) and caches downloaded model weights in `~/.cache/whisper`. Available models include `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `medium.en`, `large`, `large-v1`, `large-v2`, `large-v3`, and `turbo`.

```python
import whisper

# Load the turbo model (fastest, recommended for most use cases)
model = whisper.load_model("turbo")

# Load a specific model with custom device
model = whisper.load_model("large-v3", device="cuda")

# Load English-only model for better performance on English audio
model = whisper.load_model("base.en")

# Load model with custom download directory
model = whisper.load_model("medium", download_root="/path/to/models")

# Load from a local checkpoint file
model = whisper.load_model("/path/to/custom_model.pt")

# List all available models
print(whisper.available_models())
# Output: ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small',
#          'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3',
#          'large', 'large-v3-turbo', 'turbo']
```

## Transcribe Audio

The `transcribe` function processes an audio file using a sliding 30-second window and performs autoregressive sequence-to-sequence predictions. It returns a dictionary containing the full transcription text, detailed segments with timestamps, and the detected language. The function supports various parameters for controlling temperature-based sampling, compression ratio thresholds, and word-level timestamp extraction.

```python
import whisper

model = whisper.load_model("turbo")

# Basic transcription
result = model.transcribe("audio.mp3")
print(result["text"])
# Output: " The quick brown fox jumps over the lazy dog."

# Access detailed segments with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")
# Output: [0.00s -> 2.50s]  The quick brown fox
#         [2.50s -> 4.80s]  jumps over the lazy dog.

# Transcription with word-level timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
    for word in segment["words"]:
        print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")

# Specify language explicitly (skip auto-detection)
result = model.transcribe("german_audio.mp3", language="de")
print(f"Detected language: {result['language']}")

# Translation to English
result = model.transcribe("japanese_audio.mp3", task="translate")
print(result["text"])  # English translation

# Use initial prompt for context (helps with proper nouns, technical terms)
result = model.transcribe(
    "meeting.mp3",
    initial_prompt="Meeting transcript for Acme Corp discussing the Q4 roadmap."
)

# Transcribe specific time ranges
result = model.transcribe(
    "long_audio.mp3",
    clip_timestamps="30.0,60.0,120.0,180.0"  # Process 30-60s and 120-180s
)

# Fine-tune decoding parameters
result = model.transcribe(
    "audio.mp3",
    temperature=0.0,              # Greedy decoding (deterministic)
    beam_size=5,                   # Beam search width
    best_of=5,                     # Number of candidates for sampling
    compression_ratio_threshold=2.4,  # Retry if too repetitive
    logprob_threshold=-1.0,        # Retry if confidence too low
    no_speech_threshold=0.6,       # Silence detection threshold
    condition_on_previous_text=True,  # Use context across windows
    fp16=True,                     # Use half-precision (faster on GPU)
    verbose=True                   # Print progress
)
```

## Load Audio

The `load_audio` function reads an audio file and converts it to a mono waveform at 16kHz sample rate, which is the format expected by Whisper models. It uses ffmpeg under the hood to handle various audio formats including MP3, WAV, FLAC, M4A, and video files.

```python
import whisper
import numpy as np

# Load audio file as numpy array
audio = whisper.load_audio("speech.mp3")
print(f"Audio shape: {audio.shape}")  # (num_samples,)
print(f"Duration: {len(audio) / 16000:.2f} seconds")

# Load with custom sample rate (not recommended, Whisper expects 16kHz)
audio = whisper.load_audio("speech.wav", sr=16000)

# Audio is returned as float32 normalized to [-1, 1]
print(f"Dtype: {audio.dtype}")  # float32
print(f"Range: [{audio.min():.3f}, {audio.max():.3f}]")
```

## Pad or Trim Audio

The `pad_or_trim` function adjusts audio arrays or tensors to exactly 30 seconds (480,000 samples at 16kHz), which is the input size expected by the Whisper encoder. It pads shorter audio with silence or trims longer audio.

```python
import whisper
import torch

# Load and prepare audio for the model
audio = whisper.load_audio("short_clip.mp3")
print(f"Original length: {len(audio)} samples")

# Pad or trim to 30 seconds
audio = whisper.pad_or_trim(audio)
print(f"Adjusted length: {len(audio)} samples")  # 480000

# Works with torch tensors too
audio_tensor = torch.from_numpy(audio)
audio_tensor = whisper.pad_or_trim(audio_tensor)
print(f"Tensor shape: {audio_tensor.shape}")  # torch.Size([480000])

# Pad/trim mel spectrograms along frame axis
mel = whisper.log_mel_spectrogram(audio)
mel = whisper.pad_or_trim(mel, 3000)  # 3000 frames for 30 seconds
print(f"Mel shape: {mel.shape}")  # torch.Size([80, 3000]) or torch.Size([128, 3000])
```

## Log-Mel Spectrogram

The `log_mel_spectrogram` function computes the log-Mel spectrogram representation of audio, which is the input format for the Whisper encoder. It applies STFT with a Hann window, projects to mel scale using precomputed filter banks, and applies log scaling with clamping.

```python
import whisper
import torch

model = whisper.load_model("turbo")

# Compute mel spectrogram from audio file path
mel = whisper.log_mel_spectrogram("audio.mp3")
print(f"Mel shape: {mel.shape}")  # torch.Size([80, num_frames])

# Compute from numpy array
audio = whisper.load_audio("audio.mp3")
mel = whisper.log_mel_spectrogram(audio)

# Specify number of mel bands (80 for most models, 128 for large-v3/turbo)
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels)
print(f"N_mels: {model.dims.n_mels}")  # 128 for turbo, 80 for older models

# Move to specific device
mel = whisper.log_mel_spectrogram(audio, device="cuda")
print(f"Device: {mel.device}")  # cuda:0

# Add padding for processing in sliding windows
mel = whisper.log_mel_spectrogram(audio, padding=480000)  # Pad 30 seconds
```

## Detect Language

The `detect_language` method identifies the spoken language in the audio by analyzing the first 30 seconds. It returns the detected language token and a dictionary of probabilities for all supported languages. This is useful when processing multilingual content or validating language assumptions.

```python
import whisper

model = whisper.load_model("turbo")

# Load and prepare audio
audio = whisper.load_audio("multilingual_audio.mp3")
audio = whisper.pad_or_trim(audio)

# Compute mel spectrogram
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# Detect language
_, probs = model.detect_language(mel)

# Get the most likely language
detected_lang = max(probs, key=probs.get)
confidence = probs[detected_lang]
print(f"Detected: {detected_lang} (confidence: {confidence:.2%})")
# Output: Detected: en (confidence: 98.45%)

# Show top 5 language probabilities
sorted_probs = sorted(probs.items(), key=lambda x: x[1], reverse=True)[:5]
for lang, prob in sorted_probs:
    print(f"  {lang}: {prob:.2%}")
# Output:   en: 98.45%
#           de: 0.82%
#           nl: 0.31%
#           fr: 0.15%
#           es: 0.09%
```

## Decode Audio

The `decode` function provides low-level access to the decoder for processing 30-second mel spectrogram segments. It returns a `DecodingResult` containing the transcribed text, tokens, language, and quality metrics like average log probability and compression ratio. Use `DecodingOptions` to configure decoding behavior.

```python
import whisper

model = whisper.load_model("turbo")

# Prepare a 30-second audio segment
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# Basic decoding with default options
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

print(f"Text: {result.text}")
print(f"Language: {result.language}")
print(f"Tokens: {result.tokens[:10]}...")  # First 10 token IDs
print(f"Avg log prob: {result.avg_logprob:.3f}")
print(f"No speech prob: {result.no_speech_prob:.3f}")
print(f"Compression ratio: {result.compression_ratio:.2f}")

# Decoding with custom options
options = whisper.DecodingOptions(
    language="en",           # Force English
    task="transcribe",       # or "translate" for X->English
    temperature=0.0,         # Greedy decoding
    beam_size=5,             # Use beam search
    best_of=None,            # Mutually exclusive with beam_size
    fp16=True,               # Half precision
    without_timestamps=False, # Include timestamp tokens
    max_initial_timestamp=1.0, # Max offset for first timestamp
    suppress_tokens="-1",    # Suppress non-speech tokens
    suppress_blank=True,     # Suppress blank at start
    prompt="Previous context here",  # Provide context
    prefix="The speaker said:",      # Force output prefix
)
result = whisper.decode(model, mel, options)

# Batch decoding (multiple segments at once)
mel_batch = mel.unsqueeze(0).repeat(4, 1, 1)  # 4 copies
results = whisper.decode(model, mel_batch, options)
for i, r in enumerate(results):
    print(f"Segment {i}: {r.text}")
```

## Command-Line Interface

Whisper provides a command-line tool for transcribing audio files directly from the terminal. It supports multiple input files, various output formats (txt, vtt, srt, tsv, json), and all the same options available in the Python API.

```bash
# Basic transcription (uses turbo model by default)
whisper audio.mp3

# Transcribe multiple files
whisper audio1.mp3 audio2.wav audio3.flac

# Specify model and output format
whisper audio.mp3 --model large-v3 --output_format srt

# Transcribe non-English audio
whisper japanese.mp3 --language Japanese

# Translate to English
whisper french.mp3 --model medium --language French --task translate

# Generate all output formats
whisper audio.mp3 --output_format all --output_dir ./transcripts/

# Enable word-level timestamps
whisper audio.mp3 --word_timestamps True

# Generate highlighted subtitles (word-by-word highlighting)
whisper audio.mp3 --word_timestamps True --highlight_words True

# Control subtitle formatting
whisper audio.mp3 --word_timestamps True \
    --max_line_width 42 \
    --max_line_count 2

# Use GPU with specific device
whisper audio.mp3 --device cuda

# Use CPU with specific thread count
whisper audio.mp3 --device cpu --threads 4

# Custom decoding parameters
whisper audio.mp3 \
    --temperature 0 \
    --beam_size 5 \
    --best_of 5 \
    --compression_ratio_threshold 2.4 \
    --logprob_threshold -1.0 \
    --no_speech_threshold 0.6

# Provide initial prompt for better accuracy
whisper meeting.mp3 --initial_prompt "Meeting about Project Alpha with CEO John Smith"

# Process specific time clips
whisper podcast.mp3 --clip_timestamps "0,300,600,900"

# Reduce hallucinations in silent sections
whisper audio.mp3 --word_timestamps True --hallucination_silence_threshold 2.0

# View all options
whisper --help
```

## Output Writers

Whisper includes built-in writers for exporting transcription results to various formats including plain text, VTT (WebVTT), SRT (SubRip), TSV (tab-separated values), and JSON. The `get_writer` function returns the appropriate writer class for the specified format.

```python
import whisper
from whisper.utils import get_writer, WriteSRT, WriteVTT, WriteJSON

model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3", word_timestamps=True)

# Get a writer for specific format
writer = get_writer("srt", output_dir="./output")
writer(result, "audio.mp3")
# Creates: ./output/audio.srt

# Write all formats at once
writer = get_writer("all", output_dir="./output")
writer(result, "audio.mp3")
# Creates: audio.txt, audio.vtt, audio.srt, audio.tsv, audio.json

# Use with word-level highlighting in VTT/SRT
writer = get_writer("vtt", output_dir="./output")
writer(result, "audio.mp3", highlight_words=True)

# Control subtitle line formatting
writer = get_writer("srt", output_dir="./output")
writer(
    result,
    "audio.mp3",
    max_line_width=42,      # Characters per line
    max_line_count=2,       # Lines per subtitle
    max_words_per_line=8    # Words per line (alternative to width)
)

# Direct file writing with WriteSRT
from whisper.utils import WriteSRT
import io

srt_writer = WriteSRT(output_dir=".")
with io.StringIO() as f:
    srt_writer.write_result(result, file=f)
    srt_content = f.getvalue()
    print(srt_content[:500])
```

## Tokenizer

The `Tokenizer` class provides text encoding/decoding functionality using tiktoken, with special handling for Whisper's task-specific tokens (language codes, timestamps, transcribe/translate markers). It supports 99 languages and includes methods for word-level token splitting.

```python
import whisper
from whisper.tokenizer import get_tokenizer, LANGUAGES

# Get tokenizer for multilingual model
tokenizer = get_tokenizer(multilingual=True, language="en", task="transcribe")

# Encode text to tokens
tokens = tokenizer.encode("Hello, world!")
print(f"Tokens: {tokens}")  # [15947, 11, 1002, 0]

# Decode tokens back to text
text = tokenizer.decode(tokens)
print(f"Text: {text}")  # "Hello, world!"

# Access special tokens
print(f"SOT (start of transcript): {tokenizer.sot}")
print(f"EOT (end of transcript): {tokenizer.eot}")
print(f"Timestamp begin: {tokenizer.timestamp_begin}")
print(f"No timestamps token: {tokenizer.no_timestamps}")
print(f"Transcribe token: {tokenizer.transcribe}")
print(f"Translate token: {tokenizer.translate}")

# Get SOT sequence for a task
print(f"SOT sequence: {tokenizer.sot_sequence}")
# (50258, 50259, 50359) for English transcription

# Decode with timestamps
tokens_with_timestamps = [50364, 15947, 11, 1002, 50414]  # Example
text = tokenizer.decode_with_timestamps(tokens_with_timestamps)
print(text)  # "<|0.00|>Hello, world!<|1.00|>"

# Get all supported languages
print(f"Supported languages: {len(LANGUAGES)}")  # 99
print(f"Language codes: {list(LANGUAGES.keys())[:10]}")
# ['en', 'zh', 'de', 'es', 'ru', 'ko', 'fr', 'ja', 'pt', 'tr']

# Get language token
lang_token = tokenizer.to_language_token("fr")
print(f"French token: {lang_token}")

# Split tokens into words
tokens = tokenizer.encode(" The quick brown fox")
words, word_tokens = tokenizer.split_to_word_tokens(tokens)
for word, toks in zip(words, word_tokens):
    print(f"'{word}' -> {toks}")
```

## Model Architecture

The Whisper model consists of an `AudioEncoder` (CNN + Transformer) and `TextDecoder` (Transformer). The `Whisper` class combines these components and provides methods for embedding audio, generating logits, and managing KV-caching for efficient autoregressive decoding.

```python
import whisper
import torch

model = whisper.load_model("turbo")

# Inspect model dimensions
dims = model.dims
print(f"Mel bands: {dims.n_mels}")            # 128 (turbo) or 80 (others)
print(f"Audio context: {dims.n_audio_ctx}")    # 1500 frames
print(f"Audio state: {dims.n_audio_state}")    # Hidden dimension
print(f"Audio heads: {dims.n_audio_head}")     # Attention heads
print(f"Audio layers: {dims.n_audio_layer}")   # Encoder layers
print(f"Vocab size: {dims.n_vocab}")           # ~51865 for multilingual
print(f"Text context: {dims.n_text_ctx}")      # 448 tokens
print(f"Text state: {dims.n_text_state}")      # Hidden dimension
print(f"Text heads: {dims.n_text_head}")       # Attention heads
print(f"Text layers: {dims.n_text_layer}")     # Decoder layers

# Check model properties
print(f"Device: {model.device}")
print(f"Multilingual: {model.is_multilingual}")
print(f"Num languages: {model.num_languages}")

# Encode audio to features
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio, n_mels=dims.n_mels)
mel = mel.unsqueeze(0).to(model.device).half()

# Get audio embeddings from encoder
audio_features = model.embed_audio(mel)
print(f"Audio features shape: {audio_features.shape}")
# torch.Size([1, 1500, audio_state])

# Forward pass through decoder to get logits
tokens = torch.tensor([[50258, 50259, 50359]]).to(model.device)  # SOT sequence
logits = model.logits(tokens, audio_features)
print(f"Logits shape: {logits.shape}")
# torch.Size([1, 3, vocab_size])

# Full forward pass (encoder + decoder)
output = model(mel, tokens)
print(f"Output shape: {output.shape}")  # Same as logits

# Install KV-cache hooks for efficient generation
cache, hooks = model.install_kv_cache_hooks()
# Use cache during iterative decoding, then cleanup:
for hook in hooks:
    hook.remove()
```

Whisper is ideal for applications requiring robust speech-to-text conversion, including transcription services, subtitle generation, voice assistants, meeting summarization, and accessibility tools. Its multilingual capabilities make it suitable for global applications, while the translation feature enables cross-language content processing. The model handles diverse acoustic conditions including background noise, accents, and technical vocabulary when properly prompted.

Integration patterns typically involve loading a model once and reusing it for multiple transcriptions, leveraging GPU acceleration when available for throughput-sensitive applications. For real-time or streaming use cases, developers process audio in 30-second chunks and manage segment boundaries manually. The various output writers facilitate integration with video editing software, web players (WebVTT), and media workflows (SRT). Word-level timestamps enable advanced features like karaoke-style highlighting, precise audio editing, and speaker diarization when combined with external tools.