Try Live
Add Docs
Rankings
Pricing
Docs
Install
Theme
Install
Docs
Pricing
More...
More...
Try Live
Rankings
Enterprise
Create API Key
Add Docs
Whisper
https://github.com/openai/whisper
Admin
Whisper is a general-purpose speech recognition model trained on diverse audio, capable of
...
Tokens:
15,345
Snippets:
123
Trust Score:
9.1
Update:
1 month ago
Context
Skills
Chat
Benchmark
79.3
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# Whisper Whisper is OpenAI's general-purpose speech recognition model built on a Transformer sequence-to-sequence architecture. It is trained on a large dataset of diverse audio and performs multilingual speech recognition, speech translation to English, spoken language identification, and voice activity detection. The model processes audio in 30-second windows, converting speech to text with optional word-level timestamps. The library provides both a Python API and command-line interface for transcribing audio files. It supports six model sizes (tiny, base, small, medium, large, turbo) with varying speed/accuracy tradeoffs, and includes English-only variants for improved performance on English content. The model automatically detects the spoken language when not specified and can output transcriptions in multiple formats including plain text, VTT, SRT, TSV, and JSON. ## Load a Whisper Model The `load_model` function downloads and initializes a Whisper ASR model. It automatically selects CUDA if available, downloads model weights to a cache directory, and returns a model instance ready for inference. ```python import whisper # Load a model by name (downloads automatically if needed) model = whisper.load_model("turbo") # Load with specific device model = whisper.load_model("medium", device="cuda") # Load with custom download directory model = whisper.load_model("small", download_root="/path/to/models") # Load a custom model checkpoint model = whisper.load_model("/path/to/custom_model.pt") # List all available models print(whisper.available_models()) # Output: ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', # 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', # 'large', 'large-v3-turbo', 'turbo'] ``` ## Transcribe Audio Files The `transcribe` function is the high-level API for converting audio files to text. It handles audio loading, language detection, and sliding window processing automatically, returning a dictionary with the full text, timestamped segments, and detected language. ```python import whisper model = whisper.load_model("turbo") # Basic transcription result = model.transcribe("audio.mp3") print(result["text"]) # Output: " The quick brown fox jumps over the lazy dog." # Transcription with verbose output result = model.transcribe("audio.mp3", verbose=True) # Prints: [00:00.000 --> 00:02.500] The quick brown fox jumps over the lazy dog. # Specify language explicitly result = model.transcribe("japanese_audio.wav", language="Japanese") # Enable word-level timestamps result = model.transcribe("audio.mp3", word_timestamps=True) for segment in result["segments"]: for word in segment["words"]: print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s") # Output: # The: 0.00s - 0.20s # quick: 0.20s - 0.45s # brown: 0.45s - 0.70s # Translate non-English speech to English (use non-turbo models) model_medium = whisper.load_model("medium") result = model_medium.transcribe( "german_audio.mp3", language="German", task="translate" ) # Provide initial prompt for context (improves accuracy for domain-specific terms) result = model.transcribe( "medical_lecture.mp3", initial_prompt="This is a medical lecture about cardiology and heart disease." ) # Process specific clips within audio result = model.transcribe( "long_audio.mp3", clip_timestamps="10.5,30.0,45.0,60.0" # Process 10.5-30s and 45-60s ) # Skip hallucinations in silent regions result = model.transcribe( "audio_with_silence.mp3", word_timestamps=True, hallucination_silence_threshold=2.0 # Skip silent periods > 2 seconds ) # Access full transcription result structure print(result.keys()) # Output: dict_keys(['text', 'segments', 'language']) print(result["language"]) # Output: 'en' # Each segment contains timing and metadata segment = result["segments"][0] print(segment.keys()) # Output: dict_keys(['id', 'seek', 'start', 'end', 'text', 'tokens', # 'temperature', 'avg_logprob', 'compression_ratio', # 'no_speech_prob', 'words']) ``` ## Load and Process Audio The `load_audio` function reads audio files using ffmpeg, converting to 16kHz mono waveform. The `pad_or_trim` function adjusts audio length to exactly 30 seconds (480,000 samples) as required by the encoder, and `log_mel_spectrogram` converts waveforms to mel spectrograms for model input. ```python import whisper import numpy as np # Load audio file (requires ffmpeg in PATH) audio = whisper.load_audio("audio.mp3") print(type(audio), audio.shape, audio.dtype) # Output: <class 'numpy.ndarray'> (48000,) float32 # Load with custom sample rate audio = whisper.load_audio("audio.mp3", sr=8000) # Pad or trim to 30 seconds (480,000 samples at 16kHz) audio = whisper.pad_or_trim(audio) print(audio.shape) # Output: (480000,) # Compute log-Mel spectrogram mel = whisper.log_mel_spectrogram(audio) print(mel.shape) # Output: torch.Size([80, 3000]) # Or compute directly from file path mel = whisper.log_mel_spectrogram("audio.mp3") # Use 128 mel filters (for large-v3 and turbo models) model = whisper.load_model("turbo") mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels) print(mel.shape) # Output: torch.Size([128, 3000]) # Move to model device for inference mel = mel.to(model.device) ``` ## Detect Language The `detect_language` function identifies the spoken language in audio by analyzing the first 30 seconds. It returns probability distributions over all supported languages, enabling language-conditional processing. ```python import whisper model = whisper.load_model("turbo") # Load and prepare audio audio = whisper.load_audio("unknown_language.mp3") audio = whisper.pad_or_trim(audio) # Create mel spectrogram and move to model device mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device) # Detect language _, probs = model.detect_language(mel) # Get the most probable language detected_language = max(probs, key=probs.get) print(f"Detected language: {detected_language}") # Output: Detected language: en # View top language probabilities sorted_probs = sorted(probs.items(), key=lambda x: x[1], reverse=True) for lang, prob in sorted_probs[:5]: print(f"{lang}: {prob:.4f}") # Output: # en: 0.9823 # de: 0.0089 # fr: 0.0045 # es: 0.0021 # it: 0.0008 # All supported languages from whisper.tokenizer import LANGUAGES print(list(LANGUAGES.keys())[:10]) # Output: ['en', 'zh', 'de', 'es', 'ru', 'ko', 'fr', 'ja', 'pt', 'tr'] ``` ## Decode Audio Segments The `decode` function provides low-level access to transcribe 30-second mel spectrogram segments with fine-grained control over decoding parameters. It returns a `DecodingResult` with text, tokens, and probability metrics. ```python import whisper from whisper import DecodingOptions model = whisper.load_model("turbo") # Prepare audio audio = whisper.load_audio("audio.mp3") audio = whisper.pad_or_trim(audio) mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device) # Basic decoding with default options options = DecodingOptions() result = whisper.decode(model, mel, options) print(result.text) # Output: " The quick brown fox jumps over the lazy dog." # Greedy decoding (temperature=0) options = DecodingOptions(temperature=0.0, language="en") result = whisper.decode(model, mel, options) # Beam search decoding options = DecodingOptions( beam_size=5, patience=1.0, language="en" ) result = whisper.decode(model, mel, options) # Sampling with temperature options = DecodingOptions( temperature=0.4, best_of=5, # Sample 5 candidates, return best language="en" ) result = whisper.decode(model, mel, options) # Translation to English options = DecodingOptions( task="translate", language="de" # Source language is German ) result = whisper.decode(model, mel, options) # Decode without timestamps options = DecodingOptions(without_timestamps=True) result = whisper.decode(model, mel, options) # Provide prompt context options = DecodingOptions( prompt="The following is a conversation about artificial intelligence." ) result = whisper.decode(model, mel, options) # Access detailed decoding results print(f"Text: {result.text}") print(f"Language: {result.language}") print(f"Avg log probability: {result.avg_logprob:.4f}") print(f"No speech probability: {result.no_speech_prob:.4f}") print(f"Compression ratio: {result.compression_ratio:.4f}") print(f"Temperature used: {result.temperature}") print(f"Token count: {len(result.tokens)}") # DecodingOptions dataclass fields print(DecodingOptions.__dataclass_fields__.keys()) # Output: dict_keys(['task', 'language', 'temperature', 'sample_len', # 'best_of', 'beam_size', 'patience', 'length_penalty', # 'prompt', 'prefix', 'suppress_tokens', 'suppress_blank', # 'without_timestamps', 'max_initial_timestamp', 'fp16']) ``` ## Command-Line Interface Whisper provides a CLI for transcribing audio files directly from the terminal. It supports multiple input files, various output formats, and all model configuration options. ```bash # Basic transcription with turbo model (default) whisper audio.mp3 # Specify a different model whisper audio.mp3 --model medium # Process multiple files whisper audio1.mp3 audio2.wav audio3.flac --model base # Specify language for faster processing whisper japanese.wav --language Japanese # Translate to English (don't use turbo for translation) whisper german.mp3 --model medium --language German --task translate # Choose output format whisper audio.mp3 --output_format srt whisper audio.mp3 --output_format vtt whisper audio.mp3 --output_format txt whisper audio.mp3 --output_format json whisper audio.mp3 --output_format tsv whisper audio.mp3 --output_format all # Generate all formats # Specify output directory whisper audio.mp3 --output_dir ./transcripts # Enable word-level timestamps whisper audio.mp3 --word_timestamps True # Word timestamps with highlighting in subtitles whisper audio.mp3 --word_timestamps True --highlight_words True --output_format srt # Control subtitle formatting whisper audio.mp3 --word_timestamps True --max_line_width 42 --max_line_count 2 # Provide initial prompt for context whisper medical.mp3 --initial_prompt "Medical terminology: myocardial infarction, arrhythmia" # Adjust decoding parameters whisper audio.mp3 --temperature 0.2 --beam_size 5 whisper audio.mp3 --best_of 5 # For temperature > 0 # Control quality thresholds whisper audio.mp3 --compression_ratio_threshold 2.4 --logprob_threshold -1.0 # Process specific time clips whisper long_audio.mp3 --clip_timestamps "0,60,120,180" # Use specific device whisper audio.mp3 --device cuda whisper audio.mp3 --device cpu # Control threading for CPU inference whisper audio.mp3 --device cpu --threads 4 # Skip hallucinations in silence whisper audio.mp3 --word_timestamps True --hallucination_silence_threshold 2.0 # Disable verbose output whisper audio.mp3 --verbose False # Use FP32 instead of FP16 (slower but may be needed on some hardware) whisper audio.mp3 --fp16 False # View all options whisper --help ``` ## Tokenizer and Text Processing The `Tokenizer` class wraps tiktoken for encoding/decoding text with Whisper's vocabulary. It handles special tokens for timestamps, language codes, and task specifiers, plus provides word-level token splitting for timestamp alignment. ```python from whisper.tokenizer import get_tokenizer, LANGUAGES, TO_LANGUAGE_CODE # Get tokenizer for multilingual model tokenizer = get_tokenizer(multilingual=True, language="en", task="transcribe") # Encode text to tokens tokens = tokenizer.encode("Hello, world!") print(tokens) # Output: [15947, 11, 1002, 0] # Decode tokens back to text text = tokenizer.decode(tokens) print(text) # Output: 'Hello, world!' # Access special tokens print(f"End of text token: {tokenizer.eot}") print(f"Start of transcript: {tokenizer.sot}") print(f"Transcribe token: {tokenizer.transcribe}") print(f"Translate token: {tokenizer.translate}") print(f"No speech token: {tokenizer.no_speech}") print(f"Timestamp begin: {tokenizer.timestamp_begin}") # Start-of-transcript sequence for current task print(f"SOT sequence: {tokenizer.sot_sequence}") # Output: SOT sequence: (50258, 50259, 50359) # Get tokenizer for English-only model tokenizer_en = get_tokenizer(multilingual=False) # Available languages print(f"Number of languages: {len(LANGUAGES)}") print(list(LANGUAGES.items())[:5]) # Output: [('en', 'english'), ('zh', 'chinese'), ('de', 'german'), # ('es', 'spanish'), ('ru', 'russian')] # Language code lookup print(TO_LANGUAGE_CODE["german"]) # Output: 'de' print(TO_LANGUAGE_CODE["mandarin"]) # Output: 'zh' # Split tokens into words (for word timestamps) tokens = tokenizer.encode(" The quick brown fox") words, word_tokens = tokenizer.split_to_word_tokens(tokens) print(words) # Output: [' The', ' quick', ' brown', ' fox'] print(word_tokens) # Output: [[440], [2068], [3147], [21831]] # Decode with timestamps preserved tokens_with_ts = [50364, 15947, 50414] # <|0.00|> Hello <|1.00|> text_with_ts = tokenizer.decode_with_timestamps(tokens_with_ts) print(text_with_ts) # Output: '<|0.00|>Hello<|1.00|>' # Get non-speech tokens to suppress non_speech = tokenizer.non_speech_tokens print(f"Non-speech token count: {len(non_speech)}") ``` ## Output Writers The utils module provides writer classes for generating transcription outputs in multiple formats including TXT, VTT (WebVTT), SRT (SubRip), TSV, and JSON. ```python from whisper.utils import get_writer, WriteTXT, WriteVTT, WriteSRT, WriteJSON, WriteTSV import whisper model = whisper.load_model("turbo") result = model.transcribe("audio.mp3", word_timestamps=True) # Get a specific format writer txt_writer = get_writer("txt", output_dir="./output") txt_writer(result, "audio.mp3") # Creates: ./output/audio.txt # Generate VTT subtitles vtt_writer = get_writer("vtt", output_dir="./output") vtt_writer(result, "audio.mp3") # Creates: ./output/audio.vtt # Generate SRT subtitles with word highlighting srt_writer = get_writer("srt", output_dir="./output") srt_writer(result, "audio.mp3", highlight_words=True) # Creates: ./output/audio.srt # Control subtitle line formatting srt_writer( result, "audio.mp3", max_line_width=42, max_line_count=2, max_words_per_line=8 ) # Generate JSON output json_writer = get_writer("json", output_dir="./output") json_writer(result, "audio.mp3") # Creates: ./output/audio.json # Generate TSV output (tab-separated with millisecond timestamps) tsv_writer = get_writer("tsv", output_dir="./output") tsv_writer(result, "audio.mp3") # Creates: ./output/audio.tsv # Generate all formats at once all_writer = get_writer("all", output_dir="./output") all_writer(result, "audio.mp3") # Creates: audio.txt, audio.vtt, audio.srt, audio.tsv, audio.json # Utility function for timestamp formatting from whisper.utils import format_timestamp print(format_timestamp(125.5)) # Output: '02:05.500' print(format_timestamp(125.5, always_include_hours=True)) # Output: '00:02:05.500' print(format_timestamp(125.5, decimal_marker=",")) # Output: '02:05,500' ``` ## Model Architecture and Configuration The `Whisper` model class implements an encoder-decoder Transformer architecture. The `ModelDimensions` dataclass defines model hyperparameters, and the model provides methods for encoding audio and generating text. ```python import whisper import torch model = whisper.load_model("turbo") # Access model dimensions dims = model.dims print(f"Mel filters: {dims.n_mels}") print(f"Audio context: {dims.n_audio_ctx}") print(f"Audio state: {dims.n_audio_state}") print(f"Audio heads: {dims.n_audio_head}") print(f"Audio layers: {dims.n_audio_layer}") print(f"Vocabulary size: {dims.n_vocab}") print(f"Text context: {dims.n_text_ctx}") print(f"Text state: {dims.n_text_state}") print(f"Text heads: {dims.n_text_head}") print(f"Text layers: {dims.n_text_layer}") # Check model properties print(f"Device: {model.device}") print(f"Is multilingual: {model.is_multilingual}") print(f"Number of languages: {model.num_languages}") # Encode audio directly audio = whisper.load_audio("audio.mp3") audio = whisper.pad_or_trim(audio) mel = whisper.log_mel_spectrogram(audio, n_mels=dims.n_mels) mel = mel.unsqueeze(0).to(model.device) # Add batch dimension # Get audio features from encoder audio_features = model.encoder(mel.half()) print(f"Audio features shape: {audio_features.shape}") # Output: torch.Size([1, 1500, 1280]) # Forward pass through full model tokens = torch.tensor([[50258, 50259, 50359]]).to(model.device) # SOT sequence logits = model.decoder(tokens, audio_features) print(f"Logits shape: {logits.shape}") # Output: torch.Size([1, 3, vocab_size]) # Get logits for next token prediction logits = model.logits(tokens, audio_features) # Model sizes and their configurations MODEL_DIMS = { "tiny": {"n_audio_layer": 4, "n_text_layer": 4, "n_audio_state": 384}, "base": {"n_audio_layer": 6, "n_text_layer": 6, "n_audio_state": 512}, "small": {"n_audio_layer": 12, "n_text_layer": 12, "n_audio_state": 768}, "medium": {"n_audio_layer": 24, "n_text_layer": 24, "n_audio_state": 1024}, "large": {"n_audio_layer": 32, "n_text_layer": 32, "n_audio_state": 1280}, "turbo": {"n_audio_layer": 32, "n_text_layer": 4, "n_audio_state": 1280}, } ``` ## Summary Whisper is ideal for building speech-to-text applications including transcription services, subtitle generation, voice assistants, meeting note automation, and multilingual content processing. The high-level `transcribe()` API handles most use cases with automatic language detection and sliding window processing, while the low-level `decode()` function enables custom decoding strategies for specialized applications. Integration patterns typically involve loading a model once at startup, then calling `transcribe()` for each audio file or stream. For real-time applications, audio can be chunked into 30-second segments and processed with `decode()` directly. The library integrates well with audio processing pipelines through numpy arrays and torch tensors, and supports multiple output formats for downstream subtitle rendering or text analysis systems.