Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Theme
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Create API Key
Add Docs
Whisper
https://github.com/openai/whisper
Admin
Whisper is a general-purpose speech recognition model trained on diverse audio, capable of
...
Tokens:
13,841
Snippets:
148
Trust Score:
9.1
Update:
2 weeks ago
Context
Skills
Chat
Benchmark
78.5
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# Whisper Whisper is OpenAI's general-purpose speech recognition model built on a Transformer sequence-to-sequence architecture. It is trained on 680,000 hours of multilingual and multitask supervised data collected from the web, enabling robust speech recognition across 99 languages. The model can perform multilingual speech recognition, speech translation (any language to English), spoken language identification, and voice activity detection through a unified token-based approach. The library provides both a Python API and command-line interface for transcribing audio files. Whisper offers six model sizes (tiny, base, small, medium, large, turbo) with English-only variants available for smaller models, allowing users to balance accuracy against speed and resource requirements. The turbo model provides an optimized version of large-v3 with 8x faster inference while maintaining high accuracy for transcription tasks. ## Load Model The `load_model` function downloads and initializes a Whisper ASR model. It automatically selects the appropriate device (CUDA if available, otherwise CPU) and caches downloaded model weights in `~/.cache/whisper`. Available models include `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `medium.en`, `large`, `large-v1`, `large-v2`, `large-v3`, and `turbo`. ```python import whisper # Load the turbo model (fastest, recommended for most use cases) model = whisper.load_model("turbo") # Load a specific model with custom device model = whisper.load_model("large-v3", device="cuda") # Load English-only model for better performance on English audio model = whisper.load_model("base.en") # Load model with custom download directory model = whisper.load_model("medium", download_root="/path/to/models") # Load from a local checkpoint file model = whisper.load_model("/path/to/custom_model.pt") # List all available models print(whisper.available_models()) # Output: ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', # 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', # 'large', 'large-v3-turbo', 'turbo'] ``` ## Transcribe Audio The `transcribe` function processes an audio file using a sliding 30-second window and performs autoregressive sequence-to-sequence predictions. It returns a dictionary containing the full transcription text, detailed segments with timestamps, and the detected language. The function supports various parameters for controlling temperature-based sampling, compression ratio thresholds, and word-level timestamp extraction. ```python import whisper model = whisper.load_model("turbo") # Basic transcription result = model.transcribe("audio.mp3") print(result["text"]) # Output: " The quick brown fox jumps over the lazy dog." # Access detailed segments with timestamps for segment in result["segments"]: print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}") # Output: [0.00s -> 2.50s] The quick brown fox # [2.50s -> 4.80s] jumps over the lazy dog. # Transcription with word-level timestamps result = model.transcribe("audio.mp3", word_timestamps=True) for segment in result["segments"]: for word in segment["words"]: print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)") # Specify language explicitly (skip auto-detection) result = model.transcribe("german_audio.mp3", language="de") print(f"Detected language: {result['language']}") # Translation to English result = model.transcribe("japanese_audio.mp3", task="translate") print(result["text"]) # English translation # Use initial prompt for context (helps with proper nouns, technical terms) result = model.transcribe( "meeting.mp3", initial_prompt="Meeting transcript for Acme Corp discussing the Q4 roadmap." ) # Transcribe specific time ranges result = model.transcribe( "long_audio.mp3", clip_timestamps="30.0,60.0,120.0,180.0" # Process 30-60s and 120-180s ) # Fine-tune decoding parameters result = model.transcribe( "audio.mp3", temperature=0.0, # Greedy decoding (deterministic) beam_size=5, # Beam search width best_of=5, # Number of candidates for sampling compression_ratio_threshold=2.4, # Retry if too repetitive logprob_threshold=-1.0, # Retry if confidence too low no_speech_threshold=0.6, # Silence detection threshold condition_on_previous_text=True, # Use context across windows fp16=True, # Use half-precision (faster on GPU) verbose=True # Print progress ) ``` ## Load Audio The `load_audio` function reads an audio file and converts it to a mono waveform at 16kHz sample rate, which is the format expected by Whisper models. It uses ffmpeg under the hood to handle various audio formats including MP3, WAV, FLAC, M4A, and video files. ```python import whisper import numpy as np # Load audio file as numpy array audio = whisper.load_audio("speech.mp3") print(f"Audio shape: {audio.shape}") # (num_samples,) print(f"Duration: {len(audio) / 16000:.2f} seconds") # Load with custom sample rate (not recommended, Whisper expects 16kHz) audio = whisper.load_audio("speech.wav", sr=16000) # Audio is returned as float32 normalized to [-1, 1] print(f"Dtype: {audio.dtype}") # float32 print(f"Range: [{audio.min():.3f}, {audio.max():.3f}]") ``` ## Pad or Trim Audio The `pad_or_trim` function adjusts audio arrays or tensors to exactly 30 seconds (480,000 samples at 16kHz), which is the input size expected by the Whisper encoder. It pads shorter audio with silence or trims longer audio. ```python import whisper import torch # Load and prepare audio for the model audio = whisper.load_audio("short_clip.mp3") print(f"Original length: {len(audio)} samples") # Pad or trim to 30 seconds audio = whisper.pad_or_trim(audio) print(f"Adjusted length: {len(audio)} samples") # 480000 # Works with torch tensors too audio_tensor = torch.from_numpy(audio) audio_tensor = whisper.pad_or_trim(audio_tensor) print(f"Tensor shape: {audio_tensor.shape}") # torch.Size([480000]) # Pad/trim mel spectrograms along frame axis mel = whisper.log_mel_spectrogram(audio) mel = whisper.pad_or_trim(mel, 3000) # 3000 frames for 30 seconds print(f"Mel shape: {mel.shape}") # torch.Size([80, 3000]) or torch.Size([128, 3000]) ``` ## Log-Mel Spectrogram The `log_mel_spectrogram` function computes the log-Mel spectrogram representation of audio, which is the input format for the Whisper encoder. It applies STFT with a Hann window, projects to mel scale using precomputed filter banks, and applies log scaling with clamping. ```python import whisper import torch model = whisper.load_model("turbo") # Compute mel spectrogram from audio file path mel = whisper.log_mel_spectrogram("audio.mp3") print(f"Mel shape: {mel.shape}") # torch.Size([80, num_frames]) # Compute from numpy array audio = whisper.load_audio("audio.mp3") mel = whisper.log_mel_spectrogram(audio) # Specify number of mel bands (80 for most models, 128 for large-v3/turbo) mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels) print(f"N_mels: {model.dims.n_mels}") # 128 for turbo, 80 for older models # Move to specific device mel = whisper.log_mel_spectrogram(audio, device="cuda") print(f"Device: {mel.device}") # cuda:0 # Add padding for processing in sliding windows mel = whisper.log_mel_spectrogram(audio, padding=480000) # Pad 30 seconds ``` ## Detect Language The `detect_language` method identifies the spoken language in the audio by analyzing the first 30 seconds. It returns the detected language token and a dictionary of probabilities for all supported languages. This is useful when processing multilingual content or validating language assumptions. ```python import whisper model = whisper.load_model("turbo") # Load and prepare audio audio = whisper.load_audio("multilingual_audio.mp3") audio = whisper.pad_or_trim(audio) # Compute mel spectrogram mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device) # Detect language _, probs = model.detect_language(mel) # Get the most likely language detected_lang = max(probs, key=probs.get) confidence = probs[detected_lang] print(f"Detected: {detected_lang} (confidence: {confidence:.2%})") # Output: Detected: en (confidence: 98.45%) # Show top 5 language probabilities sorted_probs = sorted(probs.items(), key=lambda x: x[1], reverse=True)[:5] for lang, prob in sorted_probs: print(f" {lang}: {prob:.2%}") # Output: en: 98.45% # de: 0.82% # nl: 0.31% # fr: 0.15% # es: 0.09% ``` ## Decode Audio The `decode` function provides low-level access to the decoder for processing 30-second mel spectrogram segments. It returns a `DecodingResult` containing the transcribed text, tokens, language, and quality metrics like average log probability and compression ratio. Use `DecodingOptions` to configure decoding behavior. ```python import whisper model = whisper.load_model("turbo") # Prepare a 30-second audio segment audio = whisper.load_audio("audio.mp3") audio = whisper.pad_or_trim(audio) mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device) # Basic decoding with default options options = whisper.DecodingOptions() result = whisper.decode(model, mel, options) print(f"Text: {result.text}") print(f"Language: {result.language}") print(f"Tokens: {result.tokens[:10]}...") # First 10 token IDs print(f"Avg log prob: {result.avg_logprob:.3f}") print(f"No speech prob: {result.no_speech_prob:.3f}") print(f"Compression ratio: {result.compression_ratio:.2f}") # Decoding with custom options options = whisper.DecodingOptions( language="en", # Force English task="transcribe", # or "translate" for X->English temperature=0.0, # Greedy decoding beam_size=5, # Use beam search best_of=None, # Mutually exclusive with beam_size fp16=True, # Half precision without_timestamps=False, # Include timestamp tokens max_initial_timestamp=1.0, # Max offset for first timestamp suppress_tokens="-1", # Suppress non-speech tokens suppress_blank=True, # Suppress blank at start prompt="Previous context here", # Provide context prefix="The speaker said:", # Force output prefix ) result = whisper.decode(model, mel, options) # Batch decoding (multiple segments at once) mel_batch = mel.unsqueeze(0).repeat(4, 1, 1) # 4 copies results = whisper.decode(model, mel_batch, options) for i, r in enumerate(results): print(f"Segment {i}: {r.text}") ``` ## Command-Line Interface Whisper provides a command-line tool for transcribing audio files directly from the terminal. It supports multiple input files, various output formats (txt, vtt, srt, tsv, json), and all the same options available in the Python API. ```bash # Basic transcription (uses turbo model by default) whisper audio.mp3 # Transcribe multiple files whisper audio1.mp3 audio2.wav audio3.flac # Specify model and output format whisper audio.mp3 --model large-v3 --output_format srt # Transcribe non-English audio whisper japanese.mp3 --language Japanese # Translate to English whisper french.mp3 --model medium --language French --task translate # Generate all output formats whisper audio.mp3 --output_format all --output_dir ./transcripts/ # Enable word-level timestamps whisper audio.mp3 --word_timestamps True # Generate highlighted subtitles (word-by-word highlighting) whisper audio.mp3 --word_timestamps True --highlight_words True # Control subtitle formatting whisper audio.mp3 --word_timestamps True \ --max_line_width 42 \ --max_line_count 2 # Use GPU with specific device whisper audio.mp3 --device cuda # Use CPU with specific thread count whisper audio.mp3 --device cpu --threads 4 # Custom decoding parameters whisper audio.mp3 \ --temperature 0 \ --beam_size 5 \ --best_of 5 \ --compression_ratio_threshold 2.4 \ --logprob_threshold -1.0 \ --no_speech_threshold 0.6 # Provide initial prompt for better accuracy whisper meeting.mp3 --initial_prompt "Meeting about Project Alpha with CEO John Smith" # Process specific time clips whisper podcast.mp3 --clip_timestamps "0,300,600,900" # Reduce hallucinations in silent sections whisper audio.mp3 --word_timestamps True --hallucination_silence_threshold 2.0 # View all options whisper --help ``` ## Output Writers Whisper includes built-in writers for exporting transcription results to various formats including plain text, VTT (WebVTT), SRT (SubRip), TSV (tab-separated values), and JSON. The `get_writer` function returns the appropriate writer class for the specified format. ```python import whisper from whisper.utils import get_writer, WriteSRT, WriteVTT, WriteJSON model = whisper.load_model("turbo") result = model.transcribe("audio.mp3", word_timestamps=True) # Get a writer for specific format writer = get_writer("srt", output_dir="./output") writer(result, "audio.mp3") # Creates: ./output/audio.srt # Write all formats at once writer = get_writer("all", output_dir="./output") writer(result, "audio.mp3") # Creates: audio.txt, audio.vtt, audio.srt, audio.tsv, audio.json # Use with word-level highlighting in VTT/SRT writer = get_writer("vtt", output_dir="./output") writer(result, "audio.mp3", highlight_words=True) # Control subtitle line formatting writer = get_writer("srt", output_dir="./output") writer( result, "audio.mp3", max_line_width=42, # Characters per line max_line_count=2, # Lines per subtitle max_words_per_line=8 # Words per line (alternative to width) ) # Direct file writing with WriteSRT from whisper.utils import WriteSRT import io srt_writer = WriteSRT(output_dir=".") with io.StringIO() as f: srt_writer.write_result(result, file=f) srt_content = f.getvalue() print(srt_content[:500]) ``` ## Tokenizer The `Tokenizer` class provides text encoding/decoding functionality using tiktoken, with special handling for Whisper's task-specific tokens (language codes, timestamps, transcribe/translate markers). It supports 99 languages and includes methods for word-level token splitting. ```python import whisper from whisper.tokenizer import get_tokenizer, LANGUAGES # Get tokenizer for multilingual model tokenizer = get_tokenizer(multilingual=True, language="en", task="transcribe") # Encode text to tokens tokens = tokenizer.encode("Hello, world!") print(f"Tokens: {tokens}") # [15947, 11, 1002, 0] # Decode tokens back to text text = tokenizer.decode(tokens) print(f"Text: {text}") # "Hello, world!" # Access special tokens print(f"SOT (start of transcript): {tokenizer.sot}") print(f"EOT (end of transcript): {tokenizer.eot}") print(f"Timestamp begin: {tokenizer.timestamp_begin}") print(f"No timestamps token: {tokenizer.no_timestamps}") print(f"Transcribe token: {tokenizer.transcribe}") print(f"Translate token: {tokenizer.translate}") # Get SOT sequence for a task print(f"SOT sequence: {tokenizer.sot_sequence}") # (50258, 50259, 50359) for English transcription # Decode with timestamps tokens_with_timestamps = [50364, 15947, 11, 1002, 50414] # Example text = tokenizer.decode_with_timestamps(tokens_with_timestamps) print(text) # "<|0.00|>Hello, world!<|1.00|>" # Get all supported languages print(f"Supported languages: {len(LANGUAGES)}") # 99 print(f"Language codes: {list(LANGUAGES.keys())[:10]}") # ['en', 'zh', 'de', 'es', 'ru', 'ko', 'fr', 'ja', 'pt', 'tr'] # Get language token lang_token = tokenizer.to_language_token("fr") print(f"French token: {lang_token}") # Split tokens into words tokens = tokenizer.encode(" The quick brown fox") words, word_tokens = tokenizer.split_to_word_tokens(tokens) for word, toks in zip(words, word_tokens): print(f"'{word}' -> {toks}") ``` ## Model Architecture The Whisper model consists of an `AudioEncoder` (CNN + Transformer) and `TextDecoder` (Transformer). The `Whisper` class combines these components and provides methods for embedding audio, generating logits, and managing KV-caching for efficient autoregressive decoding. ```python import whisper import torch model = whisper.load_model("turbo") # Inspect model dimensions dims = model.dims print(f"Mel bands: {dims.n_mels}") # 128 (turbo) or 80 (others) print(f"Audio context: {dims.n_audio_ctx}") # 1500 frames print(f"Audio state: {dims.n_audio_state}") # Hidden dimension print(f"Audio heads: {dims.n_audio_head}") # Attention heads print(f"Audio layers: {dims.n_audio_layer}") # Encoder layers print(f"Vocab size: {dims.n_vocab}") # ~51865 for multilingual print(f"Text context: {dims.n_text_ctx}") # 448 tokens print(f"Text state: {dims.n_text_state}") # Hidden dimension print(f"Text heads: {dims.n_text_head}") # Attention heads print(f"Text layers: {dims.n_text_layer}") # Decoder layers # Check model properties print(f"Device: {model.device}") print(f"Multilingual: {model.is_multilingual}") print(f"Num languages: {model.num_languages}") # Encode audio to features audio = whisper.load_audio("audio.mp3") audio = whisper.pad_or_trim(audio) mel = whisper.log_mel_spectrogram(audio, n_mels=dims.n_mels) mel = mel.unsqueeze(0).to(model.device).half() # Get audio embeddings from encoder audio_features = model.embed_audio(mel) print(f"Audio features shape: {audio_features.shape}") # torch.Size([1, 1500, audio_state]) # Forward pass through decoder to get logits tokens = torch.tensor([[50258, 50259, 50359]]).to(model.device) # SOT sequence logits = model.logits(tokens, audio_features) print(f"Logits shape: {logits.shape}") # torch.Size([1, 3, vocab_size]) # Full forward pass (encoder + decoder) output = model(mel, tokens) print(f"Output shape: {output.shape}") # Same as logits # Install KV-cache hooks for efficient generation cache, hooks = model.install_kv_cache_hooks() # Use cache during iterative decoding, then cleanup: for hook in hooks: hook.remove() ``` Whisper is ideal for applications requiring robust speech-to-text conversion, including transcription services, subtitle generation, voice assistants, meeting summarization, and accessibility tools. Its multilingual capabilities make it suitable for global applications, while the translation feature enables cross-language content processing. The model handles diverse acoustic conditions including background noise, accents, and technical vocabulary when properly prompted. Integration patterns typically involve loading a model once and reusing it for multiple transcriptions, leveraging GPU acceleration when available for throughput-sensitive applications. For real-time or streaming use cases, developers process audio in 30-second chunks and manage segment boundaries manually. The various output writers facilitate integration with video editing software, web players (WebVTT), and media workflows (SRT). Word-level timestamps enable advanced features like karaoke-style highlighting, precise audio editing, and speaker diarization when combined with external tools.