# Silero VAD Silero VAD is a pre-trained enterprise-grade Voice Activity Detector that identifies speech segments in audio. It provides highly accurate speech detection across 6000+ languages, with support for both 8kHz and 16kHz sampling rates. The library is lightweight (2MB JIT model), fast (less than 1ms per audio chunk on CPU), and runs on any platform supporting PyTorch or ONNX runtime. The library offers multiple integration options: a pip-installable Python package, torch.hub loading, and ONNX models for cross-platform deployment. It includes utilities for batch processing full audio files, real-time streaming detection, and audio manipulation (collecting or dropping speech segments). Silero VAD is commonly used for voice interfaces, telephony automation, data cleaning, and IoT/edge voice detection applications. ## load_silero_vad Loads the Silero VAD model from the package. Supports both JIT (TorchScript) and ONNX model formats. The JIT model requires PyTorch >= 1.12.0, while the ONNX model requires onnxruntime >= 1.16.1. ONNX models support opset versions 15 and 16. ```python from silero_vad import load_silero_vad # Load JIT model (default) model = load_silero_vad() # Load ONNX model for cross-platform deployment model_onnx = load_silero_vad(onnx=True) # Load ONNX model with specific opset version model_onnx_15 = load_silero_vad(onnx=True, opset_version=15) ``` ## read_audio Reads an audio file and returns a torch.Tensor at the specified sampling rate. Automatically handles resampling, stereo to mono conversion, and works with multiple audio backends (FFmpeg, sox, soundfile, torchcodec). ```python from silero_vad import read_audio # Read audio file at 16kHz (default) wav = read_audio('audio.wav') # wav shape: torch.Size([num_samples]) # Read audio file at 8kHz wav_8k = read_audio('audio.wav', sampling_rate=8000) # Read and process any supported format wav_mp3 = read_audio('audio.mp3', sampling_rate=16000) ``` ## get_speech_timestamps Analyzes audio and returns a list of dictionaries containing start and end timestamps of speech segments. This is the primary function for processing complete audio files. Returns coordinates in samples by default, or seconds when `return_seconds=True`. ```python from silero_vad import load_silero_vad, read_audio, get_speech_timestamps model = load_silero_vad() wav = read_audio('audio.wav', sampling_rate=16000) # Basic usage - returns sample coordinates speech_timestamps = get_speech_timestamps(wav, model) # Output: [{'start': 1024, 'end': 48000}, {'start': 64000, 'end': 96000}] # Return timestamps in seconds speech_timestamps = get_speech_timestamps( wav, model, return_seconds=True, sampling_rate=16000 ) # Output: [{'start': 0.1, 'end': 3.0}, {'start': 4.0, 'end': 6.0}] # Fine-tuned detection parameters speech_timestamps = get_speech_timestamps( wav, model, threshold=0.5, # Speech probability threshold (default: 0.5) sampling_rate=16000, # Sample rate (8000 or 16000) min_speech_duration_ms=250, # Minimum speech chunk duration max_speech_duration_s=float('inf'), # Maximum speech chunk duration min_silence_duration_ms=100, # Minimum silence to end speech speech_pad_ms=30, # Padding around speech chunks return_seconds=True, visualize_probs=False, # Plot speech probabilities neg_threshold=0.35, # Threshold to exit speech state progress_tracking_callback=lambda p: print(f"Progress: {p:.1f}%") ) ``` ## VADIterator A class for streaming/real-time voice activity detection. Processes audio in chunks and returns speech start/end events as they occur. Ideal for live microphone input and real-time applications. ```python from silero_vad import load_silero_vad, read_audio, VADIterator model = load_silero_vad() # Initialize iterator with custom parameters vad_iterator = VADIterator( model, threshold=0.5, # Speech probability threshold sampling_rate=16000, # Sample rate (8000 or 16000) min_silence_duration_ms=100, # Silence duration to end speech speech_pad_ms=30 # Padding around detected speech ) # Process audio in chunks (simulating streaming) wav = read_audio('audio.wav', sampling_rate=16000) window_size_samples = 512 # 512 for 16kHz, 256 for 8kHz for i in range(0, len(wav), window_size_samples): chunk = wav[i:i + window_size_samples] if len(chunk) < window_size_samples: break speech_dict = vad_iterator(chunk, return_seconds=True) if speech_dict: if 'start' in speech_dict: print(f"Speech started at {speech_dict['start']}s") if 'end' in speech_dict: print(f"Speech ended at {speech_dict['end']}s") # Reset states for next audio vad_iterator.reset_states() ``` ## Model Direct Inference The model can be called directly on audio chunks to get speech probabilities. Each call returns a probability (0-1) indicating likelihood of speech. Remember to reset model states between different audio files. ```python from silero_vad import load_silero_vad, read_audio model = load_silero_vad() wav = read_audio('audio.wav', sampling_rate=16000) # Process audio in chunks and get probabilities speech_probs = [] window_size_samples = 512 # 512 for 16kHz, 256 for 8kHz for i in range(0, len(wav), window_size_samples): chunk = wav[i:i + window_size_samples] if len(chunk) < window_size_samples: break speech_prob = model(chunk, 16000).item() speech_probs.append(speech_prob) print(f"First 10 probabilities: {speech_probs[:10]}") # Output: [0.01, 0.02, 0.45, 0.89, 0.95, 0.92, 0.88, 0.12, 0.03, 0.01] # Reset states before processing new audio model.reset_states() ``` ## audio_forward (ONNX Model) Processes an entire audio file at once using the ONNX model wrapper. Returns speech probabilities for all chunks. Each output corresponds to a 32ms window (512 samples at 16kHz). ```python from silero_vad import load_silero_vad, read_audio model = load_silero_vad(onnx=True) wav = read_audio('audio.wav', sampling_rate=16000) # Process entire audio at once # Returns probabilities for each ~32ms window probabilities = model.audio_forward(wav.unsqueeze(0), sr=16000) # probabilities shape: torch.Size([1, num_chunks]) print(f"Audio duration: {len(wav)/16000:.2f}s") print(f"Number of probability windows: {probabilities.shape[1]}") print(f"Probabilities: {probabilities[0, :10].tolist()}") ``` ## collect_chunks Extracts and concatenates speech segments from audio based on timestamp coordinates. Useful for creating audio containing only speech portions. ```python from silero_vad import load_silero_vad, read_audio, get_speech_timestamps, collect_chunks, save_audio model = load_silero_vad() wav = read_audio('audio.wav', sampling_rate=16000) # Get speech timestamps in samples speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=16000) # Collect only speech portions (sample coordinates) speech_only = collect_chunks(speech_timestamps, wav) # Save extracted speech save_audio('speech_only.wav', speech_only, sampling_rate=16000) # Using second-based coordinates speech_timestamps_sec = get_speech_timestamps(wav, model, return_seconds=True) speech_only = collect_chunks( speech_timestamps_sec, wav, seconds=True, sampling_rate=16000 ) ``` ## drop_chunks Removes speech segments from audio based on timestamp coordinates. Useful for extracting non-speech portions (silence, noise, music). ```python from silero_vad import load_silero_vad, read_audio, get_speech_timestamps, save_audio from silero_vad.utils_vad import drop_chunks model = load_silero_vad() wav = read_audio('audio.wav', sampling_rate=16000) # Get speech timestamps speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=16000) # Drop speech portions, keeping only non-speech non_speech = drop_chunks(speech_timestamps, wav) # Save non-speech audio save_audio('non_speech.wav', non_speech, sampling_rate=16000) # Using second-based coordinates speech_timestamps_sec = get_speech_timestamps(wav, model, return_seconds=True) non_speech = drop_chunks( speech_timestamps_sec, wav, seconds=True, sampling_rate=16000 ) ``` ## save_audio Saves a torch.Tensor as a WAV audio file. Handles tensor dimension conversion automatically. ```python from silero_vad import save_audio import torch # Save 1D tensor audio_tensor = torch.randn(16000) # 1 second of audio at 16kHz save_audio('output.wav', audio_tensor, sampling_rate=16000) # Save at different sample rate save_audio('output_8k.wav', audio_tensor[:8000], sampling_rate=8000) ``` ## torch.hub Loading Alternative method to load the model via torch.hub without installing the pip package. Returns both the model and utility functions. ```python import torch torch.set_num_threads(1) # Load model and utilities from torch.hub model, utils = torch.hub.load( repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=False, onnx=False # Set True for ONNX model ) # Unpack utilities (get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils # Use as normal wav = read_audio('audio.wav') speech_timestamps = get_speech_timestamps(wav, model, return_seconds=True) print(speech_timestamps) ``` ## TinySileroVAD (Tinygrad Model) Experimental tinygrad implementation for minimal-dependency deployments. Requires the tinygrad library and loading weights from safetensors format. ```python from tinygrad import Tensor from tinygrad.nn.state import safe_load, load_state_dict from silero_vad.tinygrad_model import TinySileroVAD from silero_vad import read_audio import numpy as np # Load tinygrad model tiny_model = TinySileroVAD() state_dict = safe_load('silero_vad_16k.safetensors') load_state_dict(tiny_model, state_dict) # Process audio wav = read_audio('audio.wav', sampling_rate=16000) num_samples = 512 context_size = 64 context = Tensor(np.zeros((1, context_size))).float() state = None # Prepare audio with padding import torch if wav.shape[0] % num_samples: pad_num = num_samples - (wav.shape[0] % num_samples) wav = torch.nn.functional.pad(wav.unsqueeze(0), (0, pad_num), 'constant', value=0.0) else: wav = wav.unsqueeze(0) wav = torch.nn.functional.pad(wav, (context_size, 0)) wav_tg = Tensor(wav.numpy()).float() # Process chunks outs = [] for i in range(context_size, wav_tg.shape[1], num_samples): chunk = wav_tg[:, i-context_size:i+num_samples] out, state = tiny_model(chunk, state) outs.append(out) # Concatenate predictions predictions = outs[0].cat(*outs[1:], dim=1).numpy() print(f"Predictions shape: {predictions.shape}") ``` ## Summary Silero VAD provides a complete toolkit for voice activity detection in Python applications. The primary use case is splitting long audio recordings into speech segments using `get_speech_timestamps()`, which handles all the complexity of threshold-based detection, silence handling, and speech padding. For real-time applications like voice assistants or telephony systems, the `VADIterator` class enables streaming detection with minimal latency, processing audio in small chunks (32ms at 16kHz) and returning speech boundary events. Integration patterns typically involve loading the model once at application startup, then either batch-processing audio files through `get_speech_timestamps()` or setting up a `VADIterator` for continuous streaming. The model supports both PyTorch JIT and ONNX formats, enabling deployment across Python backends, mobile applications, web browsers (via ONNX.js), and embedded systems. Community examples demonstrate integration with C++, Rust, Go, Java, C#, and JavaScript, making Silero VAD highly portable for production voice-enabled applications.