# Silero VAD

Silero VAD is a pre-trained enterprise-grade Voice Activity Detector that identifies speech segments in audio. It provides highly accurate speech detection across 6000+ languages, with support for both 8kHz and 16kHz sampling rates. The library is lightweight (2MB JIT model), fast (less than 1ms per audio chunk on CPU), and runs on any platform supporting PyTorch or ONNX runtime.

The library offers multiple integration options: a pip-installable Python package, torch.hub loading, and ONNX models for cross-platform deployment. It includes utilities for batch processing full audio files, real-time streaming detection, and audio manipulation (collecting or dropping speech segments). Silero VAD is commonly used for voice interfaces, telephony automation, data cleaning, and IoT/edge voice detection applications.

## load_silero_vad

Loads the Silero VAD model from the package. Supports both JIT (TorchScript) and ONNX model formats. The JIT model requires PyTorch >= 1.12.0, while the ONNX model requires onnxruntime >= 1.16.1. ONNX models support opset versions 15 and 16.

```python
from silero_vad import load_silero_vad

# Load JIT model (default)
model = load_silero_vad()

# Load ONNX model for cross-platform deployment
model_onnx = load_silero_vad(onnx=True)

# Load ONNX model with specific opset version
model_onnx_15 = load_silero_vad(onnx=True, opset_version=15)
```

## read_audio

Reads an audio file and returns a torch.Tensor at the specified sampling rate. Automatically handles resampling, stereo to mono conversion, and works with multiple audio backends (FFmpeg, sox, soundfile, torchcodec).

```python
from silero_vad import read_audio

# Read audio file at 16kHz (default)
wav = read_audio('audio.wav')
# wav shape: torch.Size([num_samples])

# Read audio file at 8kHz
wav_8k = read_audio('audio.wav', sampling_rate=8000)

# Read and process any supported format
wav_mp3 = read_audio('audio.mp3', sampling_rate=16000)
```

## get_speech_timestamps

Analyzes audio and returns a list of dictionaries containing start and end timestamps of speech segments. This is the primary function for processing complete audio files. Returns coordinates in samples by default, or seconds when `return_seconds=True`.

```python
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps

model = load_silero_vad()
wav = read_audio('audio.wav', sampling_rate=16000)

# Basic usage - returns sample coordinates
speech_timestamps = get_speech_timestamps(wav, model)
# Output: [{'start': 1024, 'end': 48000}, {'start': 64000, 'end': 96000}]

# Return timestamps in seconds
speech_timestamps = get_speech_timestamps(
    wav,
    model,
    return_seconds=True,
    sampling_rate=16000
)
# Output: [{'start': 0.1, 'end': 3.0}, {'start': 4.0, 'end': 6.0}]

# Fine-tuned detection parameters
speech_timestamps = get_speech_timestamps(
    wav,
    model,
    threshold=0.5,                    # Speech probability threshold (default: 0.5)
    sampling_rate=16000,              # Sample rate (8000 or 16000)
    min_speech_duration_ms=250,       # Minimum speech chunk duration
    max_speech_duration_s=float('inf'),  # Maximum speech chunk duration
    min_silence_duration_ms=100,      # Minimum silence to end speech
    speech_pad_ms=30,                 # Padding around speech chunks
    return_seconds=True,
    visualize_probs=False,            # Plot speech probabilities
    neg_threshold=0.35,               # Threshold to exit speech state
    progress_tracking_callback=lambda p: print(f"Progress: {p:.1f}%")
)
```

## VADIterator

A class for streaming/real-time voice activity detection. Processes audio in chunks and returns speech start/end events as they occur. Ideal for live microphone input and real-time applications.

```python
from silero_vad import load_silero_vad, read_audio, VADIterator

model = load_silero_vad()

# Initialize iterator with custom parameters
vad_iterator = VADIterator(
    model,
    threshold=0.5,              # Speech probability threshold
    sampling_rate=16000,        # Sample rate (8000 or 16000)
    min_silence_duration_ms=100,  # Silence duration to end speech
    speech_pad_ms=30            # Padding around detected speech
)

# Process audio in chunks (simulating streaming)
wav = read_audio('audio.wav', sampling_rate=16000)
window_size_samples = 512  # 512 for 16kHz, 256 for 8kHz

for i in range(0, len(wav), window_size_samples):
    chunk = wav[i:i + window_size_samples]
    if len(chunk) < window_size_samples:
        break

    speech_dict = vad_iterator(chunk, return_seconds=True)
    if speech_dict:
        if 'start' in speech_dict:
            print(f"Speech started at {speech_dict['start']}s")
        if 'end' in speech_dict:
            print(f"Speech ended at {speech_dict['end']}s")

# Reset states for next audio
vad_iterator.reset_states()
```

## Model Direct Inference

The model can be called directly on audio chunks to get speech probabilities. Each call returns a probability (0-1) indicating likelihood of speech. Remember to reset model states between different audio files.

```python
from silero_vad import load_silero_vad, read_audio

model = load_silero_vad()
wav = read_audio('audio.wav', sampling_rate=16000)

# Process audio in chunks and get probabilities
speech_probs = []
window_size_samples = 512  # 512 for 16kHz, 256 for 8kHz

for i in range(0, len(wav), window_size_samples):
    chunk = wav[i:i + window_size_samples]
    if len(chunk) < window_size_samples:
        break
    speech_prob = model(chunk, 16000).item()
    speech_probs.append(speech_prob)

print(f"First 10 probabilities: {speech_probs[:10]}")
# Output: [0.01, 0.02, 0.45, 0.89, 0.95, 0.92, 0.88, 0.12, 0.03, 0.01]

# Reset states before processing new audio
model.reset_states()
```

## audio_forward (ONNX Model)

Processes an entire audio file at once using the ONNX model wrapper. Returns speech probabilities for all chunks. Each output corresponds to a 32ms window (512 samples at 16kHz).

```python
from silero_vad import load_silero_vad, read_audio

model = load_silero_vad(onnx=True)
wav = read_audio('audio.wav', sampling_rate=16000)

# Process entire audio at once
# Returns probabilities for each ~32ms window
probabilities = model.audio_forward(wav.unsqueeze(0), sr=16000)
# probabilities shape: torch.Size([1, num_chunks])

print(f"Audio duration: {len(wav)/16000:.2f}s")
print(f"Number of probability windows: {probabilities.shape[1]}")
print(f"Probabilities: {probabilities[0, :10].tolist()}")
```

## collect_chunks

Extracts and concatenates speech segments from audio based on timestamp coordinates. Useful for creating audio containing only speech portions.

```python
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps, collect_chunks, save_audio

model = load_silero_vad()
wav = read_audio('audio.wav', sampling_rate=16000)

# Get speech timestamps in samples
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=16000)

# Collect only speech portions (sample coordinates)
speech_only = collect_chunks(speech_timestamps, wav)

# Save extracted speech
save_audio('speech_only.wav', speech_only, sampling_rate=16000)

# Using second-based coordinates
speech_timestamps_sec = get_speech_timestamps(wav, model, return_seconds=True)
speech_only = collect_chunks(
    speech_timestamps_sec,
    wav,
    seconds=True,
    sampling_rate=16000
)
```

## drop_chunks

Removes speech segments from audio based on timestamp coordinates. Useful for extracting non-speech portions (silence, noise, music).

```python
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps, save_audio
from silero_vad.utils_vad import drop_chunks

model = load_silero_vad()
wav = read_audio('audio.wav', sampling_rate=16000)

# Get speech timestamps
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=16000)

# Drop speech portions, keeping only non-speech
non_speech = drop_chunks(speech_timestamps, wav)

# Save non-speech audio
save_audio('non_speech.wav', non_speech, sampling_rate=16000)

# Using second-based coordinates
speech_timestamps_sec = get_speech_timestamps(wav, model, return_seconds=True)
non_speech = drop_chunks(
    speech_timestamps_sec,
    wav,
    seconds=True,
    sampling_rate=16000
)
```

## save_audio

Saves a torch.Tensor as a WAV audio file. Handles tensor dimension conversion automatically.

```python
from silero_vad import save_audio
import torch

# Save 1D tensor
audio_tensor = torch.randn(16000)  # 1 second of audio at 16kHz
save_audio('output.wav', audio_tensor, sampling_rate=16000)

# Save at different sample rate
save_audio('output_8k.wav', audio_tensor[:8000], sampling_rate=8000)
```

## torch.hub Loading

Alternative method to load the model via torch.hub without installing the pip package. Returns both the model and utility functions.

```python
import torch
torch.set_num_threads(1)

# Load model and utilities from torch.hub
model, utils = torch.hub.load(
    repo_or_dir='snakers4/silero-vad',
    model='silero_vad',
    force_reload=False,
    onnx=False  # Set True for ONNX model
)

# Unpack utilities
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

# Use as normal
wav = read_audio('audio.wav')
speech_timestamps = get_speech_timestamps(wav, model, return_seconds=True)
print(speech_timestamps)
```

## TinySileroVAD (Tinygrad Model)

Experimental tinygrad implementation for minimal-dependency deployments. Requires the tinygrad library and loading weights from safetensors format.

```python
from tinygrad import Tensor
from tinygrad.nn.state import safe_load, load_state_dict
from silero_vad.tinygrad_model import TinySileroVAD
from silero_vad import read_audio
import numpy as np

# Load tinygrad model
tiny_model = TinySileroVAD()
state_dict = safe_load('silero_vad_16k.safetensors')
load_state_dict(tiny_model, state_dict)

# Process audio
wav = read_audio('audio.wav', sampling_rate=16000)
num_samples = 512
context_size = 64
context = Tensor(np.zeros((1, context_size))).float()
state = None

# Prepare audio with padding
import torch
if wav.shape[0] % num_samples:
    pad_num = num_samples - (wav.shape[0] % num_samples)
    wav = torch.nn.functional.pad(wav.unsqueeze(0), (0, pad_num), 'constant', value=0.0)
else:
    wav = wav.unsqueeze(0)
wav = torch.nn.functional.pad(wav, (context_size, 0))
wav_tg = Tensor(wav.numpy()).float()

# Process chunks
outs = []
for i in range(context_size, wav_tg.shape[1], num_samples):
    chunk = wav_tg[:, i-context_size:i+num_samples]
    out, state = tiny_model(chunk, state)
    outs.append(out)

# Concatenate predictions
predictions = outs[0].cat(*outs[1:], dim=1).numpy()
print(f"Predictions shape: {predictions.shape}")
```

## Summary

Silero VAD provides a complete toolkit for voice activity detection in Python applications. The primary use case is splitting long audio recordings into speech segments using `get_speech_timestamps()`, which handles all the complexity of threshold-based detection, silence handling, and speech padding. For real-time applications like voice assistants or telephony systems, the `VADIterator` class enables streaming detection with minimal latency, processing audio in small chunks (32ms at 16kHz) and returning speech boundary events.

Integration patterns typically involve loading the model once at application startup, then either batch-processing audio files through `get_speech_timestamps()` or setting up a `VADIterator` for continuous streaming. The model supports both PyTorch JIT and ONNX formats, enabling deployment across Python backends, mobile applications, web browsers (via ONNX.js), and embedded systems. Community examples demonstrate integration with C++, Rust, Go, Java, C#, and JavaScript, making Silero VAD highly portable for production voice-enabled applications.