# onnx-asr

onnx-asr is a lightweight Python package for Automatic Speech Recognition (ASR) using ONNX models. It provides a fast, easy-to-use pure Python interface with minimal dependencies - only NumPy and ONNX Runtime are required, with no need for PyTorch, Transformers, or FFmpeg. The package supports modern ASR architectures including NVIDIA NeMo Parakeet/Canary, GigaChat GigaAM, Kaldi/Vosk, T-Tech T-one, and OpenAI Whisper models.

The library runs on a wide range of devices from IoT/edge devices to servers with powerful GPUs, supporting Windows, Linux, and macOS on x86 and ARM CPUs. It provides hardware acceleration through CUDA, TensorRT, CoreML, DirectML, ROCm, and WebGPU. Key features include loading models from Hugging Face or local directories (including quantized versions), accepting WAV files or NumPy arrays with built-in resampling, batch processing, long-form recognition using Voice Activity Detection (VAD), and returning token-level timestamps and log probabilities.

## load_model

Load an ASR model from Hugging Face or a local directory. This is the primary entry point for speech recognition. It supports various model architectures including NeMo Conformer/Parakeet/Canary, GigaAM, Kaldi/Vosk, T-one, and Whisper models. The function handles model downloading, preprocessor setup, and runtime configuration automatically.

```python
import onnx_asr

# Load model from Hugging Face (downloads automatically on first use)
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")

# Recognize speech from WAV file
result = model.recognize("audio.wav")
print(result)
# Output: "hello world this is a test"

# Load with int8 quantization for faster inference
model_quantized = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3", quantization="int8")

# Load from local directory
model_local = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3", "models/parakeet-v3")

# Load custom model from Hugging Face repository
model_custom = onnx_asr.load_model("istupakov/canary-180m-flash-onnx")

# Load Whisper model
model_whisper = onnx_asr.load_model("onnx-community/whisper-large-v3-turbo")

# Configure TensorRT provider for GPU acceleration
providers = [
    ("TensorrtExecutionProvider", {
        "trt_max_workspace_size": 6 * 1024**3,
        "trt_fp16_enable": True,
    })
]
model_gpu = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3", providers=providers)
```

## recognize

Perform speech recognition on audio input. Accepts WAV file paths, NumPy arrays, or lists for batch processing. The method supports channel selection for multi-channel audio and language specification for multilingual models like Whisper and Canary.

```python
import onnx_asr
import numpy as np

model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")

# Single file recognition
text = model.recognize("audio.wav")
print(text)
# Output: "the quick brown fox jumps over the lazy dog"

# Batch recognition for multiple files
results = model.recognize(["file1.wav", "file2.wav", "file3.wav"])
print(results)
# Output: ["transcript one", "transcript two", "transcript three"]

# Recognition from NumPy array (16kHz mono float32)
sample_rate = 16000
duration = 3.0
waveform = np.random.randn(int(sample_rate * duration)).astype(np.float32)
text = model.recognize(waveform, sample_rate=16000)

# Handle multi-channel audio by averaging channels
text = model.recognize("stereo_audio.wav", channel="mean")

# Or select specific channel (0-indexed)
text = model.recognize("stereo_audio.wav", channel=0)

# For multilingual models (Whisper, Canary), specify language
whisper = onnx_asr.load_model("onnx-community/whisper-large-v3-turbo")
text = whisper.recognize("french_audio.wav", language="fr")

# Canary model with punctuation and capitalization
canary = onnx_asr.load_model("nemo-canary-1b-v2")
text = canary.recognize("audio.wav", language="en", pnc=True)
```

## with_timestamps

Enable timestamped recognition to get token-level timing, log probabilities, and individual tokens. Returns a TimestampedResult object containing the text, timestamps array, tokens list, and logprobs list for detailed analysis of the recognition output.

```python
import onnx_asr

model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_timestamps()

# Get detailed recognition results
result = model.recognize("audio.wav")

print(f"Text: {result.text}")
# Output: "hello world"

print(f"Tokens: {result.tokens}")
# Output: [' hello', ' world']

print(f"Timestamps: {result.timestamps}")
# Output: [0.32, 0.64]

print(f"Log probabilities: {result.logprobs}")
# Output: [-0.123, -0.089]

# Batch processing with timestamps
results = model.recognize(["file1.wav", "file2.wav"])
for r in results:
    print(f"{r.text} at times {r.timestamps}")
```

## load_vad

Load a Voice Activity Detection (VAD) model for processing long audio files. VAD segments audio into speech chunks before recognition, enabling processing of recordings longer than the ASR model's maximum duration (typically 20-30 seconds). Supports Silero VAD and PyAnnote segmentation models.

```python
import onnx_asr

# Load VAD model (Silero is default)
vad = onnx_asr.load_vad("silero")

# Alternative: PyAnnote segmentation
vad_pyannote = onnx_asr.load_vad("onnx-community/pyannote-segmentation-3.0")

# Load with quantization
vad_quantized = onnx_asr.load_vad("silero", quantization="int8")
```

## with_vad

Combine ASR model with VAD for long-form audio recognition. Returns an iterator of SegmentResult objects, each containing the start time, end time, and transcribed text for each detected speech segment. Supports configurable VAD parameters for threshold, duration, and padding.

```python
import onnx_asr

vad = onnx_asr.load_vad("silero")
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_vad(vad)

# Recognize long audio file with automatic segmentation
for segment in model.recognize("long_recording.wav"):
    print(f"[{segment.start:5.1f}s - {segment.end:5.1f}s]: {segment.text}")
# Output:
# [  0.5s -   3.2s]: hello and welcome to our presentation
# [  4.1s -   8.7s]: today we will discuss speech recognition
# [ 10.2s -  15.6s]: let us begin with the fundamentals

# Configure VAD parameters
model_custom_vad = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_vad(
    vad,
    batch_size=8,                    # Parallel segment processing
    threshold=0.5,                   # Speech detection threshold
    min_speech_duration_ms=250,      # Minimum speech segment duration
    max_speech_duration_s=20,        # Maximum speech segment duration
    min_silence_duration_ms=100,     # Minimum silence to split segments
    speech_pad_ms=30                 # Padding around speech segments
)

# Get timestamps with VAD
model_vad_timestamps = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_vad(vad).with_timestamps()
for segment in model_vad_timestamps.recognize("audio.wav"):
    print(f"[{segment.start:.1f}s - {segment.end:.1f}s]: {segment.text}")
    print(f"  Tokens: {segment.tokens}")
    print(f"  Token times: {segment.timestamps}")
```

## Using soundfile for Audio Input

Read audio files using soundfile library for formats beyond standard WAV. Convert audio to float32 NumPy arrays and pass to the recognize method with the appropriate sample rate.

```python
import onnx_asr
import soundfile as sf

model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")

# Read audio file with soundfile (supports FLAC, OGG, etc.)
waveform, sample_rate = sf.read("audio.flac", dtype="float32")

# Pass to model with sample rate
text = model.recognize(waveform, sample_rate=sample_rate)
print(text)

# For stereo audio, average channels
waveform, sample_rate = sf.read("stereo.wav", dtype="float32")
text = model.recognize(waveform, sample_rate=sample_rate, channel="mean")
```

## CLI Usage

Use the command-line interface for quick speech recognition from terminal. Supports model selection, file paths, quantization, language options, and VAD processing.

```bash
# Basic recognition
onnx-asr nemo-parakeet-tdt-0.6b-v3 audio.wav

# Multiple files
onnx-asr nemo-parakeet-tdt-0.6b-v3 file1.wav file2.wav file3.wav

# With quantization
onnx-asr nemo-parakeet-tdt-0.6b-v3 audio.wav -q int8

# With local model path
onnx-asr nemo-parakeet-tdt-0.6b-v3 audio.wav -p ./models/parakeet

# With VAD for long audio
onnx-asr nemo-parakeet-tdt-0.6b-v3 long_recording.wav --vad silero

# Multilingual model with language
onnx-asr onnx-community/whisper-large-v3-turbo audio.wav --lang fr

# Show help
onnx-asr -h

# Show version
onnx-asr --version
```

## Gradio Web Interface

Create a simple web interface for speech recognition using Gradio. The interface accepts audio input from microphone or file upload and displays the transcribed text.

```python
import onnx_asr
import gradio as gr

model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")

def recognize(audio):
    if not audio:
        return None

    sample_rate, waveform = audio
    # Normalize int16 audio to float32
    waveform = waveform / 2**15
    return model.recognize(waveform, sample_rate=sample_rate, channel="mean")

demo = gr.Interface(
    fn=recognize,
    inputs=gr.Audio(sources=["microphone", "upload"]),
    outputs="text",
    title="Speech Recognition with onnx-asr",
    description="Upload audio or record from microphone"
)

demo.launch()
```

## Manager Class

Use the Manager class for advanced control over model creation with shared runtime configuration. Enables creating multiple ASR, VAD, and speaker embedding models with consistent ONNX session options and preprocessor settings.

```python
import onnx_asr
from onnx_asr.loader import Manager

# Create manager with custom providers
manager = Manager(
    providers=[
        ("CUDAExecutionProvider", {"device_id": 0}),
        "CPUExecutionProvider"
    ],
    preprocessor_config={
        "max_concurrent_workers": 4,
        "use_numpy_preprocessors": True
    }
)

# Create models using the manager
asr = manager.create_asr("nemo-parakeet-tdt-0.6b-v3")
vad = manager.create_vad("silero")

# Use models
text = asr.recognize("audio.wav")
print(text)

# Create model with VAD
asr_with_vad = asr.with_vad(vad)
for segment in asr_with_vad.recognize("long_audio.wav"):
    print(f"[{segment.start:.1f}s - {segment.end:.1f}s]: {segment.text}")
```

## Supported Models Reference

Complete list of supported model names that can be automatically downloaded from Hugging Face and used with load_model.

```python
import onnx_asr

# GigaAM models (Russian)
model = onnx_asr.load_model("gigaam-v2-ctc")      # GigaAM v2 CTC decoder
model = onnx_asr.load_model("gigaam-v2-rnnt")     # GigaAM v2 RNN-T decoder
model = onnx_asr.load_model("gigaam-v3-ctc")      # GigaAM v3 CTC decoder
model = onnx_asr.load_model("gigaam-v3-rnnt")     # GigaAM v3 RNN-T decoder
model = onnx_asr.load_model("gigaam-v3-e2e-ctc")  # GigaAM v3 End-to-End CTC
model = onnx_asr.load_model("gigaam-v3-e2e-rnnt") # GigaAM v3 End-to-End RNN-T

# NeMo FastConformer (Russian)
model = onnx_asr.load_model("nemo-fastconformer-ru-ctc")
model = onnx_asr.load_model("nemo-fastconformer-ru-rnnt")

# NeMo Parakeet (English)
model = onnx_asr.load_model("nemo-parakeet-ctc-0.6b")
model = onnx_asr.load_model("nemo-parakeet-rnnt-0.6b")
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v2")

# NeMo Parakeet v3 (Multilingual)
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")

# NeMo Canary (Multilingual)
model = onnx_asr.load_model("nemo-canary-1b-v2")

# Vosk (Russian)
model = onnx_asr.load_model("alphacep/vosk-model-ru")
model = onnx_asr.load_model("alphacep/vosk-model-small-ru")

# T-Tech T-one (Russian)
model = onnx_asr.load_model("t-tech/t-one")

# Whisper models
model = onnx_asr.load_model("whisper-base")
model = onnx_asr.load_model("onnx-community/whisper-tiny")
model = onnx_asr.load_model("onnx-community/whisper-base")
model = onnx_asr.load_model("onnx-community/whisper-small")
model = onnx_asr.load_model("onnx-community/whisper-large-v3-turbo")

# VAD models
vad = onnx_asr.load_vad("silero")
vad = onnx_asr.load_vad("onnx-community/pyannote-segmentation-3.0")
```

## Summary

onnx-asr provides a streamlined solution for integrating automatic speech recognition into Python applications. The primary use cases include transcribing audio files and streams, building voice-enabled applications, creating speech-to-text pipelines, and processing recorded meetings or calls. The library's key integration pattern involves loading a model with `load_model()`, optionally combining it with VAD using `with_vad()` for long-form audio, and calling `recognize()` with file paths or NumPy arrays. For detailed analysis, chain `with_timestamps()` to get token-level timing and probabilities.

The package excels in production environments due to its minimal dependencies, cross-platform support, and hardware acceleration options. Typical integration patterns include batch processing multiple files for throughput, streaming recognition with VAD segmentation, and building web interfaces with Gradio. The Manager class enables sharing runtime configuration across multiple models in complex applications. Performance can be optimized through quantized models (`quantization="int8"`), TensorRT acceleration on NVIDIA GPUs, and parallel preprocessing with configurable worker counts.