# onnx-asr onnx-asr is a lightweight Python package for Automatic Speech Recognition (ASR) using ONNX models. It provides a fast, easy-to-use pure Python interface with minimal dependencies - only NumPy and ONNX Runtime are required, with no need for PyTorch, Transformers, or FFmpeg. The package supports modern ASR architectures including NVIDIA NeMo Parakeet/Canary, GigaChat GigaAM, Kaldi/Vosk, T-Tech T-one, and OpenAI Whisper models. The library runs on a wide range of devices from IoT/edge devices to servers with powerful GPUs, supporting Windows, Linux, and macOS on x86 and ARM CPUs. It provides hardware acceleration through CUDA, TensorRT, CoreML, DirectML, ROCm, and WebGPU. Key features include loading models from Hugging Face or local directories (including quantized versions), accepting WAV files or NumPy arrays with built-in resampling, batch processing, long-form recognition using Voice Activity Detection (VAD), and returning token-level timestamps and log probabilities. ## load_model Load an ASR model from Hugging Face or a local directory. This is the primary entry point for speech recognition. It supports various model architectures including NeMo Conformer/Parakeet/Canary, GigaAM, Kaldi/Vosk, T-one, and Whisper models. The function handles model downloading, preprocessor setup, and runtime configuration automatically. ```python import onnx_asr # Load model from Hugging Face (downloads automatically on first use) model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3") # Recognize speech from WAV file result = model.recognize("audio.wav") print(result) # Output: "hello world this is a test" # Load with int8 quantization for faster inference model_quantized = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3", quantization="int8") # Load from local directory model_local = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3", "models/parakeet-v3") # Load custom model from Hugging Face repository model_custom = onnx_asr.load_model("istupakov/canary-180m-flash-onnx") # Load Whisper model model_whisper = onnx_asr.load_model("onnx-community/whisper-large-v3-turbo") # Configure TensorRT provider for GPU acceleration providers = [ ("TensorrtExecutionProvider", { "trt_max_workspace_size": 6 * 1024**3, "trt_fp16_enable": True, }) ] model_gpu = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3", providers=providers) ``` ## recognize Perform speech recognition on audio input. Accepts WAV file paths, NumPy arrays, or lists for batch processing. The method supports channel selection for multi-channel audio and language specification for multilingual models like Whisper and Canary. ```python import onnx_asr import numpy as np model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3") # Single file recognition text = model.recognize("audio.wav") print(text) # Output: "the quick brown fox jumps over the lazy dog" # Batch recognition for multiple files results = model.recognize(["file1.wav", "file2.wav", "file3.wav"]) print(results) # Output: ["transcript one", "transcript two", "transcript three"] # Recognition from NumPy array (16kHz mono float32) sample_rate = 16000 duration = 3.0 waveform = np.random.randn(int(sample_rate * duration)).astype(np.float32) text = model.recognize(waveform, sample_rate=16000) # Handle multi-channel audio by averaging channels text = model.recognize("stereo_audio.wav", channel="mean") # Or select specific channel (0-indexed) text = model.recognize("stereo_audio.wav", channel=0) # For multilingual models (Whisper, Canary), specify language whisper = onnx_asr.load_model("onnx-community/whisper-large-v3-turbo") text = whisper.recognize("french_audio.wav", language="fr") # Canary model with punctuation and capitalization canary = onnx_asr.load_model("nemo-canary-1b-v2") text = canary.recognize("audio.wav", language="en", pnc=True) ``` ## with_timestamps Enable timestamped recognition to get token-level timing, log probabilities, and individual tokens. Returns a TimestampedResult object containing the text, timestamps array, tokens list, and logprobs list for detailed analysis of the recognition output. ```python import onnx_asr model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_timestamps() # Get detailed recognition results result = model.recognize("audio.wav") print(f"Text: {result.text}") # Output: "hello world" print(f"Tokens: {result.tokens}") # Output: [' hello', ' world'] print(f"Timestamps: {result.timestamps}") # Output: [0.32, 0.64] print(f"Log probabilities: {result.logprobs}") # Output: [-0.123, -0.089] # Batch processing with timestamps results = model.recognize(["file1.wav", "file2.wav"]) for r in results: print(f"{r.text} at times {r.timestamps}") ``` ## load_vad Load a Voice Activity Detection (VAD) model for processing long audio files. VAD segments audio into speech chunks before recognition, enabling processing of recordings longer than the ASR model's maximum duration (typically 20-30 seconds). Supports Silero VAD and PyAnnote segmentation models. ```python import onnx_asr # Load VAD model (Silero is default) vad = onnx_asr.load_vad("silero") # Alternative: PyAnnote segmentation vad_pyannote = onnx_asr.load_vad("onnx-community/pyannote-segmentation-3.0") # Load with quantization vad_quantized = onnx_asr.load_vad("silero", quantization="int8") ``` ## with_vad Combine ASR model with VAD for long-form audio recognition. Returns an iterator of SegmentResult objects, each containing the start time, end time, and transcribed text for each detected speech segment. Supports configurable VAD parameters for threshold, duration, and padding. ```python import onnx_asr vad = onnx_asr.load_vad("silero") model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_vad(vad) # Recognize long audio file with automatic segmentation for segment in model.recognize("long_recording.wav"): print(f"[{segment.start:5.1f}s - {segment.end:5.1f}s]: {segment.text}") # Output: # [ 0.5s - 3.2s]: hello and welcome to our presentation # [ 4.1s - 8.7s]: today we will discuss speech recognition # [ 10.2s - 15.6s]: let us begin with the fundamentals # Configure VAD parameters model_custom_vad = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_vad( vad, batch_size=8, # Parallel segment processing threshold=0.5, # Speech detection threshold min_speech_duration_ms=250, # Minimum speech segment duration max_speech_duration_s=20, # Maximum speech segment duration min_silence_duration_ms=100, # Minimum silence to split segments speech_pad_ms=30 # Padding around speech segments ) # Get timestamps with VAD model_vad_timestamps = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_vad(vad).with_timestamps() for segment in model_vad_timestamps.recognize("audio.wav"): print(f"[{segment.start:.1f}s - {segment.end:.1f}s]: {segment.text}") print(f" Tokens: {segment.tokens}") print(f" Token times: {segment.timestamps}") ``` ## Using soundfile for Audio Input Read audio files using soundfile library for formats beyond standard WAV. Convert audio to float32 NumPy arrays and pass to the recognize method with the appropriate sample rate. ```python import onnx_asr import soundfile as sf model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3") # Read audio file with soundfile (supports FLAC, OGG, etc.) waveform, sample_rate = sf.read("audio.flac", dtype="float32") # Pass to model with sample rate text = model.recognize(waveform, sample_rate=sample_rate) print(text) # For stereo audio, average channels waveform, sample_rate = sf.read("stereo.wav", dtype="float32") text = model.recognize(waveform, sample_rate=sample_rate, channel="mean") ``` ## CLI Usage Use the command-line interface for quick speech recognition from terminal. Supports model selection, file paths, quantization, language options, and VAD processing. ```bash # Basic recognition onnx-asr nemo-parakeet-tdt-0.6b-v3 audio.wav # Multiple files onnx-asr nemo-parakeet-tdt-0.6b-v3 file1.wav file2.wav file3.wav # With quantization onnx-asr nemo-parakeet-tdt-0.6b-v3 audio.wav -q int8 # With local model path onnx-asr nemo-parakeet-tdt-0.6b-v3 audio.wav -p ./models/parakeet # With VAD for long audio onnx-asr nemo-parakeet-tdt-0.6b-v3 long_recording.wav --vad silero # Multilingual model with language onnx-asr onnx-community/whisper-large-v3-turbo audio.wav --lang fr # Show help onnx-asr -h # Show version onnx-asr --version ``` ## Gradio Web Interface Create a simple web interface for speech recognition using Gradio. The interface accepts audio input from microphone or file upload and displays the transcribed text. ```python import onnx_asr import gradio as gr model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3") def recognize(audio): if not audio: return None sample_rate, waveform = audio # Normalize int16 audio to float32 waveform = waveform / 2**15 return model.recognize(waveform, sample_rate=sample_rate, channel="mean") demo = gr.Interface( fn=recognize, inputs=gr.Audio(sources=["microphone", "upload"]), outputs="text", title="Speech Recognition with onnx-asr", description="Upload audio or record from microphone" ) demo.launch() ``` ## Manager Class Use the Manager class for advanced control over model creation with shared runtime configuration. Enables creating multiple ASR, VAD, and speaker embedding models with consistent ONNX session options and preprocessor settings. ```python import onnx_asr from onnx_asr.loader import Manager # Create manager with custom providers manager = Manager( providers=[ ("CUDAExecutionProvider", {"device_id": 0}), "CPUExecutionProvider" ], preprocessor_config={ "max_concurrent_workers": 4, "use_numpy_preprocessors": True } ) # Create models using the manager asr = manager.create_asr("nemo-parakeet-tdt-0.6b-v3") vad = manager.create_vad("silero") # Use models text = asr.recognize("audio.wav") print(text) # Create model with VAD asr_with_vad = asr.with_vad(vad) for segment in asr_with_vad.recognize("long_audio.wav"): print(f"[{segment.start:.1f}s - {segment.end:.1f}s]: {segment.text}") ``` ## Supported Models Reference Complete list of supported model names that can be automatically downloaded from Hugging Face and used with load_model. ```python import onnx_asr # GigaAM models (Russian) model = onnx_asr.load_model("gigaam-v2-ctc") # GigaAM v2 CTC decoder model = onnx_asr.load_model("gigaam-v2-rnnt") # GigaAM v2 RNN-T decoder model = onnx_asr.load_model("gigaam-v3-ctc") # GigaAM v3 CTC decoder model = onnx_asr.load_model("gigaam-v3-rnnt") # GigaAM v3 RNN-T decoder model = onnx_asr.load_model("gigaam-v3-e2e-ctc") # GigaAM v3 End-to-End CTC model = onnx_asr.load_model("gigaam-v3-e2e-rnnt") # GigaAM v3 End-to-End RNN-T # NeMo FastConformer (Russian) model = onnx_asr.load_model("nemo-fastconformer-ru-ctc") model = onnx_asr.load_model("nemo-fastconformer-ru-rnnt") # NeMo Parakeet (English) model = onnx_asr.load_model("nemo-parakeet-ctc-0.6b") model = onnx_asr.load_model("nemo-parakeet-rnnt-0.6b") model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v2") # NeMo Parakeet v3 (Multilingual) model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3") # NeMo Canary (Multilingual) model = onnx_asr.load_model("nemo-canary-1b-v2") # Vosk (Russian) model = onnx_asr.load_model("alphacep/vosk-model-ru") model = onnx_asr.load_model("alphacep/vosk-model-small-ru") # T-Tech T-one (Russian) model = onnx_asr.load_model("t-tech/t-one") # Whisper models model = onnx_asr.load_model("whisper-base") model = onnx_asr.load_model("onnx-community/whisper-tiny") model = onnx_asr.load_model("onnx-community/whisper-base") model = onnx_asr.load_model("onnx-community/whisper-small") model = onnx_asr.load_model("onnx-community/whisper-large-v3-turbo") # VAD models vad = onnx_asr.load_vad("silero") vad = onnx_asr.load_vad("onnx-community/pyannote-segmentation-3.0") ``` ## Summary onnx-asr provides a streamlined solution for integrating automatic speech recognition into Python applications. The primary use cases include transcribing audio files and streams, building voice-enabled applications, creating speech-to-text pipelines, and processing recorded meetings or calls. The library's key integration pattern involves loading a model with `load_model()`, optionally combining it with VAD using `with_vad()` for long-form audio, and calling `recognize()` with file paths or NumPy arrays. For detailed analysis, chain `with_timestamps()` to get token-level timing and probabilities. The package excels in production environments due to its minimal dependencies, cross-platform support, and hardware acceleration options. Typical integration patterns include batch processing multiple files for throughput, streaming recognition with VAD segmentation, and building web interfaces with Gradio. The Manager class enables sharing runtime configuration across multiple models in complex applications. Performance can be optimized through quantized models (`quantization="int8"`), TensorRT acceleration on NVIDIA GPUs, and parallel preprocessing with configurable worker counts.