# Speech Recognition SpeechRecognition is a Python library for performing speech recognition, with support for several engines and APIs, both online and offline. It provides a unified interface to work with multiple speech recognition services including Google Speech Recognition, Google Cloud Speech-to-Text, CMU Sphinx, Wit.ai, Microsoft Azure Speech, Houndify, IBM Speech to Text, Vosk, OpenAI Whisper (local and API), and Groq Whisper API. The library is designed around two core concepts: audio sources (like microphones and audio files) and recognizers that process the audio. It handles audio capture, format conversion, and communication with various speech-to-text backends transparently, making it easy to switch between different recognition engines without changing your application code significantly. ## Recognizer Class The `Recognizer` class is the main entry point for speech recognition functionality. It contains methods for recording audio, adjusting for ambient noise, and recognizing speech using various engines. ```python import speech_recognition as sr # Create a Recognizer instance r = sr.Recognizer() # Configure recognition settings r.energy_threshold = 300 # Minimum audio energy to consider for recording r.dynamic_energy_threshold = True # Auto-adjust threshold based on ambient noise r.pause_threshold = 0.8 # Seconds of silence before phrase is considered complete r.operation_timeout = None # Timeout for API requests (None = no timeout) ``` ## Microphone Audio Source The `Microphone` class represents a physical microphone and is used as an audio source for real-time speech recognition. Requires PyAudio to be installed. ```python import speech_recognition as sr r = sr.Recognizer() # Use default microphone with sr.Microphone() as source: print("Say something!") audio = r.listen(source) # Use specific microphone by device index with sr.Microphone(device_index=3, sample_rate=16000, chunk_size=1024) as source: audio = r.listen(source) # List all available microphones for index, name in enumerate(sr.Microphone.list_microphone_names()): print(f"Microphone {index}: {name}") # Find working microphones (that are currently hearing sounds) working_mics = sr.Microphone.list_working_microphones() for device_index, name in working_mics.items(): print(f"Working microphone {device_index}: {name}") ``` ## AudioFile Audio Source The `AudioFile` class allows you to use WAV, AIFF, and FLAC audio files as audio sources for speech recognition. ```python import speech_recognition as sr r = sr.Recognizer() # From file path with sr.AudioFile("audio.wav") as source: audio = r.record(source) # Read entire file print(f"Audio duration: {source.DURATION} seconds") # Record specific portion of audio with sr.AudioFile("audio.wav") as source: audio = r.record(source, duration=5) # First 5 seconds with sr.AudioFile("audio.wav") as source: audio = r.record(source, offset=2, duration=5) # 5 seconds starting at 2s # Using AudioData.from_file() shortcut audio = sr.AudioData.from_file("audio.wav") ``` ## Adjust for Ambient Noise Calibrates the recognizer's energy threshold to account for ambient noise levels. Should be called before listening for speech in noisy environments. ```python import speech_recognition as sr r = sr.Recognizer() with sr.Microphone() as source: # Calibrate for 1 second (default) r.adjust_for_ambient_noise(source) # Or calibrate for longer in very noisy environments r.adjust_for_ambient_noise(source, duration=2) print("Say something!") audio = r.listen(source) ``` ## Listen Method Records a single phrase from an audio source, automatically detecting when speech starts and ends based on energy thresholds. ```python import speech_recognition as sr r = sr.Recognizer() with sr.Microphone() as source: # Basic listening audio = r.listen(source) # With timeout (raises WaitTimeoutError if no speech detected) try: audio = r.listen(source, timeout=5) except sr.WaitTimeoutError: print("No speech detected within timeout") # With phrase time limit (cuts off after limit) audio = r.listen(source, phrase_time_limit=10) ``` ## Background Listening Spawns a background thread to continuously listen for phrases and process them via a callback function. ```python import speech_recognition as sr import time def callback(recognizer, audio): try: text = recognizer.recognize_google(audio) print(f"You said: {text}") except sr.UnknownValueError: print("Could not understand audio") except sr.RequestError as e: print(f"Request error: {e}") r = sr.Recognizer() m = sr.Microphone() with m as source: r.adjust_for_ambient_noise(source) # Start background listening stop_listening = r.listen_in_background(m, callback) # Do other work while listening continues print("Listening in background... (say something)") time.sleep(30) # Stop background listening stop_listening(wait_for_stop=True) ``` ## Google Speech Recognition Uses the free Google Speech Recognition API. Works out of the box with a default API key (for testing only). ```python import speech_recognition as sr r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") try: # Using default API key (for testing) text = r.recognize_google(audio) print(f"Google thinks you said: {text}") # With custom API key text = r.recognize_google(audio, key="YOUR_API_KEY") # Specify language text = r.recognize_google(audio, language="fr-FR") # French # With profanity filter (0=off, 1=on) text = r.recognize_google(audio, pfilter=1) # Get full response result = r.recognize_google(audio, show_all=True) print(result) except sr.UnknownValueError: print("Google could not understand the audio") except sr.RequestError as e: print(f"Could not request results from Google; {e}") ``` ## Google Cloud Speech-to-Text Uses Google Cloud Speech-to-Text V1 API. Requires a Google Cloud account and credentials. ```python import speech_recognition as sr r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") try: # Using Application Default Credentials (ADC) text = r.recognize_google_cloud(audio) print(f"Google Cloud thinks you said: {text}") # With explicit credentials file text = r.recognize_google_cloud( audio, credentials_json_path="/path/to/credentials.json" ) # With language and model options text = r.recognize_google_cloud( audio, language_code="en-US", model="latest_long", use_enhanced=True ) # With preferred phrases for better recognition text = r.recognize_google_cloud( audio, preferred_phrases=["hello world", "speech recognition"] ) # Get full response response = r.recognize_google_cloud(audio, show_all=True) except sr.UnknownValueError: print("Google Cloud could not understand the audio") except sr.RequestError as e: print(f"Request error: {e}") ``` ## OpenAI Whisper (Local) Uses OpenAI's Whisper model locally for offline speech recognition. Requires the whisper package. ```python import speech_recognition as sr r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") try: # Basic usage with base model text = r.recognize_whisper(audio) print(f"Whisper thinks you said: {text}") # Use different model sizes: tiny, base, small, medium, large text = r.recognize_whisper(audio, model="medium") # Specify language (improves accuracy) text = r.recognize_whisper(audio, model="base", language="english") # Translate to English text = r.recognize_whisper(audio, model="base", task="translate") # Get full response with timestamps result = r.recognize_whisper(audio, model="base", show_dict=True) print(result["text"]) for segment in result["segments"]: print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}") except sr.UnknownValueError: print("Whisper could not understand the audio") except sr.RequestError as e: print(f"Whisper error: {e}") ``` ## Faster Whisper (Local) Uses the faster-whisper implementation for more efficient local speech recognition. ```python import speech_recognition as sr r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") try: # Basic usage text = r.recognize_faster_whisper(audio) print(f"Faster Whisper: {text}") # With model and language text = r.recognize_faster_whisper( audio, model="base", language="en" ) # Get detailed output result = r.recognize_faster_whisper(audio, model="base", show_dict=True) except sr.RequestError as e: print(f"Error: {e}") ``` ## OpenAI Whisper API Uses OpenAI's cloud-based Whisper API. Requires an OpenAI API key. ```python import os import speech_recognition as sr # Set API key os.environ["OPENAI_API_KEY"] = "your-openai-api-key" r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") try: # Basic usage text = r.recognize_openai(audio) print(f"OpenAI Whisper API: {text}") # With specific model text = r.recognize_openai(audio, model="whisper-1") # With language hint and prompt text = r.recognize_openai( audio, model="whisper-1", language="en", prompt="This is a technical discussion about programming." ) except sr.RequestError as e: print(f"OpenAI API error: {e}") # For self-hosted OpenAI-compatible endpoints (vLLM, Ollama, etc.) os.environ["OPENAI_BASE_URL"] = "http://localhost:8000/v1" os.environ["OPENAI_API_KEY"] = "dummy" text = r.recognize_openai(audio) ``` ## Groq Whisper API Uses Groq's fast Whisper API for cloud-based speech recognition. ```python import os import speech_recognition as sr # Set API key os.environ["GROQ_API_KEY"] = "your-groq-api-key" r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") try: # Basic usage with default model text = r.recognize_groq(audio) print(f"Groq Whisper: {text}") # With specific model text = r.recognize_groq(audio, model="whisper-large-v3-turbo") # or text = r.recognize_groq(audio, model="whisper-large-v3") # With options text = r.recognize_groq( audio, model="whisper-large-v3-turbo", language="en", temperature=0.0 ) except sr.RequestError as e: print(f"Groq API error: {e}") ``` ## Vosk (Offline) Uses Vosk for offline speech recognition. Requires downloading a Vosk model. ```python import speech_recognition as sr # First, download the Vosk model using CLI: # sprc download vosk r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") try: # Basic usage text = r.recognize_vosk(audio) print(f"Vosk thinks you said: {text}") # Get verbose output result = r.recognize_vosk(audio, verbose=True) print(result) # Returns dict with 'text' key except sr.UnknownValueError: print("Vosk could not understand the audio") except sr.RequestError as e: print(f"Vosk error: {e}") ``` ## CMU Sphinx (Offline) Uses CMU PocketSphinx for offline speech recognition. Requires the pocketsphinx package. ```python import speech_recognition as sr r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") try: # Basic usage text = r.recognize_sphinx(audio) print(f"Sphinx thinks you said: {text}") # Specify language text = r.recognize_sphinx(audio, language="en-US") # With keyword spotting (phrase, sensitivity) text = r.recognize_sphinx( audio, keyword_entries=[ ("hello", 1.0), ("goodbye", 0.8) ] ) # With custom grammar file text = r.recognize_sphinx(audio, grammar="commands.gram") # Get full decoder object decoder = r.recognize_sphinx(audio, show_all=True) except sr.UnknownValueError: print("Sphinx could not understand the audio") except sr.RequestError as e: print(f"Sphinx error: {e}") ``` ## Wit.ai Recognition Uses Wit.ai for speech recognition. Requires a Wit.ai account and API key. ```python import speech_recognition as sr r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") WIT_AI_KEY = "YOUR_WIT_AI_API_KEY" try: text = r.recognize_wit(audio, key=WIT_AI_KEY) print(f"Wit.ai thinks you said: {text}") # Get full response result = r.recognize_wit(audio, key=WIT_AI_KEY, show_all=True) except sr.UnknownValueError: print("Wit.ai could not understand the audio") except sr.RequestError as e: print(f"Wit.ai error: {e}") ``` ## Microsoft Azure Speech Uses Microsoft Azure Speech Services for recognition. ```python import speech_recognition as sr r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") AZURE_KEY = "YOUR_AZURE_SPEECH_KEY" try: # Basic usage text, confidence = r.recognize_azure(audio, key=AZURE_KEY) print(f"Azure: {text} (confidence: {confidence})") # With options text, confidence = r.recognize_azure( audio, key=AZURE_KEY, language="en-US", location="westus", profanity="masked" # or "removed", "raw" ) # Get full response result = r.recognize_azure(audio, key=AZURE_KEY, show_all=True) except sr.UnknownValueError: print("Azure could not understand the audio") except sr.RequestError as e: print(f"Azure error: {e}") ``` ## Houndify Recognition Uses Houndify API for speech recognition. ```python import speech_recognition as sr r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") HOUNDIFY_CLIENT_ID = "YOUR_CLIENT_ID" HOUNDIFY_CLIENT_KEY = "YOUR_CLIENT_KEY" try: text, confidence = r.recognize_houndify( audio, client_id=HOUNDIFY_CLIENT_ID, client_key=HOUNDIFY_CLIENT_KEY ) print(f"Houndify: {text} (confidence: {confidence})") # Get full response result = r.recognize_houndify( audio, client_id=HOUNDIFY_CLIENT_ID, client_key=HOUNDIFY_CLIENT_KEY, show_all=True ) except sr.UnknownValueError: print("Houndify could not understand the audio") except sr.RequestError as e: print(f"Houndify error: {e}") ``` ## IBM Speech to Text Uses IBM Watson Speech to Text service. ```python import speech_recognition as sr r = sr.Recognizer() audio = sr.AudioData.from_file("audio.wav") IBM_API_KEY = "YOUR_IBM_API_KEY" try: text, confidence = r.recognize_ibm(audio, key=IBM_API_KEY) print(f"IBM: {text} (confidence: {confidence})") # With language text, confidence = r.recognize_ibm( audio, key=IBM_API_KEY, language="en-US" ) # Get full response result = r.recognize_ibm(audio, key=IBM_API_KEY, show_all=True) except sr.UnknownValueError: print("IBM could not understand the audio") except sr.RequestError as e: print(f"IBM error: {e}") ``` ## AudioData Class The `AudioData` class represents captured audio data. It provides methods for converting audio to various formats. ```python import speech_recognition as sr r = sr.Recognizer() # Capture audio from microphone with sr.Microphone() as source: audio = r.listen(source) # Or load from file audio = sr.AudioData.from_file("audio.wav") # Get audio segment segment = audio.get_segment(start_ms=1000, end_ms=5000) # 1s to 5s # Export to various formats raw_data = audio.get_raw_data() wav_data = audio.get_wav_data() aiff_data = audio.get_aiff_data() flac_data = audio.get_flac_data() # With conversion options wav_data = audio.get_wav_data(convert_rate=16000, convert_width=2) # Save to files with open("output.raw", "wb") as f: f.write(audio.get_raw_data()) with open("output.wav", "wb") as f: f.write(audio.get_wav_data()) with open("output.aiff", "wb") as f: f.write(audio.get_aiff_data()) with open("output.flac", "wb") as f: f.write(audio.get_flac_data()) ``` ## CLI Tool The SpeechRecognition library includes a command-line interface for downloading models. ```bash # Download Vosk model sprc download vosk # Download specific Vosk model sprc download vosk --url https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip # Quick test from command line python -m speech_recognition ``` ## Exception Handling The library defines several exception types for proper error handling. ```python import speech_recognition as sr r = sr.Recognizer() try: with sr.Microphone() as source: audio = r.listen(source, timeout=5) text = r.recognize_google(audio) print(text) except sr.WaitTimeoutError: print("Listening timed out while waiting for phrase to start") except sr.UnknownValueError: print("Speech was unintelligible") except sr.RequestError as e: print(f"API request failed: {e}") except AttributeError: print("PyAudio not installed (required for Microphone)") ``` ## Summary SpeechRecognition is ideal for applications requiring speech-to-text capabilities, including voice assistants, transcription tools, voice-controlled interfaces, accessibility applications, and any project needing to convert spoken language to text. The library's strength lies in its unified API that abstracts away the differences between various speech recognition backends, allowing developers to easily switch between services or provide fallback options. Common integration patterns include using local engines like Vosk or Whisper for offline-first applications, Google Speech Recognition for quick prototyping, and cloud services like Google Cloud, OpenAI, or Azure for production deployments requiring high accuracy. The background listening feature enables continuous speech monitoring, while the audio file support makes it suitable for batch transcription tasks. For best results, always calibrate for ambient noise before listening, handle exceptions appropriately, and choose the recognition engine based on your specific requirements for accuracy, latency, cost, and offline capability.