# Speech Recognition

SpeechRecognition is a Python library for performing speech recognition, with support for several engines and APIs, both online and offline. It provides a unified interface to work with multiple speech recognition services including Google Speech Recognition, Google Cloud Speech-to-Text, CMU Sphinx, Wit.ai, Microsoft Azure Speech, Houndify, IBM Speech to Text, Vosk, OpenAI Whisper (local and API), and Groq Whisper API.

The library is designed around two core concepts: audio sources (like microphones and audio files) and recognizers that process the audio. It handles audio capture, format conversion, and communication with various speech-to-text backends transparently, making it easy to switch between different recognition engines without changing your application code significantly.

## Recognizer Class

The `Recognizer` class is the main entry point for speech recognition functionality. It contains methods for recording audio, adjusting for ambient noise, and recognizing speech using various engines.

```python
import speech_recognition as sr

# Create a Recognizer instance
r = sr.Recognizer()

# Configure recognition settings
r.energy_threshold = 300  # Minimum audio energy to consider for recording
r.dynamic_energy_threshold = True  # Auto-adjust threshold based on ambient noise
r.pause_threshold = 0.8  # Seconds of silence before phrase is considered complete
r.operation_timeout = None  # Timeout for API requests (None = no timeout)
```

## Microphone Audio Source

The `Microphone` class represents a physical microphone and is used as an audio source for real-time speech recognition. Requires PyAudio to be installed.

```python
import speech_recognition as sr

r = sr.Recognizer()

# Use default microphone
with sr.Microphone() as source:
    print("Say something!")
    audio = r.listen(source)

# Use specific microphone by device index
with sr.Microphone(device_index=3, sample_rate=16000, chunk_size=1024) as source:
    audio = r.listen(source)

# List all available microphones
for index, name in enumerate(sr.Microphone.list_microphone_names()):
    print(f"Microphone {index}: {name}")

# Find working microphones (that are currently hearing sounds)
working_mics = sr.Microphone.list_working_microphones()
for device_index, name in working_mics.items():
    print(f"Working microphone {device_index}: {name}")
```

## AudioFile Audio Source

The `AudioFile` class allows you to use WAV, AIFF, and FLAC audio files as audio sources for speech recognition.

```python
import speech_recognition as sr

r = sr.Recognizer()

# From file path
with sr.AudioFile("audio.wav") as source:
    audio = r.record(source)  # Read entire file
    print(f"Audio duration: {source.DURATION} seconds")

# Record specific portion of audio
with sr.AudioFile("audio.wav") as source:
    audio = r.record(source, duration=5)  # First 5 seconds

with sr.AudioFile("audio.wav") as source:
    audio = r.record(source, offset=2, duration=5)  # 5 seconds starting at 2s

# Using AudioData.from_file() shortcut
audio = sr.AudioData.from_file("audio.wav")
```

## Adjust for Ambient Noise

Calibrates the recognizer's energy threshold to account for ambient noise levels. Should be called before listening for speech in noisy environments.

```python
import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    # Calibrate for 1 second (default)
    r.adjust_for_ambient_noise(source)

    # Or calibrate for longer in very noisy environments
    r.adjust_for_ambient_noise(source, duration=2)

    print("Say something!")
    audio = r.listen(source)
```

## Listen Method

Records a single phrase from an audio source, automatically detecting when speech starts and ends based on energy thresholds.

```python
import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    # Basic listening
    audio = r.listen(source)

    # With timeout (raises WaitTimeoutError if no speech detected)
    try:
        audio = r.listen(source, timeout=5)
    except sr.WaitTimeoutError:
        print("No speech detected within timeout")

    # With phrase time limit (cuts off after limit)
    audio = r.listen(source, phrase_time_limit=10)
```

## Background Listening

Spawns a background thread to continuously listen for phrases and process them via a callback function.

```python
import speech_recognition as sr
import time

def callback(recognizer, audio):
    try:
        text = recognizer.recognize_google(audio)
        print(f"You said: {text}")
    except sr.UnknownValueError:
        print("Could not understand audio")
    except sr.RequestError as e:
        print(f"Request error: {e}")

r = sr.Recognizer()
m = sr.Microphone()

with m as source:
    r.adjust_for_ambient_noise(source)

# Start background listening
stop_listening = r.listen_in_background(m, callback)

# Do other work while listening continues
print("Listening in background... (say something)")
time.sleep(30)

# Stop background listening
stop_listening(wait_for_stop=True)
```

## Google Speech Recognition

Uses the free Google Speech Recognition API. Works out of the box with a default API key (for testing only).

```python
import speech_recognition as sr

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

try:
    # Using default API key (for testing)
    text = r.recognize_google(audio)
    print(f"Google thinks you said: {text}")

    # With custom API key
    text = r.recognize_google(audio, key="YOUR_API_KEY")

    # Specify language
    text = r.recognize_google(audio, language="fr-FR")  # French

    # With profanity filter (0=off, 1=on)
    text = r.recognize_google(audio, pfilter=1)

    # Get full response
    result = r.recognize_google(audio, show_all=True)
    print(result)

except sr.UnknownValueError:
    print("Google could not understand the audio")
except sr.RequestError as e:
    print(f"Could not request results from Google; {e}")
```

## Google Cloud Speech-to-Text

Uses Google Cloud Speech-to-Text V1 API. Requires a Google Cloud account and credentials.

```python
import speech_recognition as sr

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

try:
    # Using Application Default Credentials (ADC)
    text = r.recognize_google_cloud(audio)
    print(f"Google Cloud thinks you said: {text}")

    # With explicit credentials file
    text = r.recognize_google_cloud(
        audio,
        credentials_json_path="/path/to/credentials.json"
    )

    # With language and model options
    text = r.recognize_google_cloud(
        audio,
        language_code="en-US",
        model="latest_long",
        use_enhanced=True
    )

    # With preferred phrases for better recognition
    text = r.recognize_google_cloud(
        audio,
        preferred_phrases=["hello world", "speech recognition"]
    )

    # Get full response
    response = r.recognize_google_cloud(audio, show_all=True)

except sr.UnknownValueError:
    print("Google Cloud could not understand the audio")
except sr.RequestError as e:
    print(f"Request error: {e}")
```

## OpenAI Whisper (Local)

Uses OpenAI's Whisper model locally for offline speech recognition. Requires the whisper package.

```python
import speech_recognition as sr

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

try:
    # Basic usage with base model
    text = r.recognize_whisper(audio)
    print(f"Whisper thinks you said: {text}")

    # Use different model sizes: tiny, base, small, medium, large
    text = r.recognize_whisper(audio, model="medium")

    # Specify language (improves accuracy)
    text = r.recognize_whisper(audio, model="base", language="english")

    # Translate to English
    text = r.recognize_whisper(audio, model="base", task="translate")

    # Get full response with timestamps
    result = r.recognize_whisper(audio, model="base", show_dict=True)
    print(result["text"])
    for segment in result["segments"]:
        print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

except sr.UnknownValueError:
    print("Whisper could not understand the audio")
except sr.RequestError as e:
    print(f"Whisper error: {e}")
```

## Faster Whisper (Local)

Uses the faster-whisper implementation for more efficient local speech recognition.

```python
import speech_recognition as sr

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

try:
    # Basic usage
    text = r.recognize_faster_whisper(audio)
    print(f"Faster Whisper: {text}")

    # With model and language
    text = r.recognize_faster_whisper(
        audio,
        model="base",
        language="en"
    )

    # Get detailed output
    result = r.recognize_faster_whisper(audio, model="base", show_dict=True)

except sr.RequestError as e:
    print(f"Error: {e}")
```

## OpenAI Whisper API

Uses OpenAI's cloud-based Whisper API. Requires an OpenAI API key.

```python
import os
import speech_recognition as sr

# Set API key
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

try:
    # Basic usage
    text = r.recognize_openai(audio)
    print(f"OpenAI Whisper API: {text}")

    # With specific model
    text = r.recognize_openai(audio, model="whisper-1")

    # With language hint and prompt
    text = r.recognize_openai(
        audio,
        model="whisper-1",
        language="en",
        prompt="This is a technical discussion about programming."
    )

except sr.RequestError as e:
    print(f"OpenAI API error: {e}")

# For self-hosted OpenAI-compatible endpoints (vLLM, Ollama, etc.)
os.environ["OPENAI_BASE_URL"] = "http://localhost:8000/v1"
os.environ["OPENAI_API_KEY"] = "dummy"
text = r.recognize_openai(audio)
```

## Groq Whisper API

Uses Groq's fast Whisper API for cloud-based speech recognition.

```python
import os
import speech_recognition as sr

# Set API key
os.environ["GROQ_API_KEY"] = "your-groq-api-key"

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

try:
    # Basic usage with default model
    text = r.recognize_groq(audio)
    print(f"Groq Whisper: {text}")

    # With specific model
    text = r.recognize_groq(audio, model="whisper-large-v3-turbo")
    # or
    text = r.recognize_groq(audio, model="whisper-large-v3")

    # With options
    text = r.recognize_groq(
        audio,
        model="whisper-large-v3-turbo",
        language="en",
        temperature=0.0
    )

except sr.RequestError as e:
    print(f"Groq API error: {e}")
```

## Vosk (Offline)

Uses Vosk for offline speech recognition. Requires downloading a Vosk model.

```python
import speech_recognition as sr

# First, download the Vosk model using CLI:
# sprc download vosk

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

try:
    # Basic usage
    text = r.recognize_vosk(audio)
    print(f"Vosk thinks you said: {text}")

    # Get verbose output
    result = r.recognize_vosk(audio, verbose=True)
    print(result)  # Returns dict with 'text' key

except sr.UnknownValueError:
    print("Vosk could not understand the audio")
except sr.RequestError as e:
    print(f"Vosk error: {e}")
```

## CMU Sphinx (Offline)

Uses CMU PocketSphinx for offline speech recognition. Requires the pocketsphinx package.

```python
import speech_recognition as sr

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

try:
    # Basic usage
    text = r.recognize_sphinx(audio)
    print(f"Sphinx thinks you said: {text}")

    # Specify language
    text = r.recognize_sphinx(audio, language="en-US")

    # With keyword spotting (phrase, sensitivity)
    text = r.recognize_sphinx(
        audio,
        keyword_entries=[
            ("hello", 1.0),
            ("goodbye", 0.8)
        ]
    )

    # With custom grammar file
    text = r.recognize_sphinx(audio, grammar="commands.gram")

    # Get full decoder object
    decoder = r.recognize_sphinx(audio, show_all=True)

except sr.UnknownValueError:
    print("Sphinx could not understand the audio")
except sr.RequestError as e:
    print(f"Sphinx error: {e}")
```

## Wit.ai Recognition

Uses Wit.ai for speech recognition. Requires a Wit.ai account and API key.

```python
import speech_recognition as sr

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

WIT_AI_KEY = "YOUR_WIT_AI_API_KEY"

try:
    text = r.recognize_wit(audio, key=WIT_AI_KEY)
    print(f"Wit.ai thinks you said: {text}")

    # Get full response
    result = r.recognize_wit(audio, key=WIT_AI_KEY, show_all=True)

except sr.UnknownValueError:
    print("Wit.ai could not understand the audio")
except sr.RequestError as e:
    print(f"Wit.ai error: {e}")
```

## Microsoft Azure Speech

Uses Microsoft Azure Speech Services for recognition.

```python
import speech_recognition as sr

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

AZURE_KEY = "YOUR_AZURE_SPEECH_KEY"

try:
    # Basic usage
    text, confidence = r.recognize_azure(audio, key=AZURE_KEY)
    print(f"Azure: {text} (confidence: {confidence})")

    # With options
    text, confidence = r.recognize_azure(
        audio,
        key=AZURE_KEY,
        language="en-US",
        location="westus",
        profanity="masked"  # or "removed", "raw"
    )

    # Get full response
    result = r.recognize_azure(audio, key=AZURE_KEY, show_all=True)

except sr.UnknownValueError:
    print("Azure could not understand the audio")
except sr.RequestError as e:
    print(f"Azure error: {e}")
```

## Houndify Recognition

Uses Houndify API for speech recognition.

```python
import speech_recognition as sr

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

HOUNDIFY_CLIENT_ID = "YOUR_CLIENT_ID"
HOUNDIFY_CLIENT_KEY = "YOUR_CLIENT_KEY"

try:
    text, confidence = r.recognize_houndify(
        audio,
        client_id=HOUNDIFY_CLIENT_ID,
        client_key=HOUNDIFY_CLIENT_KEY
    )
    print(f"Houndify: {text} (confidence: {confidence})")

    # Get full response
    result = r.recognize_houndify(
        audio,
        client_id=HOUNDIFY_CLIENT_ID,
        client_key=HOUNDIFY_CLIENT_KEY,
        show_all=True
    )

except sr.UnknownValueError:
    print("Houndify could not understand the audio")
except sr.RequestError as e:
    print(f"Houndify error: {e}")
```

## IBM Speech to Text

Uses IBM Watson Speech to Text service.

```python
import speech_recognition as sr

r = sr.Recognizer()
audio = sr.AudioData.from_file("audio.wav")

IBM_API_KEY = "YOUR_IBM_API_KEY"

try:
    text, confidence = r.recognize_ibm(audio, key=IBM_API_KEY)
    print(f"IBM: {text} (confidence: {confidence})")

    # With language
    text, confidence = r.recognize_ibm(
        audio,
        key=IBM_API_KEY,
        language="en-US"
    )

    # Get full response
    result = r.recognize_ibm(audio, key=IBM_API_KEY, show_all=True)

except sr.UnknownValueError:
    print("IBM could not understand the audio")
except sr.RequestError as e:
    print(f"IBM error: {e}")
```

## AudioData Class

The `AudioData` class represents captured audio data. It provides methods for converting audio to various formats.

```python
import speech_recognition as sr

r = sr.Recognizer()

# Capture audio from microphone
with sr.Microphone() as source:
    audio = r.listen(source)

# Or load from file
audio = sr.AudioData.from_file("audio.wav")

# Get audio segment
segment = audio.get_segment(start_ms=1000, end_ms=5000)  # 1s to 5s

# Export to various formats
raw_data = audio.get_raw_data()
wav_data = audio.get_wav_data()
aiff_data = audio.get_aiff_data()
flac_data = audio.get_flac_data()

# With conversion options
wav_data = audio.get_wav_data(convert_rate=16000, convert_width=2)

# Save to files
with open("output.raw", "wb") as f:
    f.write(audio.get_raw_data())

with open("output.wav", "wb") as f:
    f.write(audio.get_wav_data())

with open("output.aiff", "wb") as f:
    f.write(audio.get_aiff_data())

with open("output.flac", "wb") as f:
    f.write(audio.get_flac_data())
```

## CLI Tool

The SpeechRecognition library includes a command-line interface for downloading models.

```bash
# Download Vosk model
sprc download vosk

# Download specific Vosk model
sprc download vosk --url https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip

# Quick test from command line
python -m speech_recognition
```

## Exception Handling

The library defines several exception types for proper error handling.

```python
import speech_recognition as sr

r = sr.Recognizer()

try:
    with sr.Microphone() as source:
        audio = r.listen(source, timeout=5)
    text = r.recognize_google(audio)
    print(text)

except sr.WaitTimeoutError:
    print("Listening timed out while waiting for phrase to start")

except sr.UnknownValueError:
    print("Speech was unintelligible")

except sr.RequestError as e:
    print(f"API request failed: {e}")

except AttributeError:
    print("PyAudio not installed (required for Microphone)")
```

## Summary

SpeechRecognition is ideal for applications requiring speech-to-text capabilities, including voice assistants, transcription tools, voice-controlled interfaces, accessibility applications, and any project needing to convert spoken language to text. The library's strength lies in its unified API that abstracts away the differences between various speech recognition backends, allowing developers to easily switch between services or provide fallback options.

Common integration patterns include using local engines like Vosk or Whisper for offline-first applications, Google Speech Recognition for quick prototyping, and cloud services like Google Cloud, OpenAI, or Azure for production deployments requiring high accuracy. The background listening feature enables continuous speech monitoring, while the audio file support makes it suitable for batch transcription tasks. For best results, always calibrate for ambient noise before listening, handle exceptions appropriately, and choose the recognition engine based on your specific requirements for accuracy, latency, cost, and offline capability.