Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Theme
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Create API Key
Add Docs
Cartesia Python
https://github.com/cartesia-ai/cartesia-python
Admin
The Cartesia Python library provides convenient access to the Cartesia API for text-to-speech
...
Tokens:
22,685
Snippets:
148
Trust Score:
8.3
Update:
1 month ago
Context
Skills
Chat
Benchmark
90.9
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# Cartesia Python SDK The Cartesia Python library provides convenient access to the Cartesia REST API for text-to-speech (TTS), speech-to-text (STT), voice cloning, and voice changing capabilities. The library offers both synchronous and asynchronous clients powered by httpx, with WebSocket support for real-time streaming applications. It includes comprehensive type definitions for all request parameters and response fields, making it easy to integrate AI-powered voice generation into Python 3.9+ applications. The SDK supports multiple audio output formats (WAV, raw PCM, MP3) with configurable sample rates and encodings. Key features include voice cloning from audio clips, voice localization to different languages, pronunciation dictionaries for custom word pronunciations, and fine-tuning capabilities for creating custom voice models. The library handles authentication, retries, timeouts, and pagination automatically, providing a seamless developer experience. ## Installation Install the Cartesia SDK from PyPI. ```bash # Basic installation pip install cartesia # With WebSocket support for real-time streaming pip install 'cartesia[websockets]' # With aiohttp for improved async performance pip install 'cartesia[aiohttp]' ``` ## Client Initialization Initialize the synchronous or async Cartesia client with your API key. ```python import os from cartesia import Cartesia, AsyncCartesia # Synchronous client client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Async client async_client = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY")) # With custom configuration client = Cartesia( api_key=os.getenv("CARTESIA_API_KEY"), timeout=30.0, # Request timeout in seconds max_retries=3, # Number of retry attempts base_url="https://api.cartesia.ai", # Custom base URL ) ``` ## Text-to-Speech Generation (tts.generate) Generate complete audio files from text using the TTS API. Returns a binary response that can be saved directly to a file. ```python import os from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Generate speech and save to WAV file response = client.tts.generate( model_id="sonic-3", transcript="Hello, world! This is a demonstration of text-to-speech synthesis.", voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"}, output_format={ "container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100, }, language="en", # Optional language code generation_config={ # Sonic-3 specific options "speed": 1.0, "emotion": "neutral", }, ) # Save audio to file response.write_to_file("output.wav") print("Audio saved to output.wav") # Or iterate over bytes for custom processing for chunk in response.iter_bytes(): process_audio_chunk(chunk) ``` ## Text-to-Speech SSE Streaming (tts.generate_sse) Stream audio generation via Server-Sent Events for lower latency and real-time playback. ```python import os from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Stream TTS with word timestamps stream = client.tts.generate_sse( model_id="sonic-3", transcript="The quick brown fox jumps over the lazy dog.", voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"}, output_format={ "container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100, }, add_timestamps=True, # Enable word-level timestamps ) audio_chunks = [] for event in stream: if event.type == "chunk" and event.audio: audio_chunks.append(event.audio) elif event.type == "timestamps": wt = event.word_timestamps print(f"Words: {wt.words}") print(f"Start times: {wt.start}") print(f"End times: {wt.end}") elif event.type == "done": break elif event.type == "error": raise Exception(f"TTS Error: {event.error}") # Save raw audio with open("output.pcm", "wb") as f: f.write(b"".join(audio_chunks)) print("Play with: ffplay -f f32le -ar 44100 output.pcm") ``` ## WebSocket TTS Streaming (tts.websocket_connect) Use WebSocket connections for real-time bidirectional streaming, ideal for voice agents and LLM integrations. ```python import os from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Basic WebSocket usage with client.tts.websocket_connect() as connection: connection.send({ "model_id": "sonic-3", "transcript": "Hello from WebSocket streaming!", "voice": {"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"}, "output_format": { "container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100, }, }) with open("ws_output.pcm", "wb") as f: for response in connection: if response.type == "chunk" and response.audio: f.write(response.audio) elif response.done: break # Streaming with continuations (for LLM output) with client.tts.websocket_connect() as connection: ctx = connection.context( model_id="sonic-3", voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"}, output_format={ "container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100, }, ) # Stream text parts (simulating LLM output) for part in ["The road ", "goes ever ", "on and ", "on."]: ctx.push(part) ctx.no_more_inputs() # Signal end of input # Receive audio chunks with open("continuation_output.pcm", "wb") as f: for response in ctx.receive(): if response.type == "chunk" and response.audio: f.write(response.audio) ``` ## Async TTS Operations Use the async client for concurrent operations and integration with async frameworks. ```python import asyncio import os from cartesia import AsyncCartesia async def generate_speech(): client = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Generate audio asynchronously response = await client.tts.generate( model_id="sonic-3", transcript="Async text-to-speech generation.", voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"}, output_format={ "container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100, }, ) await response.write_to_file("async_output.wav") async def concurrent_websocket(): client = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY")) async with client.tts.websocket_connect() as connection: # Create multiple concurrent contexts ctx1 = connection.context( model_id="sonic-3", voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"}, output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}, ) ctx2 = connection.context( model_id="sonic-3", voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"}, output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}, ) # Send to both contexts await ctx1.push("First context speaking.") await ctx1.no_more_inputs() await ctx2.push("Second context speaking.") await ctx2.no_more_inputs() # Collect audio from both async for response in ctx1.receive(): if response.type == "chunk" and response.audio: # Process ctx1 audio pass asyncio.run(generate_speech()) ``` ## Speech-to-Text Transcription (stt.transcribe) Transcribe audio files to text with optional word-level timestamps. ```python import os from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Transcribe audio file with word timestamps with open("recording.wav", "rb") as audio_file: response = client.stt.transcribe( file=audio_file, model="ink-whisper", language="en", timestamp_granularities=["word"], # Get word-level timestamps ) print(f"Transcription: {response.text}") print(f"Duration: {response.duration} seconds") # Access word timestamps if response.words: for word in response.words: print(f" '{word.word}': {word.start:.2f}s - {word.end:.2f}s") # Transcribe with specific encoding response = client.stt.transcribe( file=audio_bytes, model="ink-whisper", encoding="pcm_s16le", sample_rate=16000, ) ``` ## Voice Management (voices) List, get, clone, update, localize, and delete voices. ```python import os from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # List all voices with pagination all_voices = [] for voice in client.voices.list(limit=50): all_voices.append(voice) print(f"Voice: {voice.name} (ID: {voice.id})") # Get a specific voice voice = client.voices.get("voice-id-here") print(f"Name: {voice.name}, Language: {voice.language}") # Clone a voice from audio clip (5 seconds recommended) with open("sample_voice.wav", "rb") as clip: cloned_voice = client.voices.clone( clip=clip, name="My Custom Voice", description="Cloned from sample recording", language="en", ) print(f"Cloned voice ID: {cloned_voice.id}") # Update voice metadata updated = client.voices.update( "voice-id-here", name="Updated Voice Name", description="New description", gender="female", # Optional: male, female, or None ) # Localize a voice to another language localized = client.voices.localize( voice_id="original-voice-id", name="Spanish Voice", description="Localized to Spanish", language="es", original_speaker_gender="female", dialect="mx", # Mexican Spanish ) # Delete a voice client.voices.delete("voice-id-to-delete") ``` ## Voice Changer (voice_changer) Change the voice of existing audio while preserving intonation. ```python import os from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Change voice of an audio file with open("input_speech.wav", "rb") as audio_clip: response = client.voice_changer.change_voice_bytes( clip=audio_clip, voice_id="target-voice-id", output_format_container="wav", output_format_encoding="pcm_f32le", output_format_sample_rate=44100, ) # Save the voice-changed audio response.write_to_file("changed_voice.wav") print("Voice changed audio saved to changed_voice.wav") ``` ## Audio Infill (tts.infill) Generate audio that connects two existing audio segments with natural transitions. ```python import os from pathlib import Path from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Create infill audio between two clips response = client.tts.infill( model_id="sonic-3", language="en", transcript="inserted text to speak", left_audio=Path("left_segment.wav"), right_audio=Path("right_segment.wav"), voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b", output_format={ "container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100, }, ) response.write_to_file("infilled_audio.wav") print("Infill audio saved to infilled_audio.wav") ``` ## Pronunciation Dictionaries (pronunciation_dicts) Create and manage custom pronunciation dictionaries for domain-specific terms. ```python import os from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Create a pronunciation dictionary pdict = client.pronunciation_dicts.create( name="Technical Terms", items=[ {"text": "API", "pronunciation": "A P I"}, {"text": "SDK", "pronunciation": "S D K"}, {"text": "Cartesia", "pronunciation": "kar-TEE-zhuh"}, ], ) print(f"Dictionary ID: {pdict.id}") # Use dictionary in TTS generation response = client.tts.generate( model_id="sonic-3", transcript="The Cartesia SDK provides a powerful API.", voice={"mode": "id", "id": "voice-id"}, output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100}, pronunciation_dict_id=pdict.id, ) # Update dictionary updated = client.pronunciation_dicts.update( pdict.id, items=[ {"text": "API", "pronunciation": "A P I"}, {"text": "TTS", "pronunciation": "text to speech"}, ], ) # List all dictionaries for d in client.pronunciation_dicts.list(): print(f"Dictionary: {d.name} (ID: {d.id})") # Delete dictionary client.pronunciation_dicts.delete(pdict.id) ``` ## Datasets and Fine-Tuning Create datasets and fine-tune custom voice models. ```python import os from pathlib import Path from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) # Create a dataset dataset = client.datasets.create( name="Custom Voice Dataset", description="Training data for custom voice model", ) print(f"Dataset ID: {dataset.id}") # Upload files to dataset client.datasets.files.upload( id=dataset.id, file=Path("/path/to/audio_sample.wav"), ) # List files in dataset for file in client.datasets.files.list(dataset.id): print(f"File: {file.name}") # Create a fine-tune job fine_tune = client.fine_tunes.create( name="My Custom Voice Model", description="Fine-tuned voice model", dataset=dataset.id, model_id="sonic-3", language="en", ) print(f"Fine-tune ID: {fine_tune.id}, Status: {fine_tune.status}") # Check fine-tune status status = client.fine_tunes.retrieve(fine_tune.id) print(f"Status: {status.status}") # List voices from completed fine-tune for voice in client.fine_tunes.list_voices(fine_tune.id): print(f"Fine-tuned voice: {voice.name} (ID: {voice.id})") ``` ## Error Handling Handle API errors with specific exception types. ```python import os import cartesia from cartesia import Cartesia client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) try: response = client.tts.generate( model_id="sonic-3", transcript="Hello, world!", voice={"mode": "id", "id": "invalid-voice-id"}, output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100}, ) except cartesia.BadRequestError as e: print(f"Bad request (400): {e}") except cartesia.AuthenticationError as e: print(f"Authentication failed (401): {e}") except cartesia.NotFoundError as e: print(f"Resource not found (404): {e}") except cartesia.RateLimitError as e: print(f"Rate limited (429): {e}") # Implement backoff/retry logic except cartesia.APIConnectionError as e: print(f"Connection error: {e.__cause__}") except cartesia.APIStatusError as e: print(f"API error {e.status_code}: {e.response}") ``` ## Summary The Cartesia Python SDK is ideal for building voice-enabled applications including voice agents, audiobook generation, accessibility tools, real-time transcription systems, and content creation platforms. Its WebSocket support with context-based streaming makes it particularly well-suited for latency-sensitive applications like conversational AI where text is generated progressively by an LLM. The SDK integrates seamlessly with existing Python async frameworks and provides comprehensive type hints for IDE autocompletion. Key integration patterns include: using `tts.generate()` for batch audio generation, `tts.generate_sse()` for streaming with timestamps, `tts.websocket_connect()` with contexts for real-time LLM-to-voice pipelines, and `stt.transcribe()` for audio-to-text conversion. The voice management APIs enable dynamic voice selection, cloning, and localization, while pronunciation dictionaries ensure accurate rendering of domain-specific terminology.