### Install Python Dependencies Source: https://github.com/fluidinference/fluidaudio/blob/main/Scripts/nemo_ami_benchmark/README.md Sets up a Python virtual environment and installs necessary libraries including PyTorch, Torchaudio, Torchcodec, NeMo toolkit, and pyannote.metrics. ```bash python3.10 -m venv .venv source .venv/bin/activate pip install torch torchaudio torchcodec pip install nemo_toolkit[asr] pyannote.metrics ``` -------------------------------- ### Install swift-format for Swift <6 Source: https://github.com/fluidinference/fluidaudio/blob/main/CONTRIBUTING.md Instructions for users with Swift versions older than 6 to install swift-format manually. This involves cloning the repository, building the release version, and copying the executable to your PATH. ```bash # For Swift <6, install swift-format separately: # git clone https://github.com/apple/swift-format # cd swift-format && swift build -c release # cp .build/release/swift-format /usr/local/bin/ ``` -------------------------------- ### Real-time Audio Capture and Diarization Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md Set up real-time audio capture using AVAudioEngine and process it with DiarizerManager for live diarization. This example demonstrates installing an audio tap and handling diarization results asynchronously. ```swift import AVFoundation class RealTimeDiarizer { private let audioEngine = AVAudioEngine() private let diarizer: DiarizerManager private var audioStream: AudioStream init() async throws { let models = try await DiarizerModels.downloadIfNeeded() diarizer = DiarizerManager() diarizer.initialize(models: models) audioStream = AudioStream( chunkDuration: 5.0, // 5 second chunks work well chunkSkip: 3.0, // 3.0 second delay between chunks works well streamStartTime: 0.0, chunkingStrategy: .useFixedSkip // ensure chunks are evenly spaced ) audioStream.bind { chunk, _ in Task { do { let result = try diarizer.performCompleteDiarization(chunk) await handleResults(result) } catch { print("Diarization error: \(error)") } } } } func startCapture() throws { let inputNode = audioEngine.inputNode let recordingFormat = inputNode.outputFormat(forBus: 0) // Install tap to capture audio inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { [weak self] buffer, _ in guard let self = self else { return } try? self.audioStream.write(from: buffer) } audioEngine.prepare() try audioEngine.start() } @MainActor private func handleResults(_ result: DiarizationResult) { for segment in result.segments { print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s") } } } ``` -------------------------------- ### Install Dependencies Source: https://github.com/fluidinference/fluidaudio/blob/main/Scripts/voice_cloning/README.md Install required libraries for spectral analysis and optional plotting. ```bash pip install librosa numpy # Or minimal (scipy fallback): pip install scipy numpy # Optional for plotting: pip install matplotlib ``` -------------------------------- ### Quick Test CLI Commands Source: https://github.com/fluidinference/fluidaudio/blob/main/Sources/FluidAudioCLI/README.md Basic commands for testing the CLI installation using sample audio files. ```bash # Test with included sample files swift run fluidaudiocli transcribe medical.wav swift run fluidaudiocli process IS1001a.Mix-Headset.wav --threshold 0.7 ``` -------------------------------- ### Text-to-Speech Benchmark Setup Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md Command to run Text-to-Speech benchmarks, testing inference speed across different pipelines (PyTorch CPU, MPS, MLX, Swift Core ML). ```bash KPipeline benchmark for voice af_heart (warm-up took 0.175s) using hexgrad/kokoro ``` -------------------------------- ### LSEENDFeatureProvider Usage Example Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md Provides an example of how to use the LSEENDFeatureProvider to process audio streams and get model inputs. ```APIDOC ```swift // Push raw audio. eagerPreprocessing: true (default) runs STFT/log-mel/CMN immediately. try feeder.enqueueAudio(samples, withSampleRate: 16_000) // Or push from a file (returns sample count read). let count = try feeder.enqueueAudioFile(at: audioURL) // Pad the tail before the final predict pass. try feeder.drainRightContextWithSilence() // Pull ready chunks until none remain. while let input = try feeder.emitNextChunk() { let probs = try model.predict(from: input) // probs has shape (chunkSize - input.warmupFrames) * metadata.maxSpeakers } ``` ``` -------------------------------- ### CLI Usage Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/StyleTTS2.md Example of how to use the StyleTTS2 CLI for text-to-speech synthesis. ```APIDOC ## CLI Usage ### Description Use the `fluidaudiocli` tool to perform text-to-speech synthesis with StyleTTS2. ### Command ```bash swift run fluidaudiocli tts "Hello from StyleTTS2." \ --backend styletts2 \ --reference path/to/speaker.wav \ --output ~/Desktop/styletts2-demo.wav ``` ### Parameters - `--backend` (string): Specifies the TTS backend to use. Set to `styletts2`. - `--reference` (path): Required. Path to the reference audio file (e.g., WAV, AIFF, CAF, m4a). The file will be resampled to 24 kHz mono. - `--output` (path): Path to save the synthesized audio file. ### Optional Flags - `--styletts2-alpha `: Controls the blend weight for the diffusion-sampled style versus the reference encoder. Default is `0.3`. - `--styletts2-beta `: Controls the blend weight for the prosody. Default is `0.7`. - `--styletts2-ipa `: Skips the lexicon and G2P pipeline, allowing direct input of an IPA string. - `--seed `: Sets the random seed for the diffusion sampler for reproducible results. Default is `0`. ``` -------------------------------- ### Voice Cloning Workflow Source: https://github.com/fluidinference/fluidaudio/blob/main/Scripts/voice_cloning/README.md Example workflow demonstrating voice cloning via CLI followed by spectral evaluation. ```bash # 1. Clone a voice using FluidAudio CLI fluidaudio tts "Hello, this is a test." --backend pocket --clone-voice speaker.wav -o output.wav # 2. Evaluate the result python Tools/voice_cloning/evaluate_voice.py speaker.wav output.wav --plot ``` -------------------------------- ### Install React Native/Expo Wrapper for FluidAudio Source: https://github.com/fluidinference/fluidaudio/blob/main/README.md Install the React Native and Expo wrapper for FluidAudio using npm. ```bash npm install @fluidinference/react-native-fluidaudio ``` -------------------------------- ### Initialize LSEENDDiarizer with Async Convenience Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md Initializes the LSEENDDiarizer asynchronously, handling model download and setup in one step. Use this for a straightforward setup. ```swift // Async convenience: downloads (or reuses cached) model and initializes in one step. let diarizer = try await LSEENDDiarizer( variant: .dihard3, stepSize: .step100ms, timelineConfig: nil // optional DiarizerTimelineConfig ) ``` -------------------------------- ### Install Rust/Tauri Wrapper for FluidAudio Source: https://github.com/fluidinference/fluidaudio/blob/main/README.md Add the Rust/Tauri wrapper for FluidAudio to your project using cargo. ```bash cargo add fluidaudio-rs ``` -------------------------------- ### Manual Audio Source Control Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md This example demonstrates how to manually construct an audio source using `StreamingAudioSourceFactory` for advanced control over memory management and audio loading. This is useful for processing the same file multiple times, measuring loading time separately, or implementing custom logic. ```APIDOC ## Manual Audio Source Control This approach allows for fine-grained control over memory management and audio loading by manually constructing the audio source using `StreamingAudioSourceFactory`. ### Use Cases - Process the same file multiple times without reloading. - Measure audio loading time separately from diarization time. - Implement custom cleanup or caching logic. ### Code Example ```swift import FluidAudio let config = OfflineDiarizerConfig() try await manager.prepareModels() let factory = StreamingAudioSourceFactory() let (source, loadDuration) = try factory.makeDiskBackedSource( from: URL(fileURLWithPath: "meeting.wav"), targetSampleRate: config.segmentation.sampleRate ) defer { source.cleanup() } let result = try await manager.process( audioSource: source, audioLoadingSeconds: loadDuration ) ``` **Note:** For most use cases, the simpler `manager.process(url)` API is recommended. ``` -------------------------------- ### CLI Usage Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Cohere.md Examples of how to use the Cohere Transcribe model from the command line interface. ```APIDOC ## CLI Usage ### Single-precision (FP16 or INT8 in one dir) ```bash swift run -c release fluidaudiocli cohere-transcribe audio.wav \ --model-dir /path/to/cohere-fp16 \ --language en ``` ### Mixed precision (INT8 encoder + FP16 decoder) ```bash swift run -c release fluidaudiocli cohere-transcribe audio.wav \ --encoder-dir /path/to/q8 \ --decoder-dir /path/to/f16 \ --vocab-dir /path/to/f16 \ --language en ``` ``` -------------------------------- ### Transcribe Audio with CLI (Bash) Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Qwen3-ASR.md Provides command-line examples for transcribing audio files using the fluidaudiocli tool. Supports auto-detection of language, specifying a language hint, and using a local model directory. ```bash # Transcribe a file (auto-detect language) swift run -c release fluidaudiocli qwen3-transcribe audio.wav # Transcribe with language hint swift run -c release fluidaudiocli qwen3-transcribe audio.wav --language zh # Transcribe with local model swift run -c release fluidaudiocli qwen3-transcribe audio.wav --model-dir /path/to/model ``` -------------------------------- ### Format and Build Project Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ModelConversion.md Run Swift formatting tools to ensure code style consistency and then build and test the project to verify the integration. ```bash swift format --in-place --recursive --configuration .swift-format Sources/ Tests/ swift build swift test ``` -------------------------------- ### Build Swift Project (Release) Source: https://github.com/fluidinference/fluidaudio/blob/main/CLAUDE.md Compile the Swift project in release mode, recommended for performance benchmarks. ```bash swift build -c release ``` -------------------------------- ### Run Various Benchmarks with FluidAudio CLI Source: https://github.com/fluidinference/fluidaudio/blob/main/CLAUDE.md Execute benchmarks for Sortformer, Qwen3, CTC earnings, and G2P models. ```bash swift run fluidaudiocli sortformer-benchmark swift run fluidaudiocli qwen3-benchmark swift run fluidaudiocli ctc-earnings-benchmark swift run fluidaudiocli g2p-benchmark ``` -------------------------------- ### Swift API Initialization and Synthesis Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/PocketTTS.md Demonstrates how to initialize the PocketTtsManager with a specific language and synthesize text to audio. ```APIDOC ## Swift API ### Initialization and Synthesis This snippet shows how to create a `PocketTtsManager` instance for a specific language, initialize it, and then synthesize text into audio. ```swift let manager = PocketTtsManager(language: .spanish) try await manager.initialize() let audio = try await manager.synthesize(text: "Hola mundo") ``` **Notes:** - `PocketTtsManager.language` is immutable after instantiation. To support multiple languages, create separate manager instances for each language. ``` -------------------------------- ### iOS ASR Test App Example Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/TDT-CTC-110M.md Example Swift code for an iOS test app demonstrating ASR transcription using FluidAudio. It includes model auto-download, manager initialization, and transcription of audio samples. ```swift import SwiftUI import FluidAudio struct ContentView: View { @State private var transcript: String = "" @State private var isTesting: Bool = false func runTest() async { // Auto-download models on device let models = try await AsrModels.downloadAndLoad( to: nil, // Uses default cache version: .tdtCtc110m ) // Initialize manager let manager = AsrManager() try await manager.loadModels(models) // Load test audio let audioSamples: [Float] = ... // Load from bundle or record // Transcribe let result = try await manager.transcribe(audioSamples) transcript = result.text } } ``` -------------------------------- ### Build Project with Swift Source: https://github.com/fluidinference/fluidaudio/blob/main/AGENTS.md Use 'swift build' to compile the project. For a release build, use the '-c release' flag. ```bash swift build ``` ```bash swift build -c release ``` -------------------------------- ### Initialize and Load Qwen3-ASR Models (Swift) Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Qwen3-ASR.md Shows the Swift API for initializing the Qwen3AsrManager and loading the necessary CoreML models. The models are automatically downloaded if they are not found locally. ```swift import FluidAudio // Initialize manager let manager = Qwen3AsrManager() // Load models (auto-downloads if needed) let modelDir = try await Qwen3AsrModels.download() try await manager.loadModels(from: modelDir) // Transcribe audio samples (16kHz mono Float32) let text = try await manager.transcribe(audioSamples: samples) ``` -------------------------------- ### Get Speaker IDs Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md Retrieve a sorted array of all speaker IDs. ```swift let ids = speakerManager.speakerIds // Returns: [String] - sorted array of speaker IDs ``` -------------------------------- ### Get Speaker Count Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md Retrieve the total number of tracked speakers. ```swift print("Active speakers: \(speakerManager.speakerCount)") ``` -------------------------------- ### SwiftUI Integration Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md Provides an example of integrating the diarization functionality into a SwiftUI view. ```APIDOC ## SwiftUI Integration ```swift import SwiftUI import FluidAudio struct DiarizationView: View { @StateObject private var processor = DiarizationProcessor() var body: some View { VStack { Text("Speakers: \(processor.speakerCount)") List(processor.activeSpeakers) { speaker in HStack { Circle() .fill(speaker.isSpeaking ? Color.green : Color.gray) .frame(width: 10, height: 10) Text(speaker.name) Spacer() Text("\(speaker.duration, specifier: \"%.1f\")s") } } Button(processor.isProcessing ? "Stop" : "Start") { processor.toggleProcessing() } } } } @MainActor class DiarizationProcessor: ObservableObject { @Published var speakerCount = 0 @Published var activeSpeakers: [SpeakerDisplay] = [] @Published var isProcessing = false private var diarizer: DiarizerManager? func toggleProcessing() { if isProcessing { stopProcessing() } else { startProcessing() } } private func startProcessing() { Task { let models = try await DiarizerModels.downloadIfNeeded() diarizer = DiarizerManager() // Default config diarizer?.initialize(models: models) isProcessing = true // Start audio capture and process chunks AudioCapture.start { [weak self] chunk in self?.processChunk(chunk) } } } private func processChunk(_ audio: [Float]) { Task { guard let diarizer = diarizer else { return } let result = try diarizer.performCompleteDiarization(audio) speakerCount = diarizer.speakerManager.speakerCount // Update UI with current speakers activeSpeakers = diarizer.speakerManager.speakerIds.compactMap { guard let speaker = diarizer.speakerManager.getSpeaker(for: $0) else { return nil } return SpeakerDisplay( id: $0, name: speaker.name, duration: speaker.duration, isSpeaking: result.segments.contains { $0.speakerId == $0 } ) } } } } ``` ``` -------------------------------- ### Speaker Count and IDs Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md Operations to get the total number of speakers and their IDs. ```APIDOC ## GET /speaker/count ### Description Get the total number of tracked speakers. ### Method GET ### Endpoint /speaker/count ### Response #### Success Response (200) - **count** (integer) - The total number of speakers. #### Response Example ```json { "count": 15 } ``` ``` ```APIDOC ## GET /speaker/ids ### Description Get all speaker IDs as a sorted array. ### Method GET ### Endpoint /speaker/ids ### Response #### Success Response (200) - **ids** (array) - A sorted array of speaker IDs (strings). #### Response Example ```json [ "speaker_1", "speaker_10", "speaker_2" ] ``` ``` -------------------------------- ### Run Qwen3-ASR AISHELL Benchmark Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md Execute the AISHELL benchmark for the Qwen3-ASR model using the command-line interface. Ensure the release build and specify the dataset. ```bash # Run AISHELL-1 benchmark swift run -c release fluidaudiocli qwen3-benchmark --dataset aishell ``` -------------------------------- ### Get Current Speaker Names Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md Retrieve a sorted list of all speaker IDs. ```swift let names = speakerManager.getCurrentSpeakerNames() // Returns: [String] - sorted speaker IDs ``` -------------------------------- ### Swift API Usage Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Cohere.md Example of how to integrate and use the Cohere Transcribe pipeline in Swift. ```APIDOC ## Swift API ```swift import CoreML import FluidAudio let encoderDir = URL(fileURLWithPath: "/path/to/q8") let decoderDir = URL(fileURLWithPath: "/path/to/f16") let models = try await CoherePipeline.loadModels( encoderDir: encoderDir, decoderDir: decoderDir, vocabDir: decoderDir ) let pipeline = CoherePipeline() let result = try await pipeline.transcribe( audio: samples, // 16 kHz mono Float32, up to 35 s models: models, language: .english ) print(result.text) ``` `TranscriptionResult` also exposes `encoderSeconds`, `decoderSeconds`, and `totalSeconds` for per-stage profiling. ``` -------------------------------- ### Transcribe Audio with Language Hint (Swift) Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Qwen3-ASR.md Demonstrates how to transcribe audio samples using the Qwen3AsrManager, with options for automatic language detection or specifying a language hint for improved accuracy. ```swift // Auto-detect language (default) let text = try await manager.transcribe(audioSamples: samples) // Specify language for better accuracy let text = try await manager.transcribe(audioSamples: samples, language: .chinese) let text = try await manager.transcribe(audioSamples: samples, language: .japanese) ``` -------------------------------- ### ARPA Model File Format Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CtcDecoderGuide.md Example structure of an ARPA language model file. ```text \data\ ngram 1=4 ngram 2=2 \1-grams: -1.0 patient -0.5 -1.5 diabetes 0.0 -2.0 hypertension 0.0 \2-grams: -0.3 patient diabetes -0.5 patient hypertension \end\ ``` -------------------------------- ### CLI Usage Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Qwen3-ASR.md Instructions for using the Qwen3-ASR model via the command-line interface (CLI), including transcription with and without language hints, and specifying a local model directory. ```APIDOC ## CLI Usage ### Description This section provides command-line instructions for transcribing audio files using the Qwen3-ASR model. ### Basic Transcription (Auto-detect Language) To transcribe an audio file and automatically detect the language: ```bash swift run -c release fluidaudiocli qwen3-transcribe audio.wav ``` ### Transcription with Language Hint To improve accuracy, you can specify the language of the audio file: ```bash # Transcribe with language hint (e.g., Chinese) swift run -c release fluidaudiocli qwen3-transcribe audio.wav --language zh # Transcribe with language hint (e.g., Japanese) swift run -c release fluidaudiocli qwen3-transcribe audio.wav --language ja ``` ### Using a Local Model If you have downloaded the model files, you can specify the directory containing them: ```bash # Transcribe with a local model directory swift run -c release fluidaudiocli qwen3-transcribe audio.wav --model-dir /path/to/model ``` ``` -------------------------------- ### Get Global Speaker Statistics Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md Retrieve aggregate statistics across all tracked speakers. ```swift let stats = speakerManager.getGlobalSpeakerStats() print("Total speakers: \(stats.totalSpeakers)") print("Total duration: \(stats.totalDuration)s") print("Average confidence: \(stats.averageConfidence)") print("Speakers with history: \(stats.speakersWithHistory)") ``` -------------------------------- ### Download Models and Datasets Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/benchmarks100.md Use this command to download necessary models and datasets for benchmarking. Requires an active internet connection. ```bash ./Scripts/parakeet_subset_benchmark.sh --download ``` -------------------------------- ### Implement Real-time Diarization in Swift Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md Demonstrates initializing the diarizer and processing audio chunks to identify speakers. Requires downloading models before initialization. ```swift class RealtimeDiarizer { let diarizer: DiarizerManager let speakerManager: SpeakerManager init() async throws { let models = try await DiarizerModels.downloadIfNeeded() diarizer = DiarizerManager() diarizer.initialize(models: models) speakerManager = diarizer.speakerManager } func processChunk(_ audio: [Float]) throws { let result = try diarizer.performCompleteDiarization(audio) for segment in result.segments { // Get speaker with updated name if let speaker = speakerManager.getSpeaker(for: segment.speakerId) { print("\(speaker.name): '\(segment.text ?? "")' at \(segment.startTimeSeconds)s") } } } func enrollSpeaker(name: String, audio: [Float]) throws { let result = try diarizer.performCompleteDiarization(audio) if let firstSegment = result.segments.first, let speaker = speakerManager.getSpeaker(for: firstSegment.speakerId) { speaker.name = name print("Enrolled \(name) as \(speaker.id)") } } } ``` -------------------------------- ### FluidAudio CLI Available Options Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CtcDecoderGuide.md List of available command-line arguments for the ctc-decode-benchmark tool. ```text --audio Audio file (WAV, 16kHz recommended) --arpa ARPA language model file --reference Reference text for WER calculation --ctc-variant 06b|110m CTC model variant (default: 06b) --lm-weight LM scaling factor (default: 0.3) --beam-width Beam width (default: 100) --word-bonus Per-word insertion bonus (default: 0.0) --token-candidates Top-K tokens per frame (default: 40) ``` -------------------------------- ### Diarizer Protocol - Speaker Enrollment Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/API.md Allows for the enrollment of known speakers before starting the diarization process. ```APIDOC ## Diarizer Protocol - Speaker Enrollment ### Description Enroll known speakers to improve diarization accuracy by providing labeled audio data beforehand. ### Method - `enrollSpeaker(withAudio:sourceSampleRate:named:...)` ### Notes This method should be called before initiating streaming or offline processing to associate specific audio segments with speaker identities. ``` -------------------------------- ### Performance Optimization Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md Provides examples of how to optimize diarization performance by adjusting configuration parameters for lower latency. ```APIDOC ## Performance Optimization ```swift let config = DiarizerConfig( clusteringThreshold: 0.7, minSpeechDuration: 1.0, minSilenceGap: 0.5 ) // Lower latency for real-time let config = DiarizerConfig( clusteringThreshold: 0.7, minSpeechDuration: 0.5, // Faster response minSilenceGap: 0.3 // Quicker speaker switches ) ``` ``` -------------------------------- ### Initialize and Run Diarization Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md This Swift code demonstrates the basic workflow for speaker diarization. It includes downloading necessary models, initializing the diarizer, converting audio to the required format, and performing diarization on audio samples or slices. Ensure models are downloaded once before initialization. ```swift import FluidAudio // 1. Download models (one-time setup) let models = try await DiarizerModels.downloadIfNeeded() // 2. Initialize with default config let diarizer = DiarizerManager() diarizer.initialize(models: models) // 3. Normalize any audio file to 16kHz mono Float32 using AudioConverter let converter = AudioConverter() let url = URL(fileURLWithPath: "path/to/audio.wav") let audioSamples = try converter.resampleAudioFile(url) // 4. Run diarization (accepts any RandomAccessCollection) let result = try diarizer.performCompleteDiarization(audioSamples) // Alternative: Use ArraySlice for zero-copy processing let audioSlice = audioSamples[1000..<5000] // No memory copy! let sliceResult = try diarizer.performCompleteDiarization(audioSlice) // 5. Get results for segment in result.segments { print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s") } ``` -------------------------------- ### Run G2P Benchmark Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ModelConversion.md Executes the Grapheme-to-Phoneme benchmark using the release configuration. ```bash swift run -c release fluidaudiocli g2p-benchmark ``` -------------------------------- ### Evaluation Output Example Source: https://github.com/fluidinference/fluidaudio/blob/main/Scripts/voice_cloning/README.md Sample output generated by the evaluation script showing duration and similarity metrics. ```text Reference: speaker.wav Synthesized: output.wav Reference duration: 5.23s Synthesized duration: 2.15s Computing spectral similarity... Mel Similarity: 0.9234 MFCC Similarity: 0.8876 MFCC Std Similarity: 0.8543 Combined Score: 0.8951 Quality: Good ``` -------------------------------- ### Diarizer Properties Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md Access properties of the diarizer instance to get information about the current state and model configuration. ```APIDOC ## Properties ### `timeline` - **Type:** `DiarizerTimeline` - **Description:** Accumulated finalized results of the diarization. ### `isAvailable` - **Type:** `Bool` - **Description:** Indicates whether the model is loaded and ready. ### `numFramesProcessed` - **Type:** `Int` - **Description:** The total number of committed frames that have been processed. ### `targetSampleRate` - **Type:** `Int?` - **Description:** The expected input sample rate for the model (e.g., 8000 Hz). Optional. ### `modelFrameHz` - **Type:** `Double?` - **Description:** The output frame rate of the model, typically around 10.0 Hz. Optional. ### `numSpeakers` - **Type:** `Int?` - **Description:** The maximum number of speakers the model can output (`maxSpeakers`). Optional. ``` -------------------------------- ### Initialize and Use Qwen3AsrManager Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/API.md Initialize the Qwen3AsrManager with a model directory and configuration. Transcribe audio samples or from an audio file URL. Supports multi-language and experimental high-accuracy models. ```swift let manager = try await Qwen3AsrManager(modelDir: modelsURL, configuration: .default) let transcript = try await manager.transcribe(audioSamples) // Or transcribe from a file URL: // let transcript = try await manager.transcribe(audioFileURL) ``` -------------------------------- ### MagpieTtsManager Warmup Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/Magpie.md Explains how to use the `warmup()` function to mitigate cold-start performance issues by pre-compiling CoreML graphs on the ANE. ```APIDOC ## Warmup MagpieTtsManager ### Description Prepares the Magpie TTS CoreML graphs for immediate use after system sleep or long idle periods, preventing user-visible stalls during the first `synthesize` call. ### Method `public func warmup() async throws` ### Parameters None. ### Throws - `.notInitialized`: If called before `MagpieTtsManager` is initialized. ### Usage Call `warmup()` from your application's wake-handler (e.g., `NSApplication.didBecomeActiveNotification`) to ensure the ANE graphs are ready. It is safe to call repeatedly. ### Request Example ```swift // In your app's wake-handler: NotificationCenter.default.addObserver( forName: NSApplication.didBecomeActiveNotification, object: nil, queue: nil ) { _ in Task { try? await manager.warmup() } // Ensure manager is accessible } // Note: Wrap in Task if you don't want to block the calling context. ``` ### Implementation Details `warmup()` performs a short, throwaway synthesis to trigger ANE graph compilation and specialization. This process takes approximately 1.5–2 seconds on an M2 chip. ``` -------------------------------- ### Extension Operations Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md Additional operations provided via extensions, such as reassigning segments and getting global statistics. ```APIDOC ## POST /speaker/reassignSegment ### Description Move a segment embedding from one speaker to another. Useful for correcting misclassified segments in post-processing. ### Method POST ### Endpoint /speaker/reassignSegment ### Request Body - **segmentId** (string) - Required - The ID of the segment to reassign. - **fromSpeakerId** (string) - Required - The ID of the speaker the segment currently belongs to. - **toSpeakerId** (string) - Required - The ID of the speaker the segment should be reassigned to. ### Response #### Success Response (200) - **success** (boolean) - True if the reassignment was successful, false otherwise. #### Response Example ```json { "success": true } ``` ``` ```APIDOC ## GET /speaker/names ### Description Get a sorted list of all speaker IDs (names). ### Method GET ### Endpoint /speaker/names ### Response #### Success Response (200) - **names** (array) - A sorted array of speaker IDs (strings). #### Response Example ```json [ "Alice", "Bob", "Charlie" ] ``` ``` ```APIDOC ## GET /speaker/globalStats ### Description Get aggregate statistics across all speakers. ### Method GET ### Endpoint /speaker/globalStats ### Response #### Success Response (200) - **totalSpeakers** (integer) - Number of tracked speakers. - **totalDuration** (float) - Combined speech duration across all speakers in seconds. - **averageConfidence** (float) - Normalized confidence score (0.0-1.0). - **speakersWithHistory** (integer) - Number of speakers with raw embedding history. #### Response Example ```json { "totalSpeakers": 25, "totalDuration": 1500.75, "averageConfidence": 0.85, "speakersWithHistory": 20 } ``` ``` -------------------------------- ### Creating an LSEENDInput Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md Illustrates how to initialize an LSEENDInput object, either with fresh state or by resuming from a previous state. ```APIDOC ## LSEENDInput `MLFeatureProvider` that carries the per-chunk inputs and recurrent state for one `LSEENDModel.predict` call. ```swift let input = try LSEENDInput(from: model.metadata) // fresh state let input = try LSEENDInput(from: model.metadata, state: existingState) // resume from snapshot ``` ``` -------------------------------- ### CLI Usage for PocketTTS Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/PocketTTS.md Provides examples of using the `fluidaudio` command-line interface to perform text-to-speech synthesis with PocketTTS. ```APIDOC ### CLI Usage This section details how to use the `fluidaudio` command-line tool with the PocketTTS backend for text-to-speech conversion. ```bash # Default (English) fluidaudio tts "Hello world" --backend pocket --output en.wav # Spanish (6L) fluidaudio tts "Hola mundo" --backend pocket --language spanish --output es.wav # French (24L only) fluidaudio tts "Bonjour" --backend pocket --language french_24l --output fr.wav ``` ``` -------------------------------- ### Initialize and Synthesize with KokoroAneManager Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/API.md Initialize the ANE-resident Kokoro manager and perform basic text-to-speech synthesis. The output is a WAV file. ```swift let manager = KokoroAneManager() try await manager.initialize() let wav = try await manager.synthesize(text: "Hello from FluidAudio!") try wav.write(to: URL(fileURLWithPath: "/tmp/demo.wav")) ``` -------------------------------- ### Run Nemotron Speech Streaming 0.6B Benchmark with Limited Files Source: https://github.com/fluidinference/fluidaudio/blob/main/benchmarks.md Execute the Nemotron Speech Streaming 0.6B benchmark on a restricted number of files for faster execution. ```bash .build/release/fluidaudiocli nemotron-benchmark --subset test-clean --max-files 100 ``` -------------------------------- ### Benchmark Qwen3 ASR Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CLI.md Run the benchmark for the Qwen3 ASR model. ```bash swift run fluidaudiocli qwen3-benchmark ``` -------------------------------- ### Real-time Diarization with Speaker Names Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md Example usage of the RealtimeDiarizer class for processing audio chunks and enrolling speakers. ```APIDOC ## Real-time Diarization with Speaker Names This section demonstrates how to use the `RealtimeDiarizer` class to perform real-time speaker diarization and enroll new speakers. ### Class: `RealtimeDiarizer` #### Initialization ```swift init() async throws ``` Initializes the diarizer by downloading necessary models and setting up the `DiarizerManager` and `SpeakerManager`. #### Processing Audio Chunks ```swift func processChunk(_ audio: [Float]) throws ``` Processes a chunk of audio data to perform diarization. It iterates through the resulting segments, retrieves speaker names using the `SpeakerManager`, and prints the speaker, their transcribed text, and start time. #### Enrolling Speakers ```swift func enrollSpeaker(name: String, audio: [Float]) throws ``` Enrolls a new speaker by processing a provided audio sample. It assigns the given name to the detected speaker and prints a confirmation message. ``` -------------------------------- ### Swift API - Voice Cloning and File Output Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/PocketTTS.md Illustrates advanced Swift API usage for cloning a voice from an audio source and synthesizing directly to a file. ```APIDOC ## Usage ### Advanced Swift API Features This section covers using the PocketTTS Swift API for voice cloning and synthesizing audio directly to a specified file URL. ```swift import FluidAudio let manager = PocketTtsManager() try await manager.initialize() // Using built-in voices let audioData = try await manager.synthesize(text: "Hello, world!") // Using cloned voice let voiceData = try await manager.cloneVoice(from: speakerAudioURL) let audioData = try await manager.synthesize(text: "Hello, world!", voiceData: voiceData) try await manager.synthesizeToFile( text: "Hello, world!", outputURL: URL(fileURLWithPath: "/tmp/output.wav") ) ``` ``` -------------------------------- ### CLI Decoding Benchmark Output Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CtcDecoderGuide.md Example output showing WER improvement when using an ARPA language model. ```text Greedy: "patient has die beetus" (15.2% WER) Beam (no LM): "patient has die beetus" (14.1% WER) Beam + LM: "patient has diabetes" ✅ (9.4% WER) 🎯 LM Improvement: 38% reduction in WER ``` -------------------------------- ### CLI Usage for PocketTTS Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/PocketTTS.md Examples of using the FluidAudio CLI to synthesize speech with PocketTTS, specifying different languages and output files. ```bash # Default (English) fluidaudio tts "Hello world" --backend pocket --output en.wav # Spanish (6L) fluidaudio tts "Hola mundo" --backend pocket --language spanish --output es.wav # French (24L only) fluidaudio tts "Bonjour" --backend pocket --language french_24l --output fr.wav ``` -------------------------------- ### Complete Workflow CLI Commands Source: https://github.com/fluidinference/fluidaudio/blob/main/Sources/FluidAudioCLI/README.md A sequence of commands demonstrating the full workflow from dataset download to multi-stream transcription. ```bash # 1. Download dataset swift run fluidaudiocli download --dataset ami-sdm # 2. Run diarization benchmark swift run fluidaudiocli diarization-benchmark --dataset ami-sdm --output results.json # 3. Process individual file swift run fluidaudiocli process audio.wav --threshold 0.7 # 4. Transcribe audio swift run fluidaudiocli transcribe audio.wav --config low-latency # 5. Multi-stream transcription swift run fluidaudiocli multi-stream mic.wav system.wav ``` -------------------------------- ### Run LS-EEND Benchmark (AMI variant, 500ms step) Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md Executes the LS-EEND benchmark using the AMI variant with a 500ms step size. This configuration commits approximately 500ms of audio per CoreML call. Use `--auto-download` to fetch models. ```bash swift run fluidaudiocli lseend-benchmark --variant ami --step-size 500ms --auto-download ``` -------------------------------- ### Train and Use Custom ARPA Language Model Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CtcDecoderGuide.md Commands to install KenLM, train a bigram model from a corpus, and use it with the FluidAudio CLI. ```bash # Install KenLM brew install kenlm # Collect domain text (medical, legal, financial, etc.) cat medical_transcripts/*.txt > corpus.txt # Train bigram language model lmplz -o 2 < corpus.txt > medical.arpa # Use with FluidAudio swift run fluidaudiocli ctc-decode-benchmark \ --audio speech.wav \ --arpa medical.arpa ``` -------------------------------- ### Perform Inference with LSEENDModel Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md Get speaker probability predictions from the LSEENDModel for a given input. The output is a flat array of probabilities with sigmoid applied. ```swift let probs = try model.predict(from: input) // probs is flat [Float], row-major, shape (chunkSize - input.warmupFrames) * metadata.maxSpeakers, // with sigmoid already applied. ``` -------------------------------- ### Download VAD Datasets with CLI Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/VAD/GettingStarted.md Fetch datasets for benchmarking VAD using the `download` CLI command with the `--dataset vad` argument. ```bash swift run fluidaudiocli download --dataset vad ``` -------------------------------- ### VadManager - Streaming VAD Source: https://context7.com/fluidinference/fluidaudio/llms.txt Process audio chunks in real-time for streaming voice activity detection, emitting events for speech start and end transitions. ```APIDOC ## VadManager.processStreamingChunk ### Description Streaming VAD with speech start/end events. Processes audio chunk-by-chunk in real time, emitting `VadStreamEvent` values for `speechStart` and `speechEnd` transitions using Silero-style hysteresis. ### Method `processStreamingChunk(chunk: [Float], state: VadStreamState, config: VadStreamConfig, returnSeconds: Bool, timeResolution: Int)` ### Parameters #### Path Parameters - None #### Query Parameters - None #### Request Body - `chunk` ([Float]): A chunk of audio samples (16 kHz, 4096 samples). - `state` (VadStreamState): The current state of the streaming VAD. - `config` (VadStreamConfig): Configuration for streaming VAD. Defaults to `.default`. - `returnSeconds` (Bool): If true, time values are returned in seconds. - `timeResolution` (Int): The time resolution for events. ### Request Example ```swift import FluidAudio Task { let manager = try await VadManager() var state = await manager.makeStreamState() for chunk in microphoneChunks { // [Float] at 16 kHz, 4096 samples each let result = try await manager.processStreamingChunk( chunk, state: state, config: .default, returnSeconds: true, timeResolution: 2 ) state = result.state print(String(format: "Probability: %.3f", result.probability)) if let event = result.event { switch event.kind { case .speechStart: print("Speech started at \(event.time ?? 0)s") case .speechEnd: print("Speech ended at \(event.time ?? 0)s") } } } } ``` ### Response #### Success Response (200) - `result` (VadStreamResult): An object containing: - `probability` (Float): The speech probability for the current chunk. - `event` (VadStreamEvent?): A `VadStreamEvent` if a speech start or end transition occurred. - `state` (VadStreamState): The updated state for the next chunk. #### Response Example ``` // Output example: // Probability: 0.012 // Probability: 0.965 // Speech started at 0.26s // Probability: 0.978 // Probability: 0.023 // Speech ended at 1.56s ``` ``` -------------------------------- ### Run FLEURS Benchmark for All Languages Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md Executes the FLEURS benchmark for all supported languages using the Qwen3 model. Ensure the Swift toolchain is installed and the project is set up. ```bash swift run -c release fluidaudiocli qwen3-benchmark --dataset fleurs --languages all ``` -------------------------------- ### CTC Earnings Benchmark Output Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md Example output from the CTC earnings benchmark, showing Word Error Rate (WER) and dictionary metrics for individual files. ```text Data directory: /Users//Library/Application Support/FluidAudio/earnings22-kws/test-dataset Output file: ctc_earnings_benchmark.json TDT version: v2 CTC model: /Users//Library/Application Support/FluidAudio/Models/parakeet-ctc-110m-coreml Loading TDT models (v2) for transcription... TDT models loaded successfully Loading CTC models from: /Users//Library/Application Support/FluidAudio/Models/parakeet-ctc-110m-coreml Loaded CTC vocabulary with 1024 tokens, variant: Parakeet CTC 110M (hybrid) Created CTC spotter with blankId=1024 Processing 773 test files... [ 1/772] 4329526_chunk0 WER: 10.3% Dict: 1/1 [ 2/772] 4329526_chunk109 WER: 12.5% Dict: 2/2 [ 3/772] 4329526_chunk118 WER: 3.1% Dict: 3/3 [ 4/772] 4329526_chunk132 WER: 8.1% Dict: 1/1 [ 5/772] 4329526_chunk135 WER: 25.7% Dict: 1/1 [ 6/772] 4329526_chunk16 WER: 8.6% Dict: 1/1 ... [767/772] 4485206_chunk_86 WER: 5.0% Dict: 2/2 [768/772] 4485206_chunk_88 WER: 8.3% Dict: 2/2 [769/772] 4485206_chunk_92 WER: 14.7% Dict: 4/4 [770/772] 4485206_chunk_97 WER: 30.5% Dict: 1/1 [771/772] 4485206_chunk_98 WER: 18.6% Dict: 4/4 [772/772] 4485206_chunk_99 WER: 22.0% Dict: 1/1 ```