### Install Python Dependencies

Source: https://github.com/fluidinference/fluidaudio/blob/main/Scripts/nemo_ami_benchmark/README.md

Sets up a Python virtual environment and installs necessary libraries including PyTorch, Torchaudio, Torchcodec, NeMo toolkit, and pyannote.metrics.

```bash
python3.10 -m venv .venv
source .venv/bin/activate

pip install torch torchaudio torchcodec
pip install nemo_toolkit[asr] pyannote.metrics
```

--------------------------------

### Install swift-format for Swift <6

Source: https://github.com/fluidinference/fluidaudio/blob/main/CONTRIBUTING.md

Instructions for users with Swift versions older than 6 to install swift-format manually. This involves cloning the repository, building the release version, and copying the executable to your PATH.

```bash
# For Swift <6, install swift-format separately:
# git clone https://github.com/apple/swift-format
# cd swift-format && swift build -c release
# cp .build/release/swift-format /usr/local/bin/
```

--------------------------------

### Real-time Audio Capture and Diarization

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md

Set up real-time audio capture using AVAudioEngine and process it with DiarizerManager for live diarization. This example demonstrates installing an audio tap and handling diarization results asynchronously.

```swift
import AVFoundation

class RealTimeDiarizer {
    private let audioEngine = AVAudioEngine()
    private let diarizer: DiarizerManager
    private var audioStream: AudioStream
    
    init() async throws {
        let models = try await DiarizerModels.downloadIfNeeded()
        diarizer = DiarizerManager()
        diarizer.initialize(models: models)
        audioStream = AudioStream(
            chunkDuration: 5.0, // 5 second chunks work well
            chunkSkip: 3.0, // 3.0 second delay between chunks works well
            streamStartTime: 0.0,
            chunkingStrategy: .useFixedSkip // ensure chunks are evenly spaced
        )
        audioStream.bind { chunk, _ in
            Task {
                do {
                    let result = try diarizer.performCompleteDiarization(chunk)
                    await handleResults(result)
                } catch {
                    print("Diarization error: \(error)")
                }
            }
        }
    }

    func startCapture() throws {
        let inputNode = audioEngine.inputNode
        let recordingFormat = inputNode.outputFormat(forBus: 0)

        // Install tap to capture audio
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { [weak self] buffer, _ in
            guard let self = self else { return }
            try? self.audioStream.write(from: buffer)
        }

        audioEngine.prepare()
        try audioEngine.start()
    }

    @MainActor
    private func handleResults(_ result: DiarizationResult) {
        for segment in result.segments {
            print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
        }
    }
}
```

--------------------------------

### Install Dependencies

Source: https://github.com/fluidinference/fluidaudio/blob/main/Scripts/voice_cloning/README.md

Install required libraries for spectral analysis and optional plotting.

```bash
pip install librosa numpy
# Or minimal (scipy fallback):
pip install scipy numpy

# Optional for plotting:
pip install matplotlib
```

--------------------------------

### Quick Test CLI Commands

Source: https://github.com/fluidinference/fluidaudio/blob/main/Sources/FluidAudioCLI/README.md

Basic commands for testing the CLI installation using sample audio files.

```bash
# Test with included sample files
swift run fluidaudiocli transcribe medical.wav
swift run fluidaudiocli process IS1001a.Mix-Headset.wav --threshold 0.7
```

--------------------------------

### Text-to-Speech Benchmark Setup

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md

Command to run Text-to-Speech benchmarks, testing inference speed across different pipelines (PyTorch CPU, MPS, MLX, Swift Core ML).

```bash
KPipeline benchmark for voice af_heart (warm-up took 0.175s) using hexgrad/kokoro
```

--------------------------------

### LSEENDFeatureProvider Usage Example

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md

Provides an example of how to use the LSEENDFeatureProvider to process audio streams and get model inputs.

```APIDOC
```swift
// Push raw audio. eagerPreprocessing: true (default) runs STFT/log-mel/CMN immediately.
try feeder.enqueueAudio(samples, withSampleRate: 16_000)

// Or push from a file (returns sample count read).
let count = try feeder.enqueueAudioFile(at: audioURL)

// Pad the tail before the final predict pass.
try feeder.drainRightContextWithSilence()

// Pull ready chunks until none remain.
while let input = try feeder.emitNextChunk() {
    let probs = try model.predict(from: input)
    // probs has shape (chunkSize - input.warmupFrames) * metadata.maxSpeakers
}
```
```

--------------------------------

### CLI Usage

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/StyleTTS2.md

Example of how to use the StyleTTS2 CLI for text-to-speech synthesis.

```APIDOC
## CLI Usage

### Description
Use the `fluidaudiocli` tool to perform text-to-speech synthesis with StyleTTS2.

### Command
```bash
swift run fluidaudiocli tts "Hello from StyleTTS2." \
    --backend styletts2 \
    --reference path/to/speaker.wav \
    --output ~/Desktop/styletts2-demo.wav
```

### Parameters
- `--backend` (string): Specifies the TTS backend to use. Set to `styletts2`.
- `--reference` (path): Required. Path to the reference audio file (e.g., WAV, AIFF, CAF, m4a). The file will be resampled to 24 kHz mono.
- `--output` (path): Path to save the synthesized audio file.

### Optional Flags
- `--styletts2-alpha <f>`: Controls the blend weight for the diffusion-sampled style versus the reference encoder. Default is `0.3`.
- `--styletts2-beta <f>`: Controls the blend weight for the prosody. Default is `0.7`.
- `--styletts2-ipa <s>`: Skips the lexicon and G2P pipeline, allowing direct input of an IPA string.
- `--seed <u64>`: Sets the random seed for the diffusion sampler for reproducible results. Default is `0`.
```

--------------------------------

### Voice Cloning Workflow

Source: https://github.com/fluidinference/fluidaudio/blob/main/Scripts/voice_cloning/README.md

Example workflow demonstrating voice cloning via CLI followed by spectral evaluation.

```bash
# 1. Clone a voice using FluidAudio CLI
fluidaudio tts "Hello, this is a test." --backend pocket --clone-voice speaker.wav -o output.wav

# 2. Evaluate the result
python Tools/voice_cloning/evaluate_voice.py speaker.wav output.wav --plot
```

--------------------------------

### Install React Native/Expo Wrapper for FluidAudio

Source: https://github.com/fluidinference/fluidaudio/blob/main/README.md

Install the React Native and Expo wrapper for FluidAudio using npm.

```bash
npm install @fluidinference/react-native-fluidaudio
```

--------------------------------

### Initialize LSEENDDiarizer with Async Convenience

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md

Initializes the LSEENDDiarizer asynchronously, handling model download and setup in one step. Use this for a straightforward setup.

```swift
// Async convenience: downloads (or reuses cached) model and initializes in one step.
let diarizer = try await LSEENDDiarizer(
    variant: .dihard3,
    stepSize: .step100ms,
    timelineConfig: nil                 // optional DiarizerTimelineConfig
)
```

--------------------------------

### Install Rust/Tauri Wrapper for FluidAudio

Source: https://github.com/fluidinference/fluidaudio/blob/main/README.md

Add the Rust/Tauri wrapper for FluidAudio to your project using cargo.

```bash
cargo add fluidaudio-rs
```

--------------------------------

### Manual Audio Source Control

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md

This example demonstrates how to manually construct an audio source using `StreamingAudioSourceFactory` for advanced control over memory management and audio loading. This is useful for processing the same file multiple times, measuring loading time separately, or implementing custom logic.

```APIDOC
## Manual Audio Source Control

This approach allows for fine-grained control over memory management and audio loading by manually constructing the audio source using `StreamingAudioSourceFactory`.

### Use Cases

- Process the same file multiple times without reloading.
- Measure audio loading time separately from diarization time.
- Implement custom cleanup or caching logic.

### Code Example

```swift
import FluidAudio

let config = OfflineDiarizerConfig()
try await manager.prepareModels()

let factory = StreamingAudioSourceFactory()
let (source, loadDuration) = try factory.makeDiskBackedSource(
    from: URL(fileURLWithPath: "meeting.wav"),
    targetSampleRate: config.segmentation.sampleRate
)
defer { source.cleanup() }

let result = try await manager.process(
    audioSource: source,
    audioLoadingSeconds: loadDuration
)
```

**Note:** For most use cases, the simpler `manager.process(url)` API is recommended.
```

--------------------------------

### CLI Usage

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Cohere.md

Examples of how to use the Cohere Transcribe model from the command line interface.

```APIDOC
## CLI Usage

### Single-precision (FP16 or INT8 in one dir)
```bash
swift run -c release fluidaudiocli cohere-transcribe audio.wav \
    --model-dir /path/to/cohere-fp16 \
    --language en
```

### Mixed precision (INT8 encoder + FP16 decoder)
```bash
swift run -c release fluidaudiocli cohere-transcribe audio.wav \
    --encoder-dir /path/to/q8 \
    --decoder-dir /path/to/f16 \
    --vocab-dir /path/to/f16 \
    --language en
```
```

--------------------------------

### Transcribe Audio with CLI (Bash)

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Qwen3-ASR.md

Provides command-line examples for transcribing audio files using the fluidaudiocli tool. Supports auto-detection of language, specifying a language hint, and using a local model directory.

```bash
# Transcribe a file (auto-detect language)
swift run -c release fluidaudiocli qwen3-transcribe audio.wav

# Transcribe with language hint
swift run -c release fluidaudiocli qwen3-transcribe audio.wav --language zh

# Transcribe with local model
swift run -c release fluidaudiocli qwen3-transcribe audio.wav --model-dir /path/to/model
```

--------------------------------

### Format and Build Project

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ModelConversion.md

Run Swift formatting tools to ensure code style consistency and then build and test the project to verify the integration.

```bash
swift format --in-place --recursive --configuration .swift-format Sources/ Tests/
swift build
swift test
```

--------------------------------

### Build Swift Project (Release)

Source: https://github.com/fluidinference/fluidaudio/blob/main/CLAUDE.md

Compile the Swift project in release mode, recommended for performance benchmarks.

```bash
swift build -c release
```

--------------------------------

### Run Various Benchmarks with FluidAudio CLI

Source: https://github.com/fluidinference/fluidaudio/blob/main/CLAUDE.md

Execute benchmarks for Sortformer, Qwen3, CTC earnings, and G2P models.

```bash
swift run fluidaudiocli sortformer-benchmark
swift run fluidaudiocli qwen3-benchmark
swift run fluidaudiocli ctc-earnings-benchmark
swift run fluidaudiocli g2p-benchmark
```

--------------------------------

### Swift API Initialization and Synthesis

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/PocketTTS.md

Demonstrates how to initialize the PocketTtsManager with a specific language and synthesize text to audio.

```APIDOC
## Swift API

### Initialization and Synthesis

This snippet shows how to create a `PocketTtsManager` instance for a specific language, initialize it, and then synthesize text into audio.

```swift
let manager = PocketTtsManager(language: .spanish)
try await manager.initialize()
let audio = try await manager.synthesize(text: "Hola mundo")
```

**Notes:**
- `PocketTtsManager.language` is immutable after instantiation. To support multiple languages, create separate manager instances for each language.
```

--------------------------------

### iOS ASR Test App Example

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/TDT-CTC-110M.md

Example Swift code for an iOS test app demonstrating ASR transcription using FluidAudio. It includes model auto-download, manager initialization, and transcription of audio samples.

```swift
import SwiftUI
import FluidAudio

struct ContentView: View {
    @State private var transcript: String = ""
    @State private var isTesting: Bool = false

    func runTest() async {
        // Auto-download models on device
        let models = try await AsrModels.downloadAndLoad(
            to: nil,  // Uses default cache
            version: .tdtCtc110m
        )

        // Initialize manager
        let manager = AsrManager()
        try await manager.loadModels(models)

        // Load test audio
        let audioSamples: [Float] = ... // Load from bundle or record

        // Transcribe
        let result = try await manager.transcribe(audioSamples)
        transcript = result.text
    }
}
```

--------------------------------

### Build Project with Swift

Source: https://github.com/fluidinference/fluidaudio/blob/main/AGENTS.md

Use 'swift build' to compile the project. For a release build, use the '-c release' flag.

```bash
swift build
```

```bash
swift build -c release
```

--------------------------------

### Initialize and Load Qwen3-ASR Models (Swift)

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Qwen3-ASR.md

Shows the Swift API for initializing the Qwen3AsrManager and loading the necessary CoreML models. The models are automatically downloaded if they are not found locally.

```swift
import FluidAudio

// Initialize manager
let manager = Qwen3AsrManager()

// Load models (auto-downloads if needed)
let modelDir = try await Qwen3AsrModels.download()
try await manager.loadModels(from: modelDir)

// Transcribe audio samples (16kHz mono Float32)
let text = try await manager.transcribe(audioSamples: samples)
```

--------------------------------

### Get Speaker IDs

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md

Retrieve a sorted array of all speaker IDs.

```swift
let ids = speakerManager.speakerIds
// Returns: [String] - sorted array of speaker IDs
```

--------------------------------

### Get Speaker Count

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md

Retrieve the total number of tracked speakers.

```swift
print("Active speakers: \(speakerManager.speakerCount)")
```

--------------------------------

### SwiftUI Integration

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md

Provides an example of integrating the diarization functionality into a SwiftUI view.

```APIDOC
## SwiftUI Integration

```swift
import SwiftUI
import FluidAudio

struct DiarizationView: View {
    @StateObject private var processor = DiarizationProcessor()

    var body: some View {
        VStack {
            Text("Speakers: \(processor.speakerCount)")

            List(processor.activeSpeakers) { speaker in
                HStack {
                    Circle()
                        .fill(speaker.isSpeaking ? Color.green : Color.gray)
                        .frame(width: 10, height: 10)
                    Text(speaker.name)
                    Spacer()
                    Text("\(speaker.duration, specifier: \"%.1f\")s")
                }
            }

            Button(processor.isProcessing ? "Stop" : "Start") {
                processor.toggleProcessing()
            }
        }
    }
}

@MainActor
class DiarizationProcessor: ObservableObject {
    @Published var speakerCount = 0
    @Published var activeSpeakers: [SpeakerDisplay] = []
    @Published var isProcessing = false

    private var diarizer: DiarizerManager?

    func toggleProcessing() {
        if isProcessing {
            stopProcessing()
        } else {
            startProcessing()
        }
    }

    private func startProcessing() {
        Task {
            let models = try await DiarizerModels.downloadIfNeeded()
            diarizer = DiarizerManager()  // Default config
            diarizer?.initialize(models: models)
            isProcessing = true

            // Start audio capture and process chunks
            AudioCapture.start { [weak self] chunk in
                self?.processChunk(chunk)
            }
        }
    }

    private func processChunk(_ audio: [Float]) {
        Task {
            guard let diarizer = diarizer else { return }

            let result = try diarizer.performCompleteDiarization(audio)
            speakerCount = diarizer.speakerManager.speakerCount

            // Update UI with current speakers
            activeSpeakers = diarizer.speakerManager.speakerIds.compactMap {
                guard let speaker = diarizer.speakerManager.getSpeaker(for: $0) else {
                    return nil
                }
                return SpeakerDisplay(
                    id: $0,
                    name: speaker.name,
                    duration: speaker.duration,
                    isSpeaking: result.segments.contains { $0.speakerId == $0 }
                )
            }
        }
    }
}
```
```

--------------------------------

### Speaker Count and IDs

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md

Operations to get the total number of speakers and their IDs.

```APIDOC
## GET /speaker/count

### Description
Get the total number of tracked speakers.

### Method
GET

### Endpoint
/speaker/count

### Response
#### Success Response (200)
- **count** (integer) - The total number of speakers.

#### Response Example
```json
{
  "count": 15
}
```
```

```APIDOC
## GET /speaker/ids

### Description
Get all speaker IDs as a sorted array.

### Method
GET

### Endpoint
/speaker/ids

### Response
#### Success Response (200)
- **ids** (array) - A sorted array of speaker IDs (strings).

#### Response Example
```json
[
  "speaker_1",
  "speaker_10",
  "speaker_2"
]
```
```

--------------------------------

### Run Qwen3-ASR AISHELL Benchmark

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md

Execute the AISHELL benchmark for the Qwen3-ASR model using the command-line interface. Ensure the release build and specify the dataset.

```bash
# Run AISHELL-1 benchmark
swift run -c release fluidaudiocli qwen3-benchmark --dataset aishell
```

--------------------------------

### Get Current Speaker Names

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md

Retrieve a sorted list of all speaker IDs.

```swift
let names = speakerManager.getCurrentSpeakerNames()
// Returns: [String] - sorted speaker IDs
```

--------------------------------

### Swift API Usage

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Cohere.md

Example of how to integrate and use the Cohere Transcribe pipeline in Swift.

```APIDOC
## Swift API

```swift
import CoreML
import FluidAudio

let encoderDir = URL(fileURLWithPath: "/path/to/q8")
let decoderDir = URL(fileURLWithPath: "/path/to/f16")

let models = try await CoherePipeline.loadModels(
    encoderDir: encoderDir,
    decoderDir: decoderDir,
    vocabDir: decoderDir
)

let pipeline = CoherePipeline()
let result = try await pipeline.transcribe(
    audio: samples,        // 16 kHz mono Float32, up to 35 s
    models: models,
    language: .english
)
print(result.text)
```

`TranscriptionResult` also exposes `encoderSeconds`, `decoderSeconds`, and
`totalSeconds` for per-stage profiling.
```

--------------------------------

### Transcribe Audio with Language Hint (Swift)

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Qwen3-ASR.md

Demonstrates how to transcribe audio samples using the Qwen3AsrManager, with options for automatic language detection or specifying a language hint for improved accuracy.

```swift
// Auto-detect language (default)
let text = try await manager.transcribe(audioSamples: samples)

// Specify language for better accuracy
let text = try await manager.transcribe(audioSamples: samples, language: .chinese)
let text = try await manager.transcribe(audioSamples: samples, language: .japanese)
```

--------------------------------

### ARPA Model File Format

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CtcDecoderGuide.md

Example structure of an ARPA language model file.

```text
\data\
ngram 1=4
ngram 2=2

\1-grams:
-1.0    patient     -0.5
-1.5    diabetes    0.0
-2.0    hypertension 0.0

\2-grams:
-0.3    patient     diabetes
-0.5    patient     hypertension

\end\
```

--------------------------------

### CLI Usage

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/Qwen3-ASR.md

Instructions for using the Qwen3-ASR model via the command-line interface (CLI), including transcription with and without language hints, and specifying a local model directory.

```APIDOC
## CLI Usage

### Description
This section provides command-line instructions for transcribing audio files using the Qwen3-ASR model.

### Basic Transcription (Auto-detect Language)
To transcribe an audio file and automatically detect the language:

```bash
swift run -c release fluidaudiocli qwen3-transcribe audio.wav
```

### Transcription with Language Hint
To improve accuracy, you can specify the language of the audio file:

```bash
# Transcribe with language hint (e.g., Chinese)
swift run -c release fluidaudiocli qwen3-transcribe audio.wav --language zh

# Transcribe with language hint (e.g., Japanese)
swift run -c release fluidaudiocli qwen3-transcribe audio.wav --language ja
```

### Using a Local Model
If you have downloaded the model files, you can specify the directory containing them:

```bash
# Transcribe with a local model directory
swift run -c release fluidaudiocli qwen3-transcribe audio.wav --model-dir /path/to/model
```
```

--------------------------------

### Get Global Speaker Statistics

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md

Retrieve aggregate statistics across all tracked speakers.

```swift
let stats = speakerManager.getGlobalSpeakerStats()
print("Total speakers: \(stats.totalSpeakers)")
print("Total duration: \(stats.totalDuration)s")
print("Average confidence: \(stats.averageConfidence)")
print("Speakers with history: \(stats.speakersWithHistory)")
```

--------------------------------

### Download Models and Datasets

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ASR/benchmarks100.md

Use this command to download necessary models and datasets for benchmarking. Requires an active internet connection.

```bash
./Scripts/parakeet_subset_benchmark.sh --download
```

--------------------------------

### Implement Real-time Diarization in Swift

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md

Demonstrates initializing the diarizer and processing audio chunks to identify speakers. Requires downloading models before initialization.

```swift
class RealtimeDiarizer {
    let diarizer: DiarizerManager
    let speakerManager: SpeakerManager

    init() async throws {
        let models = try await DiarizerModels.downloadIfNeeded()
        diarizer = DiarizerManager()
        diarizer.initialize(models: models)
        speakerManager = diarizer.speakerManager
    }

    func processChunk(_ audio: [Float]) throws {
        let result = try diarizer.performCompleteDiarization(audio)

        for segment in result.segments {
            // Get speaker with updated name
            if let speaker = speakerManager.getSpeaker(for: segment.speakerId) {
                print("\(speaker.name): '\(segment.text ?? "")' at \(segment.startTimeSeconds)s")
            }
        }
    }

    func enrollSpeaker(name: String, audio: [Float]) throws {
        let result = try diarizer.performCompleteDiarization(audio)

        if let firstSegment = result.segments.first,
           let speaker = speakerManager.getSpeaker(for: firstSegment.speakerId) {
            speaker.name = name
            print("Enrolled \(name) as \(speaker.id)")
        }
    }
}
```

--------------------------------

### FluidAudio CLI Available Options

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CtcDecoderGuide.md

List of available command-line arguments for the ctc-decode-benchmark tool.

```text
--audio <file>           Audio file (WAV, 16kHz recommended)
--arpa <file>            ARPA language model file
--reference <text>       Reference text for WER calculation
--ctc-variant 06b|110m   CTC model variant (default: 06b)
--lm-weight <float>      LM scaling factor (default: 0.3)
--beam-width <int>       Beam width (default: 100)
--word-bonus <float>     Per-word insertion bonus (default: 0.0)
--token-candidates <int> Top-K tokens per frame (default: 40)
```

--------------------------------

### Diarizer Protocol - Speaker Enrollment

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/API.md

Allows for the enrollment of known speakers before starting the diarization process.

```APIDOC
## Diarizer Protocol - Speaker Enrollment

### Description
Enroll known speakers to improve diarization accuracy by providing labeled audio data beforehand.

### Method
- `enrollSpeaker(withAudio:sourceSampleRate:named:...)`

### Notes
This method should be called before initiating streaming or offline processing to associate specific audio segments with speaker identities.
```

--------------------------------

### Performance Optimization

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md

Provides examples of how to optimize diarization performance by adjusting configuration parameters for lower latency.

```APIDOC
## Performance Optimization

```swift
let config = DiarizerConfig(
    clusteringThreshold: 0.7,
    minSpeechDuration: 1.0,
    minSilenceGap: 0.5
)

// Lower latency for real-time
let config = DiarizerConfig(
    clusteringThreshold: 0.7,
    minSpeechDuration: 0.5,    // Faster response
    minSilenceGap: 0.3         // Quicker speaker switches
)
```
```

--------------------------------

### Initialize and Run Diarization

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/GettingStarted.md

This Swift code demonstrates the basic workflow for speaker diarization. It includes downloading necessary models, initializing the diarizer, converting audio to the required format, and performing diarization on audio samples or slices. Ensure models are downloaded once before initialization.

```swift
import FluidAudio

// 1. Download models (one-time setup)
let models = try await DiarizerModels.downloadIfNeeded()

// 2. Initialize with default config
let diarizer = DiarizerManager()
diarizer.initialize(models: models)

// 3. Normalize any audio file to 16kHz mono Float32 using AudioConverter
let converter = AudioConverter()
let url = URL(fileURLWithPath: "path/to/audio.wav")
let audioSamples = try converter.resampleAudioFile(url)

// 4. Run diarization (accepts any RandomAccessCollection<Float>)
let result = try diarizer.performCompleteDiarization(audioSamples)

// Alternative: Use ArraySlice for zero-copy processing
let audioSlice = audioSamples[1000..<5000]  // No memory copy!
let sliceResult = try diarizer.performCompleteDiarization(audioSlice)

// 5. Get results
for segment in result.segments {
    print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
}
```

--------------------------------

### Run G2P Benchmark

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/ModelConversion.md

Executes the Grapheme-to-Phoneme benchmark using the release configuration.

```bash
swift run -c release fluidaudiocli g2p-benchmark
```

--------------------------------

### Evaluation Output Example

Source: https://github.com/fluidinference/fluidaudio/blob/main/Scripts/voice_cloning/README.md

Sample output generated by the evaluation script showing duration and similarity metrics.

```text
Reference:   speaker.wav
Synthesized: output.wav

Reference duration:   5.23s
Synthesized duration: 2.15s

Computing spectral similarity...

  Mel Similarity:      0.9234
  MFCC Similarity:     0.8876
  MFCC Std Similarity: 0.8543
  Combined Score:      0.8951
  Quality:             Good
```

--------------------------------

### Diarizer Properties

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md

Access properties of the diarizer instance to get information about the current state and model configuration.

```APIDOC
## Properties

### `timeline`
- **Type:** `DiarizerTimeline`
- **Description:** Accumulated finalized results of the diarization.

### `isAvailable`
- **Type:** `Bool`
- **Description:** Indicates whether the model is loaded and ready.

### `numFramesProcessed`
- **Type:** `Int`
- **Description:** The total number of committed frames that have been processed.

### `targetSampleRate`
- **Type:** `Int?`
- **Description:** The expected input sample rate for the model (e.g., 8000 Hz). Optional.

### `modelFrameHz`
- **Type:** `Double?`
- **Description:** The output frame rate of the model, typically around 10.0 Hz. Optional.

### `numSpeakers`
- **Type:** `Int?`
- **Description:** The maximum number of speakers the model can output (`maxSpeakers`). Optional.
```

--------------------------------

### Initialize and Use Qwen3AsrManager

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/API.md

Initialize the Qwen3AsrManager with a model directory and configuration. Transcribe audio samples or from an audio file URL. Supports multi-language and experimental high-accuracy models.

```swift
let manager = try await Qwen3AsrManager(modelDir: modelsURL, configuration: .default)
let transcript = try await manager.transcribe(audioSamples)
// Or transcribe from a file URL:
// let transcript = try await manager.transcribe(audioFileURL)
```

--------------------------------

### MagpieTtsManager Warmup

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/Magpie.md

Explains how to use the `warmup()` function to mitigate cold-start performance issues by pre-compiling CoreML graphs on the ANE.

```APIDOC
## Warmup MagpieTtsManager

### Description
Prepares the Magpie TTS CoreML graphs for immediate use after system sleep or long idle periods, preventing user-visible stalls during the first `synthesize` call.

### Method
`public func warmup() async throws`

### Parameters
None.

### Throws
- `.notInitialized`: If called before `MagpieTtsManager` is initialized.

### Usage
Call `warmup()` from your application's wake-handler (e.g., `NSApplication.didBecomeActiveNotification`) to ensure the ANE graphs are ready. It is safe to call repeatedly.

### Request Example
```swift
// In your app's wake-handler:
NotificationCenter.default.addObserver(
    forName: NSApplication.didBecomeActiveNotification, object: nil, queue: nil
) { _ in
    Task { try? await manager.warmup() } // Ensure manager is accessible
}

// Note: Wrap in Task if you don't want to block the calling context.
```

### Implementation Details
`warmup()` performs a short, throwaway synthesis to trigger ANE graph compilation and specialization. This process takes approximately 1.5–2 seconds on an M2 chip.
```

--------------------------------

### Extension Operations

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md

Additional operations provided via extensions, such as reassigning segments and getting global statistics.

```APIDOC
## POST /speaker/reassignSegment

### Description
Move a segment embedding from one speaker to another. Useful for correcting misclassified segments in post-processing.

### Method
POST

### Endpoint
/speaker/reassignSegment

### Request Body
- **segmentId** (string) - Required - The ID of the segment to reassign.
- **fromSpeakerId** (string) - Required - The ID of the speaker the segment currently belongs to.
- **toSpeakerId** (string) - Required - The ID of the speaker the segment should be reassigned to.

### Response
#### Success Response (200)
- **success** (boolean) - True if the reassignment was successful, false otherwise.

#### Response Example
```json
{
  "success": true
}
```
```

```APIDOC
## GET /speaker/names

### Description
Get a sorted list of all speaker IDs (names).

### Method
GET

### Endpoint
/speaker/names

### Response
#### Success Response (200)
- **names** (array) - A sorted array of speaker IDs (strings).

#### Response Example
```json
[
  "Alice",
  "Bob",
  "Charlie"
]
```
```

```APIDOC
## GET /speaker/globalStats

### Description
Get aggregate statistics across all speakers.

### Method
GET

### Endpoint
/speaker/globalStats

### Response
#### Success Response (200)
- **totalSpeakers** (integer) - Number of tracked speakers.
- **totalDuration** (float) - Combined speech duration across all speakers in seconds.
- **averageConfidence** (float) - Normalized confidence score (0.0-1.0).
- **speakersWithHistory** (integer) - Number of speakers with raw embedding history.

#### Response Example
```json
{
  "totalSpeakers": 25,
  "totalDuration": 1500.75,
  "averageConfidence": 0.85,
  "speakersWithHistory": 20
}
```
```

--------------------------------

### Creating an LSEENDInput

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md

Illustrates how to initialize an LSEENDInput object, either with fresh state or by resuming from a previous state.

```APIDOC
## LSEENDInput

`MLFeatureProvider` that carries the per-chunk inputs and recurrent state for one `LSEENDModel.predict` call.

```swift
let input = try LSEENDInput(from: model.metadata)        // fresh state
let input = try LSEENDInput(from: model.metadata, state: existingState)  // resume from snapshot
```
```

--------------------------------

### CLI Usage for PocketTTS

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/PocketTTS.md

Provides examples of using the `fluidaudio` command-line interface to perform text-to-speech synthesis with PocketTTS.

```APIDOC
### CLI Usage

This section details how to use the `fluidaudio` command-line tool with the PocketTTS backend for text-to-speech conversion.

```bash
# Default (English)
fluidaudio tts "Hello world" --backend pocket --output en.wav

# Spanish (6L)
fluidaudio tts "Hola mundo" --backend pocket --language spanish --output es.wav

# French (24L only)
fluidaudio tts "Bonjour" --backend pocket --language french_24l --output fr.wav
```
```

--------------------------------

### Initialize and Synthesize with KokoroAneManager

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/API.md

Initialize the ANE-resident Kokoro manager and perform basic text-to-speech synthesis. The output is a WAV file.

```swift
let manager = KokoroAneManager()
try await manager.initialize()

let wav = try await manager.synthesize(text: "Hello from FluidAudio!")
try wav.write(to: URL(fileURLWithPath: "/tmp/demo.wav"))
```

--------------------------------

### Run Nemotron Speech Streaming 0.6B Benchmark with Limited Files

Source: https://github.com/fluidinference/fluidaudio/blob/main/benchmarks.md

Execute the Nemotron Speech Streaming 0.6B benchmark on a restricted number of files for faster execution.

```bash
.build/release/fluidaudiocli nemotron-benchmark --subset test-clean --max-files 100
```

--------------------------------

### Benchmark Qwen3 ASR

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CLI.md

Run the benchmark for the Qwen3 ASR model.

```bash
swift run fluidaudiocli qwen3-benchmark
```

--------------------------------

### Real-time Diarization with Speaker Names

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/SpeakerManager.md

Example usage of the RealtimeDiarizer class for processing audio chunks and enrolling speakers.

```APIDOC
## Real-time Diarization with Speaker Names

This section demonstrates how to use the `RealtimeDiarizer` class to perform real-time speaker diarization and enroll new speakers.

### Class: `RealtimeDiarizer`

#### Initialization
```swift
init() async throws
```
Initializes the diarizer by downloading necessary models and setting up the `DiarizerManager` and `SpeakerManager`.

#### Processing Audio Chunks
```swift
func processChunk(_ audio: [Float]) throws
```
Processes a chunk of audio data to perform diarization. It iterates through the resulting segments, retrieves speaker names using the `SpeakerManager`, and prints the speaker, their transcribed text, and start time.

#### Enrolling Speakers
```swift
func enrollSpeaker(name: String, audio: [Float]) throws
```
Enrolls a new speaker by processing a provided audio sample. It assigns the given name to the detected speaker and prints a confirmation message.
```

--------------------------------

### Swift API - Voice Cloning and File Output

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/PocketTTS.md

Illustrates advanced Swift API usage for cloning a voice from an audio source and synthesizing directly to a file.

```APIDOC
## Usage

### Advanced Swift API Features

This section covers using the PocketTTS Swift API for voice cloning and synthesizing audio directly to a specified file URL.

```swift
import FluidAudio

let manager = PocketTtsManager()
try await manager.initialize()

// Using built-in voices
let audioData = try await manager.synthesize(text: "Hello, world!")

// Using cloned voice
let voiceData = try await manager.cloneVoice(from: speakerAudioURL)
let audioData = try await manager.synthesize(text: "Hello, world!", voiceData: voiceData)

try await manager.synthesizeToFile(
    text: "Hello, world!",
    outputURL: URL(fileURLWithPath: "/tmp/output.wav")
)
```
```

--------------------------------

### CLI Decoding Benchmark Output

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CtcDecoderGuide.md

Example output showing WER improvement when using an ARPA language model.

```text
Greedy:         "patient has die beetus"     (15.2% WER)
Beam (no LM):   "patient has die beetus"     (14.1% WER)
Beam + LM:      "patient has diabetes" ✅    (9.4% WER)

🎯 LM Improvement: 38% reduction in WER
```

--------------------------------

### CLI Usage for PocketTTS

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/TTS/PocketTTS.md

Examples of using the FluidAudio CLI to synthesize speech with PocketTTS, specifying different languages and output files.

```bash
# Default (English)
fluidaudio tts "Hello world" --backend pocket --output en.wav

# Spanish (6L)
fluidaudio tts "Hola mundo" --backend pocket --language spanish --output es.wav

# French (24L only)
fluidaudio tts "Bonjour" --backend pocket --language french_24l --output fr.wav
```

--------------------------------

### Complete Workflow CLI Commands

Source: https://github.com/fluidinference/fluidaudio/blob/main/Sources/FluidAudioCLI/README.md

A sequence of commands demonstrating the full workflow from dataset download to multi-stream transcription.

```bash
# 1. Download dataset
swift run fluidaudiocli download --dataset ami-sdm

# 2. Run diarization benchmark
swift run fluidaudiocli diarization-benchmark --dataset ami-sdm --output results.json

# 3. Process individual file
swift run fluidaudiocli process audio.wav --threshold 0.7

# 4. Transcribe audio
swift run fluidaudiocli transcribe audio.wav --config low-latency

# 5. Multi-stream transcription
swift run fluidaudiocli multi-stream mic.wav system.wav
```

--------------------------------

### Run LS-EEND Benchmark (AMI variant, 500ms step)

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md

Executes the LS-EEND benchmark using the AMI variant with a 500ms step size. This configuration commits approximately 500ms of audio per CoreML call. Use `--auto-download` to fetch models.

```bash
swift run fluidaudiocli lseend-benchmark --variant ami --step-size 500ms --auto-download
```

--------------------------------

### Train and Use Custom ARPA Language Model

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/CtcDecoderGuide.md

Commands to install KenLM, train a bigram model from a corpus, and use it with the FluidAudio CLI.

```bash
# Install KenLM
brew install kenlm

# Collect domain text (medical, legal, financial, etc.)
cat medical_transcripts/*.txt > corpus.txt

# Train bigram language model
lmplz -o 2 < corpus.txt > medical.arpa

# Use with FluidAudio
swift run fluidaudiocli ctc-decode-benchmark \
    --audio speech.wav \
    --arpa medical.arpa
```

--------------------------------

### Perform Inference with LSEENDModel

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Diarization/LS-EEND.md

Get speaker probability predictions from the LSEENDModel for a given input. The output is a flat array of probabilities with sigmoid applied.

```swift
let probs = try model.predict(from: input)
// probs is flat [Float], row-major, shape (chunkSize - input.warmupFrames) * metadata.maxSpeakers,
// with sigmoid already applied.
```

--------------------------------

### Download VAD Datasets with CLI

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/VAD/GettingStarted.md

Fetch datasets for benchmarking VAD using the `download` CLI command with the `--dataset vad` argument.

```bash
swift run fluidaudiocli download --dataset vad
```

--------------------------------

### VadManager - Streaming VAD

Source: https://context7.com/fluidinference/fluidaudio/llms.txt

Process audio chunks in real-time for streaming voice activity detection, emitting events for speech start and end transitions.

```APIDOC
## VadManager.processStreamingChunk

### Description
Streaming VAD with speech start/end events. Processes audio chunk-by-chunk in real time, emitting `VadStreamEvent` values for `speechStart` and `speechEnd` transitions using Silero-style hysteresis.

### Method
`processStreamingChunk(chunk: [Float], state: VadStreamState, config: VadStreamConfig, returnSeconds: Bool, timeResolution: Int)`

### Parameters
#### Path Parameters
- None

#### Query Parameters
- None

#### Request Body
- `chunk` ([Float]): A chunk of audio samples (16 kHz, 4096 samples).
- `state` (VadStreamState): The current state of the streaming VAD.
- `config` (VadStreamConfig): Configuration for streaming VAD. Defaults to `.default`.
- `returnSeconds` (Bool): If true, time values are returned in seconds.
- `timeResolution` (Int): The time resolution for events.

### Request Example
```swift
import FluidAudio

Task {
    let manager = try await VadManager()
    var state = await manager.makeStreamState()

    for chunk in microphoneChunks {   // [Float] at 16 kHz, 4096 samples each
        let result = try await manager.processStreamingChunk(
            chunk,
            state: state,
            config: .default,
            returnSeconds: true,
            timeResolution: 2
        )
        state = result.state

        print(String(format: "Probability: %.3f", result.probability))
        if let event = result.event {
            switch event.kind {
            case .speechStart:
                print("Speech started at \(event.time ?? 0)s")
            case .speechEnd:
                print("Speech ended at \(event.time ?? 0)s")
            }
        }
    }
}
```

### Response
#### Success Response (200)
- `result` (VadStreamResult): An object containing:
  - `probability` (Float): The speech probability for the current chunk.
  - `event` (VadStreamEvent?): A `VadStreamEvent` if a speech start or end transition occurred.
  - `state` (VadStreamState): The updated state for the next chunk.

#### Response Example
```
// Output example:
// Probability: 0.012
// Probability: 0.965
// Speech started at 0.26s
// Probability: 0.978
// Probability: 0.023
// Speech ended at 1.56s
```
```

--------------------------------

### Run FLEURS Benchmark for All Languages

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md

Executes the FLEURS benchmark for all supported languages using the Qwen3 model. Ensure the Swift toolchain is installed and the project is set up.

```bash
swift run -c release fluidaudiocli qwen3-benchmark --dataset fleurs --languages all
```

--------------------------------

### CTC Earnings Benchmark Output

Source: https://github.com/fluidinference/fluidaudio/blob/main/Documentation/Benchmarks.md

Example output from the CTC earnings benchmark, showing Word Error Rate (WER) and dictionary metrics for individual files.

```text
  Data directory: /Users/<user>/Library/Application Support/FluidAudio/earnings22-kws/test-dataset
  Output file: ctc_earnings_benchmark.json
  TDT version: v2
  CTC model: /Users/<user>/Library/Application Support/FluidAudio/Models/parakeet-ctc-110m-coreml
Loading TDT models (v2) for transcription...
TDT models loaded successfully
Loading CTC models from: /Users/<user>/Library/Application Support/FluidAudio/Models/parakeet-ctc-110m-coreml
Loaded CTC vocabulary with 1024 tokens, variant: Parakeet CTC 110M (hybrid)
Created CTC spotter with blankId=1024
Processing 773 test files...
[  1/772] 4329526_chunk0            WER:  10.3%  Dict: 1/1
[  2/772] 4329526_chunk109          WER:  12.5%  Dict: 2/2
[  3/772] 4329526_chunk118          WER:   3.1%  Dict: 3/3
[  4/772] 4329526_chunk132          WER:   8.1%  Dict: 1/1
[  5/772] 4329526_chunk135          WER:  25.7%  Dict: 1/1
[  6/772] 4329526_chunk16           WER:   8.6%  Dict: 1/1
...
[767/772] 4485206_chunk_86          WER:   5.0%  Dict: 2/2
[768/772] 4485206_chunk_88          WER:   8.3%  Dict: 2/2
[769/772] 4485206_chunk_92          WER:  14.7%  Dict: 4/4
[770/772] 4485206_chunk_97          WER:  30.5%  Dict: 1/1
[771/772] 4485206_chunk_98          WER:  18.6%  Dict: 4/4
[772/772] 4485206_chunk_99          WER:  22.0%  Dict: 1/1
```