### Output Format Example

Source: https://github.com/ufal/simulstreaming/blob/main/README.md

This example illustrates the space-separated output format from SimulStreaming, showing emission time, start and end timestamps of the line in the original audio, and the transcribed text. The first column (emission time) is omitted in server output.

```text
1200.0000 0 1200  And so
2400.0000 1200 2400  my fellow Americans
3600.0000 2400 3600  ,
4800.0000 3600 4800  ask not
6000.0000 4800 6000  what
7200.0000 6000 7200  your country can do
8400.0000 7200 8400  for you,
9600.0000 8400 9600  ask what you
10800.0000 9600 10800  can do for your country
11000.0000 10800 11000  .
```

--------------------------------

### Install Dependencies with Pip

Source: https://github.com/ufal/simulstreaming/blob/main/README.md

Installs the necessary dependencies for the direct speech-to-text Whisper part of SimulStreaming using pip. The comments in the requirements.txt file provide details about the origin of each dependency.

```shell
pip install -r requirements.txt
```

--------------------------------

### Debugging Simulation with Start Timestamp

Source: https://context7.com/ufal/simulstreaming/llms.txt

Starts the audio processing simulation from a specific timestamp. This is useful for debugging specific sections of audio without processing the entire file from the beginning. The first update will contain all audio up to the specified start time.

```bash
# Start at specific timestamp for debugging
python3 simulstreaming_whisper.py audio.wav \
    --start_at 120.0 \
    --language en
```

--------------------------------

### Linux Client Example for Real-time Audio Streaming

Source: https://github.com/ufal/simulstreaming/blob/main/README.md

This snippet demonstrates how to stream real-time audio from a microphone to a SimulStreaming server using `arecord` and `nc` (netcat) on Linux. It specifies audio format (S16_LE, 16000Hz, mono) and sends it to a local host on port 43001.

```bash
arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001
```

--------------------------------

### Real-Time Server Mode for Live Microphone Input

Source: https://context7.com/ufal/simulstreaming/llms.txt

Starts a TCP server that accepts a raw audio stream from a microphone for real-time processing. This section includes instructions for starting the server and connecting a client.

```APIDOC
## Real-Time Server Mode for Live Microphone Input

### Description
Starts a TCP server that accepts a raw audio stream from a microphone for real-time simultaneous speech translation and transcription.

### Method
1. Start server: `python3` command-line execution
2. Connect client: `arecord` and `nc` command-line execution

### Endpoint
- **Server:** `simulstreaming_whisper_server.py`
- **Client Connection:** `localhost:43001` (default)

### Parameters
#### Server Command-line Arguments
- **`--host`** (string) - Optional - Host address for the server (e.g., `localhost`).
- **`--port`** (integer) - Optional - Port number for the server (e.g., `43001`).
- **`--language`** (string) - Optional - Source language code (e.g., `de` for German).
- **`--task`** (string) - Optional - Task to perform (e.g., `translate`, `transcribe`). Defaults to `transcribe`.
- **`--model_path`** (string) - Optional - Path to the Whisper model file (e.g., `./large-v3.pt`).
- **`--warmup-file`** (string) - Optional - Audio file for initial warm-up (e.g., `jfk.wav`).
- **`--vac`** (flag) - Optional - Enables voice activity detection.
- **`--beams`** (integer) - Optional - Number of beams for beam search (e.g., `5`).
- **`--frame_threshold`** (float) - Optional - Threshold for frame processing (e.g., `25`).

#### Client Command-line Arguments (Linux bash)
- **`arecord`**: Captures audio from the microphone.
  - **`-f S16_LE`**: Specifies audio format (16-bit Little Endian).
  - **`-c1`**: Sets the number of channels to 1 (mono).
  - **`-r 16000`**: Sets the sample rate to 16000 Hz.
  - **`-t raw`**: Specifies raw audio data.
  - **`-D default`**: Uses the default audio input device.
- **`nc`**: Netcat utility to send data to the server.
  - **`localhost`**: The hostname of the server.
  - **`43001`**: The port of the server.

### Request Example
```bash
# Start server (Python)
python3 simulstreaming_whisper_server.py \
    --host localhost \
    --port 43001 \
    --language de \
    --task translate \
    --model_path ./large-v3.pt \
    --warmup-file jfk.wav \
    --vac \
    --beams 5 \
    --frame_threshold 25

# Connect client and stream audio (Linux bash)
# Send 16kHz mono S16_LE format from microphone to server
arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001
```

### Response
#### Success Response (Server Output)
- **Output Format** (string) - `start_ms end_ms text` - Provides timestamps and transcribed/translated text for each segment received from the client.

#### Response Example
```
0 1720 And so
1720 3400 my fellow Americans
```
```

--------------------------------

### SimulStreaming: Real-Time Server & Client (Python/Bash)

Source: https://context7.com/ufal/simulstreaming/llms.txt

This section provides instructions for setting up a real-time translation server using SimulStreaming and connecting to it with a client. The server runs in Python, accepting raw audio streams via TCP. The client example uses `arecord` and `nc` on Linux to stream microphone input to the server.

```python
# Start server (Python)
python3 simulstreaming_whisper_server.py \
    --host localhost \
    --port 43001 \
    --language de \
    --task translate \
    --model_path ./large-v3.pt \
    --warmup-file jfk.wav \
    --vac \
    --beams 5 \
    --frame_threshold 25
```

```bash
# Connect client and stream audio (Linux bash)
# Send 16kHz mono S16_LE format from microphone to server
arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001

# Server output format (start_ms end_ms text):
# 0 1720 And so
# 1720 3400 my fellow Americans
```

--------------------------------

### Transcription with Context and Terminology Injection

Source: https://context7.com/ufal/simulstreaming/llms.txt

Performs transcription while injecting domain-specific terminology and maintaining context across processing windows. This example uses static terminology and a scrolling context.

```APIDOC
## Transcription with Context and Terminology Injection

### Description
Transcribes audio while injecting domain-specific terminology and maintaining context across processing windows. This example utilizes static terminology and scrolling context.

### Method
`python3` command-line execution

### Endpoint
`simulstreaming_whisper.py`

### Parameters
#### Command-line Arguments
- **`conference_audio.wav`** (string) - Required - Path to the input audio file.
- **`--language`** (string) - Optional - Source language code (e.g., `en` for English).
- **`--task`** (string) - Optional - Task to perform (e.g., `transcribe`, `translate`). Defaults to `transcribe`.
- **`--static_init_prompt`** (string) - Optional - A comma-separated list of static terms to inject (e.g., `"COVID-19, RNA, mRNA, spike protein"`).
- **`--init_prompt`** (string) - Optional - An initial prompt to set the context (e.g., `"The speaker is discussing vaccine development."`).
- **`--max_context_tokens`** (integer) - Optional - Maximum number of tokens to maintain for context (e.g., `100`).
- **`--beams`** (integer) - Optional - Number of beams for beam search (e.g., `3`).
- **`--audio_max_len`** (float) - Optional - Maximum audio length in seconds for processing (e.g., `30.0`).
- **`--min-chunk-size`** (float) - Optional - Minimum chunk size in seconds (e.g., `1.5`).

### Request Example
```bash
python3 simulstreaming_whisper.py conference_audio.wav \
    --language en \
    --task transcribe \
    --static_init_prompt "COVID-19, RNA, mRNA, spike protein" \
    --init_prompt "The speaker is discussing vaccine development." \
    --max_context_tokens 100 \
    --beams 3 \
    --audio_max_len 30.0 \
    --min-chunk-size 1.5
```

### Response
#### Success Response
- **Output** (string) - The transcribed text will include the specified terminology and reflect the maintained context. The exact format depends on the internal processing.

#### Response Example
```
# Output includes terminology from static prompt maintained throughout
# Context tokens scroll while static prompt remains constant
```
```

--------------------------------

### PaddedAlignAttWhisper Direct Usage

Source: https://context7.com/ufal/simulstreaming/llms.txt

Provides low-level access to the AlignAtt policy for custom implementations. This example shows how to configure and initialize the PaddedAlignAttWhisper model and process audio incrementally.

```APIDOC
## PaddedAlignAttWhisper Direct Usage

### Description
Offers low-level access to the AlignAtt policy for custom implementations. This example demonstrates how to configure and initialize the `PaddedAlignAttWhisper` model and process audio incrementally.

### Method
Python script execution

### Endpoint
N/A (Library usage)

### Parameters
#### Python Libraries
- `simul_whisper.config`: For `AlignAttConfig`.
- `simul_whisper.simul_whisper`: For `PaddedAlignAttWhisper`.
- `torch`: For tensor operations.

#### `AlignAttConfig` Parameters
- **`model_path`** (string) - Path to the Whisper model file.
- **`segment_length`** (float) - Length of audio segments in seconds.
- **`frame_threshold`** (float) - Threshold for frame processing.
- **`language`** (string) - Source language code.
- **`audio_max_len`** (float) - Maximum audio length in seconds for processing.
- **`audio_min_len`** (float) - Minimum audio length in seconds for processing.
- **`decoder_type`** (string) - Type of decoder (e.g., `beam`).
- **`beam_size`** (integer) - Size of the beam for beam search.
- **`task`** (string) - Task to perform (`translate` or `transcribe`).
- **`init_prompt`** (string) - Initial prompt for context or terminology.
- **`max_context_tokens`** (integer) - Maximum number of tokens for context.

### Request Example
```python
from simul_whisper.config import AlignAttConfig
from simul_whisper.simul_whisper import PaddedAlignAttWhisper
import torch

# Configure AlignAtt policy
cfg = AlignAttConfig(
    model_path='./large-v3.pt',
    segment_length=1.2,
    frame_threshold=25,
    language='en',
    audio_max_len=30.0,
    audio_min_len=1.0,
    decoder_type='beam',
    beam_size=5,
    task='translate',
    init_prompt='Domain-specific context here',
    max_context_tokens=100
)

# Initialize model
model = PaddedAlignAttWhisper(cfg)

# Process audio incrementally
audio_segment = torch.randn(16000)  # 1 second at 16kHz
model.insert_audio(audio_segment)

# Inference with AlignAtt policy
tokens, generation_progress = model.infer(is_last=False)

# Print results
print("Tokens:", tokens)
print("Generation Progress:", generation_progress)
```

### Response
#### Success Response
- **`tokens`** (list) - List of generated token IDs.
- **`generation_progress`** (object/dict) - Information about the generation progress (structure may vary).

#### Response Example
```
Tokens: [token_ids]
Generation Progress: { 'progress_details': ... }
```
```

--------------------------------

### End-of-Word Detection with CIF Model

Source: https://context7.com/ufal/simulstreaming/llms.txt

Utilizes a CIF (Connectionist Temporal Classification) model for end-of-word detection to prevent partial word outputs at segment boundaries. Includes examples for using the CIF model, disabling 'never_fire', and forcing 'never_fire'. Note that CIF models are not yet available for large-v3.

```bash
# Use CIF model to detect end-of-word boundaries
python3 simulstreaming_whisper.py audio.wav \
    --language en \
    --task transcribe \
    --cif_ckpt_path ./cif_models/large-v2.pt \
    --model_path ./large-v2.pt

# Without CIF: last word always truncated if incomplete
python3 simulstreaming_whisper.py audio.wav \
    --cif_ckpt_path ./cif_models/large-v2.pt \
    --no-never_fire

# Force never truncate last word
python3 simulstreaming_whisper.py audio.wav \
    --never_fire
```

--------------------------------

### TokenBuffer for Context Management

Source: https://context7.com/ufal/simulstreaming/llms.txt

Demonstrates the usage of `TokenBuffer` for managing rolling context windows with static and dynamic prompts. It handles tokenization, device placement, prefix tokens, appending text, converting to tensors, trimming old words, and appending new tokens. Requires `token_buffer` and `torch` libraries.

```python
from token_buffer import TokenBuffer
import torch

# Create empty buffer with prefix tokens
buffer = TokenBuffer.empty(
    tokenizer=tokenizer,
    device=torch.device('cuda'),
    prefix_token_ids=[50361]  # sot_prev token
)

# Add static terminology that never scrolls
static_prompt = "COVID-19, mRNA, vaccine"
buffer = TokenBuffer.from_text(
    static_prompt,
    tokenizer=tokenizer,
    device=torch.device('cuda'),
    prefix_token_ids=[50361]
)

# Append dynamic context
buffer.text += " The research focuses on spike proteins"

# Convert to tensor for model input
context_tensor = buffer.as_tensor_beam(beam=5)

# Trim oldest words when exceeding max tokens
tokens_removed = buffer.trim_words(num=2, after=len(static_prompt))

# Append new tokens from model output
new_token_ids = [1234, 5678]
buffer.append_token_ids(new_token_ids)
```

--------------------------------

### SimulStreaming: Transcribe with Context/Terminology (Python)

Source: https://context7.com/ufal/simulstreaming/llms.txt

This code snippet shows how to transcribe audio while injecting domain-specific terminology and maintaining context across processing windows. It uses the `--static_init_prompt` and `--init_prompt` arguments for terminology and context, respectively, along with settings for context token management and audio chunking.

```python
python3 simulstreaming_whisper.py conference_audio.wav \
    --language en \
    --task transcribe \
    --static_init_prompt "COVID-19, RNA, mRNA, spike protein" \
    --init_prompt "The speaker is discussing vaccine development." \
    --max_context_tokens 100 \
    --beams 3 \
    --audio_max_len 30.0 \
    --min-chunk-size 1.5

# Output includes terminology from static prompt maintained throughout
# Context tokens scroll while static prompt remains constant
```

--------------------------------

### Convert TXT Output to Instance Logs

Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt

Scripts to convert the default text output of `simul_llm_translate.py` into instance log format. `txt-to-instances.py` is for En->De, and `zh-ja-txt-to-instances-nobreaking+nospaces.py` is for En->Zh and Ja, handling space and newline removal for the latter.

```python
python3 zh-ja-txt-to-instances-nobreaking+nospaces.py < res/ja/asr.ch-1.4-frame-15-beam-1+eurollm.ch-4.unaware.gputype-any.i-1.model-eurollm-9b.language-ja/2022.acl-long.590.txt 2022.acl-long.590.wav > inst.log
```

--------------------------------

### SimulWhisper ASR Backend Integration (Python)

Source: https://context7.com/ufal/simulstreaming/llms.txt

This Python snippet demonstrates the core Automatic Speech Recognition (ASR) backend integration using the AlignAtt policy within SimulStreaming. It shows how to create an ASR factory, configure arguments using `argparse`, and process audio chunks using an online processor.

```python
from simulstreaming_whisper import simul_asr_factory, simulwhisper_args
import argparse

# Create ASR factory with configuration
parser = argparse.ArgumentParser()
parser.add_argument('--min-chunk-size', type=float, default=1.2)
parser.add_argument('--lan', type=str, default='en')
parser.add_argument('--task', type=str, default='transcribe')
parser.add_argument('--vac', action='store_true')
parser.add_argument('--log-level', default='INFO')
simulwhisper_args(parser)

args = parser.parse_args(['--model_path', './large-v3.pt',
                          '--beams', '5',
                          '--frame_threshold', '25'])
args.logdir = None

# Factory returns ASR and online processor
asr, online_processor = simul_asr_factory(args)

# Process audio chunks
import numpy as np
audio_chunk = np.random.randn(16000).astype(np.float32)  # 1 second
online_processor.insert_audio_chunk(audio_chunk)
result = online_processor.process_iter()

# Result structure:
# {'start': 0.0, 'end': 1.0, 'text': 'transcribed text',
#  'tokens': [token_ids], 'words': [word_level_timestamps]}
```

--------------------------------

### Voice Activity Controller (VAC) Integration (Code)

Source: https://context7.com/ufal/simulstreaming/llms.txt

Demonstrates programmatic integration of Voice Activity Detection (VAC) by wrapping an existing online ASR processor with `VACOnlineASRProcessor`. This allows for seamless use of VAC's silence detection capabilities within the application logic.

```python
# In code integration
from whisper_streaming.vac_online_processor import VACOnlineASRProcessor

# Wrap online processor with VAC
online_with_vac = VACOnlineASRProcessor(
    min_chunk_size=1.2,
    online_processor=online_processor
)

# Use same interface
online_with_vac.insert_audio_chunk(audio)
result = online_with_vac.process_iter()
```

--------------------------------

### PaddedAlignAttWhisper Direct Usage (Python)

Source: https://context7.com/ufal/simulstreaming/llms.txt

This Python code provides low-level access to the AlignAtt policy for custom implementations within SimulStreaming. It demonstrates initializing the `PaddedAlignAttWhisper` model with specific configuration parameters and processing audio segments incrementally for inference.

```python
from simul_whisper.config import AlignAttConfig
from simul_whisper.simul_whisper import PaddedAlignAttWhisper
import torch

# Configure AlignAtt policy
cfg = AlignAttConfig(
    model_path='./large-v3.pt',
    segment_length=1.2,
    frame_threshold=25,
    language='en',
    audio_max_len=30.0,
    audio_min_len=1.0,
    decoder_type='beam',
    beam_size=5,
    task='translate',
    init_prompt='Domain-specific context here',
    max_context_tokens=100
)

# Initialize model
model = PaddedAlignAttWhisper(cfg)

# Process audio incrementally
audio_segment = torch.randn(16000)  # 1 second at 16kHz
model.insert_audio(audio_segment)

# Inference with AlignAtt policy
tokens, generation_progress = model.infer(is_last=False)
```

--------------------------------

### Clone EuroLLM Hugging Face Model

Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt

Clones the EuroLLM-9B-Instruct model from Hugging Face. This is the first step to obtain the necessary model files for translation.

```bash
git clone https://huggingface.co/utter-project/EuroLLM-9B-Instruct
```

```bash
git clone git@hf.co:utter-project/EuroLLM-9B-Instruct
```

--------------------------------

### SimulWhisperASR Backend Integration

Source: https://context7.com/ufal/simulstreaming/llms.txt

Demonstrates the integration of the core ASR backend that implements the AlignAtt policy with the Whisper model using Python. This includes setting up the factory and processing audio chunks.

```APIDOC
## SimulWhisperASR Backend Integration

### Description
Provides a Python code example for integrating the core ASR backend that implements the AlignAtt policy with the Whisper model. This snippet shows how to create an ASR factory and process audio chunks.

### Method
Python script execution

### Endpoint
N/A (Library usage)

### Parameters
#### Python Libraries
- `simulstreaming_whisper`: For ASR factory and arguments.
- `argparse`: For parsing command-line arguments.
- `numpy`: For audio chunk manipulation.

#### `simulwhisper_args` Configuration
- **`--model_path`** (string) - Path to the Whisper model file.
- **`--beams`** (integer) - Number of beams for beam search.
- **`--frame_threshold`** (float) - Threshold for frame processing.
- **`--min-chunk-size`** (float) - Minimum chunk size in seconds.
- **`--lan`** (string) - Source language code.
- **`--task`** (string) - Task to perform (`transcribe` or `translate`).
- **`--vac`** (flag) - Enables voice activity detection.
- **`--log-level`** (string) - Logging level (e.g., `INFO`).

### Request Example
```python
from simulstreaming_whisper import simul_asr_factory, simulwhisper_args
import argparse
import numpy as np

# Create ASR factory with configuration
parser = argparse.ArgumentParser()
parser.add_argument('--min-chunk-size', type=float, default=1.2)
parser.add_argument('--lan', type=str, default='en')
parser.add_argument('--task', type=str, default='transcribe')
parser.add_argument('--vac', action='store_true')
parser.add_argument('--log-level', default='INFO')
simulwhisper_args(parser)

args = parser.parse_args(['--model_path', './large-v3.pt',
                          '--beams', '5',
                          '--frame_threshold', '25'])
args.logdir = None

# Factory returns ASR and online processor
asr, online_processor = simul_asr_factory(args)

# Process audio chunks
audio_chunk = np.random.randn(16000).astype(np.float32)  # 1 second
online_processor.insert_audio_chunk(audio_chunk)
result = online_processor.process_iter()

# Print result
print(result)
```

### Response
#### Success Response
- **`result`** (dict) - Contains transcription details:
  - **`start`** (float) - Start timestamp of the segment.
  - **`end`** (float) - End timestamp of the segment.
  - **`text`** (string) - Transcribed or translated text.
  - **`tokens`** (list) - List of token IDs.
  - **`words`** (list) - List of word-level timestamps.

#### Response Example
```json
{
  "start": 0.0,
  "end": 1.0,
  "text": "transcribed text",
  "tokens": [token_ids],
  "words": [word_level_timestamps]
}
```
```

--------------------------------

### Voice Activity Controller (VAC) Integration

Source: https://context7.com/ufal/simulstreaming/llms.txt

Shows how to integrate Voice Activity Detection (VAC) with the Whisper ASR processor for automatic silence detection. This improves latency by avoiding processing of silence. Requires `torchaudio`. It can be enabled via command-line arguments or by wrapping an online processor.

```bash
# Enable VAC in file simulation
python3 simulstreaming_whisper.py audio.wav \
    --language en \
    --task transcribe \
    --vac \
    --vac-chunk-size 0.04 \
    --min-chunk-size 1.2
```

--------------------------------

### Run Simultaneous LLM Translation

Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt

Executes the `simul_llm_translate.py` script for simultaneous translation using EuroLLM. It supports input from text files with timestamps or instance log formats. Key parameters control chunk size, language, context length, and buffer trimming.

```python
cat gold-asr-dir//2022.acl-long.110.txt | python3 simul_llm_translate.py --min-chunk-size 1 --language de --language-specific-len-threshold --max-context-length 80 --buffer_trimming sentences
```

```python
python3 simul_llm_translate.py --input-instance gold-asr-dir//2022.acl-long.110.instance.log --min-chunk-size 1 --language de --max-context-length 300 | tee out
```

```python
python3 simul_llm_translate.py --input-instance $input \
--min-chunk-size $ch \
--language $language \
--language-specific-len-threshold --buffer_trimming $trim \
--max-context-length $max_context_len
```

--------------------------------

### SimulStreaming: Translate Audio File (Python)

Source: https://context7.com/ufal/simulstreaming/llms.txt

This snippet demonstrates how to perform real-time simultaneous translation from an audio file using SimulStreaming. It utilizes the Whisper model with the AlignAtt policy and supports various command-line arguments for language, task, model path, beam search, and chunking.

```python
python3 simulstreaming_whisper.py audio.wav \
    --language cs \
    --task translate \
    --comp_unaware \
    --model_path ./large-v3.pt \
    --beams 5 \
    --frame_threshold 25 \
    --min-chunk-size 1.2 \
    --vac

# Expected output format (emission_time start_ms end_ms text):
# 1200.0000 0 1200  And so
# 2400.0000 1200 2400  my fellow Americans
# 3600.0000 2400 3600  ,
# 4800.0000 3600 4800  ask not
```

--------------------------------

### Computationally Aware Simulation Mode

Source: https://context7.com/ufal/simulstreaming/llms.txt

Runs the simulation in 'computationally aware' mode, where latency includes processing time. This provides a realistic measure of real-world latency. The output timestamp reflects the emission time after computation.

```bash
# Computationally aware (default): includes processing time in latency
python3 simulstreaming_whisper.py audio.wav \
    --language en \
    --task translate \
    --min-chunk-size 1.2
```

--------------------------------

### Automatic Language Detection

Source: https://context7.com/ufal/simulstreaming/llms.txt

Enables automatic language detection for speech input when the source language is not specified. The model analyzes audio features to identify the language, creating the appropriate tokenizer for subsequent processing. Works for both transcription and translation tasks.

```bash
# Enable automatic language detection
python3 simulstreaming_whisper.py audio.wav \
    --language auto \
    --task translate \
    --beams 5
```

--------------------------------

### Speech-to-Text Translation from Audio File

Source: https://context7.com/ufal/simulstreaming/llms.txt

Simulates real-time simultaneous translation from an audio file. This command translates Czech audio to English using the large-v3 Whisper model with specified parameters for beam search, chunk size, and prompt injection.

```APIDOC
## Speech-to-Text Translation from Audio File

### Description
Real-time simulation of simultaneous translation from an audio file. This example demonstrates translating Czech audio to English using the large-v3 Whisper model.

### Method
`python3` command-line execution

### Endpoint
`simulstreaming_whisper.py`

### Parameters
#### Command-line Arguments
- **`audio.wav`** (string) - Required - Path to the input audio file.
- **`--language`** (string) - Optional - Source language code (e.g., `cs` for Czech).
- **`--task`** (string) - Optional - Task to perform (e.g., `translate`, `transcribe`). Defaults to `transcribe`.
- **`--comp_unaware`** (flag) - Optional - Enables computationally unaware simulation mode.
- **`--model_path`** (string) - Optional - Path to the Whisper model file (e.g., `./large-v3.pt`).
- **`--beams`** (integer) - Optional - Number of beams for beam search (e.g., `5`).
- **`--frame_threshold`** (float) - Optional - Threshold for frame processing (e.g., `25`).
- **`--min-chunk-size`** (float) - Optional - Minimum chunk size in seconds (e.g., `1.2`).
- **`--vac`** (flag) - Optional - Enables voice activity detection.

### Request Example
```bash
python3 simulstreaming_whisper.py audio.wav \
    --language cs \
    --task translate \
    --comp_unaware \
    --model_path ./large-v3.pt \
    --beams 5 \
    --frame_threshold 25 \
    --min-chunk-size 1.2 \
    --vac
```

### Response
#### Success Response
- **Output Format** (string) - `emission_time start_ms end_ms text` - Provides timestamps and transcribed/translated text for each segment.

#### Response Example
```
1200.0000 0 1200  And so
2400.0000 1200 2400  my fellow Americans
3600.0000 2400 3600  ,
4800.0000 3600 4800  ask not
```
```

--------------------------------

### Cascaded LLM Translation Pipeline

Source: https://context7.com/ufal/simulstreaming/llms.txt

Performs speech-to-text transcription using Whisper, followed by LLM-based translation. It processes audio files and outputs translated text. Requires `simulstreaming_whisper.py` and `translate/simul_llm_translate.py` scripts. Outputs translated text with timestamps.

```bash
python3 simulstreaming_whisper.py audio.wav \
    --language en \
    --task transcribe \
    --comp_unaware \
    > asr_output.txt

python3 translate/simul_llm_translate.py \
    --lan de \
    --min-chunk-size 3 \
    --min-len 5 \
    --language-specific-len-threshold \
    --sys_prompt "You are simultaneous interpreter from English to German." \
    --init_prompt_src "Welcome to the conference." \
    --init_prompt_tgt "Willkommen zur Konferenz." \
    --max-context-length 4096 \
    < asr_output.txt
```

--------------------------------

### Run SLAAL for Translation Evaluation

Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt

Shell scripts to run SLAAL (Simultaneous LLM Alignment) for the entire development set. These scripts process documents, generate instance logs, and align candidates with reference translations using MWERSegmenter.

```bash
./slaal-de.sh de-output/2022.acl-long.110.txt de-output/2022.acl-long.110.mw-segments > de-output/2022.acl-long.110.slaal
```

```bash
./slaal-de.sh de-output/ > de-output/slaal
```

--------------------------------

### Computationally Unaware Simulation Mode

Source: https://context7.com/ufal/simulstreaming/llms.txt

Runs the simulation in 'computationally unaware' mode, measuring only policy latency, excluding actual processing time. This is useful for determining the theoretical minimum latency of the policy. The timer effectively stops during computation.

```bash
# Computationally unaware: measures only policy latency
python3 simulstreaming_whisper.py audio.wav \
    --language en \
    --task translate \
    --comp_unaware \
    --min-chunk-size 1.2
```

--------------------------------

### Convert Hugging Face Model to CTranslate2

Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt

Converts a Hugging Face model to the CTranslate2 format using the `ct2-transformers-converter` tool. CTranslate2 is a fast inference engine for Transformer models.

```bash
ct2-transformers-converter --model EuroLLM-9B-Instruct/ --output_dir ct2_EuroLLM-9B-Instruct
```

--------------------------------

### Decode Tokens and Refresh Segment

Source: https://context7.com/ufal/simulstreaming/llms.txt

Decodes a sequence of tokens into human-readable text and refreshes the model's segment for subsequent processing. Assumes a 'model' object with 'tokenizer' and 'refresh_segment' methods.

```python
text = model.tokenizer.decode(tokens)
print(f"Output: {text}")

model.refresh_segment(complete=False)
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.