### Output Format Example Source: https://github.com/ufal/simulstreaming/blob/main/README.md This example illustrates the space-separated output format from SimulStreaming, showing emission time, start and end timestamps of the line in the original audio, and the transcribed text. The first column (emission time) is omitted in server output. ```text 1200.0000 0 1200 And so 2400.0000 1200 2400 my fellow Americans 3600.0000 2400 3600 , 4800.0000 3600 4800 ask not 6000.0000 4800 6000 what 7200.0000 6000 7200 your country can do 8400.0000 7200 8400 for you, 9600.0000 8400 9600 ask what you 10800.0000 9600 10800 can do for your country 11000.0000 10800 11000 . ``` -------------------------------- ### Install Dependencies with Pip Source: https://github.com/ufal/simulstreaming/blob/main/README.md Installs the necessary dependencies for the direct speech-to-text Whisper part of SimulStreaming using pip. The comments in the requirements.txt file provide details about the origin of each dependency. ```shell pip install -r requirements.txt ``` -------------------------------- ### Debugging Simulation with Start Timestamp Source: https://context7.com/ufal/simulstreaming/llms.txt Starts the audio processing simulation from a specific timestamp. This is useful for debugging specific sections of audio without processing the entire file from the beginning. The first update will contain all audio up to the specified start time. ```bash # Start at specific timestamp for debugging python3 simulstreaming_whisper.py audio.wav \ --start_at 120.0 \ --language en ``` -------------------------------- ### Linux Client Example for Real-time Audio Streaming Source: https://github.com/ufal/simulstreaming/blob/main/README.md This snippet demonstrates how to stream real-time audio from a microphone to a SimulStreaming server using `arecord` and `nc` (netcat) on Linux. It specifies audio format (S16_LE, 16000Hz, mono) and sends it to a local host on port 43001. ```bash arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001 ``` -------------------------------- ### Real-Time Server Mode for Live Microphone Input Source: https://context7.com/ufal/simulstreaming/llms.txt Starts a TCP server that accepts a raw audio stream from a microphone for real-time processing. This section includes instructions for starting the server and connecting a client. ```APIDOC ## Real-Time Server Mode for Live Microphone Input ### Description Starts a TCP server that accepts a raw audio stream from a microphone for real-time simultaneous speech translation and transcription. ### Method 1. Start server: `python3` command-line execution 2. Connect client: `arecord` and `nc` command-line execution ### Endpoint - **Server:** `simulstreaming_whisper_server.py` - **Client Connection:** `localhost:43001` (default) ### Parameters #### Server Command-line Arguments - **`--host`** (string) - Optional - Host address for the server (e.g., `localhost`). - **`--port`** (integer) - Optional - Port number for the server (e.g., `43001`). - **`--language`** (string) - Optional - Source language code (e.g., `de` for German). - **`--task`** (string) - Optional - Task to perform (e.g., `translate`, `transcribe`). Defaults to `transcribe`. - **`--model_path`** (string) - Optional - Path to the Whisper model file (e.g., `./large-v3.pt`). - **`--warmup-file`** (string) - Optional - Audio file for initial warm-up (e.g., `jfk.wav`). - **`--vac`** (flag) - Optional - Enables voice activity detection. - **`--beams`** (integer) - Optional - Number of beams for beam search (e.g., `5`). - **`--frame_threshold`** (float) - Optional - Threshold for frame processing (e.g., `25`). #### Client Command-line Arguments (Linux bash) - **`arecord`**: Captures audio from the microphone. - **`-f S16_LE`**: Specifies audio format (16-bit Little Endian). - **`-c1`**: Sets the number of channels to 1 (mono). - **`-r 16000`**: Sets the sample rate to 16000 Hz. - **`-t raw`**: Specifies raw audio data. - **`-D default`**: Uses the default audio input device. - **`nc`**: Netcat utility to send data to the server. - **`localhost`**: The hostname of the server. - **`43001`**: The port of the server. ### Request Example ```bash # Start server (Python) python3 simulstreaming_whisper_server.py \ --host localhost \ --port 43001 \ --language de \ --task translate \ --model_path ./large-v3.pt \ --warmup-file jfk.wav \ --vac \ --beams 5 \ --frame_threshold 25 # Connect client and stream audio (Linux bash) # Send 16kHz mono S16_LE format from microphone to server arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001 ``` ### Response #### Success Response (Server Output) - **Output Format** (string) - `start_ms end_ms text` - Provides timestamps and transcribed/translated text for each segment received from the client. #### Response Example ``` 0 1720 And so 1720 3400 my fellow Americans ``` ``` -------------------------------- ### SimulStreaming: Real-Time Server & Client (Python/Bash) Source: https://context7.com/ufal/simulstreaming/llms.txt This section provides instructions for setting up a real-time translation server using SimulStreaming and connecting to it with a client. The server runs in Python, accepting raw audio streams via TCP. The client example uses `arecord` and `nc` on Linux to stream microphone input to the server. ```python # Start server (Python) python3 simulstreaming_whisper_server.py \ --host localhost \ --port 43001 \ --language de \ --task translate \ --model_path ./large-v3.pt \ --warmup-file jfk.wav \ --vac \ --beams 5 \ --frame_threshold 25 ``` ```bash # Connect client and stream audio (Linux bash) # Send 16kHz mono S16_LE format from microphone to server arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001 # Server output format (start_ms end_ms text): # 0 1720 And so # 1720 3400 my fellow Americans ``` -------------------------------- ### Transcription with Context and Terminology Injection Source: https://context7.com/ufal/simulstreaming/llms.txt Performs transcription while injecting domain-specific terminology and maintaining context across processing windows. This example uses static terminology and a scrolling context. ```APIDOC ## Transcription with Context and Terminology Injection ### Description Transcribes audio while injecting domain-specific terminology and maintaining context across processing windows. This example utilizes static terminology and scrolling context. ### Method `python3` command-line execution ### Endpoint `simulstreaming_whisper.py` ### Parameters #### Command-line Arguments - **`conference_audio.wav`** (string) - Required - Path to the input audio file. - **`--language`** (string) - Optional - Source language code (e.g., `en` for English). - **`--task`** (string) - Optional - Task to perform (e.g., `transcribe`, `translate`). Defaults to `transcribe`. - **`--static_init_prompt`** (string) - Optional - A comma-separated list of static terms to inject (e.g., `"COVID-19, RNA, mRNA, spike protein"`). - **`--init_prompt`** (string) - Optional - An initial prompt to set the context (e.g., `"The speaker is discussing vaccine development."`). - **`--max_context_tokens`** (integer) - Optional - Maximum number of tokens to maintain for context (e.g., `100`). - **`--beams`** (integer) - Optional - Number of beams for beam search (e.g., `3`). - **`--audio_max_len`** (float) - Optional - Maximum audio length in seconds for processing (e.g., `30.0`). - **`--min-chunk-size`** (float) - Optional - Minimum chunk size in seconds (e.g., `1.5`). ### Request Example ```bash python3 simulstreaming_whisper.py conference_audio.wav \ --language en \ --task transcribe \ --static_init_prompt "COVID-19, RNA, mRNA, spike protein" \ --init_prompt "The speaker is discussing vaccine development." \ --max_context_tokens 100 \ --beams 3 \ --audio_max_len 30.0 \ --min-chunk-size 1.5 ``` ### Response #### Success Response - **Output** (string) - The transcribed text will include the specified terminology and reflect the maintained context. The exact format depends on the internal processing. #### Response Example ``` # Output includes terminology from static prompt maintained throughout # Context tokens scroll while static prompt remains constant ``` ``` -------------------------------- ### PaddedAlignAttWhisper Direct Usage Source: https://context7.com/ufal/simulstreaming/llms.txt Provides low-level access to the AlignAtt policy for custom implementations. This example shows how to configure and initialize the PaddedAlignAttWhisper model and process audio incrementally. ```APIDOC ## PaddedAlignAttWhisper Direct Usage ### Description Offers low-level access to the AlignAtt policy for custom implementations. This example demonstrates how to configure and initialize the `PaddedAlignAttWhisper` model and process audio incrementally. ### Method Python script execution ### Endpoint N/A (Library usage) ### Parameters #### Python Libraries - `simul_whisper.config`: For `AlignAttConfig`. - `simul_whisper.simul_whisper`: For `PaddedAlignAttWhisper`. - `torch`: For tensor operations. #### `AlignAttConfig` Parameters - **`model_path`** (string) - Path to the Whisper model file. - **`segment_length`** (float) - Length of audio segments in seconds. - **`frame_threshold`** (float) - Threshold for frame processing. - **`language`** (string) - Source language code. - **`audio_max_len`** (float) - Maximum audio length in seconds for processing. - **`audio_min_len`** (float) - Minimum audio length in seconds for processing. - **`decoder_type`** (string) - Type of decoder (e.g., `beam`). - **`beam_size`** (integer) - Size of the beam for beam search. - **`task`** (string) - Task to perform (`translate` or `transcribe`). - **`init_prompt`** (string) - Initial prompt for context or terminology. - **`max_context_tokens`** (integer) - Maximum number of tokens for context. ### Request Example ```python from simul_whisper.config import AlignAttConfig from simul_whisper.simul_whisper import PaddedAlignAttWhisper import torch # Configure AlignAtt policy cfg = AlignAttConfig( model_path='./large-v3.pt', segment_length=1.2, frame_threshold=25, language='en', audio_max_len=30.0, audio_min_len=1.0, decoder_type='beam', beam_size=5, task='translate', init_prompt='Domain-specific context here', max_context_tokens=100 ) # Initialize model model = PaddedAlignAttWhisper(cfg) # Process audio incrementally audio_segment = torch.randn(16000) # 1 second at 16kHz model.insert_audio(audio_segment) # Inference with AlignAtt policy tokens, generation_progress = model.infer(is_last=False) # Print results print("Tokens:", tokens) print("Generation Progress:", generation_progress) ``` ### Response #### Success Response - **`tokens`** (list) - List of generated token IDs. - **`generation_progress`** (object/dict) - Information about the generation progress (structure may vary). #### Response Example ``` Tokens: [token_ids] Generation Progress: { 'progress_details': ... } ``` ``` -------------------------------- ### End-of-Word Detection with CIF Model Source: https://context7.com/ufal/simulstreaming/llms.txt Utilizes a CIF (Connectionist Temporal Classification) model for end-of-word detection to prevent partial word outputs at segment boundaries. Includes examples for using the CIF model, disabling 'never_fire', and forcing 'never_fire'. Note that CIF models are not yet available for large-v3. ```bash # Use CIF model to detect end-of-word boundaries python3 simulstreaming_whisper.py audio.wav \ --language en \ --task transcribe \ --cif_ckpt_path ./cif_models/large-v2.pt \ --model_path ./large-v2.pt # Without CIF: last word always truncated if incomplete python3 simulstreaming_whisper.py audio.wav \ --cif_ckpt_path ./cif_models/large-v2.pt \ --no-never_fire # Force never truncate last word python3 simulstreaming_whisper.py audio.wav \ --never_fire ``` -------------------------------- ### TokenBuffer for Context Management Source: https://context7.com/ufal/simulstreaming/llms.txt Demonstrates the usage of `TokenBuffer` for managing rolling context windows with static and dynamic prompts. It handles tokenization, device placement, prefix tokens, appending text, converting to tensors, trimming old words, and appending new tokens. Requires `token_buffer` and `torch` libraries. ```python from token_buffer import TokenBuffer import torch # Create empty buffer with prefix tokens buffer = TokenBuffer.empty( tokenizer=tokenizer, device=torch.device('cuda'), prefix_token_ids=[50361] # sot_prev token ) # Add static terminology that never scrolls static_prompt = "COVID-19, mRNA, vaccine" buffer = TokenBuffer.from_text( static_prompt, tokenizer=tokenizer, device=torch.device('cuda'), prefix_token_ids=[50361] ) # Append dynamic context buffer.text += " The research focuses on spike proteins" # Convert to tensor for model input context_tensor = buffer.as_tensor_beam(beam=5) # Trim oldest words when exceeding max tokens tokens_removed = buffer.trim_words(num=2, after=len(static_prompt)) # Append new tokens from model output new_token_ids = [1234, 5678] buffer.append_token_ids(new_token_ids) ``` -------------------------------- ### SimulStreaming: Transcribe with Context/Terminology (Python) Source: https://context7.com/ufal/simulstreaming/llms.txt This code snippet shows how to transcribe audio while injecting domain-specific terminology and maintaining context across processing windows. It uses the `--static_init_prompt` and `--init_prompt` arguments for terminology and context, respectively, along with settings for context token management and audio chunking. ```python python3 simulstreaming_whisper.py conference_audio.wav \ --language en \ --task transcribe \ --static_init_prompt "COVID-19, RNA, mRNA, spike protein" \ --init_prompt "The speaker is discussing vaccine development." \ --max_context_tokens 100 \ --beams 3 \ --audio_max_len 30.0 \ --min-chunk-size 1.5 # Output includes terminology from static prompt maintained throughout # Context tokens scroll while static prompt remains constant ``` -------------------------------- ### Convert TXT Output to Instance Logs Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt Scripts to convert the default text output of `simul_llm_translate.py` into instance log format. `txt-to-instances.py` is for En->De, and `zh-ja-txt-to-instances-nobreaking+nospaces.py` is for En->Zh and Ja, handling space and newline removal for the latter. ```python python3 zh-ja-txt-to-instances-nobreaking+nospaces.py < res/ja/asr.ch-1.4-frame-15-beam-1+eurollm.ch-4.unaware.gputype-any.i-1.model-eurollm-9b.language-ja/2022.acl-long.590.txt 2022.acl-long.590.wav > inst.log ``` -------------------------------- ### SimulWhisper ASR Backend Integration (Python) Source: https://context7.com/ufal/simulstreaming/llms.txt This Python snippet demonstrates the core Automatic Speech Recognition (ASR) backend integration using the AlignAtt policy within SimulStreaming. It shows how to create an ASR factory, configure arguments using `argparse`, and process audio chunks using an online processor. ```python from simulstreaming_whisper import simul_asr_factory, simulwhisper_args import argparse # Create ASR factory with configuration parser = argparse.ArgumentParser() parser.add_argument('--min-chunk-size', type=float, default=1.2) parser.add_argument('--lan', type=str, default='en') parser.add_argument('--task', type=str, default='transcribe') parser.add_argument('--vac', action='store_true') parser.add_argument('--log-level', default='INFO') simulwhisper_args(parser) args = parser.parse_args(['--model_path', './large-v3.pt', '--beams', '5', '--frame_threshold', '25']) args.logdir = None # Factory returns ASR and online processor asr, online_processor = simul_asr_factory(args) # Process audio chunks import numpy as np audio_chunk = np.random.randn(16000).astype(np.float32) # 1 second online_processor.insert_audio_chunk(audio_chunk) result = online_processor.process_iter() # Result structure: # {'start': 0.0, 'end': 1.0, 'text': 'transcribed text', # 'tokens': [token_ids], 'words': [word_level_timestamps]} ``` -------------------------------- ### Voice Activity Controller (VAC) Integration (Code) Source: https://context7.com/ufal/simulstreaming/llms.txt Demonstrates programmatic integration of Voice Activity Detection (VAC) by wrapping an existing online ASR processor with `VACOnlineASRProcessor`. This allows for seamless use of VAC's silence detection capabilities within the application logic. ```python # In code integration from whisper_streaming.vac_online_processor import VACOnlineASRProcessor # Wrap online processor with VAC online_with_vac = VACOnlineASRProcessor( min_chunk_size=1.2, online_processor=online_processor ) # Use same interface online_with_vac.insert_audio_chunk(audio) result = online_with_vac.process_iter() ``` -------------------------------- ### PaddedAlignAttWhisper Direct Usage (Python) Source: https://context7.com/ufal/simulstreaming/llms.txt This Python code provides low-level access to the AlignAtt policy for custom implementations within SimulStreaming. It demonstrates initializing the `PaddedAlignAttWhisper` model with specific configuration parameters and processing audio segments incrementally for inference. ```python from simul_whisper.config import AlignAttConfig from simul_whisper.simul_whisper import PaddedAlignAttWhisper import torch # Configure AlignAtt policy cfg = AlignAttConfig( model_path='./large-v3.pt', segment_length=1.2, frame_threshold=25, language='en', audio_max_len=30.0, audio_min_len=1.0, decoder_type='beam', beam_size=5, task='translate', init_prompt='Domain-specific context here', max_context_tokens=100 ) # Initialize model model = PaddedAlignAttWhisper(cfg) # Process audio incrementally audio_segment = torch.randn(16000) # 1 second at 16kHz model.insert_audio(audio_segment) # Inference with AlignAtt policy tokens, generation_progress = model.infer(is_last=False) ``` -------------------------------- ### Clone EuroLLM Hugging Face Model Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt Clones the EuroLLM-9B-Instruct model from Hugging Face. This is the first step to obtain the necessary model files for translation. ```bash git clone https://huggingface.co/utter-project/EuroLLM-9B-Instruct ``` ```bash git clone git@hf.co:utter-project/EuroLLM-9B-Instruct ``` -------------------------------- ### SimulWhisperASR Backend Integration Source: https://context7.com/ufal/simulstreaming/llms.txt Demonstrates the integration of the core ASR backend that implements the AlignAtt policy with the Whisper model using Python. This includes setting up the factory and processing audio chunks. ```APIDOC ## SimulWhisperASR Backend Integration ### Description Provides a Python code example for integrating the core ASR backend that implements the AlignAtt policy with the Whisper model. This snippet shows how to create an ASR factory and process audio chunks. ### Method Python script execution ### Endpoint N/A (Library usage) ### Parameters #### Python Libraries - `simulstreaming_whisper`: For ASR factory and arguments. - `argparse`: For parsing command-line arguments. - `numpy`: For audio chunk manipulation. #### `simulwhisper_args` Configuration - **`--model_path`** (string) - Path to the Whisper model file. - **`--beams`** (integer) - Number of beams for beam search. - **`--frame_threshold`** (float) - Threshold for frame processing. - **`--min-chunk-size`** (float) - Minimum chunk size in seconds. - **`--lan`** (string) - Source language code. - **`--task`** (string) - Task to perform (`transcribe` or `translate`). - **`--vac`** (flag) - Enables voice activity detection. - **`--log-level`** (string) - Logging level (e.g., `INFO`). ### Request Example ```python from simulstreaming_whisper import simul_asr_factory, simulwhisper_args import argparse import numpy as np # Create ASR factory with configuration parser = argparse.ArgumentParser() parser.add_argument('--min-chunk-size', type=float, default=1.2) parser.add_argument('--lan', type=str, default='en') parser.add_argument('--task', type=str, default='transcribe') parser.add_argument('--vac', action='store_true') parser.add_argument('--log-level', default='INFO') simulwhisper_args(parser) args = parser.parse_args(['--model_path', './large-v3.pt', '--beams', '5', '--frame_threshold', '25']) args.logdir = None # Factory returns ASR and online processor asr, online_processor = simul_asr_factory(args) # Process audio chunks audio_chunk = np.random.randn(16000).astype(np.float32) # 1 second online_processor.insert_audio_chunk(audio_chunk) result = online_processor.process_iter() # Print result print(result) ``` ### Response #### Success Response - **`result`** (dict) - Contains transcription details: - **`start`** (float) - Start timestamp of the segment. - **`end`** (float) - End timestamp of the segment. - **`text`** (string) - Transcribed or translated text. - **`tokens`** (list) - List of token IDs. - **`words`** (list) - List of word-level timestamps. #### Response Example ```json { "start": 0.0, "end": 1.0, "text": "transcribed text", "tokens": [token_ids], "words": [word_level_timestamps] } ``` ``` -------------------------------- ### Voice Activity Controller (VAC) Integration Source: https://context7.com/ufal/simulstreaming/llms.txt Shows how to integrate Voice Activity Detection (VAC) with the Whisper ASR processor for automatic silence detection. This improves latency by avoiding processing of silence. Requires `torchaudio`. It can be enabled via command-line arguments or by wrapping an online processor. ```bash # Enable VAC in file simulation python3 simulstreaming_whisper.py audio.wav \ --language en \ --task transcribe \ --vac \ --vac-chunk-size 0.04 \ --min-chunk-size 1.2 ``` -------------------------------- ### Run Simultaneous LLM Translation Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt Executes the `simul_llm_translate.py` script for simultaneous translation using EuroLLM. It supports input from text files with timestamps or instance log formats. Key parameters control chunk size, language, context length, and buffer trimming. ```python cat gold-asr-dir//2022.acl-long.110.txt | python3 simul_llm_translate.py --min-chunk-size 1 --language de --language-specific-len-threshold --max-context-length 80 --buffer_trimming sentences ``` ```python python3 simul_llm_translate.py --input-instance gold-asr-dir//2022.acl-long.110.instance.log --min-chunk-size 1 --language de --max-context-length 300 | tee out ``` ```python python3 simul_llm_translate.py --input-instance $input \ --min-chunk-size $ch \ --language $language \ --language-specific-len-threshold --buffer_trimming $trim \ --max-context-length $max_context_len ``` -------------------------------- ### SimulStreaming: Translate Audio File (Python) Source: https://context7.com/ufal/simulstreaming/llms.txt This snippet demonstrates how to perform real-time simultaneous translation from an audio file using SimulStreaming. It utilizes the Whisper model with the AlignAtt policy and supports various command-line arguments for language, task, model path, beam search, and chunking. ```python python3 simulstreaming_whisper.py audio.wav \ --language cs \ --task translate \ --comp_unaware \ --model_path ./large-v3.pt \ --beams 5 \ --frame_threshold 25 \ --min-chunk-size 1.2 \ --vac # Expected output format (emission_time start_ms end_ms text): # 1200.0000 0 1200 And so # 2400.0000 1200 2400 my fellow Americans # 3600.0000 2400 3600 , # 4800.0000 3600 4800 ask not ``` -------------------------------- ### Computationally Aware Simulation Mode Source: https://context7.com/ufal/simulstreaming/llms.txt Runs the simulation in 'computationally aware' mode, where latency includes processing time. This provides a realistic measure of real-world latency. The output timestamp reflects the emission time after computation. ```bash # Computationally aware (default): includes processing time in latency python3 simulstreaming_whisper.py audio.wav \ --language en \ --task translate \ --min-chunk-size 1.2 ``` -------------------------------- ### Automatic Language Detection Source: https://context7.com/ufal/simulstreaming/llms.txt Enables automatic language detection for speech input when the source language is not specified. The model analyzes audio features to identify the language, creating the appropriate tokenizer for subsequent processing. Works for both transcription and translation tasks. ```bash # Enable automatic language detection python3 simulstreaming_whisper.py audio.wav \ --language auto \ --task translate \ --beams 5 ``` -------------------------------- ### Speech-to-Text Translation from Audio File Source: https://context7.com/ufal/simulstreaming/llms.txt Simulates real-time simultaneous translation from an audio file. This command translates Czech audio to English using the large-v3 Whisper model with specified parameters for beam search, chunk size, and prompt injection. ```APIDOC ## Speech-to-Text Translation from Audio File ### Description Real-time simulation of simultaneous translation from an audio file. This example demonstrates translating Czech audio to English using the large-v3 Whisper model. ### Method `python3` command-line execution ### Endpoint `simulstreaming_whisper.py` ### Parameters #### Command-line Arguments - **`audio.wav`** (string) - Required - Path to the input audio file. - **`--language`** (string) - Optional - Source language code (e.g., `cs` for Czech). - **`--task`** (string) - Optional - Task to perform (e.g., `translate`, `transcribe`). Defaults to `transcribe`. - **`--comp_unaware`** (flag) - Optional - Enables computationally unaware simulation mode. - **`--model_path`** (string) - Optional - Path to the Whisper model file (e.g., `./large-v3.pt`). - **`--beams`** (integer) - Optional - Number of beams for beam search (e.g., `5`). - **`--frame_threshold`** (float) - Optional - Threshold for frame processing (e.g., `25`). - **`--min-chunk-size`** (float) - Optional - Minimum chunk size in seconds (e.g., `1.2`). - **`--vac`** (flag) - Optional - Enables voice activity detection. ### Request Example ```bash python3 simulstreaming_whisper.py audio.wav \ --language cs \ --task translate \ --comp_unaware \ --model_path ./large-v3.pt \ --beams 5 \ --frame_threshold 25 \ --min-chunk-size 1.2 \ --vac ``` ### Response #### Success Response - **Output Format** (string) - `emission_time start_ms end_ms text` - Provides timestamps and transcribed/translated text for each segment. #### Response Example ``` 1200.0000 0 1200 And so 2400.0000 1200 2400 my fellow Americans 3600.0000 2400 3600 , 4800.0000 3600 4800 ask not ``` ``` -------------------------------- ### Cascaded LLM Translation Pipeline Source: https://context7.com/ufal/simulstreaming/llms.txt Performs speech-to-text transcription using Whisper, followed by LLM-based translation. It processes audio files and outputs translated text. Requires `simulstreaming_whisper.py` and `translate/simul_llm_translate.py` scripts. Outputs translated text with timestamps. ```bash python3 simulstreaming_whisper.py audio.wav \ --language en \ --task transcribe \ --comp_unaware \ > asr_output.txt python3 translate/simul_llm_translate.py \ --lan de \ --min-chunk-size 3 \ --min-len 5 \ --language-specific-len-threshold \ --sys_prompt "You are simultaneous interpreter from English to German." \ --init_prompt_src "Welcome to the conference." \ --init_prompt_tgt "Willkommen zur Konferenz." \ --max-context-length 4096 \ < asr_output.txt ``` -------------------------------- ### Run SLAAL for Translation Evaluation Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt Shell scripts to run SLAAL (Simultaneous LLM Alignment) for the entire development set. These scripts process documents, generate instance logs, and align candidates with reference translations using MWERSegmenter. ```bash ./slaal-de.sh de-output/2022.acl-long.110.txt de-output/2022.acl-long.110.mw-segments > de-output/2022.acl-long.110.slaal ``` ```bash ./slaal-de.sh de-output/ > de-output/slaal ``` -------------------------------- ### Computationally Unaware Simulation Mode Source: https://context7.com/ufal/simulstreaming/llms.txt Runs the simulation in 'computationally unaware' mode, measuring only policy latency, excluding actual processing time. This is useful for determining the theoretical minimum latency of the policy. The timer effectively stops during computation. ```bash # Computationally unaware: measures only policy latency python3 simulstreaming_whisper.py audio.wav \ --language en \ --task translate \ --comp_unaware \ --min-chunk-size 1.2 ``` -------------------------------- ### Convert Hugging Face Model to CTranslate2 Source: https://github.com/ufal/simulstreaming/blob/main/translate/README.txt Converts a Hugging Face model to the CTranslate2 format using the `ct2-transformers-converter` tool. CTranslate2 is a fast inference engine for Transformer models. ```bash ct2-transformers-converter --model EuroLLM-9B-Instruct/ --output_dir ct2_EuroLLM-9B-Instruct ``` -------------------------------- ### Decode Tokens and Refresh Segment Source: https://context7.com/ufal/simulstreaming/llms.txt Decodes a sequence of tokens into human-readable text and refreshes the model's segment for subsequent processing. Assumes a 'model' object with 'tokenizer' and 'refresh_segment' methods. ```python text = model.tokenizer.decode(tokens) print(f"Output: {text}") model.refresh_segment(complete=False) ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.