### Install Dependencies

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/language-identification/README.md

Install the required packages by running the following command.

```sh
pip install -r requirements.txt
```

--------------------------------

### Install Dependencies and Run Script

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/ava-asd/talknet/README.md

Installs project dependencies using pip and executes the main run script. Ensure ffmpeg is installed separately.

```sh
pip install -r requirements.txt
bash run.sh
```

```sh
sudo apt-get update
sudo apt-get install ffmpeg
```

```sh
conda install ffmpeg
```

--------------------------------

### Install ModelScope and Run Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-cam++/README.md

Install the ModelScope library and run inference using a pretrained CAM++ model. Specify the model ID and the path to your audio files.

```sh
# Install modelscope
pip install modelscope
# CAM++ trained on 3D-Speaker
model_id=damo/speech_campplus_sv_zh-cn_3dspeaker_16k
# CAM++ trained on 200k labeled speakers
model_id=damo/speech_campplus_sv_zh-cn_16k-common
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Install 3D-Speaker Toolkit

Source: https://github.com/modelscope/3d-speaker/blob/main/README.md

Clone the repository, create and activate a conda environment, and install the required dependencies.

```sh
git clone https://github.com/modelscope/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt
```

--------------------------------

### Install ModelScope for Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/README.md

Install the modelscope library to use pretrained models for inference.

```sh
# Install modelscope
pip install modelscope
```

--------------------------------

### Install 3D-Speaker and Dependencies

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Commands to clone the repository, set up the conda environment, and install required packages.

```bash
# Clone the repository
git clone https://github.com/modelscope/3D-Speaker.git && cd 3D-Speaker

# Create and activate conda environment
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker

# Install dependencies
pip install -r requirements.txt

# Install modelscope for pretrained model access
pip install modelscope
```

--------------------------------

### Install ModelScope and Run RDINO Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-rdino/README.md

Installs the ModelScope library and demonstrates how to run inference with the RDINO pretrained model for speaker verification. Ensure you have audio files to test with.

```sh
# Install modelscope
pip install modelscope
# RDINO trained on 3D-Speaker
model_id=damo/speech_rdino_ecapa_tdnn_sv_zh-cn_3dspeaker_16k
# Run inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Install ModelScope

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/README.md

Install the ModelScope library using pip. This is a prerequisite for using pretrained models for inference.

```sh
pip install modelscope
```

--------------------------------

### Install ModelScope and Run Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/sv-eres2net/README.md

Install the ModelScope library and use the provided script to run inference with a pretrained ERes2Net model. Specify the model ID and the path to your audio files.

```sh
# Install modelscope
pip install modelscope
# ERes2Net trained on CNCeleb
model_id=damo/speech_eres2net_base_sv_zh-cn_cnceleb_16k
# ERes2Net trained on 200k labeled speakers
model_id=damo/speech_eres2net_sv_zh-cn_16k-common
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Run Multimodal Diarization

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/speaker-diarization/README.md

Install ffmpeg and execute the multimodal diarization pipeline.

```sh
sudo apt-get update
sudo apt-get install ffmpeg
bash run_video.sh
```

--------------------------------

### Install ModelScope and Run ERes2Net Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-eres2net/README.md

Install the ModelScope library and then use the provided script to run inference with a specified ERes2Net model ID and audio file path. Ensure you have the correct model ID for the desired pretrained model.

```sh
# Install modelscope
pip install modelscope
# ERes2Net trained on 3D-Speaker
model_id=damo/speech_eres2net_large_sv_zh-cn_3dspeaker_16k
# ERes2Net trained on 200k labeled speakers
model_id=damo/speech_eres2net_sv_zh-cn_16k-common
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Install ModelScope and Run Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-eres2netv2/README.md

Install the ModelScope library and use the provided script to extract speaker embeddings from audio files using a pretrained ERes2NetV2 model. Ensure you have the model ID and the path to your audio files.

```sh
# Install modelscope
pip install modelscope
# ERes2NetV2 trained on 200k labeled speakers
model_id=iic/speech_eres2netv2_sv_zh-cn_16k-common
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Run SDPN Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-sdpn/README.md

Install the modelscope library and execute the inference script using the specified pretrained model ID.

```sh
# Install modelscope
pip install modelscope
# SDPN trained on VoxCeleb
model_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k
# Run inference
python speakerlab/bin/infer_sv_ssl.py --model_id $model_id
```

--------------------------------

### Install ModelScope and Run ECAPA-TDNN Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/sv-ecapa/README.md

Install the ModelScope library and use the provided script to extract speaker embeddings with the ECAPA-TDNN pretrained model. Ensure you have the path to your audio files ready.

```sh
# Install modelscope
pip install modelscope
# ECAPA-TDNN trained on CNCeleb
model_id=damo/speech_ecapa-tdnn_sv_zh-cn_cnceleb_16k
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Install and Run ECAPA-TDNN Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-ecapa/README.md

Installs the ModelScope library and runs inference using a pretrained ECAPA-TDNN model to extract speaker embeddings. Ensure you have the audio file path ready.

```sh
# Install modelscope
pip install modelscope
# ECAPA-TDNN trained on 3D-Speaker
model_id=damo/speech_ecapa-tdnn_sv_zh-cn_3dspeaker_16k
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Wav.scp File Format

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md

Example format for the wav.scp file, which maps utterance IDs to the paths of WAV audio files.

```text
utt_id_1 /path/to/wav_1.wav
utt_id_2 /path/to/wav_2.wav
....

```

--------------------------------

### Build ONNX Runtime

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md

Commands to build ONNX Runtime. Ensure cmake and gcc are installed. The build output will be in the 'build' directory.

```shell
cd runtime/onnxruntime/
mkdir build/ # you can change the folder name
cd build/
cmake ..
make
```

--------------------------------

### Perform inference with Modelscope pretrained models

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/README.md

Install the Modelscope library and run inference scripts for CAM++, ERes2Net, or RDINO models.

```sh
# Install modelscope
pip install modelscope
# CAM++ trained on 3D-Speaker
model_id=iic/speech_campplus_sv_zh-cn_3dspeaker_16k
# CAM++ trained on 200k labeled speakers
model_id=iic/speech_campplus_sv_zh-cn_16k-common
# ERes2Net trained on 3D-Speaker
model_id=iic/speech_eres2net_large_sv_zh-cn_3dspeaker_16k
# ERes2Net trained on 200k labeled speakers
mode_id=iic/speech_eres2net_sv_zh-cn_16k-common
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path

# RDINO trained on 3D-Speaker
model_id=iic/speech_rdino_ecapa_tdnn_sv_zh-cn_3dspeaker_16k
# Run RDINO inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### YAML Training Configuration for CAM++

Source: https://context7.com/modelscope/3d-speaker/llms.txt

An example YAML configuration file for training the CAM++ model. It specifies parameters for training, audio processing, model architecture, loss function (ArcMarginLoss with margin scheduling), and optimizer (SGD).

```yaml
# Training configuration for CAM++ model

# Basic training parameters
num_epoch: 60
save_epoch_freq: 5
log_batch_freq: 100
batch_size: 256
num_workers: 16

# Audio parameters
wav_len: 3.0          # Duration in seconds
sample_rate: 16000
aug_prob: 0.2         # Augmentation probability
speed_pertub: True    # Enable speed perturbation

# Model parameters
fbank_dim: 80
embedding_size: 512
num_classes: 5994     # Number of speakers in training set

# Learning rate
lr: 0.1
min_lr: 1e-4

# Model architecture
embedding_model:
  obj: speakerlab.models.campplus.DTDNN.CAMPPlus
  args:
    feat_dim: 80
    embedding_size: 512

# Loss function with margin scheduling
loss:
  obj: speakerlab.loss.margin_loss.ArcMarginLoss
  args:
    scale: 32.0
    margin: 0.2
    easy_margin: False

margin_scheduler:
  obj: speakerlab.process.scheduler.MarginScheduler
  args:
    initial_margin: 0.0
    final_margin: 0.2
    increase_start_epoch: 15
    fix_epoch: 25

# Optimizer
optimizer:
  obj: torch.optim.SGD
  args:
    lr: 0.1
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0001

```

--------------------------------

### Perform Inference with Pretrained Models

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/README.md

Install Modelscope and run inference using specific pretrained model IDs for speaker verification tasks.

```sh
# Install modelscope
pip install modelscope
# CAM++ trained on CN-Celeb
model_id=iic/speech_campplus_sv_cn_cnceleb_16k
# CAM++ trained on 200k labeled speakers
model_id=iic/speech_campplus_sv_zh-cn_16k-common
# ERes2Net-base trained on CN-Celeb
model_id=iic/speech_eres2net_base_sv_zh-cn_cnceleb_16k
# ERes2Net-large trained on CN-Celeb
model_id=iic/speech_eres2net_large_sv_zh-cn_cnceleb_16k
# ERes2Net trained on 200k labeled speakers
mode_id=iic/speech_eres2net_sv_zh-cn_16k-common
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path

# RDINO trained on CN-Celeb
model_id=iic/speech_rdino_ecapa_tdnn_sv_zh-cn_cnceleb_16k
# Run RDINO inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### CAM++ Model Initialization and Forward Pass

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Initialize and perform a forward pass with the CAM++ model. This demonstrates setting up the model with specified dimensions and passing random input data to get embeddings.

```python
from speakerlab.models.campplus.DTDNN import CAMPPlus
import torch

# Initialize CAM++ model
model = CAMPPlus(
    feat_dim=80,          # Input feature dimension (Fbank)
    embedding_size=192,   # Output embedding dimension
    growth_rate=32,       # DenseNet growth rate
    bn_size=4,            # Bottleneck size multiplier
    init_channels=128,    # Initial channel count
    memory_efficient=True # Use checkpointing for memory efficiency
)

# Forward pass (input: [batch, time, features])
x = torch.randn(16, 300, 80)  # 16 samples, 300 frames, 80 mel bins
embedding = model(x)  # Output: [16, 192]

print(f"Embedding shape: {embedding.shape}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e6:.2f}M")
```

--------------------------------

### Extract Embeddings with CAM++

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/sv-cam++/README.md

Instructions for installing ModelScope and running inference using a specified model ID and audio path.

```sh
# Install modelscope
pip install modelscope
# CAM++ trained on CNCeleb
model_id=damo/speech_campplus_sv_cn_cnceleb_16k
# CAM++ trained on 200k labeled speakers
model_id=damo/speech_campplus_sv_zh-cn_16k-common
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Run RDINO inference for speaker embedding extraction

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-rdino/README.md

Install the modelscope library and execute the inference script using the specified model ID and input audio path.

```sh
# Install modelscope
pip install modelscope
# RDINO trained on VoxCeleb
model_id=damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k
# Run inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Extract Speaker Embeddings with Res2Net

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-res2net/README.md

Use this command to install the ModelScope library and run inference on audio files using the pretrained Res2Net model.

```sh
# Install modelscope
pip install modelscope
# Res2Net trained on 3D-Speaker-Dataset
model_id=iic/speech_res2net_sv_zh-cn_3dspeaker_16k
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Extract speaker embeddings using ECAPA-TDNN

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-ecapa/README.md

Use this command to install the ModelScope library and run inference on audio files using the pretrained ECAPA-TDNN model.

```sh
# Install modelscope
pip install modelscope
# ECAPA-TDNN trained on VoxCeleb
model_id=damo/speech_ecapa-tdnn_sv_en_voxceleb_16k
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Extract speaker embeddings using ERes2Net

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-eres2net/README.md

Use the provided shell commands to install ModelScope and run inference on audio files using pretrained ERes2Net models.

```sh
# Install modelscope
pip install modelscope
# ERes2Net trained on VoxCeleb
model_id=damo/speech_eres2net_sv_en_voxceleb_16k
# ERes2Net trained on 200k labeled speakers
model_id=damo/speech_eres2net_sv_zh-cn_16k-common
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Extract Speaker Embeddings with ResNet

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-resnet/README.md

Use this command to install the ModelScope library and run inference on a specified audio file using the ResNet34 model trained on the 3D-Speaker dataset.

```sh
# Install modelscope
pip install modelscope
# ResNet34 trained on 3D-Speaker-Dataset
model_id=iic/speech_resnet34_sv_zh-cn_3dspeaker_16k
# Run inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Download Alimeeting Data

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md

Use wget to download the Alimeeting dataset archives. Ensure sufficient disk space as files are large.

```shell
# Alimeeting data download
wget https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Train_Ali_far.tar.gz # ([73.24G] (AliMeeting Train set, 8-channel microphone array speech) )
wget https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Train_Ali_near.tar.gz # ([22.85G] (AliMeeting Train set, headset microphone speech) )
wget https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Eval_Ali.tar.gz # ([3.42G] (AliMeeting Eval set, 8-channel microphone array speech, headset microphone speech) )
wget https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Test_Ali.tar.gz # ([8.90G] (AliMeeting Test set, 8-channel microphone array speech, headset microphone speech) )
```

--------------------------------

### Add Executable: read_and_describe_wav

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/bin/CMakeLists.txt

Defines the 'read_and_describe_wav' executable and links it with the 'utils' library.

```cmake
add_executable(read_and_describe_wav read_and_describe_wav.cpp)
target_link_libraries(read_and_describe_wav utils)
```

--------------------------------

### Resume Training from Checkpoint

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Resume training for a CAM++ model from a previously saved checkpoint. Ensure the experiment directory and configuration are correctly specified.

```bash
torchrun --nproc_per_node=4 speakerlab/bin/train.py \
    --config conf/cam++.yaml \
    --gpu 0 1 2 3 \
    --resume True \
    --exp_dir exp/cam++
```

--------------------------------

### Process Wav List with SSL Models

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Process a list of WAV files for speaker verification using SSL models. Input can be a text file containing a list of WAV files.

```bash
python speakerlab/bin/infer_sv_ssl.py \
    --model_id iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k \
    --wavs wav_list.txt
```

--------------------------------

### Learning Rate Scheduler Configuration

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Configuration for a warmup cosine learning rate scheduler. Specifies minimum and maximum learning rates, and the number of warmup epochs.

```yaml
lr_scheduler:
  obj: speakerlab.process.scheduler.WarmupCosineScheduler
  args:
    min_lr: 1e-4
    max_lr: 0.1
    warmup_epoch: 5
```

--------------------------------

### Run Speaker Verification Experiments on VoxCeleb

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/README.md

Navigate to the specific experiment directory and execute the run.sh script for each model. This is used for setting up and running different speaker verification models on the VoxCeleb dataset.

```sh
cd egs/voxceleb/sv-eres2net/
bash run.sh
```

```sh
cd egs/voxceleb/sv-cam++/
bash run.sh
```

```sh
cd egs/voxceleb/sv-ecapa/
bash run.sh
```

```sh
cd egs/voxceleb/sv-resnet/
bash run.sh
```

```sh
cd egs/voxceleb/sv-res2net/
bash run.sh
```

```sh
cd egs/voxceleb/sv-rdino/
bash run.sh
```

--------------------------------

### Configure Dialogue Detection Execution

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md

Run the dialogue detection Python script with BERT-based model parameters and dataset file paths.

```shell
   python bin/run_dialogue_detection.py \
         --model_name_or_path bert-base-chinese \
         --max_seq_length 128 --pad_to_max_length \
         --train_file $json_path/train.dialogue_detection.json \
         --validation_file $json_path/valid.dialogue_detection.json \
         --test_file $json_path/test.dialogue_detection.json \
         --do_train --do_eval --do_predict \
         --per_device_train_batch_size 128 --per_device_eval_batch_size 128 --num_train_epochs 5 \
         --output_dir $output_path --overwrite_output_dir
```

--------------------------------

### Export Speaker Embedding Model to ONNX

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md

Use this script to export a speaker embedding model to ONNX format. Ensure ONNX is installed in your Python environment. You can specify different model IDs and output file paths.

```shell
python speakerlab/bin/export_speaker_embedding_onnx.py \
    --experiment_path your/experiment_path/ \
    --model_id iic/speech_eres2net_sv_en_voxceleb_16k \
    --target_onnx_file path/to/save/onnx_model
```

--------------------------------

### Run Inference with Pretrained Models

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/README.md

Use the provided Python script to run inference with various pretrained speaker verification models from ModelScope. Ensure you have the correct model_id and the path to your audio files.

```sh
# CAM++ trained on VoxCeleb
model_id=iic/speech_campplus_sv_en_voxceleb_16k
# Speaker verification: ERes2Net on VoxCeleb
model_id=iic/speech_eres2net_sv_en_voxceleb_16k
# Speaker verification: ECAPA-TDNN on VoxCeleb
model_id=iic/speech_eres2net_large_sv_en_voxceleb_16k
# Speaker verification: ResNet on VoxCeleb
model_id=iic/speech_resnet_sv_en_voxceleb_16k
# Speaker verification: Res2Net on VoxCeleb
model_id=iic/speech_res2net_sv_en_voxceleb_16k
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
```

```sh
# RDINO trained on VoxCeleb
model_id=iic/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k
# Run rdino inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path
```

--------------------------------

### Add Executable: make_fbank_feature

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/bin/CMakeLists.txt

Defines the 'make_fbank_feature' executable and links it with 'utils' and 'feature' libraries.

```cmake
add_executable(make_fbank_feature make_fbank_feature.cpp)
target_link_libraries(make_fbank_feature PUBLIC utils feature)
```

--------------------------------

### Run 3D-Speaker experiments

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/README.md

Execute speaker verification training and evaluation scripts for various models using the provided shell commands.

```sh
# Speaker verification: ERes2Net on 3D-Speaker
cd egs/3dspeaker/sv-eres2net/
bash run.sh
# Speaker verification: CAM++ on 3D-Speaker
cd egs/3dspeaker/sv-cam++/
bash run.sh
# Speaker verification: ECAPA-TDNN on 3D-Speaker
cd egs/3dspeaker/sv-ecapa/
bash run.sh
# Speaker verification: ResNet on 3D-Speaker
cd egs/3dspeaker/sv-resnet/
bash run.sh
# Speaker verification: Res2Net on 3D-Speaker
cd egs/3dspeaker/sv-res2net/
bash run.sh
# Self-supervised speaker verification: RDINO on 3D-Speaker
cd egs/3dspeaker/sv-rdino/
bash run.sh
```

--------------------------------

### Project CMake Configuration

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/CMakeLists.txt

Defines the project requirements, C++ standard, and includes necessary build modules for the SpeakerLabEngines project.

```cmake
cmake_minimum_required(VERSION 3.23)
project(SpeakerLabEngines VERSION 0.1)

set(CMAKE_CXX_STANDARD 20)

option(USE_CUDA "Build with CUDA support" OFF)

include_directories(${PROJECT_SOURCE_DIR})

# Fetch third-party library
#include(ExternalProject)
include(FetchContent)
set(FETCHCONTENT_QUIET OFF)
set(FETCHCONTENT_BASE_DIR ${CMAKE_SOURCE_DIR}/third_party)
if (NOT EXISTS ${FETCHCONTENT_BASE_DIR})
    file(MAKE_DIRECTORY ${FETCHCONTENT_BASE_DIR})
endif ()
list(APPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cmake)

include(cmake/build_json.cmake)
include(cmake/build_onnx.cmake)

add_subdirectory(utils)
add_subdirectory(bin)
add_subdirectory(feature)
add_subdirectory(model)
```

--------------------------------

### Train CAM++ Model

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Distributed training of the CAM++ model on the 3D-Speaker dataset using 4 GPUs. Specify configuration, data paths, and experiment directory.

```bash
torchrun --nproc_per_node=4 speakerlab/bin/train.py \
    --config conf/cam++.yaml \
    --gpu 0 1 2 3 \
    --data data/3dspeaker/train/train.csv \
    --noise data/musan/wav.scp \
    --reverb data/rirs/wav.scp \
    --exp_dir exp/cam++
```

--------------------------------

### Data Augmentation for Speaker Training

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Initializes an audio augmentation pipeline for speaker verification training, applying noise and reverberation based on provided probabilities and file lists. It also includes a WavReader for loading audio with optional speed perturbation and large margin training modes.

```python
from speakerlab.process.processor import SpkVeriAug, WavReader
import torch

# Initialize augmentation pipeline
augmenter = SpkVeriAug(
    aug_prob=0.6,                        # Probability of applying augmentation
    noise_file='data/musan/wav.scp',     # Path to noise file list
    reverb_file='data/rirs/wav.scp'      # Path to RIR file list
)

# Initialize wav reader with speed perturbation
wav_reader = WavReader(
    sample_rate=16000,
    duration=3.0,           # Chunk duration in seconds
    speed_pertub=True,      # Enable speed perturbation (0.9x, 1.0x, 1.1x)
    lm=True                 # Large margin training mode
)

# Load and augment audio
wav, speed_idx = wav_reader('audio.wav')  # wav: [chunk_samples], speed_idx: 0,1,2
augmented_wav = augmenter(wav)
```

--------------------------------

### Configure Speaker Turn Detection Execution

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md

Run the speaker turn detection Python script with specific column mappings and model parameters.

```shell
   python bin/run_speaker_turn_detection.py \
          --model_name_or_path bert-base-chinese \
          --max_seq_length 128 --pad_to_max_length \
          --train_file $json_path/train.speaker_turn_detection.json \
          --validation_file $json_path/valid.speaker_turn_detection.json \
          --test_file $json_path/test.speaker_turn_detection.json \
          --do_train --do_eval --do_predict \
          --text_column_name sentence --label_column_name change_point_list --label_num 2 \
          --per_device_train_batch_size 64 --per_device_eval_batch_size 64 --num_train_epochs 5 \
          --output_dir $output_path --overwrite_output_dir
```

--------------------------------

### Compute Scores with Multiple Trials and Custom DCF

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Compute speaker verification scores and metrics using multiple trial files and custom DCF parameters. This allows for detailed analysis across different conditions.

```bash
python speakerlab/bin/compute_score_metrics.py \
    --enrol_data exp/eres2net/embeddings \
    --test_data exp/eres2net/embeddings \
    --scores_dir exp/eres2net/scores \
    --trials trials/cross_device trials/cross_distance trials/cross_dialect \
    --p_target 0.01 \
    --c_miss 1 \
    --c_fa 1
```

--------------------------------

### Complete Training Pipeline Script

Source: https://context7.com/modelscope/3d-speaker/llms.txt

A bash script for the end-to-end training pipeline of speaker verification models on the 3D-Speaker dataset. It includes stages for data preparation, index creation, model training, embedding extraction, and metric computation.

```bash
#!/bin/bash
# Complete training pipeline for CAM++ on 3D-Speaker dataset

set -e
data=data
exp=exp
exp_name=cam++
gpus="0 1 2 3"

# Stage 1: Prepare dataset
echo "Stage 1: Preparing 3D Speaker dataset..."
./local/prepare_data.sh --stage 1 --stop_stage 3 --data ${data}

# Stage 2: Create training data index
echo "Stage 2: Preparing training data index files..."
python local/prepare_data_csv.py --data_dir $data/3dspeaker/train

# Stage 3: Train speaker embedding model
echo "Stage 3: Training the speaker model..."
num_gpu=$(echo $gpus | awk -F ' ' '{print NF}')
torchrun --nproc_per_node=$num_gpu speakerlab/bin/train.py \
    --config conf/cam++.yaml \
    --gpu $gpus \
    --data $data/3dspeaker/train/train.csv \
    --noise $data/musan/wav.scp \
    --reverb $data/rirs/wav.scp \
    --exp_dir $exp/$exp_name

# Stage 4: Extract test embeddings
echo "Stage 4: Extracting speaker embeddings..."
torchrun --nproc_per_node=8 speakerlab/bin/extract.py \
    --exp_dir $exp/$exp_name \
    --data $data/3dspeaker/test/wav.scp \
    --use_gpu --gpu $gpus

# Stage 5: Compute evaluation metrics
echo "Stage 5: Computing score metrics..."
trials="$data/3dspeaker/trials/trials_cross_device"
trials="$trials $data/3dspeaker/trials/trials_cross_distance"
trials="$trials $data/3dspeaker/trials/trials_cross_dialect"
python speakerlab/bin/compute_score_metrics.py \
    --enrol_data $exp/$exp_name/embeddings \
    --test_data $exp/$exp_name/embeddings \
    --scores_dir $exp/$exp_name/scores \
    --trials $trials
```

--------------------------------

### Run Speaker Diarization Experiments

Source: https://github.com/modelscope/3d-speaker/blob/main/README.md

Execute shell scripts to run audio and multimodal speaker diarization experiments.

```sh
# Audio and multimodal Speaker diarization:
cd egs/3dspeaker/speaker-diarization/
bash run_audio.sh
bash run_video.sh
```

--------------------------------

### Run Speaker-Turn Detection Task

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md

Execute the speaker-turn detection script using the specified output directory.

```shell
   bash run_speaker_turn_detection.sh exp/
```

--------------------------------

### Configure CMake for ONNX Runtime Integration

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/model/CMakeLists.txt

Defines a static library named 'model' and links it against ONNX Runtime dependencies.

```cmake
add_library(model STATIC speaker_embedding_model.cpp)

target_include_directories(model PUBLIC ${ONNX_RUNTIME_INCLUDE_DIRS})
target_link_directories(model PUBLIC ${ONNX_RUNTIME_LIB_DIRS})
target_link_libraries(model PRIVATE onnxruntime)
```

--------------------------------

### Run Audio-only Diarization

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/speaker-diarization/README.md

Execute the audio-only diarization pipeline using shell scripts.

```sh
bash run_audio.sh
# Use the funasr model to transcribe into Chinese text.
bash run_audio.sh --stop_stage 8
```

--------------------------------

### ERes2Net Model Initialization and Forward Pass

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Initialize and perform a forward pass with the ERes2Net model. This shows how to set up the model with different configurations and obtain speaker embeddings from input features.

```python
from speakerlab.models.eres2net.ERes2Net import ERes2Net
import torch

# Initialize ERes2Net model
model = ERes2Net(
    feat_dim=80,           # Input feature dimension
    embedding_size=192,    # Output embedding dimension
    m_channels=32,         # Base channel multiplier (32 for base, 64 for large)
    num_blocks=[3, 4, 6, 3],  # Blocks per stage
    pooling_func='TSTP',   # Temporal statistics pooling
    two_emb_layer=False    # Single embedding layer
)

# Forward pass
x = torch.randn(10, 300, 80)  # [batch, time, features]
embedding = model(x)  # Output: [10, 192]

print(f"Embedding shape: {embedding.shape}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e6:.2f}M")
# ERes2Net-base: 6.61M params
# ERes2Net-large (m_channels=64): 22.46M params
```

--------------------------------

### Speaker Clustering with CommonClustering

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Initializes a common clustering backend that supports spectral clustering, UMAP+HDBSCAN, or AHC. It takes speaker embeddings as input and returns cluster labels. Parameters like `min_num_spks`, `max_num_spks`, and `mer_cos` control the clustering process.

```python
from speakerlab.process.cluster import CommonClustering, SpectralCluster
import numpy as np

# Initialize clustering backend
cluster = CommonClustering(
    cluster_type='spectral',  # 'spectral', 'umap_hdbscan', or 'AHC'
    min_num_spks=1,           # Minimum number of speakers
    max_num_spks=15,          # Maximum number of speakers
    min_cluster_size=4,       # Minimum cluster size
    mer_cos=0.8,              # Merge threshold for similar speakers
    pval=0.012                # P-value for affinity pruning
)

# Cluster embeddings (auto-detect number of speakers)
embeddings = np.random.randn(100, 192)  # 100 segments, 192-dim embeddings
labels = cluster(embeddings)
print(f"Detected {labels.max() + 1} speakers")

# Cluster with known speaker count
labels = cluster(embeddings, speaker_num=4)

# Use spectral clustering directly
spectral = SpectralCluster(
    min_num_spks=1,
    max_num_spks=10,
    pval=0.02
)
labels = spectral(embeddings)
```

--------------------------------

### Run Language Identification Experiment

Source: https://github.com/modelscope/3d-speaker/blob/main/README.md

Execute a shell script to run language identification experiments.

```sh
# Language identification
cd egs/3dspeaker/language-idenitfication
bash run.sh
```

--------------------------------

### Run Speaker Verification Experiments

Source: https://github.com/modelscope/3d-speaker/blob/main/README.md

Execute shell scripts to run speaker verification experiments with different models (ERes2NetV2, CAM++, ECAPA-TDNN, SDPN) on various datasets (3D-Speaker, VoxCeleb).

```sh
# Speaker verification: ERes2NetV2 on 3D-Speaker dataset
cd egs/3dspeaker/sv-eres2netv2/
bash run.sh
```

```sh
# Speaker verification: CAM++ on 3D-Speaker dataset
cd egs/3dspeaker/sv-cam++/
bash run.sh
```

```sh
# Speaker verification: ECAPA-TDNN on 3D-Speaker dataset
cd egs/3dspeaker/sv-ecapa/
bash run.sh
```

```sh
# Self-supervised speaker verification: SDPN on VoxCeleb dataset
cd egs/voxceleb/sv-sdpn/
bash run.sh
```

--------------------------------

### Train ERes2Net Model

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Distributed training of the ERes2Net model on the VoxCeleb dataset using 8 GPUs. Specify configuration, data paths, and experiment directory.

```bash
torchrun --nproc_per_node=8 speakerlab/bin/train.py \
    --config conf/eres2net.yaml \
    --gpu 0 1 2 3 4 5 6 7 \
    --data data/voxceleb/train/train.csv \
    --exp_dir exp/eres2net
```

--------------------------------

### Download Aishell-4 Data

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md

Use wget to download the Aishell-4 dataset archives. Different sizes are available; select based on your needs.

```shell
# Aishell-4 data download
wget https://us.openslr.org/resources/111/train_L.tar.gz # [7.0G] You can change different links from the [OpenSLR](https://www.openslr.org) website.
wget https://us.openslr.org/resources/111/train_M.tar.gz # [25G] You can change different links from the [OpenSLR](https://www.openslr.org) website.
wget https://us.openslr.org/resources/111/train_S.tar.gz # [14G] You can change different links from the [OpenSLR](https://www.openslr.org) website.
wget https://us.openslr.org/resources/111/test.tar.gz # [5.2G] You can change different links from the [OpenSLR](https://www.openslr.org) website.
```

--------------------------------

### Quick Audio Diarization Inference

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/speaker-diarization/README.md

Run inference directly using the Python script for audio diarization.

```python
# audio-only diarization
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir [out_dir]
# enable overlap detection
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir [out_dir] --include_overlap --hf_access_token [hf_access_token]
# for more configurable parameters, you can refer to speakerlab/bin/infer_diarization.py
```

--------------------------------

### ONNX Runtime Download URL Configuration

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md

CMake script snippet to set the ONNX Runtime download URL based on the operating system (Windows, macOS, Linux).

```cmake
if (WIN32)
    set(ONNX_RUNTIME_URL "https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_RUNTIME_VERSION}/onnxruntime-win-x64-${ONNX_RUNTIME_VERSION}.zip")
elseif(APPLE)
    if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64")
        set(ONNX_RUNTIME_URL "https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_RUNTIME_VERSION}/onnxruntime-osx-arm64-${ONNX_RUNTIME_VERSION}.tgz")
    else ()
        set(ONNX_RUNTIME_URL "https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_RUNTIME_VERSION}/onnxruntime-osx-x86_64-${ONNX_RUNTIME_VERSION}.tgz")
    endif ()
elseif(UNIX AND NOT APPLE)
    set(ONNX_RUNTIME_URL "https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_RUNTIME_VERSION}/onnxruntime-linux-x64-${ONNX_RUNTIME_VERSION}.tgz")
else()
    message(FATAL_ERROR "Unsupported operating system")
endif()
```

--------------------------------

### SDPN Model Inference

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Perform speaker verification inference using the SDPN model. Requires specifying the model ID and a list of audio files.

```bash
python speakerlab/bin/infer_sv_ssl.py \
    --model_id iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k \
    --wavs speaker1.wav speaker2.wav
```

--------------------------------

### Add Static Library to CMake

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/utils/CMakeLists.txt

Defines a static library named 'utils' and includes 'wav_reader.cpp' as its source file. This is used for building reusable components in C++ projects.

```cmake
add_library(utils STATIC
    wav_reader.cpp
)
```

--------------------------------

### Verify ONNX Model Export with ONNX Runtime

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md

This function verifies the exported ONNX model by comparing its output with the original PyTorch model's output using ONNX Runtime. It loads the model, runs inference with both, and prints the cosine similarity between the results. The expected result should be close to 1.0.

```python
def main():
    args = get_args()
    logger.info(f"{args}")

    model_id = args.model_id
    experiment_path = args.experiment_path
    target_onnx_file = args.target_onnx_file
    if model_id is not None:
        speaker_embedding_model = build_model_from_modelscope_id(
            model_id, experiment_path
        )
    else:
        speaker_embedding_model = build_model_from_custom_work_path(
            experiment_path
        )

    logger.info(f"Load speaker embedding finished, export to onnx")
    # let function `export_onnx_file` return the random tensor
    inputs = export_onnx_file(speaker_embedding_model, target_onnx_file)

    with torch.no_grad():
        res0 = speaker_embedding_model(inputs)
    ort_sess = ort.InferenceSession(target_onnx_file)
    res1 = ort_sess.run(None, {'feature': inputs.numpy()})[0]
    res1 = torch.from_numpy(res1) # Here, convert it to torch.tensor

    cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
    print(cos(res0, res1)) # The expected result should be tensor([1.0000])
```

--------------------------------

### Run Speaker Verification Experiments

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/README.md

Execute training or evaluation scripts for various speaker verification models on the CN-Celeb dataset.

```sh
# Speaker verification: ERes2Net on CN-Celeb
cd egs/cnceleb/sv-eres2net/
bash run.sh
# Speaker verification: CAM++ on CN-Celeb
cd egs/cnceleb/sv-cam++/
bash run.sh
# Speaker verification: ECAPA-TDNN on CN-Celeb
cd egs/cnceleb/sv-ecapa/
bash run.sh
# Speaker verification: ResNet on CN-Celeb
cd egs/cnceleb/sv-resnet/
bash run.sh
# Speaker verification: Res2Net on CN-Celeb
cd egs/cnceleb/sv-res2net/
bash run.sh
# Self-supervised speaker verification: RDINO on CN-Celeb
cd egs/cnceleb/sv-rdino/
bash run.sh
```

--------------------------------

### Define Feature Library and Dependencies in CMake

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/feature/CMakeLists.txt

Configures the feature library as a static library and links it with the utils dependency.

```cmake
add_library(feature STATIC
    feature_basic.cpp
        feature_fbank.h
        feature_functions.cpp
        feature_fbank.cpp
        feature_common.cpp
)
target_link_libraries(feature PUBLIC utils)
```

--------------------------------

### Extract Speaker Embeddings Command

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md

Command to extract speaker embeddings using the pre-compiled binary. Requires a wav.scp file and a fbank configuration.

```shell
./extract_speaker_embedding path/to/fbank_config.json path/to/your/onnx_file /path/to/your/wav.scp /path/to/embedding_scp_file /path/to/save/embeddings/
```

--------------------------------

### Project Citations

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md

BibTeX entries for the relevant research papers and datasets used in the project.

```latex
@inproceedings{Luyao2023ACL,
	author       = {Luyao Cheng and Siqi Zheng and Qinglin Zhang and Hui Wang and Yafeng Chen and Qian Chen},
	title        = {Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization},
	booktitle    = {Findings of the {ACL} 2023, Toronto, Canada, July 9-14, 2023},
	pages        = {14068--14077},
	year         = {2023},
}
@article{Cheng2023ImprovingSD,
  title={Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation},
  author={Luyao Cheng and Siqi Zheng and Qinglin Zhang and Haibo Wang and Yafeng Chen and Qian Chen and Shiliang Zhang},
  journal={ArXiv},
  year={2023},
  volume={abs/2309.10456},
}
```

```latex
@inproceedings{AISHELL-4_2021,
    title={AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario},
    author={Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen},
    booktitle={Interspeech},
    url={https://arxiv.org/abs/2104.03603},
    year={2021}
}

@inproceedings{Yu2022M2MeT,
    title={M2{M}e{T}: The {ICASSP} 2022 Multi-Channel Multi-Party Meeting Transcription Challenge},
    author={Yu, Fan and Zhang, Shiliang and Fu, Yihui and Xie, Lei and Zheng, Siqi and Du, Zhihao and Huang, Weilong and Guo, Pengcheng and Yan, Zhijie and Ma, Bin and Xu, Xin and Bu, Hui},
    booktitle={Proc. ICASSP},
    year={2022},
    organization={IEEE}
}
```

--------------------------------

### Integrate Diarization in Python

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/speaker-diarization/README.md

Use the Diarization3Dspeaker class to integrate the pipeline into custom Python scripts.

```python
from speakerlab.bin.infer_diarization import Diarization3Dspeaker
wav_path = "audio.wav"
pipeline = Diarization3Dspeaker()
print(pipeline(wav_path, wav_fs=None, speaker_num=None)) # can also accept WAV data as input
```

--------------------------------

### Link Library in CMake

Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/utils/CMakeLists.txt

Links the 'utils' static library to the current target, making its functionality available. It also links against ${CMAKE_CURRENT_SOURCE_DIR}, which typically refers to the current directory.

```cmake
target_link_libraries(utils PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
```

--------------------------------

### RDINO Model Inference

Source: https://context7.com/modelscope/3d-speaker/llms.txt

Perform speaker verification inference using the RDINO model. Requires specifying the model ID and a list of audio files.

```bash
python speakerlab/bin/infer_sv_ssl.py \
    --model_id damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k \
    --wavs speaker1.wav speaker2.wav
```

--------------------------------

### Run Classic Language Identification

Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/language-identification/README.md

Execute the classic language identification script which uses only eres2net/cam++ to extract speaker embeddings.

```sh
bash run.sh
```