### Install Dependencies Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/language-identification/README.md Install the required packages by running the following command. ```sh pip install -r requirements.txt ``` -------------------------------- ### Install Dependencies and Run Script Source: https://github.com/modelscope/3d-speaker/blob/main/egs/ava-asd/talknet/README.md Installs project dependencies using pip and executes the main run script. Ensure ffmpeg is installed separately. ```sh pip install -r requirements.txt bash run.sh ``` ```sh sudo apt-get update sudo apt-get install ffmpeg ``` ```sh conda install ffmpeg ``` -------------------------------- ### Install ModelScope and Run Inference Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-cam++/README.md Install the ModelScope library and run inference using a pretrained CAM++ model. Specify the model ID and the path to your audio files. ```sh # Install modelscope pip install modelscope # CAM++ trained on 3D-Speaker model_id=damo/speech_campplus_sv_zh-cn_3dspeaker_16k # CAM++ trained on 200k labeled speakers model_id=damo/speech_campplus_sv_zh-cn_16k-common # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Install 3D-Speaker Toolkit Source: https://github.com/modelscope/3d-speaker/blob/main/README.md Clone the repository, create and activate a conda environment, and install the required dependencies. ```sh git clone https://github.com/modelscope/3D-Speaker.git && cd 3D-Speaker conda create -n 3D-Speaker python=3.8 conda activate 3D-Speaker pip install -r requirements.txt ``` -------------------------------- ### Install ModelScope for Inference Source: https://github.com/modelscope/3d-speaker/blob/main/README.md Install the modelscope library to use pretrained models for inference. ```sh # Install modelscope pip install modelscope ``` -------------------------------- ### Install 3D-Speaker and Dependencies Source: https://context7.com/modelscope/3d-speaker/llms.txt Commands to clone the repository, set up the conda environment, and install required packages. ```bash # Clone the repository git clone https://github.com/modelscope/3D-Speaker.git && cd 3D-Speaker # Create and activate conda environment conda create -n 3D-Speaker python=3.8 conda activate 3D-Speaker # Install dependencies pip install -r requirements.txt # Install modelscope for pretrained model access pip install modelscope ``` -------------------------------- ### Install ModelScope and Run RDINO Inference Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-rdino/README.md Installs the ModelScope library and demonstrates how to run inference with the RDINO pretrained model for speaker verification. Ensure you have audio files to test with. ```sh # Install modelscope pip install modelscope # RDINO trained on 3D-Speaker model_id=damo/speech_rdino_ecapa_tdnn_sv_zh-cn_3dspeaker_16k # Run inference python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Install ModelScope Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/README.md Install the ModelScope library using pip. This is a prerequisite for using pretrained models for inference. ```sh pip install modelscope ``` -------------------------------- ### Install ModelScope and Run Inference Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/sv-eres2net/README.md Install the ModelScope library and use the provided script to run inference with a pretrained ERes2Net model. Specify the model ID and the path to your audio files. ```sh # Install modelscope pip install modelscope # ERes2Net trained on CNCeleb model_id=damo/speech_eres2net_base_sv_zh-cn_cnceleb_16k # ERes2Net trained on 200k labeled speakers model_id=damo/speech_eres2net_sv_zh-cn_16k-common # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Run Multimodal Diarization Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/speaker-diarization/README.md Install ffmpeg and execute the multimodal diarization pipeline. ```sh sudo apt-get update sudo apt-get install ffmpeg bash run_video.sh ``` -------------------------------- ### Install ModelScope and Run ERes2Net Inference Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-eres2net/README.md Install the ModelScope library and then use the provided script to run inference with a specified ERes2Net model ID and audio file path. Ensure you have the correct model ID for the desired pretrained model. ```sh # Install modelscope pip install modelscope # ERes2Net trained on 3D-Speaker model_id=damo/speech_eres2net_large_sv_zh-cn_3dspeaker_16k # ERes2Net trained on 200k labeled speakers model_id=damo/speech_eres2net_sv_zh-cn_16k-common # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Install ModelScope and Run Inference Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-eres2netv2/README.md Install the ModelScope library and use the provided script to extract speaker embeddings from audio files using a pretrained ERes2NetV2 model. Ensure you have the model ID and the path to your audio files. ```sh # Install modelscope pip install modelscope # ERes2NetV2 trained on 200k labeled speakers model_id=iic/speech_eres2netv2_sv_zh-cn_16k-common # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Run SDPN Inference Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-sdpn/README.md Install the modelscope library and execute the inference script using the specified pretrained model ID. ```sh # Install modelscope pip install modelscope # SDPN trained on VoxCeleb model_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k # Run inference python speakerlab/bin/infer_sv_ssl.py --model_id $model_id ``` -------------------------------- ### Install ModelScope and Run ECAPA-TDNN Inference Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/sv-ecapa/README.md Install the ModelScope library and use the provided script to extract speaker embeddings with the ECAPA-TDNN pretrained model. Ensure you have the path to your audio files ready. ```sh # Install modelscope pip install modelscope # ECAPA-TDNN trained on CNCeleb model_id=damo/speech_ecapa-tdnn_sv_zh-cn_cnceleb_16k # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Install and Run ECAPA-TDNN Inference Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-ecapa/README.md Installs the ModelScope library and runs inference using a pretrained ECAPA-TDNN model to extract speaker embeddings. Ensure you have the audio file path ready. ```sh # Install modelscope pip install modelscope # ECAPA-TDNN trained on 3D-Speaker model_id=damo/speech_ecapa-tdnn_sv_zh-cn_3dspeaker_16k # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Wav.scp File Format Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md Example format for the wav.scp file, which maps utterance IDs to the paths of WAV audio files. ```text utt_id_1 /path/to/wav_1.wav utt_id_2 /path/to/wav_2.wav .... ``` -------------------------------- ### Build ONNX Runtime Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md Commands to build ONNX Runtime. Ensure cmake and gcc are installed. The build output will be in the 'build' directory. ```shell cd runtime/onnxruntime/ mkdir build/ # you can change the folder name cd build/ cmake .. make ``` -------------------------------- ### Perform inference with Modelscope pretrained models Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/README.md Install the Modelscope library and run inference scripts for CAM++, ERes2Net, or RDINO models. ```sh # Install modelscope pip install modelscope # CAM++ trained on 3D-Speaker model_id=iic/speech_campplus_sv_zh-cn_3dspeaker_16k # CAM++ trained on 200k labeled speakers model_id=iic/speech_campplus_sv_zh-cn_16k-common # ERes2Net trained on 3D-Speaker model_id=iic/speech_eres2net_large_sv_zh-cn_3dspeaker_16k # ERes2Net trained on 200k labeled speakers mode_id=iic/speech_eres2net_sv_zh-cn_16k-common # Run CAM++ or ERes2Net inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path # RDINO trained on 3D-Speaker model_id=iic/speech_rdino_ecapa_tdnn_sv_zh-cn_3dspeaker_16k # Run RDINO inference python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### YAML Training Configuration for CAM++ Source: https://context7.com/modelscope/3d-speaker/llms.txt An example YAML configuration file for training the CAM++ model. It specifies parameters for training, audio processing, model architecture, loss function (ArcMarginLoss with margin scheduling), and optimizer (SGD). ```yaml # Training configuration for CAM++ model # Basic training parameters num_epoch: 60 save_epoch_freq: 5 log_batch_freq: 100 batch_size: 256 num_workers: 16 # Audio parameters wav_len: 3.0 # Duration in seconds sample_rate: 16000 aug_prob: 0.2 # Augmentation probability speed_pertub: True # Enable speed perturbation # Model parameters fbank_dim: 80 embedding_size: 512 num_classes: 5994 # Number of speakers in training set # Learning rate lr: 0.1 min_lr: 1e-4 # Model architecture embedding_model: obj: speakerlab.models.campplus.DTDNN.CAMPPlus args: feat_dim: 80 embedding_size: 512 # Loss function with margin scheduling loss: obj: speakerlab.loss.margin_loss.ArcMarginLoss args: scale: 32.0 margin: 0.2 easy_margin: False margin_scheduler: obj: speakerlab.process.scheduler.MarginScheduler args: initial_margin: 0.0 final_margin: 0.2 increase_start_epoch: 15 fix_epoch: 25 # Optimizer optimizer: obj: torch.optim.SGD args: lr: 0.1 momentum: 0.9 nesterov: True weight_decay: 0.0001 ``` -------------------------------- ### Perform Inference with Pretrained Models Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/README.md Install Modelscope and run inference using specific pretrained model IDs for speaker verification tasks. ```sh # Install modelscope pip install modelscope # CAM++ trained on CN-Celeb model_id=iic/speech_campplus_sv_cn_cnceleb_16k # CAM++ trained on 200k labeled speakers model_id=iic/speech_campplus_sv_zh-cn_16k-common # ERes2Net-base trained on CN-Celeb model_id=iic/speech_eres2net_base_sv_zh-cn_cnceleb_16k # ERes2Net-large trained on CN-Celeb model_id=iic/speech_eres2net_large_sv_zh-cn_cnceleb_16k # ERes2Net trained on 200k labeled speakers mode_id=iic/speech_eres2net_sv_zh-cn_16k-common # Run CAM++ or ERes2Net inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path # RDINO trained on CN-Celeb model_id=iic/speech_rdino_ecapa_tdnn_sv_zh-cn_cnceleb_16k # Run RDINO inference python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### CAM++ Model Initialization and Forward Pass Source: https://context7.com/modelscope/3d-speaker/llms.txt Initialize and perform a forward pass with the CAM++ model. This demonstrates setting up the model with specified dimensions and passing random input data to get embeddings. ```python from speakerlab.models.campplus.DTDNN import CAMPPlus import torch # Initialize CAM++ model model = CAMPPlus( feat_dim=80, # Input feature dimension (Fbank) embedding_size=192, # Output embedding dimension growth_rate=32, # DenseNet growth rate bn_size=4, # Bottleneck size multiplier init_channels=128, # Initial channel count memory_efficient=True # Use checkpointing for memory efficiency ) # Forward pass (input: [batch, time, features]) x = torch.randn(16, 300, 80) # 16 samples, 300 frames, 80 mel bins embedding = model(x) # Output: [16, 192] print(f"Embedding shape: {embedding.shape}") print(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e6:.2f}M") ``` -------------------------------- ### Extract Embeddings with CAM++ Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/sv-cam++/README.md Instructions for installing ModelScope and running inference using a specified model ID and audio path. ```sh # Install modelscope pip install modelscope # CAM++ trained on CNCeleb model_id=damo/speech_campplus_sv_cn_cnceleb_16k # CAM++ trained on 200k labeled speakers model_id=damo/speech_campplus_sv_zh-cn_16k-common # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Run RDINO inference for speaker embedding extraction Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-rdino/README.md Install the modelscope library and execute the inference script using the specified model ID and input audio path. ```sh # Install modelscope pip install modelscope # RDINO trained on VoxCeleb model_id=damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k # Run inference python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Extract Speaker Embeddings with Res2Net Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-res2net/README.md Use this command to install the ModelScope library and run inference on audio files using the pretrained Res2Net model. ```sh # Install modelscope pip install modelscope # Res2Net trained on 3D-Speaker-Dataset model_id=iic/speech_res2net_sv_zh-cn_3dspeaker_16k # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Extract speaker embeddings using ECAPA-TDNN Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-ecapa/README.md Use this command to install the ModelScope library and run inference on audio files using the pretrained ECAPA-TDNN model. ```sh # Install modelscope pip install modelscope # ECAPA-TDNN trained on VoxCeleb model_id=damo/speech_ecapa-tdnn_sv_en_voxceleb_16k # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Extract speaker embeddings using ERes2Net Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/sv-eres2net/README.md Use the provided shell commands to install ModelScope and run inference on audio files using pretrained ERes2Net models. ```sh # Install modelscope pip install modelscope # ERes2Net trained on VoxCeleb model_id=damo/speech_eres2net_sv_en_voxceleb_16k # ERes2Net trained on 200k labeled speakers model_id=damo/speech_eres2net_sv_zh-cn_16k-common # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Extract Speaker Embeddings with ResNet Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/sv-resnet/README.md Use this command to install the ModelScope library and run inference on a specified audio file using the ResNet34 model trained on the 3D-Speaker dataset. ```sh # Install modelscope pip install modelscope # ResNet34 trained on 3D-Speaker-Dataset model_id=iic/speech_resnet34_sv_zh-cn_3dspeaker_16k # Run inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Download Alimeeting Data Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md Use wget to download the Alimeeting dataset archives. Ensure sufficient disk space as files are large. ```shell # Alimeeting data download wget https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Train_Ali_far.tar.gz # ([73.24G] (AliMeeting Train set, 8-channel microphone array speech) ) wget https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Train_Ali_near.tar.gz # ([22.85G] (AliMeeting Train set, headset microphone speech) ) wget https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Eval_Ali.tar.gz # ([3.42G] (AliMeeting Eval set, 8-channel microphone array speech, headset microphone speech) ) wget https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/AliMeeting/openlr/Test_Ali.tar.gz # ([8.90G] (AliMeeting Test set, 8-channel microphone array speech, headset microphone speech) ) ``` -------------------------------- ### Add Executable: read_and_describe_wav Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/bin/CMakeLists.txt Defines the 'read_and_describe_wav' executable and links it with the 'utils' library. ```cmake add_executable(read_and_describe_wav read_and_describe_wav.cpp) target_link_libraries(read_and_describe_wav utils) ``` -------------------------------- ### Resume Training from Checkpoint Source: https://context7.com/modelscope/3d-speaker/llms.txt Resume training for a CAM++ model from a previously saved checkpoint. Ensure the experiment directory and configuration are correctly specified. ```bash torchrun --nproc_per_node=4 speakerlab/bin/train.py \ --config conf/cam++.yaml \ --gpu 0 1 2 3 \ --resume True \ --exp_dir exp/cam++ ``` -------------------------------- ### Process Wav List with SSL Models Source: https://context7.com/modelscope/3d-speaker/llms.txt Process a list of WAV files for speaker verification using SSL models. Input can be a text file containing a list of WAV files. ```bash python speakerlab/bin/infer_sv_ssl.py \ --model_id iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k \ --wavs wav_list.txt ``` -------------------------------- ### Learning Rate Scheduler Configuration Source: https://context7.com/modelscope/3d-speaker/llms.txt Configuration for a warmup cosine learning rate scheduler. Specifies minimum and maximum learning rates, and the number of warmup epochs. ```yaml lr_scheduler: obj: speakerlab.process.scheduler.WarmupCosineScheduler args: min_lr: 1e-4 max_lr: 0.1 warmup_epoch: 5 ``` -------------------------------- ### Run Speaker Verification Experiments on VoxCeleb Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/README.md Navigate to the specific experiment directory and execute the run.sh script for each model. This is used for setting up and running different speaker verification models on the VoxCeleb dataset. ```sh cd egs/voxceleb/sv-eres2net/ bash run.sh ``` ```sh cd egs/voxceleb/sv-cam++/ bash run.sh ``` ```sh cd egs/voxceleb/sv-ecapa/ bash run.sh ``` ```sh cd egs/voxceleb/sv-resnet/ bash run.sh ``` ```sh cd egs/voxceleb/sv-res2net/ bash run.sh ``` ```sh cd egs/voxceleb/sv-rdino/ bash run.sh ``` -------------------------------- ### Configure Dialogue Detection Execution Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md Run the dialogue detection Python script with BERT-based model parameters and dataset file paths. ```shell python bin/run_dialogue_detection.py \ --model_name_or_path bert-base-chinese \ --max_seq_length 128 --pad_to_max_length \ --train_file $json_path/train.dialogue_detection.json \ --validation_file $json_path/valid.dialogue_detection.json \ --test_file $json_path/test.dialogue_detection.json \ --do_train --do_eval --do_predict \ --per_device_train_batch_size 128 --per_device_eval_batch_size 128 --num_train_epochs 5 \ --output_dir $output_path --overwrite_output_dir ``` -------------------------------- ### Export Speaker Embedding Model to ONNX Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md Use this script to export a speaker embedding model to ONNX format. Ensure ONNX is installed in your Python environment. You can specify different model IDs and output file paths. ```shell python speakerlab/bin/export_speaker_embedding_onnx.py \ --experiment_path your/experiment_path/ \ --model_id iic/speech_eres2net_sv_en_voxceleb_16k \ --target_onnx_file path/to/save/onnx_model ``` -------------------------------- ### Run Inference with Pretrained Models Source: https://github.com/modelscope/3d-speaker/blob/main/egs/voxceleb/README.md Use the provided Python script to run inference with various pretrained speaker verification models from ModelScope. Ensure you have the correct model_id and the path to your audio files. ```sh # CAM++ trained on VoxCeleb model_id=iic/speech_campplus_sv_en_voxceleb_16k # Speaker verification: ERes2Net on VoxCeleb model_id=iic/speech_eres2net_sv_en_voxceleb_16k # Speaker verification: ECAPA-TDNN on VoxCeleb model_id=iic/speech_eres2net_large_sv_en_voxceleb_16k # Speaker verification: ResNet on VoxCeleb model_id=iic/speech_resnet_sv_en_voxceleb_16k # Speaker verification: Res2Net on VoxCeleb model_id=iic/speech_res2net_sv_en_voxceleb_16k # Run CAM++ or ERes2Net inference python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path ``` ```sh # RDINO trained on VoxCeleb model_id=iic/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k # Run rdino inference python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path ``` -------------------------------- ### Add Executable: make_fbank_feature Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/bin/CMakeLists.txt Defines the 'make_fbank_feature' executable and links it with 'utils' and 'feature' libraries. ```cmake add_executable(make_fbank_feature make_fbank_feature.cpp) target_link_libraries(make_fbank_feature PUBLIC utils feature) ``` -------------------------------- ### Run 3D-Speaker experiments Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/README.md Execute speaker verification training and evaluation scripts for various models using the provided shell commands. ```sh # Speaker verification: ERes2Net on 3D-Speaker cd egs/3dspeaker/sv-eres2net/ bash run.sh # Speaker verification: CAM++ on 3D-Speaker cd egs/3dspeaker/sv-cam++/ bash run.sh # Speaker verification: ECAPA-TDNN on 3D-Speaker cd egs/3dspeaker/sv-ecapa/ bash run.sh # Speaker verification: ResNet on 3D-Speaker cd egs/3dspeaker/sv-resnet/ bash run.sh # Speaker verification: Res2Net on 3D-Speaker cd egs/3dspeaker/sv-res2net/ bash run.sh # Self-supervised speaker verification: RDINO on 3D-Speaker cd egs/3dspeaker/sv-rdino/ bash run.sh ``` -------------------------------- ### Project CMake Configuration Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/CMakeLists.txt Defines the project requirements, C++ standard, and includes necessary build modules for the SpeakerLabEngines project. ```cmake cmake_minimum_required(VERSION 3.23) project(SpeakerLabEngines VERSION 0.1) set(CMAKE_CXX_STANDARD 20) option(USE_CUDA "Build with CUDA support" OFF) include_directories(${PROJECT_SOURCE_DIR}) # Fetch third-party library #include(ExternalProject) include(FetchContent) set(FETCHCONTENT_QUIET OFF) set(FETCHCONTENT_BASE_DIR ${CMAKE_SOURCE_DIR}/third_party) if (NOT EXISTS ${FETCHCONTENT_BASE_DIR}) file(MAKE_DIRECTORY ${FETCHCONTENT_BASE_DIR}) endif () list(APPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cmake) include(cmake/build_json.cmake) include(cmake/build_onnx.cmake) add_subdirectory(utils) add_subdirectory(bin) add_subdirectory(feature) add_subdirectory(model) ``` -------------------------------- ### Train CAM++ Model Source: https://context7.com/modelscope/3d-speaker/llms.txt Distributed training of the CAM++ model on the 3D-Speaker dataset using 4 GPUs. Specify configuration, data paths, and experiment directory. ```bash torchrun --nproc_per_node=4 speakerlab/bin/train.py \ --config conf/cam++.yaml \ --gpu 0 1 2 3 \ --data data/3dspeaker/train/train.csv \ --noise data/musan/wav.scp \ --reverb data/rirs/wav.scp \ --exp_dir exp/cam++ ``` -------------------------------- ### Data Augmentation for Speaker Training Source: https://context7.com/modelscope/3d-speaker/llms.txt Initializes an audio augmentation pipeline for speaker verification training, applying noise and reverberation based on provided probabilities and file lists. It also includes a WavReader for loading audio with optional speed perturbation and large margin training modes. ```python from speakerlab.process.processor import SpkVeriAug, WavReader import torch # Initialize augmentation pipeline augmenter = SpkVeriAug( aug_prob=0.6, # Probability of applying augmentation noise_file='data/musan/wav.scp', # Path to noise file list reverb_file='data/rirs/wav.scp' # Path to RIR file list ) # Initialize wav reader with speed perturbation wav_reader = WavReader( sample_rate=16000, duration=3.0, # Chunk duration in seconds speed_pertub=True, # Enable speed perturbation (0.9x, 1.0x, 1.1x) lm=True # Large margin training mode ) # Load and augment audio wav, speed_idx = wav_reader('audio.wav') # wav: [chunk_samples], speed_idx: 0,1,2 augmented_wav = augmenter(wav) ``` -------------------------------- ### Configure Speaker Turn Detection Execution Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md Run the speaker turn detection Python script with specific column mappings and model parameters. ```shell python bin/run_speaker_turn_detection.py \ --model_name_or_path bert-base-chinese \ --max_seq_length 128 --pad_to_max_length \ --train_file $json_path/train.speaker_turn_detection.json \ --validation_file $json_path/valid.speaker_turn_detection.json \ --test_file $json_path/test.speaker_turn_detection.json \ --do_train --do_eval --do_predict \ --text_column_name sentence --label_column_name change_point_list --label_num 2 \ --per_device_train_batch_size 64 --per_device_eval_batch_size 64 --num_train_epochs 5 \ --output_dir $output_path --overwrite_output_dir ``` -------------------------------- ### Compute Scores with Multiple Trials and Custom DCF Source: https://context7.com/modelscope/3d-speaker/llms.txt Compute speaker verification scores and metrics using multiple trial files and custom DCF parameters. This allows for detailed analysis across different conditions. ```bash python speakerlab/bin/compute_score_metrics.py \ --enrol_data exp/eres2net/embeddings \ --test_data exp/eres2net/embeddings \ --scores_dir exp/eres2net/scores \ --trials trials/cross_device trials/cross_distance trials/cross_dialect \ --p_target 0.01 \ --c_miss 1 \ --c_fa 1 ``` -------------------------------- ### Complete Training Pipeline Script Source: https://context7.com/modelscope/3d-speaker/llms.txt A bash script for the end-to-end training pipeline of speaker verification models on the 3D-Speaker dataset. It includes stages for data preparation, index creation, model training, embedding extraction, and metric computation. ```bash #!/bin/bash # Complete training pipeline for CAM++ on 3D-Speaker dataset set -e data=data exp=exp exp_name=cam++ gpus="0 1 2 3" # Stage 1: Prepare dataset echo "Stage 1: Preparing 3D Speaker dataset..." ./local/prepare_data.sh --stage 1 --stop_stage 3 --data ${data} # Stage 2: Create training data index echo "Stage 2: Preparing training data index files..." python local/prepare_data_csv.py --data_dir $data/3dspeaker/train # Stage 3: Train speaker embedding model echo "Stage 3: Training the speaker model..." num_gpu=$(echo $gpus | awk -F ' ' '{print NF}') torchrun --nproc_per_node=$num_gpu speakerlab/bin/train.py \ --config conf/cam++.yaml \ --gpu $gpus \ --data $data/3dspeaker/train/train.csv \ --noise $data/musan/wav.scp \ --reverb $data/rirs/wav.scp \ --exp_dir $exp/$exp_name # Stage 4: Extract test embeddings echo "Stage 4: Extracting speaker embeddings..." torchrun --nproc_per_node=8 speakerlab/bin/extract.py \ --exp_dir $exp/$exp_name \ --data $data/3dspeaker/test/wav.scp \ --use_gpu --gpu $gpus # Stage 5: Compute evaluation metrics echo "Stage 5: Computing score metrics..." trials="$data/3dspeaker/trials/trials_cross_device" trials="$trials $data/3dspeaker/trials/trials_cross_distance" trials="$trials $data/3dspeaker/trials/trials_cross_dialect" python speakerlab/bin/compute_score_metrics.py \ --enrol_data $exp/$exp_name/embeddings \ --test_data $exp/$exp_name/embeddings \ --scores_dir $exp/$exp_name/scores \ --trials $trials ``` -------------------------------- ### Run Speaker Diarization Experiments Source: https://github.com/modelscope/3d-speaker/blob/main/README.md Execute shell scripts to run audio and multimodal speaker diarization experiments. ```sh # Audio and multimodal Speaker diarization: cd egs/3dspeaker/speaker-diarization/ bash run_audio.sh bash run_video.sh ``` -------------------------------- ### Run Speaker-Turn Detection Task Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md Execute the speaker-turn detection script using the specified output directory. ```shell bash run_speaker_turn_detection.sh exp/ ``` -------------------------------- ### Configure CMake for ONNX Runtime Integration Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/model/CMakeLists.txt Defines a static library named 'model' and links it against ONNX Runtime dependencies. ```cmake add_library(model STATIC speaker_embedding_model.cpp) target_include_directories(model PUBLIC ${ONNX_RUNTIME_INCLUDE_DIRS}) target_link_directories(model PUBLIC ${ONNX_RUNTIME_LIB_DIRS}) target_link_libraries(model PRIVATE onnxruntime) ``` -------------------------------- ### Run Audio-only Diarization Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/speaker-diarization/README.md Execute the audio-only diarization pipeline using shell scripts. ```sh bash run_audio.sh # Use the funasr model to transcribe into Chinese text. bash run_audio.sh --stop_stage 8 ``` -------------------------------- ### ERes2Net Model Initialization and Forward Pass Source: https://context7.com/modelscope/3d-speaker/llms.txt Initialize and perform a forward pass with the ERes2Net model. This shows how to set up the model with different configurations and obtain speaker embeddings from input features. ```python from speakerlab.models.eres2net.ERes2Net import ERes2Net import torch # Initialize ERes2Net model model = ERes2Net( feat_dim=80, # Input feature dimension embedding_size=192, # Output embedding dimension m_channels=32, # Base channel multiplier (32 for base, 64 for large) num_blocks=[3, 4, 6, 3], # Blocks per stage pooling_func='TSTP', # Temporal statistics pooling two_emb_layer=False # Single embedding layer ) # Forward pass x = torch.randn(10, 300, 80) # [batch, time, features] embedding = model(x) # Output: [10, 192] print(f"Embedding shape: {embedding.shape}") print(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e6:.2f}M") # ERes2Net-base: 6.61M params # ERes2Net-large (m_channels=64): 22.46M params ``` -------------------------------- ### Speaker Clustering with CommonClustering Source: https://context7.com/modelscope/3d-speaker/llms.txt Initializes a common clustering backend that supports spectral clustering, UMAP+HDBSCAN, or AHC. It takes speaker embeddings as input and returns cluster labels. Parameters like `min_num_spks`, `max_num_spks`, and `mer_cos` control the clustering process. ```python from speakerlab.process.cluster import CommonClustering, SpectralCluster import numpy as np # Initialize clustering backend cluster = CommonClustering( cluster_type='spectral', # 'spectral', 'umap_hdbscan', or 'AHC' min_num_spks=1, # Minimum number of speakers max_num_spks=15, # Maximum number of speakers min_cluster_size=4, # Minimum cluster size mer_cos=0.8, # Merge threshold for similar speakers pval=0.012 # P-value for affinity pruning ) # Cluster embeddings (auto-detect number of speakers) embeddings = np.random.randn(100, 192) # 100 segments, 192-dim embeddings labels = cluster(embeddings) print(f"Detected {labels.max() + 1} speakers") # Cluster with known speaker count labels = cluster(embeddings, speaker_num=4) # Use spectral clustering directly spectral = SpectralCluster( min_num_spks=1, max_num_spks=10, pval=0.02 ) labels = spectral(embeddings) ``` -------------------------------- ### Run Language Identification Experiment Source: https://github.com/modelscope/3d-speaker/blob/main/README.md Execute a shell script to run language identification experiments. ```sh # Language identification cd egs/3dspeaker/language-idenitfication bash run.sh ``` -------------------------------- ### Run Speaker Verification Experiments Source: https://github.com/modelscope/3d-speaker/blob/main/README.md Execute shell scripts to run speaker verification experiments with different models (ERes2NetV2, CAM++, ECAPA-TDNN, SDPN) on various datasets (3D-Speaker, VoxCeleb). ```sh # Speaker verification: ERes2NetV2 on 3D-Speaker dataset cd egs/3dspeaker/sv-eres2netv2/ bash run.sh ``` ```sh # Speaker verification: CAM++ on 3D-Speaker dataset cd egs/3dspeaker/sv-cam++/ bash run.sh ``` ```sh # Speaker verification: ECAPA-TDNN on 3D-Speaker dataset cd egs/3dspeaker/sv-ecapa/ bash run.sh ``` ```sh # Self-supervised speaker verification: SDPN on VoxCeleb dataset cd egs/voxceleb/sv-sdpn/ bash run.sh ``` -------------------------------- ### Train ERes2Net Model Source: https://context7.com/modelscope/3d-speaker/llms.txt Distributed training of the ERes2Net model on the VoxCeleb dataset using 8 GPUs. Specify configuration, data paths, and experiment directory. ```bash torchrun --nproc_per_node=8 speakerlab/bin/train.py \ --config conf/eres2net.yaml \ --gpu 0 1 2 3 4 5 6 7 \ --data data/voxceleb/train/train.csv \ --exp_dir exp/eres2net ``` -------------------------------- ### Download Aishell-4 Data Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md Use wget to download the Aishell-4 dataset archives. Different sizes are available; select based on your needs. ```shell # Aishell-4 data download wget https://us.openslr.org/resources/111/train_L.tar.gz # [7.0G] You can change different links from the [OpenSLR](https://www.openslr.org) website. wget https://us.openslr.org/resources/111/train_M.tar.gz # [25G] You can change different links from the [OpenSLR](https://www.openslr.org) website. wget https://us.openslr.org/resources/111/train_S.tar.gz # [14G] You can change different links from the [OpenSLR](https://www.openslr.org) website. wget https://us.openslr.org/resources/111/test.tar.gz # [5.2G] You can change different links from the [OpenSLR](https://www.openslr.org) website. ``` -------------------------------- ### Quick Audio Diarization Inference Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/speaker-diarization/README.md Run inference directly using the Python script for audio diarization. ```python # audio-only diarization python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir [out_dir] # enable overlap detection python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir [out_dir] --include_overlap --hf_access_token [hf_access_token] # for more configurable parameters, you can refer to speakerlab/bin/infer_diarization.py ``` -------------------------------- ### ONNX Runtime Download URL Configuration Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md CMake script snippet to set the ONNX Runtime download URL based on the operating system (Windows, macOS, Linux). ```cmake if (WIN32) set(ONNX_RUNTIME_URL "https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_RUNTIME_VERSION}/onnxruntime-win-x64-${ONNX_RUNTIME_VERSION}.zip") elseif(APPLE) if(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64") set(ONNX_RUNTIME_URL "https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_RUNTIME_VERSION}/onnxruntime-osx-arm64-${ONNX_RUNTIME_VERSION}.tgz") else () set(ONNX_RUNTIME_URL "https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_RUNTIME_VERSION}/onnxruntime-osx-x86_64-${ONNX_RUNTIME_VERSION}.tgz") endif () elseif(UNIX AND NOT APPLE) set(ONNX_RUNTIME_URL "https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_RUNTIME_VERSION}/onnxruntime-linux-x64-${ONNX_RUNTIME_VERSION}.tgz") else() message(FATAL_ERROR "Unsupported operating system") endif() ``` -------------------------------- ### SDPN Model Inference Source: https://context7.com/modelscope/3d-speaker/llms.txt Perform speaker verification inference using the SDPN model. Requires specifying the model ID and a list of audio files. ```bash python speakerlab/bin/infer_sv_ssl.py \ --model_id iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k \ --wavs speaker1.wav speaker2.wav ``` -------------------------------- ### Add Static Library to CMake Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/utils/CMakeLists.txt Defines a static library named 'utils' and includes 'wav_reader.cpp' as its source file. This is used for building reusable components in C++ projects. ```cmake add_library(utils STATIC wav_reader.cpp ) ``` -------------------------------- ### Verify ONNX Model Export with ONNX Runtime Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md This function verifies the exported ONNX model by comparing its output with the original PyTorch model's output using ONNX Runtime. It loads the model, runs inference with both, and prints the cosine similarity between the results. The expected result should be close to 1.0. ```python def main(): args = get_args() logger.info(f"{args}") model_id = args.model_id experiment_path = args.experiment_path target_onnx_file = args.target_onnx_file if model_id is not None: speaker_embedding_model = build_model_from_modelscope_id( model_id, experiment_path ) else: speaker_embedding_model = build_model_from_custom_work_path( experiment_path ) logger.info(f"Load speaker embedding finished, export to onnx") # let function `export_onnx_file` return the random tensor inputs = export_onnx_file(speaker_embedding_model, target_onnx_file) with torch.no_grad(): res0 = speaker_embedding_model(inputs) ort_sess = ort.InferenceSession(target_onnx_file) res1 = ort_sess.run(None, {'feature': inputs.numpy()})[0] res1 = torch.from_numpy(res1) # Here, convert it to torch.tensor cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6) print(cos(res0, res1)) # The expected result should be tensor([1.0000]) ``` -------------------------------- ### Run Speaker Verification Experiments Source: https://github.com/modelscope/3d-speaker/blob/main/egs/cnceleb/README.md Execute training or evaluation scripts for various speaker verification models on the CN-Celeb dataset. ```sh # Speaker verification: ERes2Net on CN-Celeb cd egs/cnceleb/sv-eres2net/ bash run.sh # Speaker verification: CAM++ on CN-Celeb cd egs/cnceleb/sv-cam++/ bash run.sh # Speaker verification: ECAPA-TDNN on CN-Celeb cd egs/cnceleb/sv-ecapa/ bash run.sh # Speaker verification: ResNet on CN-Celeb cd egs/cnceleb/sv-resnet/ bash run.sh # Speaker verification: Res2Net on CN-Celeb cd egs/cnceleb/sv-res2net/ bash run.sh # Self-supervised speaker verification: RDINO on CN-Celeb cd egs/cnceleb/sv-rdino/ bash run.sh ``` -------------------------------- ### Define Feature Library and Dependencies in CMake Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/feature/CMakeLists.txt Configures the feature library as a static library and links it with the utils dependency. ```cmake add_library(feature STATIC feature_basic.cpp feature_fbank.h feature_functions.cpp feature_fbank.cpp feature_common.cpp ) target_link_libraries(feature PUBLIC utils) ``` -------------------------------- ### Extract Speaker Embeddings Command Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/README.md Command to extract speaker embeddings using the pre-compiled binary. Requires a wav.scp file and a fbank configuration. ```shell ./extract_speaker_embedding path/to/fbank_config.json path/to/your/onnx_file /path/to/your/wav.scp /path/to/embedding_scp_file /path/to/save/embeddings/ ``` -------------------------------- ### Project Citations Source: https://github.com/modelscope/3d-speaker/blob/main/egs/semantic_speaker/bert/README.md BibTeX entries for the relevant research papers and datasets used in the project. ```latex @inproceedings{Luyao2023ACL, author = {Luyao Cheng and Siqi Zheng and Qinglin Zhang and Hui Wang and Yafeng Chen and Qian Chen}, title = {Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization}, booktitle = {Findings of the {ACL} 2023, Toronto, Canada, July 9-14, 2023}, pages = {14068--14077}, year = {2023}, } @article{Cheng2023ImprovingSD, title={Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation}, author={Luyao Cheng and Siqi Zheng and Qinglin Zhang and Haibo Wang and Yafeng Chen and Qian Chen and Shiliang Zhang}, journal={ArXiv}, year={2023}, volume={abs/2309.10456}, } ``` ```latex @inproceedings{AISHELL-4_2021, title={AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario}, author={Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen}, booktitle={Interspeech}, url={https://arxiv.org/abs/2104.03603}, year={2021} } @inproceedings{Yu2022M2MeT, title={M2{M}e{T}: The {ICASSP} 2022 Multi-Channel Multi-Party Meeting Transcription Challenge}, author={Yu, Fan and Zhang, Shiliang and Fu, Yihui and Xie, Lei and Zheng, Siqi and Du, Zhihao and Huang, Weilong and Guo, Pengcheng and Yan, Zhijie and Ma, Bin and Xu, Xin and Bu, Hui}, booktitle={Proc. ICASSP}, year={2022}, organization={IEEE} } ``` -------------------------------- ### Integrate Diarization in Python Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/speaker-diarization/README.md Use the Diarization3Dspeaker class to integrate the pipeline into custom Python scripts. ```python from speakerlab.bin.infer_diarization import Diarization3Dspeaker wav_path = "audio.wav" pipeline = Diarization3Dspeaker() print(pipeline(wav_path, wav_fs=None, speaker_num=None)) # can also accept WAV data as input ``` -------------------------------- ### Link Library in CMake Source: https://github.com/modelscope/3d-speaker/blob/main/runtime/onnxruntime/utils/CMakeLists.txt Links the 'utils' static library to the current target, making its functionality available. It also links against ${CMAKE_CURRENT_SOURCE_DIR}, which typically refers to the current directory. ```cmake target_link_libraries(utils PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}) ``` -------------------------------- ### RDINO Model Inference Source: https://context7.com/modelscope/3d-speaker/llms.txt Perform speaker verification inference using the RDINO model. Requires specifying the model ID and a list of audio files. ```bash python speakerlab/bin/infer_sv_ssl.py \ --model_id damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k \ --wavs speaker1.wav speaker2.wav ``` -------------------------------- ### Run Classic Language Identification Source: https://github.com/modelscope/3d-speaker/blob/main/egs/3dspeaker/language-identification/README.md Execute the classic language identification script which uses only eres2net/cam++ to extract speaker embeddings. ```sh bash run.sh ```