### Install emotion2vec from Source Code Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md Installs the emotion2vec library from its source code. Requires Python 3.8+ and PyTorch 1.13+. This process involves cloning the repository, installing fairseq, and downloading pre-trained checkpoints. ```bash pip install fairseq git clone https://github.com/ddlBoJack/emotion2vec.git ``` -------------------------------- ### Install FunASR for Emotion2Vec+ Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md This snippet shows how to install the FunASR library, which is required for using the emotion2vec+ models for speech emotion recognition. It uses pip for installation. ```bash pip install -U funasr ``` -------------------------------- ### FunASR Support for Kaldi-style Wav.scp Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md This example illustrates how FunASR supports processing multiple audio files using a Kaldi-style wav.scp file. This format allows for efficient batch processing of audio data by listing audio file names and their corresponding paths. ```text wav_name1 wav_path1.wav wav_name2 wav_path2.wav ... ``` -------------------------------- ### Source Code Feature Extraction (CLI) Source: https://context7.com/ddlbojack/emotion2vec/llms.txt This section provides a command-line interface (CLI) example for extracting features using the emotion2vec model directly from its source code, typically for integration with fairseq. ```APIDOC ## Source Code Feature Extraction (CLI) ### Description This command-line script allows for direct feature extraction from audio files using the emotion2vec model's source code. It's useful for advanced users or when integrating with frameworks like fairseq. The extracted features are saved as numpy arrays. ### Method CLI Command (Bash) ### Endpoint `scripts/extract_features.py` ### Parameters #### Command-line Arguments - **--source_file** (string) - Required - Path to the input audio file. - **--target_file** (string) - Required - Path to save the output numpy file. - **--model_dir** (string) - Required - Directory containing the model files. - **--checkpoint_dir** (string) - Required - Path to the model checkpoint file (e.g., `.pt`). - **--granularity** (string) - Optional - Feature extraction level. Options: `utterance` or `frame`. Defaults to `utterance`. ### Request Example ```bash #!/bin/bash export CUDA_VISIBLE_DEVICES=0 python scripts/extract_features.py \ --source_file='/path/to/your/audio.wav' \ --target_file='/path/to/save/features.npy' \ --model_dir='./upstream' \ --checkpoint_dir='/path/to/emotion2vec_base.pt' \ --granularity='utterance' ``` ### Response #### Success Response - A `.npy` file is created at the specified `--target_file` path containing the extracted emotion features. #### Response Example (No direct output, but a file is created) ``` # Output file: /path/to/save/features.npy # Content will be a numpy array (e.g., shape [768] or [T, 768]) ``` ### Error Handling - Ensure all paths and directories are correct. - CUDA availability is recommended for performance. ``` -------------------------------- ### Load SSL Features and Create Dataset (Python) Source: https://context7.com/ddlbojack/emotion2vec/llms.txt Loads pre-extracted emotion2vec features from disk and prepares them for use with PyTorch's Dataset. It handles feature files, lengths, and emotion labels, returning a dictionary containing features, sample sizes, offsets, and integer labels. ```python import numpy as np import torch from torch.utils.data import Dataset, DataLoader def load_ssl_features(feature_path, label_dict, max_speech_seq_len=None): """ Load pre-extracted emotion2vec features from disk. Expected files: - {feature_path}.npy: Concatenated features array - {feature_path}.lengths: One length per line - {feature_path}.emo: Labels in format "utt_id emotion" Returns: dict with 'feats', 'sizes', 'offsets', 'labels', 'num' """ # Assuming load_dataset is defined elsewhere and handles file loading # data, sizes, offsets, labels = load_dataset( # feature_path, labels='emo', min_length=1, max_length=max_speech_seq_len # ) # Placeholder for actual load_dataset call data, sizes, offsets, labels = np.random.rand(1000, 768), np.random.randint(10, 100, 10), np.cumsum([0] + np.random.randint(10, 100, 9).tolist()), [list(label_dict.keys())[i % len(label_dict)] for i in range(10)] labels = [label_dict[elem] for elem in labels] return { "feats": data, # [total_frames, 768] numpy array "sizes": sizes, # [num_samples] frame counts "offsets": offsets, # [num_samples] start indices "labels": labels, # [num_samples] integer labels "num": len(labels) } class SpeechDataset(Dataset): """Dataset for frame-level emotion features with variable lengths.""" def __init__(self, feats, sizes, offsets, labels=None): self.feats = feats # [total_frames, 768] self.sizes = sizes # Length of each sample self.offsets = offsets # Start offset of each sample self.labels = labels def __getitem__(self, index): offset = self.offsets[index] end = self.sizes[index] + offset feats = torch.from_numpy(self.feats[offset:end, :].copy()).float() return { "id": index, "feats": feats, # [seq_len, 768] "target": self.labels[index] if self.labels else None } def __len__(self): return len(self.sizes) def collator(self, samples): """Batch collator with padding.""" feats = [s["feats"] for s in samples] sizes = [f.shape[0] for f in feats] labels = torch.tensor([s["target"] for s in samples]) # Pad to max length in batch target_size = max(sizes) collated_feats = torch.zeros(len(feats), target_size, feats[0].size(-1)) padding_mask = torch.zeros(len(feats), target_size, dtype=torch.bool) for i, (feat, size) in enumerate(zip(feats, sizes)): collated_feats[i, :size] = feat padding_mask[i, size:] = True return { "id": torch.LongTensor([s["id"] for s in samples]), "net_input": {"feats": collated_feats, "padding_mask": padding_mask}, "labels": labels } # Example usage: label_dict = {'ang': 0, 'hap': 1, 'neu': 2, 'sad': 3} dataset = load_ssl_features("/path/to/iemocap_features", label_dict) # Assuming train_valid_test_iemocap_dataloader is defined elsewhere # train_loader, val_loader, test_loader = train_valid_test_iemocap_dataloader( # dataset, batch_size=128, test_start=0, test_end=1085, eval_is_test=False # ) # Placeholder for dataloader iteration # for batch in train_loader: # feats = batch["net_input"]["feats"] # [128, max_len, 768] # mask = batch["net_input"]["padding_mask"] # [128, max_len] # labels = batch["labels"] # [128] # print(f"Batch shapes: feats={feats.shape}, labels={labels.shape}") # break ``` -------------------------------- ### Perform Emotion Recognition with FunASR Source: https://context7.com/ddlbojack/emotion2vec/llms.txt Demonstrates how to initialize an emotion2vec+ model and perform 9-class emotion classification on a single audio file. It supports both utterance-level classification and optional embedding extraction. ```python from funasr import AutoModel model = AutoModel( model="iic/emotion2vec_plus_large", hub="ms", ) wav_file = f"{model.model_path}/example/test.wav" result = model.generate( wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False ) print(f"Predicted emotion: {result[0]['labels']}") print(f"Confidence scores: {result[0]['scores']}") ``` -------------------------------- ### Batch Processing with Kaldi-style wav.scp Source: https://context7.com/ddlbojack/emotion2vec/llms.txt This section demonstrates how to process multiple audio files in batch using a Kaldi-style `wav.scp` file for efficient inference with emotion2vec models. ```APIDOC ## Batch Processing with Kaldi-style wav.scp ### Description This feature allows for efficient batch processing of multiple audio files by referencing them in a `wav.scp` file. The `generate` method can accept a path to this file, enabling large-scale inference or embedding extraction. ### Method POST (Implicit via `model.generate`) ### Endpoint `/generate` (within FunASR library) ### Parameters #### Path Parameters - **wav_scp_path** (string) - Required - Path to the `wav.scp` file. #### Query Parameters - **model** (string) - Required - The name of the pre-trained model to use. - **hub** (string) - Required - The model hub to use (`ms` or `modelscope`, `hf` or `huggingface`). - **granularity** (string) - Optional - `utterance` or `frame`. - **extract_embedding** (boolean) - Optional - Set to `True` to also extract embeddings. ### Request Example ```python from funasr import AutoModel # Create a dummy wav.scp file wav_scp_content = """ audio_001\t/path/to/audio1.wav audio_002\t/path/to/audio2.wav audio_003\t/path/to/audio3.wav """ with open("wav.scp", "w") as f: f.write(wav_scp_content) model = AutoModel( model="iic/emotion2vec_plus_large", hub="ms", ) results = model.generate( "wav.scp", # Path to the wav.scp file output_dir="./batch_outputs", granularity="utterance", extract_embedding=True ) for i, result in enumerate(results): print(f"Audio {i}: Label={result['labels']}, Scores={result['scores']}") if 'feats' in result: print(f" Embedding shape: {result['feats'].shape}") ``` ### Response #### Success Response (200) - A list of dictionaries, where each dictionary corresponds to an audio file in the `wav.scp` and contains prediction results (`labels`, `scores`) and optionally embeddings (`feats`). #### Response Example ```json [ { "labels": [4], "scores": [[0.01, ..., 0.85, ...]], "feats": [0.1, -0.2, ..., 0.5] }, { "labels": [6], "scores": [[0.02, ..., 0.75, ...]], "feats": [-0.3, 0.4, ..., -0.1] } ] ``` ### Error Handling - Errors related to file access or model inference will be reported. ``` -------------------------------- ### Hydra Configuration for IEMOCAP Downstream Task (YAML) Source: https://context7.com/ddlbojack/emotion2vec/llms.txt Demonstrates Hydra configuration for the IEMOCAP downstream task using YAML files. It shows default settings for dataset, optimization, and model parameters, and how to override them via the command line. ```yaml # iemocap_downstream/config/default.yaml common: seed: 42 dataset: _name: IEMOCAP feat_path: /path/to/emotion2vec_features test_ratio: 0.2 batch_size: 128 fold: 5 eval_is_test: False # If True, use test set as validation optimization: epoch: 100 lr: 5e-4 weight_decay: 1e-5 label_smooth: 0.0 model: _name: BaseModel ``` ```bash # Override config via command line python main.py \ dataset.feat_path=/new/path/to/features \ dataset.batch_size=64 \ optimization.lr=1e-4 \ optimization.epoch=50 ``` -------------------------------- ### Inference with emotion2vec+ Models using FunASR Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md This Python script demonstrates how to perform speech emotion recognition using various emotion2vec+ models provided by FunASR. It shows how to load a model, specify the model ID, and run inference on a given audio file, outputting emotion labels and scores. The script also highlights different model versions and the option to extract embeddings. ```python ''' Using the finetuned emotion recognization model rec_result contains {'feats', 'labels', 'scores'} extract_embedding=False: 9-class emotions with scores extract_embedding=True: 9-class emotions with scores, along with features 9-class emotions: iic/emotion2vec_plus_seed, iic/emotion2vec_plus_base, iic/emotion2vec_plus_large (May. 2024 release) iic/emotion2vec_base_finetuned (Jan. 2024 release) 0: angry 1: disgusted 2: fearful 3: happy 4: neutral 5: other 6: sad 7: surprised 8: unknown ''' from funasr import AutoModel # model="iic/emotion2vec_base" # model="iic/emotion2vec_base_finetuned" # model="iic/emotion2vec_plus_seed" # model="iic/emotion2vec_plus_base" model_id = "iic/emotion2vec_plus_large" model = AutoModel( model=model_id, hub="ms", # "ms" or "modelscope" for China mainland users; "hf" or "huggingface" for other overseas users ) wav_file = f"{model.model_path}/example/test.wav" rec_result = model.generate(wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False) print(rec_result) ``` -------------------------------- ### Batch Process Audio Files with wav.scp Source: https://context7.com/ddlbojack/emotion2vec/llms.txt Utilizes Kaldi-style manifest files to process multiple audio files efficiently in a single batch, returning both classification results and embeddings. ```python from funasr import AutoModel wav_scp_content = """audio_001\t/path/to/audio1.wav\naudio_002\t/path/to/audio2.wav\naudio_003\t/path/to/audio3.wav\n""" with open("wav.scp", "w") as f: f.write(wav_scp_content) model = AutoModel(model="iic/emotion2vec_plus_large", hub="ms") results = model.generate( "wav.scp", output_dir="./batch_outputs", granularity="utterance", extract_embedding=True ) for i, result in enumerate(results): print(f"Audio {i}: Label={result['labels']}, Scores={result['scores']}") if 'feats' in result: print(f" Embedding shape: {result['feats'].shape}") ``` -------------------------------- ### Run IEMOCAP Downstream Training Pipeline Source: https://context7.com/ddlbojack/emotion2vec/llms.txt A shell script to execute 5-fold cross-validation training on the IEMOCAP dataset using extracted emotion2vec features. ```bash #!/bin/bash export CUDA_VISIBLE_DEVICES=0 python main.py \ dataset._name=IEMOCAP \ dataset.feat_path=/path/to/emotion2vec_features \ model._name=BaseModel \ dataset.batch_size=128 \ optimization.epoch=100 \ optimization.lr=5e-4 \ dataset.eval_is_test=false ``` -------------------------------- ### Python Emotion Prediction with Emotion2Vec Source: https://github.com/ddlbojack/emotion2vec/blob/main/iemocap_downstream/inference.ipynb This Python script loads a pre-trained Emotion2Vec model, prepares sample input features and padding masks, performs inference to predict an emotion, and then decodes the prediction into a human-readable label. It requires PyTorch and a custom BaseModel class. ```python import torch from model import BaseModel label_dict={'ang': 0, 'hap': 1, 'neu': 2, 'sad': 3} idx2label = {v: k for k, v in label_dict.items()} model = BaseModel(input_dim=768, output_dim=len(label_dict)) ckpt = torch.load('outputs/2024-01-14/22-57-42/model_1.pth') model.load_state_dict(ckpt) feat = torch.randn(1, 100, 768) padding_mask = torch.zeros(1, 100).bool() outputs = model(feat, padding_mask) _, predict = torch.max(outputs.data, dim=1) print(idx2label[predict.item()]) ``` -------------------------------- ### Train IEMOCAP Downstream Classifier - Python Source: https://context7.com/ddlbojack/emotion2vec/llms.txt Entry point for training a downstream classifier on the IEMOCAP dataset using emotion2vec features. It handles data loading, model initialization, training loop, and evaluation with 5-fold cross-validation. Dependencies include hydra, omegaconf, and torch. ```python import hydra from omegaconf import DictConfig import torch from torch import nn, optim from data import load_ssl_features, train_valid_test_iemocap_dataloader from model import BaseModel from utils import train_one_epoch, validate_and_test @hydra.main(config_path='config', config_name='default.yaml') def train_iemocap(cfg: DictConfig): torch.manual_seed(cfg.common.seed) # IEMOCAP 4-class emotion labels label_dict = {'ang': 0, 'hap': 1, 'neu': 2, 'sad': 3} n_samples = [1085, 1023, 1151, 1031, 1241] # Samples per session for fold in range(5): # 5-fold cross-validation device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load pre-extracted emotion2vec features dataset = load_ssl_features(cfg.dataset.feat_path, label_dict) # Create data loaders for this fold test_len = n_samples[fold] test_idx_start = sum(n_samples[:fold]) test_idx_end = test_idx_start + test_len train_loader, val_loader, test_loader = train_valid_test_iemocap_dataloader( dataset, cfg.dataset.batch_size, test_idx_start, test_idx_end, eval_is_test=cfg.dataset.eval_is_test ) # Initialize model: 768-dim input → 4 emotion classes model = BaseModel(input_dim=768, output_dim=len(label_dict)).to(device) optimizer = optim.RMSprop(model.parameters(), lr=cfg.optimization.lr, momentum=0.9) scheduler = optim.lr_scheduler.CyclicLR( optimizer, base_lr=cfg.optimization.lr, max_lr=1e-3, step_size_up=10 ) criterion = nn.CrossEntropyLoss() # Training loop best_val_wa = 0 for epoch in range(cfg.optimization.epoch): train_loss = train_one_epoch(model, optimizer, criterion, train_loader, device) scheduler.step() val_wa, val_ua, val_f1 = validate_and_test(model, val_loader, device, num_classes=4) if val_wa > best_val_wa: best_val_wa = val_wa torch.save(model.state_dict(), f"model_{fold+1}.pth") # Final test evaluation model.load_state_dict(torch.load(f"model_{fold+1}.pth")) test_wa, test_ua, test_f1 = validate_and_test(model, test_loader, device, num_classes=4) print(f"Fold {fold+1}: WA={test_wa:.2f}%, UA={test_ua:.2f}%, F1={test_f1:.2f}%") if __name__ == '__main__': train_iemocap() ``` -------------------------------- ### Extract Features using FunASR Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md Extracts emotion features from audio files using the FunASR library. The model is automatically downloaded. It supports both utterance-level and frame-level feature extraction. Input can be a single WAV file or a list of files in wav.scp format. ```python from funasr import AutoModel model_id = "iic/emotion2vec_base" model = AutoModel( model=model_id, hub="ms", # "ms" or "modelscope" for China mainland users; "hf" or "huggingface" for other overseas users ) wav_file = f"{model.model_path}/example/test.wav" rec_result = model.generate(wav_file, output_dir="./outputs", granularity="utterance") print(rec_result) ``` -------------------------------- ### Extract Features via Shell Script Source: https://context7.com/ddlbojack/emotion2vec/llms.txt Provides a command-line interface for feature extraction using the source code, saving output as numpy arrays. ```bash #!/bin/bash export CUDA_VISIBLE_DEVICES=0 python scripts/extract_features.py \ --source_file='/path/to/audio.wav' \ --target_file='/path/to/output.npy' \ --model_dir='./upstream' \ --checkpoint_dir='/path/to/emotion2vec_base.pt' \ --granularity='utterance' ``` -------------------------------- ### FunASR Emotion Recognition API Source: https://context7.com/ddlbojack/emotion2vec/llms.txt This API allows for speech emotion recognition using pre-trained emotion2vec+ models. It supports 9-class emotion classification and can extract embeddings if needed. ```APIDOC ## FunASR Emotion Recognition API ### Description This API provides a simple interface for performing speech emotion recognition using various pre-trained emotion2vec+ models. It supports 9-class emotion classification and can optionally extract 768-dimensional embeddings. ### Method POST (Implicit via `model.generate`) ### Endpoint `/generate` (within FunASR library) ### Parameters #### Query Parameters - **model** (string) - Required - The name of the pre-trained model to use (e.g., `iic/emotion2vec_plus_large`). - **hub** (string) - Required - The model hub to use (`ms` for ModelScope, `hf` for Hugging Face). - **granularity** (string) - Optional - The level of detail for the output. Options: `utterance` or `frame`. Defaults to `utterance`. - **extract_embedding** (boolean) - Optional - Whether to extract embeddings. Defaults to `False`. ### Request Example ```python from funasr import AutoModel model = AutoModel( model="iic/emotion2vec_plus_large", hub="ms", ) wav_file = "/path/to/your/audio.wav" result = model.generate( wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False ) print(result) ``` ### Response #### Success Response (200) - **labels** (list of int) - Predicted emotion class indices. - **scores** (list of list of float) - Confidence scores for each emotion class. - **feats** (numpy.ndarray) - (Optional) 768-dimensional embeddings if `extract_embedding` is True. #### Response Example ```json { "labels": [4], "scores": [[0.01, 0.02, 0.03, 0.04, 0.85, 0.01, 0.01, 0.02, 0.01]] } ``` ### Error Handling - Invalid model name or hub will raise an error. - File not found for `wav_file` will raise an error. ``` -------------------------------- ### Extract Emotion Features from Audio Source: https://context7.com/ddlbojack/emotion2vec/llms.txt This function loads a pre-trained emotion2vec model and processes a 16kHz mono WAV file to extract emotion embeddings. It supports both frame-level and utterance-level aggregation and saves the results as a NumPy array. ```python import numpy as np import soundfile as sf import torch import torch.nn.functional as F import fairseq def extract_emotion_features(source_file, target_file, model_dir, checkpoint_dir, granularity="utterance"): model_path = UserDirModule(model_dir) fairseq.utils.import_user_module(model_path) model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint_dir]) model = model[0] model.eval() model.cuda() wav, sr = sf.read(source_file) assert sr == 16000, f"Sample rate must be 16kHz, got {sr}" with torch.no_grad(): source = torch.from_numpy(wav).float().cuda().view(1, -1) feats = model.extract_features(source, padding_mask=None)['x'].squeeze(0).cpu().numpy() if granularity == 'utterance': feats = np.mean(feats, axis=0) np.save(target_file, feats) return feats ``` -------------------------------- ### Extract Emotion Embeddings with FunASR Source: https://context7.com/ddlbojack/emotion2vec/llms.txt Shows how to extract 768-dimensional emotion embeddings at either the utterance or frame level using the base emotion2vec model. ```python from funasr import AutoModel model = AutoModel( model="iic/emotion2vec_base", hub="ms", ) wav_file = f"{model.model_path}/example/test.wav" # Utterance-level result = model.generate(wav_file, output_dir="./outputs", granularity="utterance") print(f"Utterance embedding shape: {result[0]['feats'].shape}") # Frame-level result_frames = model.generate(wav_file, output_dir="./outputs", granularity="frame") print(f"Frame embeddings shape: {result_frames[0]['feats'].shape}") ``` -------------------------------- ### Data2VecMultiModel Feature Extraction Interface Source: https://context7.com/ddlbojack/emotion2vec/llms.txt Defines the interface for the Data2VecMultiModel, which processes audio tensors to produce emotion embeddings. It returns a dictionary containing the encoder outputs and metadata. ```python class Data2VecMultiModel: def extract_features(self, source, mode=None, padding_mask=None, mask=False, remove_extra_tokens=True) -> dict: """Extract emotion representations from audio.""" pass model = Data2VecMultiModel(cfg) model.eval() model.cuda() audio = torch.randn(1, 16000).cuda() result = model.extract_features(audio, padding_mask=None) embeddings = result['x'] ``` -------------------------------- ### FunASR Emotion Embedding Extraction API Source: https://context7.com/ddlbojack/emotion2vec/llms.txt This API focuses on extracting 768-dimensional emotion embeddings from speech using the emotion2vec_base model. Embeddings can be extracted at utterance or frame level. ```APIDOC ## FunASR Emotion Embedding Extraction API ### Description This API is designed to extract 768-dimensional emotion embeddings from audio files using the `emotion2vec_base` model. These embeddings can be used for various downstream tasks. Embeddings can be extracted at either the utterance level (averaged) or frame level. ### Method POST (Implicit via `model.generate`) ### Endpoint `/generate` (within FunASR library) ### Parameters #### Query Parameters - **model** (string) - Required - The name of the pre-trained model to use (e.g., `iic/emotion2vec_base`). - **hub** (string) - Required - The model hub to use (`ms` for ModelScope, `hf` for Hugging Face). - **granularity** (string) - Required - The level at which to extract features. Options: `utterance` or `frame`. ### Request Example ```python from funasr import AutoModel model = AutoModel( model="iic/emotion2vec_base", hub="ms", ) wav_file = "/path/to/your/audio.wav" # Extract utterance-level embeddings result_utterance = model.generate( wav_file, output_dir="./outputs", granularity="utterance" ) print(f"Utterance embedding shape: {result_utterance[0]['feats'].shape}") # Extract frame-level embeddings result_frames = model.generate( wav_file, output_dir="./outputs", granularity="frame" ) print(f"Frame embeddings shape: {result_frames[0]['feats'].shape}") ``` ### Response #### Success Response (200) - **feats** (numpy.ndarray) - A numpy array containing the emotion embeddings. Shape is `[768]` for utterance level or `[T, 768]` for frame level, where T is the number of frames. #### Response Example ```json { "feats": [0.1, -0.2, ..., 0.5] // Example for utterance level } ``` ### Error Handling - Invalid model name or hub will raise an error. - File not found for `wav_file` will raise an error. ``` -------------------------------- ### BaseModel Downstream Classifier - Python Source: https://context7.com/ddlbojack/emotion2vec/llms.txt Defines a simple linear classification head for emotion recognition using emotion2vec embeddings. It includes padding-aware mean pooling and a two-layer network. Dependencies include torch. ```python import torch from torch import nn class BaseModel(nn.Module): """ Simple downstream classifier for emotion2vec features. Architecture: Linear(768→256) → ReLU → MeanPool → Linear(256→num_classes) """ def __init__(self, input_dim=768, output_dim=4): super().__init__() self.pre_net = nn.Linear(input_dim, 256) self.post_net = nn.Linear(256, output_dim) self.activate = nn.ReLU() def forward(self, x, padding_mask=None): """ Args: x: [batch, seq_len, 768] emotion2vec frame features padding_mask: [batch, seq_len] boolean (True=padded positions) Returns: logits: [batch, num_classes] emotion class logits """ x = self.activate(self.pre_net(x)) # [batch, seq_len, 256] # Padding-aware mean pooling x = x * (1 - padding_mask.unsqueeze(-1).float()) x = x.sum(dim=1) / (1 - padding_mask.float()).sum(dim=1, keepdim=True) x = self.post_net(x) # [batch, num_classes] return x # Example usage model = BaseModel(input_dim=768, output_dim=4) # Input: batch of frame-level emotion2vec features batch_feats = torch.randn(32, 100, 768) # [batch, seq_len, 768] padding_mask = torch.zeros(32, 100, dtype=torch.bool) padding_mask[:, 80:] = True # Mark last 20 frames as padding logits = model(batch_feats, padding_mask) predictions = torch.argmax(logits, dim=1) print(f"Predictions shape: {predictions.shape}") # [32] ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.