### Install emotion2vec from Source Code

Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md

Installs the emotion2vec library from its source code. Requires Python 3.8+ and PyTorch 1.13+. This process involves cloning the repository, installing fairseq, and downloading pre-trained checkpoints.

```bash
pip install fairseq
git clone https://github.com/ddlBoJack/emotion2vec.git
```

--------------------------------

### Install FunASR for Emotion2Vec+

Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md

This snippet shows how to install the FunASR library, which is required for using the emotion2vec+ models for speech emotion recognition. It uses pip for installation.

```bash
pip install -U funasr
```

--------------------------------

### FunASR Support for Kaldi-style Wav.scp

Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md

This example illustrates how FunASR supports processing multiple audio files using a Kaldi-style wav.scp file. This format allows for efficient batch processing of audio data by listing audio file names and their corresponding paths.

```text
wav_name1 wav_path1.wav
wav_name2 wav_path2.wav
...
```

--------------------------------

### Source Code Feature Extraction (CLI)

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

This section provides a command-line interface (CLI) example for extracting features using the emotion2vec model directly from its source code, typically for integration with fairseq.

```APIDOC
## Source Code Feature Extraction (CLI)

### Description
This command-line script allows for direct feature extraction from audio files using the emotion2vec model's source code. It's useful for advanced users or when integrating with frameworks like fairseq. The extracted features are saved as numpy arrays.

### Method
CLI Command (Bash)

### Endpoint
`scripts/extract_features.py`

### Parameters
#### Command-line Arguments
- **--source_file** (string) - Required - Path to the input audio file.
- **--target_file** (string) - Required - Path to save the output numpy file.
- **--model_dir** (string) - Required - Directory containing the model files.
- **--checkpoint_dir** (string) - Required - Path to the model checkpoint file (e.g., `.pt`).
- **--granularity** (string) - Optional - Feature extraction level. Options: `utterance` or `frame`. Defaults to `utterance`.

### Request Example
```bash
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0

python scripts/extract_features.py \
    --source_file='/path/to/your/audio.wav' \
    --target_file='/path/to/save/features.npy' \
    --model_dir='./upstream' \
    --checkpoint_dir='/path/to/emotion2vec_base.pt' \
    --granularity='utterance'
```

### Response
#### Success Response
- A `.npy` file is created at the specified `--target_file` path containing the extracted emotion features.

#### Response Example
(No direct output, but a file is created)
```
# Output file: /path/to/save/features.npy
# Content will be a numpy array (e.g., shape [768] or [T, 768])
```

### Error Handling
- Ensure all paths and directories are correct.
- CUDA availability is recommended for performance.
```

--------------------------------

### Load SSL Features and Create Dataset (Python)

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

Loads pre-extracted emotion2vec features from disk and prepares them for use with PyTorch's Dataset. It handles feature files, lengths, and emotion labels, returning a dictionary containing features, sample sizes, offsets, and integer labels.

```python
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

def load_ssl_features(feature_path, label_dict, max_speech_seq_len=None):
    """
    Load pre-extracted emotion2vec features from disk.

    Expected files:
        - {feature_path}.npy: Concatenated features array
        - {feature_path}.lengths: One length per line
        - {feature_path}.emo: Labels in format "utt_id emotion"

    Returns:
        dict with 'feats', 'sizes', 'offsets', 'labels', 'num'
    """
    # Assuming load_dataset is defined elsewhere and handles file loading
    # data, sizes, offsets, labels = load_dataset(
    #     feature_path, labels='emo', min_length=1, max_length=max_speech_seq_len
    # )
    # Placeholder for actual load_dataset call
    data, sizes, offsets, labels = np.random.rand(1000, 768), np.random.randint(10, 100, 10), np.cumsum([0] + np.random.randint(10, 100, 9).tolist()), [list(label_dict.keys())[i % len(label_dict)] for i in range(10)]

    labels = [label_dict[elem] for elem in labels]

    return {
        "feats": data,      # [total_frames, 768] numpy array
        "sizes": sizes,     # [num_samples] frame counts
        "offsets": offsets, # [num_samples] start indices
        "labels": labels,   # [num_samples] integer labels
        "num": len(labels)
    }

class SpeechDataset(Dataset):
    """Dataset for frame-level emotion features with variable lengths."""

    def __init__(self, feats, sizes, offsets, labels=None):
        self.feats = feats      # [total_frames, 768]
        self.sizes = sizes      # Length of each sample
        self.offsets = offsets  # Start offset of each sample
        self.labels = labels

    def __getitem__(self, index):
        offset = self.offsets[index]
        end = self.sizes[index] + offset
        feats = torch.from_numpy(self.feats[offset:end, :].copy()).float()

        return {
            "id": index,
            "feats": feats,  # [seq_len, 768]
            "target": self.labels[index] if self.labels else None
        }

    def __len__(self):
        return len(self.sizes)

    def collator(self, samples):
        """Batch collator with padding."""
        feats = [s["feats"] for s in samples]
        sizes = [f.shape[0] for f in feats]
        labels = torch.tensor([s["target"] for s in samples])

        # Pad to max length in batch
        target_size = max(sizes)
        collated_feats = torch.zeros(len(feats), target_size, feats[0].size(-1))
        padding_mask = torch.zeros(len(feats), target_size, dtype=torch.bool)

        for i, (feat, size) in enumerate(zip(feats, sizes)):
            collated_feats[i, :size] = feat
            padding_mask[i, size:] = True

        return {
            "id": torch.LongTensor([s["id"] for s in samples]),
            "net_input": {"feats": collated_feats, "padding_mask": padding_mask},
            "labels": labels
        }

# Example usage:
label_dict = {'ang': 0, 'hap': 1, 'neu': 2, 'sad': 3}
dataset = load_ssl_features("/path/to/iemocap_features", label_dict)

# Assuming train_valid_test_iemocap_dataloader is defined elsewhere
# train_loader, val_loader, test_loader = train_valid_test_iemocap_dataloader(
#     dataset, batch_size=128, test_start=0, test_end=1085, eval_is_test=False
# )

# Placeholder for dataloader iteration
# for batch in train_loader:
#     feats = batch["net_input"]["feats"]  # [128, max_len, 768]
#     mask = batch["net_input"]["padding_mask"]  # [128, max_len]
#     labels = batch["labels"]  # [128]
#     print(f"Batch shapes: feats={feats.shape}, labels={labels.shape}")
#     break

```

--------------------------------

### Perform Emotion Recognition with FunASR

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

Demonstrates how to initialize an emotion2vec+ model and perform 9-class emotion classification on a single audio file. It supports both utterance-level classification and optional embedding extraction.

```python
from funasr import AutoModel

model = AutoModel(
    model="iic/emotion2vec_plus_large",
    hub="ms",
)

wav_file = f"{model.model_path}/example/test.wav"
result = model.generate(
    wav_file,
    output_dir="./outputs",
    granularity="utterance",
    extract_embedding=False
)

print(f"Predicted emotion: {result[0]['labels']}")
print(f"Confidence scores: {result[0]['scores']}")
```

--------------------------------

### Batch Processing with Kaldi-style wav.scp

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

This section demonstrates how to process multiple audio files in batch using a Kaldi-style `wav.scp` file for efficient inference with emotion2vec models.

```APIDOC
## Batch Processing with Kaldi-style wav.scp

### Description
This feature allows for efficient batch processing of multiple audio files by referencing them in a `wav.scp` file. The `generate` method can accept a path to this file, enabling large-scale inference or embedding extraction.

### Method
POST (Implicit via `model.generate`)

### Endpoint
`/generate` (within FunASR library)

### Parameters
#### Path Parameters
- **wav_scp_path** (string) - Required - Path to the `wav.scp` file.

#### Query Parameters
- **model** (string) - Required - The name of the pre-trained model to use.
- **hub** (string) - Required - The model hub to use (`ms` or `modelscope`, `hf` or `huggingface`).
- **granularity** (string) - Optional - `utterance` or `frame`.
- **extract_embedding** (boolean) - Optional - Set to `True` to also extract embeddings.

### Request Example
```python
from funasr import AutoModel

# Create a dummy wav.scp file
wav_scp_content = """
audio_001\t/path/to/audio1.wav
audio_002\t/path/to/audio2.wav
audio_003\t/path/to/audio3.wav
"""
with open("wav.scp", "w") as f:
    f.write(wav_scp_content)

model = AutoModel(
    model="iic/emotion2vec_plus_large",
    hub="ms",
)

results = model.generate(
    "wav.scp",  # Path to the wav.scp file
    output_dir="./batch_outputs",
    granularity="utterance",
    extract_embedding=True
)

for i, result in enumerate(results):
    print(f"Audio {i}: Label={result['labels']}, Scores={result['scores']}")
    if 'feats' in result:
        print(f"  Embedding shape: {result['feats'].shape}")
```

### Response
#### Success Response (200)
- A list of dictionaries, where each dictionary corresponds to an audio file in the `wav.scp` and contains prediction results (`labels`, `scores`) and optionally embeddings (`feats`).

#### Response Example
```json
[
  {
    "labels": [4],
    "scores": [[0.01, ..., 0.85, ...]],
    "feats": [0.1, -0.2, ..., 0.5]
  },
  {
    "labels": [6],
    "scores": [[0.02, ..., 0.75, ...]],
    "feats": [-0.3, 0.4, ..., -0.1]
  }
]
```

### Error Handling
- Errors related to file access or model inference will be reported.
```

--------------------------------

### Hydra Configuration for IEMOCAP Downstream Task (YAML)

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

Demonstrates Hydra configuration for the IEMOCAP downstream task using YAML files. It shows default settings for dataset, optimization, and model parameters, and how to override them via the command line.

```yaml
# iemocap_downstream/config/default.yaml
common:
  seed: 42

dataset:
  _name: IEMOCAP
  feat_path: /path/to/emotion2vec_features
  test_ratio: 0.2
  batch_size: 128
  fold: 5
  eval_is_test: False  # If True, use test set as validation

optimization:
  epoch: 100
  lr: 5e-4
  weight_decay: 1e-5
  label_smooth: 0.0

model:
  _name: BaseModel

```

```bash
# Override config via command line
python main.py \
    dataset.feat_path=/new/path/to/features \
    dataset.batch_size=64 \
    optimization.lr=1e-4 \
    optimization.epoch=50

```

--------------------------------

### Inference with emotion2vec+ Models using FunASR

Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md

This Python script demonstrates how to perform speech emotion recognition using various emotion2vec+ models provided by FunASR. It shows how to load a model, specify the model ID, and run inference on a given audio file, outputting emotion labels and scores. The script also highlights different model versions and the option to extract embeddings.

```python
'''
Using the finetuned emotion recognization model

rec_result contains {'feats', 'labels', 'scores'}
	extract_embedding=False: 9-class emotions with scores
	extract_embedding=True: 9-class emotions with scores, along with features

9-class emotions: 
iic/emotion2vec_plus_seed, iic/emotion2vec_plus_base, iic/emotion2vec_plus_large (May. 2024 release)
iic/emotion2vec_base_finetuned (Jan. 2024 release)
    0: angry
    1: disgusted
    2: fearful
    3: happy
    4: neutral
    5: other
    6: sad
    7: surprised
    8: unknown
'''

from funasr import AutoModel

# model="iic/emotion2vec_base"
# model="iic/emotion2vec_base_finetuned"
# model="iic/emotion2vec_plus_seed"
# model="iic/emotion2vec_plus_base"
model_id = "iic/emotion2vec_plus_large"

model = AutoModel(
    model=model_id,
    hub="ms",  # "ms" or "modelscope" for China mainland users; "hf" or "huggingface" for other overseas users
)

wav_file = f"{model.model_path}/example/test.wav"
rec_result = model.generate(wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False)
print(rec_result)
```

--------------------------------

### Batch Process Audio Files with wav.scp

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

Utilizes Kaldi-style manifest files to process multiple audio files efficiently in a single batch, returning both classification results and embeddings.

```python
from funasr import AutoModel

wav_scp_content = """audio_001\t/path/to/audio1.wav\naudio_002\t/path/to/audio2.wav\naudio_003\t/path/to/audio3.wav\n"""

with open("wav.scp", "w") as f:
    f.write(wav_scp_content)

model = AutoModel(model="iic/emotion2vec_plus_large", hub="ms")

results = model.generate(
    "wav.scp",
    output_dir="./batch_outputs",
    granularity="utterance",
    extract_embedding=True
)

for i, result in enumerate(results):
    print(f"Audio {i}: Label={result['labels']}, Scores={result['scores']}")
    if 'feats' in result:
        print(f"  Embedding shape: {result['feats'].shape}")
```

--------------------------------

### Run IEMOCAP Downstream Training Pipeline

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

A shell script to execute 5-fold cross-validation training on the IEMOCAP dataset using extracted emotion2vec features.

```bash
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0
python main.py \
    dataset._name=IEMOCAP \
    dataset.feat_path=/path/to/emotion2vec_features \
    model._name=BaseModel \
    dataset.batch_size=128 \
    optimization.epoch=100 \
    optimization.lr=5e-4 \
    dataset.eval_is_test=false
```

--------------------------------

### Python Emotion Prediction with Emotion2Vec

Source: https://github.com/ddlbojack/emotion2vec/blob/main/iemocap_downstream/inference.ipynb

This Python script loads a pre-trained Emotion2Vec model, prepares sample input features and padding masks, performs inference to predict an emotion, and then decodes the prediction into a human-readable label. It requires PyTorch and a custom BaseModel class.

```python
import torch
from model import BaseModel

label_dict={'ang': 0, 'hap': 1, 'neu': 2, 'sad': 3}
idx2label = {v: k for k, v in label_dict.items()}
model = BaseModel(input_dim=768, output_dim=len(label_dict))

ckpt = torch.load('outputs/2024-01-14/22-57-42/model_1.pth')
model.load_state_dict(ckpt)

feat = torch.randn(1, 100, 768)
padding_mask = torch.zeros(1, 100).bool()
outputs = model(feat, padding_mask)

_, predict = torch.max(outputs.data, dim=1)
print(idx2label[predict.item()])
```

--------------------------------

### Train IEMOCAP Downstream Classifier - Python

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

Entry point for training a downstream classifier on the IEMOCAP dataset using emotion2vec features. It handles data loading, model initialization, training loop, and evaluation with 5-fold cross-validation. Dependencies include hydra, omegaconf, and torch.

```python
import hydra
from omegaconf import DictConfig
import torch
from torch import nn, optim
from data import load_ssl_features, train_valid_test_iemocap_dataloader
from model import BaseModel
from utils import train_one_epoch, validate_and_test

@hydra.main(config_path='config', config_name='default.yaml')
def train_iemocap(cfg: DictConfig):
    torch.manual_seed(cfg.common.seed)

    # IEMOCAP 4-class emotion labels
    label_dict = {'ang': 0, 'hap': 1, 'neu': 2, 'sad': 3}
    n_samples = [1085, 1023, 1151, 1031, 1241]  # Samples per session

    for fold in range(5):  # 5-fold cross-validation
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        # Load pre-extracted emotion2vec features
        dataset = load_ssl_features(cfg.dataset.feat_path, label_dict)

        # Create data loaders for this fold
        test_len = n_samples[fold]
        test_idx_start = sum(n_samples[:fold])
        test_idx_end = test_idx_start + test_len

        train_loader, val_loader, test_loader = train_valid_test_iemocap_dataloader(
            dataset, cfg.dataset.batch_size, test_idx_start, test_idx_end,
            eval_is_test=cfg.dataset.eval_is_test
        )

        # Initialize model: 768-dim input → 4 emotion classes
        model = BaseModel(input_dim=768, output_dim=len(label_dict)).to(device)
        optimizer = optim.RMSprop(model.parameters(), lr=cfg.optimization.lr, momentum=0.9)
        scheduler = optim.lr_scheduler.CyclicLR(
            optimizer, base_lr=cfg.optimization.lr, max_lr=1e-3, step_size_up=10
        )
        criterion = nn.CrossEntropyLoss()

        # Training loop
        best_val_wa = 0
        for epoch in range(cfg.optimization.epoch):
            train_loss = train_one_epoch(model, optimizer, criterion, train_loader, device)
            scheduler.step()

            val_wa, val_ua, val_f1 = validate_and_test(model, val_loader, device, num_classes=4)

            if val_wa > best_val_wa:
                best_val_wa = val_wa
                torch.save(model.state_dict(), f"model_{fold+1}.pth")

        # Final test evaluation
        model.load_state_dict(torch.load(f"model_{fold+1}.pth"))
        test_wa, test_ua, test_f1 = validate_and_test(model, test_loader, device, num_classes=4)
        print(f"Fold {fold+1}: WA={test_wa:.2f}%, UA={test_ua:.2f}%, F1={test_f1:.2f}%")

if __name__ == '__main__':
    train_iemocap()

```

--------------------------------

### Extract Features using FunASR

Source: https://github.com/ddlbojack/emotion2vec/blob/main/README.md

Extracts emotion features from audio files using the FunASR library. The model is automatically downloaded. It supports both utterance-level and frame-level feature extraction. Input can be a single WAV file or a list of files in wav.scp format.

```python
from funasr import AutoModel

model_id = "iic/emotion2vec_base"
model = AutoModel(
    model=model_id,
    hub="ms",  # "ms" or "modelscope" for China mainland users; "hf" or "huggingface" for other overseas users
)

wav_file = f"{model.model_path}/example/test.wav"
rec_result = model.generate(wav_file, output_dir="./outputs", granularity="utterance")
print(rec_result)
```

--------------------------------

### Extract Features via Shell Script

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

Provides a command-line interface for feature extraction using the source code, saving output as numpy arrays.

```bash
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0

python scripts/extract_features.py \
    --source_file='/path/to/audio.wav' \
    --target_file='/path/to/output.npy' \
    --model_dir='./upstream' \
    --checkpoint_dir='/path/to/emotion2vec_base.pt' \
    --granularity='utterance'
```

--------------------------------

### FunASR Emotion Recognition API

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

This API allows for speech emotion recognition using pre-trained emotion2vec+ models. It supports 9-class emotion classification and can extract embeddings if needed.

```APIDOC
## FunASR Emotion Recognition API

### Description
This API provides a simple interface for performing speech emotion recognition using various pre-trained emotion2vec+ models. It supports 9-class emotion classification and can optionally extract 768-dimensional embeddings.

### Method
POST (Implicit via `model.generate`)

### Endpoint
`/generate` (within FunASR library)

### Parameters
#### Query Parameters
- **model** (string) - Required - The name of the pre-trained model to use (e.g., `iic/emotion2vec_plus_large`).
- **hub** (string) - Required - The model hub to use (`ms` for ModelScope, `hf` for Hugging Face).
- **granularity** (string) - Optional - The level of detail for the output. Options: `utterance` or `frame`. Defaults to `utterance`.
- **extract_embedding** (boolean) - Optional - Whether to extract embeddings. Defaults to `False`.

### Request Example
```python
from funasr import AutoModel

model = AutoModel(
    model="iic/emotion2vec_plus_large",
    hub="ms",
)
wav_file = "/path/to/your/audio.wav"
result = model.generate(
    wav_file,
    output_dir="./outputs",
    granularity="utterance",
    extract_embedding=False
)
print(result)
```

### Response
#### Success Response (200)
- **labels** (list of int) - Predicted emotion class indices.
- **scores** (list of list of float) - Confidence scores for each emotion class.
- **feats** (numpy.ndarray) - (Optional) 768-dimensional embeddings if `extract_embedding` is True.

#### Response Example
```json
{
  "labels": [4],
  "scores": [[0.01, 0.02, 0.03, 0.04, 0.85, 0.01, 0.01, 0.02, 0.01]]
}
```

### Error Handling
- Invalid model name or hub will raise an error.
- File not found for `wav_file` will raise an error.
```

--------------------------------

### Extract Emotion Features from Audio

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

This function loads a pre-trained emotion2vec model and processes a 16kHz mono WAV file to extract emotion embeddings. It supports both frame-level and utterance-level aggregation and saves the results as a NumPy array.

```python
import numpy as np
import soundfile as sf
import torch
import torch.nn.functional as F
import fairseq

def extract_emotion_features(source_file, target_file, model_dir, checkpoint_dir, granularity="utterance"):
    model_path = UserDirModule(model_dir)
    fairseq.utils.import_user_module(model_path)
    model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint_dir])
    model = model[0]
    model.eval()
    model.cuda()
    wav, sr = sf.read(source_file)
    assert sr == 16000, f"Sample rate must be 16kHz, got {sr}"
    with torch.no_grad():
        source = torch.from_numpy(wav).float().cuda().view(1, -1)
        feats = model.extract_features(source, padding_mask=None)['x'].squeeze(0).cpu().numpy()
        if granularity == 'utterance':
            feats = np.mean(feats, axis=0)
        np.save(target_file, feats)
    return feats
```

--------------------------------

### Extract Emotion Embeddings with FunASR

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

Shows how to extract 768-dimensional emotion embeddings at either the utterance or frame level using the base emotion2vec model.

```python
from funasr import AutoModel

model = AutoModel(
    model="iic/emotion2vec_base",
    hub="ms",
)

wav_file = f"{model.model_path}/example/test.wav"

# Utterance-level
result = model.generate(wav_file, output_dir="./outputs", granularity="utterance")
print(f"Utterance embedding shape: {result[0]['feats'].shape}")

# Frame-level
result_frames = model.generate(wav_file, output_dir="./outputs", granularity="frame")
print(f"Frame embeddings shape: {result_frames[0]['feats'].shape}")
```

--------------------------------

### Data2VecMultiModel Feature Extraction Interface

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

Defines the interface for the Data2VecMultiModel, which processes audio tensors to produce emotion embeddings. It returns a dictionary containing the encoder outputs and metadata.

```python
class Data2VecMultiModel:
    def extract_features(self, source, mode=None, padding_mask=None, mask=False, remove_extra_tokens=True) -> dict:
        """Extract emotion representations from audio."""
        pass

model = Data2VecMultiModel(cfg)
model.eval()
model.cuda()
audio = torch.randn(1, 16000).cuda()
result = model.extract_features(audio, padding_mask=None)
embeddings = result['x']
```

--------------------------------

### FunASR Emotion Embedding Extraction API

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

This API focuses on extracting 768-dimensional emotion embeddings from speech using the emotion2vec_base model. Embeddings can be extracted at utterance or frame level.

```APIDOC
## FunASR Emotion Embedding Extraction API

### Description
This API is designed to extract 768-dimensional emotion embeddings from audio files using the `emotion2vec_base` model. These embeddings can be used for various downstream tasks. Embeddings can be extracted at either the utterance level (averaged) or frame level.

### Method
POST (Implicit via `model.generate`)

### Endpoint
`/generate` (within FunASR library)

### Parameters
#### Query Parameters
- **model** (string) - Required - The name of the pre-trained model to use (e.g., `iic/emotion2vec_base`).
- **hub** (string) - Required - The model hub to use (`ms` for ModelScope, `hf` for Hugging Face).
- **granularity** (string) - Required - The level at which to extract features. Options: `utterance` or `frame`.

### Request Example
```python
from funasr import AutoModel

model = AutoModel(
    model="iic/emotion2vec_base",
    hub="ms",
)
wav_file = "/path/to/your/audio.wav"

# Extract utterance-level embeddings
result_utterance = model.generate(
    wav_file,
    output_dir="./outputs",
    granularity="utterance"
)
print(f"Utterance embedding shape: {result_utterance[0]['feats'].shape}")

# Extract frame-level embeddings
result_frames = model.generate(
    wav_file,
    output_dir="./outputs",
    granularity="frame"
)
print(f"Frame embeddings shape: {result_frames[0]['feats'].shape}")
```

### Response
#### Success Response (200)
- **feats** (numpy.ndarray) - A numpy array containing the emotion embeddings. Shape is `[768]` for utterance level or `[T, 768]` for frame level, where T is the number of frames.

#### Response Example
```json
{
  "feats": [0.1, -0.2, ..., 0.5] // Example for utterance level
}
```

### Error Handling
- Invalid model name or hub will raise an error.
- File not found for `wav_file` will raise an error.
```

--------------------------------

### BaseModel Downstream Classifier - Python

Source: https://context7.com/ddlbojack/emotion2vec/llms.txt

Defines a simple linear classification head for emotion recognition using emotion2vec embeddings. It includes padding-aware mean pooling and a two-layer network. Dependencies include torch.

```python
import torch
from torch import nn

class BaseModel(nn.Module):
    """
    Simple downstream classifier for emotion2vec features.
    Architecture: Linear(768→256) → ReLU → MeanPool → Linear(256→num_classes)
    """

    def __init__(self, input_dim=768, output_dim=4):
        super().__init__()
        self.pre_net = nn.Linear(input_dim, 256)
        self.post_net = nn.Linear(256, output_dim)
        self.activate = nn.ReLU()

    def forward(self, x, padding_mask=None):
        """
        Args:
            x: [batch, seq_len, 768] emotion2vec frame features
            padding_mask: [batch, seq_len] boolean (True=padded positions)
        Returns:
            logits: [batch, num_classes] emotion class logits
        """
        x = self.activate(self.pre_net(x))  # [batch, seq_len, 256]

        # Padding-aware mean pooling
        x = x * (1 - padding_mask.unsqueeze(-1).float())
        x = x.sum(dim=1) / (1 - padding_mask.float()).sum(dim=1, keepdim=True)

        x = self.post_net(x)  # [batch, num_classes]
        return x

# Example usage
model = BaseModel(input_dim=768, output_dim=4)

# Input: batch of frame-level emotion2vec features
batch_feats = torch.randn(32, 100, 768)  # [batch, seq_len, 768]
padding_mask = torch.zeros(32, 100, dtype=torch.bool)
padding_mask[:, 80:] = True  # Mark last 20 frames as padding

logits = model(batch_feats, padding_mask)
predictions = torch.argmax(logits, dim=1)
print(f"Predictions shape: {predictions.shape}")  # [32]

```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.