### Install DeBERTa as a Pip Package

Source: https://github.com/microsoft/deberta/blob/master/README.md

This command installs the DeBERTa library as a Python package, making it available for use in your projects.

```bash
pip install deberta
```

--------------------------------

### DeBERTa Multiple Choice Model

Source: https://context7.com/microsoft/deberta/llms.txt

Demonstrates how to set up and use the MultiChoiceModel for tasks requiring selection from multiple options, such as RACE or SWAG. Includes examples for training and inference.

```python
import torch
from DeBERTa.apps.models import MultiChoiceModel
from DeBERTa.deberta import ModelConfig

config = ModelConfig()
config.hidden_size = 768
config.num_hidden_layers = 12

model = MultiChoiceModel(
    config=config,
    num_labels=4,  # Number of choices
    drop_out=0.1
)

# Input shape: [batch_size, num_choices, seq_length]
batch_size = 2
num_choices = 4
seq_length = 128

# Each choice is a separate sequence (e.g., context + answer option)
input_ids = torch.randint(0, 50265, (batch_size, num_choices, seq_length))
input_mask = torch.ones((batch_size, num_choices, seq_length), dtype=torch.long)
type_ids = torch.zeros((batch_size, num_choices, seq_length), dtype=torch.long)
labels = torch.randint(0, num_choices, (batch_size,))  # Correct choice index

# Training
model.train()
output = model(
    input_ids=input_ids,
    type_ids=type_ids,
    input_mask=input_mask,
    labels=labels
)

logits = output['logits']  # Shape: [batch_size, num_choices]
loss = output['loss']

# Inference
model.eval()
with torch.no_grad():
    output = model(input_ids=input_ids, input_mask=input_mask)
    choice_predictions = output['logits'].argmax(dim=-1)
```

--------------------------------

### Configure Distributed Training

Source: https://context7.com/microsoft/deberta/llms.txt

Provides a template for using the DistributedTrainer class to handle multi-GPU training. It includes setup for arguments, data preparation, custom loss functions, and checkpoint management.

```python
from DeBERTa.training import DistributedTrainer, set_random_seed
from DeBERTa.apps.models import SequenceClassificationModel
from DeBERTa.deberta import ModelConfig

args = argparse.Namespace(seed=42, rank=0, world_size=1, train_batch_size=32, accumulative_update=2, num_train_epochs=3, output_dir='/output/path', fp16=False)
set_random_seed(args.seed)
model = SequenceClassificationModel(ModelConfig(), num_labels=2, pre_trained='base')
trainer = DistributedTrainer(args=args, output_dir=args.output_dir, model=model, device=device, data_fn=data_fn, loss_fn=loss_fn, eval_fn=eval_fn, dump_interval=10000, name='classification')
trainer.train()
```

--------------------------------

### Integrate DeBERTa Encoder into PyTorch Model

Source: https://github.com/microsoft/deberta/blob/master/README.md

This snippet demonstrates how to replace the encoder of a custom PyTorch model with DeBERTa. It shows the initialization of the DeBERTa model and how to pass input IDs through it to get encodings. Dependencies include PyTorch and the DeBERTa library.

```python
from DeBERTa import deberta
import torch

class MyModel(torch.nn.Module):
  def __init__(self):
    super().__init__()
    # Your existing model code
    self.deberta = deberta.DeBERTa(pre_trained='base') # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2'
    # Your existing model code
    # do inilization as before
    # 
    self.deberta.apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor
    # 
  def forward(self, input_ids):
    # The inputs to DeBERTa forward are
    # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary
    # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. 
    #    Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
    # `attention_mask`: an optional parameter for input mask or attention mask. 
    #   - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. 
    #      It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. 
    #      It's the mask that we typically use for attention when a batch has varying length sentences.
    #   - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. 
    #      In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence.
    # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True
    encoding = deberta.bert(input_ids)[-1]
```

--------------------------------

### DeBERTa Sequence Classification Model

Source: https://context7.com/microsoft/deberta/llms.txt

Shows how to initialize and use the SequenceClassificationModel for tasks like sentiment analysis or text classification. It includes examples for both training with labels and inference without labels, as well as a regression task.

```python
import torch
from DeBERTa.apps.models import SequenceClassificationModel
from DeBERTa.deberta import ModelConfig

# Initialize classification model
config = ModelConfig()
config.hidden_size = 768
config.num_hidden_layers = 12
config.num_attention_heads = 12

model = SequenceClassificationModel(
    config=config,
    num_labels=3,           # Number of classes
    drop_out=0.1,           # Classification dropout
    pre_trained='base'      # Load pre-trained weights
)

# Prepare batch input
batch_size = 8
seq_length = 128
input_ids = torch.randint(0, 50265, (batch_size, seq_length))
input_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
type_ids = torch.zeros((batch_size, seq_length), dtype=torch.long)
labels = torch.randint(0, 3, (batch_size,))  # Class labels

# Forward pass with labels (training)
model.train()
output = model(
    input_ids=input_ids,
    type_ids=type_ids,
    input_mask=input_mask,
    labels=labels
)

logits = output['logits']  # Shape: [batch_size, num_labels]
loss = output['loss']      # CrossEntropy loss
print(f"Loss: {loss.item()}")

# Inference (no labels)
model.eval()
with torch.no_grad():
    output = model(
        input_ids=input_ids,
        type_ids=type_ids,
        input_mask=input_mask
    )
    predictions = output['logits'].argmax(dim=-1)
    print(f"Predictions: {predictions}")

# Regression task (num_labels=1)
regression_model = SequenceClassificationModel(
    config=config,
    num_labels=1,
    pre_trained='base'
)
labels = torch.randn(batch_size)  # Continuous labels
output = regression_model(input_ids=input_ids, input_mask=input_mask, labels=labels)
# Uses MSE loss for regression
```

--------------------------------

### Running DeBERTa Experiments from Command Line

Source: https://github.com/microsoft/deberta/blob/master/README.md

Instructions for downloading data and running DeBERTa experiments for GLUE tasks.

```APIDOC
## Running DeBERTa Experiments from Command Line

### Description
This section provides the command-line instructions to download datasets for GLUE tasks and to run DeBERTa experiments using the `run.py` script.

### Method
N/A (Command Line)

### Endpoint
N/A

### Parameters
N/A

### Request Example
#### 1. Get the data
```bash
cache_dir=/tmp/DeBERTa/
cd experiments/glue
./download_data.sh  $cache_dir/glue_tasks
```

#### 2. Run task
```bash
task=STS-B 
OUTPUT=/tmp/DeBERTa/exps/$task
export OMP_NUM_THREADS=1
python3 -m DeBERTa.apps.run --task_name $task --do_train  \
  --data_dir $cache_dir/glue_tasks/$task \
  --eval_batch_size 128 \
  --predict_batch_size 128 \
  --output_dir $OUTPUT \
  --scale_steps 250 \
  --loss_scale 16384 \
  --accumulative_update 1 \
  --num_train_epochs 6 \
  --warmup 100 \
  --learning_rate 2e-5 \
  --train_batch_size 32 \
  --max_seq_len 128
```

### Response
N/A

### Response Example
N/A
```

--------------------------------

### Run Command-Line Fine-tuning

Source: https://context7.com/microsoft/deberta/llms.txt

Illustrates how to execute fine-tuning experiments for tasks like GLUE using the DeBERTa command-line interface. It covers data downloading and parameter configuration for training and evaluation.

```bash
cache_dir=/tmp/DeBERTa
cd experiments/glue
./download_data.sh $cache_dir/glue_tasks
python3 -m DeBERTa.apps.run --task_name SST-2 --do_train --do_eval --data_dir $cache_dir/glue_tasks/SST-2 --output_dir /tmp/DeBERTa/output/sst2 --init_model base --max_seq_length 128 --train_batch_size 32 --eval_batch_size 128 --num_train_epochs 3 --learning_rate 2e-5 --warmup_proportion 0.1 --cls_drop_out 0.1
```

--------------------------------

### DeBERTa Model Initialization

Source: https://context7.com/microsoft/deberta/llms.txt

Demonstrates how to initialize the DeBERTa model using pre-trained weights, custom configurations, or local checkpoints.

```APIDOC
## DeBERTa Model Initialization

### Description
Initializes the core DeBERTa encoder model. Supports initialization with various pre-trained model configurations, custom configurations, or local checkpoint paths.

### Method
```python
import torch
from DeBERTa import deberta
```

### Initialization Options

#### Pre-trained Models
Initialize with readily available pre-trained weights. Common identifiers include 'base', 'large', 'xlarge', 'xlarge-v2', 'xxlarge-v2', 'deberta-v3-small', 'deberta-v3-base', 'deberta-v3-large', and 'mdeberta-v3-base'.
```python
# Initialize with pre-trained model 'base'
model = deberta.DeBERTa(pre_trained='base')
```

#### Custom Configuration
Initialize with a custom `ModelConfig` object, allowing fine-grained control over model architecture.
```python
# Initialize with custom configuration
config = deberta.ModelConfig()
config.hidden_size = 768
config.num_hidden_layers = 12
config.num_attention_heads = 12
config.intermediate_size = 3072
config.hidden_dropout_prob = 0.1
config.attention_probs_dropout_prob = 0.1
config.max_position_embeddings = 512

model = deberta.DeBERTa(config=config)
```

#### Local Checkpoint
Initialize by loading weights from a local model checkpoint file.
```python
# Load from local checkpoint
model = deberta.DeBERTa(pre_trained='/path/to/model/checkpoint')
```
```

--------------------------------

### Fine-tune DeBERTa on GLUE Tasks via CLI

Source: https://context7.com/microsoft/deberta/llms.txt

Demonstrates how to use the DeBERTa CLI to fine-tune models on benchmarks like MNLI, QNLI, and RTE. It supports configurations for batch size, sequence length, and mixed-precision training.

```bash
python3 -m DeBERTa.apps.run --task_name MNLI --do_train --do_eval --data_dir $cache_dir/glue_tasks/MNLI --output_dir /tmp/DeBERTa/output/mnli --init_model large --max_seq_length 256 --train_batch_size 16 --accumulative_update 2 --num_train_epochs 3 --learning_rate 1e-5 --fp16 True
```

--------------------------------

### Initialize DeBERTa Model

Source: https://context7.com/microsoft/deberta/llms.txt

Demonstrates initializing the DeBERTa encoder model using pre-trained weights, custom configurations, or local checkpoints. It supports various model sizes and multilingual variants.

```python
import torch
from DeBERTa import deberta

# Initialize with pre-trained model
# Available: 'base', 'large', 'xlarge', 'xlarge-v2', 'xxlarge-v2',
# 'deberta-v3-small', 'deberta-v3-base', 'deberta-v3-large', 'mdeberta-v3-base'
model = deberta.DeBERTa(pre_trained='base')

# Initialize with custom configuration
config = deberta.ModelConfig()
config.hidden_size = 768
config.num_hidden_layers = 12
config.num_attention_heads = 12
config.intermediate_size = 3072
config.hidden_dropout_prob = 0.1
config.attention_probs_dropout_prob = 0.1
config.max_position_embeddings = 512

model = deberta.DeBERTa(config=config)

# Load from local checkpoint
model = deberta.DeBERTa(pre_trained='/path/to/model/checkpoint')
```

--------------------------------

### Run DeBERTa GLUE Task Experiment

Source: https://github.com/microsoft/deberta/blob/master/README.md

This command executes a DeBERTa model training or evaluation run for a specified GLUE task. It requires setting environment variables like OMP_NUM_THREADS, and provides numerous arguments to configure the task, data directory, batch sizes, output directory, training epochs, learning rate, and sequence length.

```bash
task=STS-B 
OUTPUT=/tmp/DeBERTa/exps/$task
export OMP_NUM_THREADS=1
python3 -m DeBERTa.apps.run --task_name $task --do_train  \
  --data_dir $cache_dir/glue_tasks/$task \
  --eval_batch_size 128 \
  --predict_batch_size 128 \
  --output_dir $OUTPUT \
  --scale_steps 250 \
  --loss_scale 16384 \
  --accumulative_update 1 \
  --num_train_epochs 6 \
  --warmup 100 \
  --learning_rate 2e-5 \
  --train_batch_size 32 \
  --max_seq_len 128
```

--------------------------------

### Using DeBERTa Tokenizer

Source: https://github.com/microsoft/deberta/blob/master/README.md

This section explains how to load and use the DeBERTa tokenizer for preparing input data.

```APIDOC
## Using DeBERTa Tokenizer

### Description
This code demonstrates how to load the vocabulary and tokenizer for DeBERTa, tokenize example text, add special tokens, convert tokens to IDs, and prepare input features including padding and attention masks.

### Method
N/A (Code Example)

### Endpoint
N/A

### Parameters
N/A

### Request Example
```python
from DeBERTa import deberta
import torch

vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base')
tokenizer = deberta.tokenizers[vocab_type](vocab_path)
max_seq_len = 512
tokens = tokenizer.tokenize('Examples input text of DeBERTa')
# Truncate long sequence
tokens = tokens[:max_seq_len - 2]
# Add special tokens to the `tokens`
tokens = ['[CLS]'] + tokens + ['[SEP]']
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1]*len(input_ids)
# padding
paddings = max_seq_len - len(input_ids)
input_ids = input_ids + [0]*paddings
input_mask = input_mask + [0]*paddings
features = {
'input_ids': torch.tensor(input_ids, dtype=torch.int),
'input_mask': torch.tensor(input_mask, dtype=torch.int)
}
```

### Response
N/A

### Response Example
N/A
```

--------------------------------

### Download Data for DeBERTa GLUE Experiments

Source: https://github.com/microsoft/deberta/blob/master/README.md

This bash script downloads the necessary datasets for running DeBERTa experiments on GLUE tasks. It requires a cache directory path as an argument and changes the current directory to the experiments/glue directory before executing the download script.

```bash
cache_dir=/tmp/DeBERTa/
cd experiments/glue
./download_data.sh  $cache_dir/glue_tasks
```

--------------------------------

### Configure Distributed Training Environment Variables

Source: https://github.com/microsoft/deberta/blob/master/experiments/language_model/README.md

This section outlines the environment variables required for distributed training across multiple nodes. It includes setting the total number of nodes, master node address and port, and the rank of the current node.

```bash
# Example for Node 0
export WORLD_SIZE=2
export MASTER_ADDR=node0
export MASTER_PORT=7488
export RANK=0
./rtd.sh deberta-v3-xsmall

# Example for Node 1
export WORLD_SIZE=2
export MASTER_ADDR=node0
export MASTER_PORT=7488
export RANK=1
./rtd.sh deberta-v3-xsmall
```

--------------------------------

### Loading Pre-trained DeBERTa Models and Vocabularies

Source: https://context7.com/microsoft/deberta/llms.txt

This snippet explains how to load pre-trained DeBERTa models and their corresponding vocabularies from the HuggingFace Hub or custom file paths. It includes listing available pre-trained models, loading model states and configurations, and loading vocabulary files, differentiating between vocabulary types for different DeBERTa versions.

```python
from DeBERTa.deberta import load_model_state, load_vocab, pretrained_models

# List available pre-trained models
print("Available models:")
for name in pretrained_models.keys():
    print(f"  - {name}")
# base, large, xlarge, base-mnli, large-mnli, xlarge-mnli
# xlarge-v2, xxlarge-v2, xlarge-v2-mnli, xxlarge-v2-mnli
# deberta-v3-small, deberta-v3-base, deberta-v3-large
# deberta-v3-xsmall, mdeberta-v3-base

# Load model state and config
model_state, model_config = load_model_state('base')
print(f"Config: {model_config}")

# Load from custom path
model_state, model_config = load_model_state('/path/to/pytorch_model.bin')

# Load vocabulary
vocab_path, vocab_type = load_vocab(pretrained_id='base')
print(f"Vocab path: {vocab_path}")
print(f"Vocab type: {vocab_type}")  # 'gpt2' for v1, 'spm' for v2/v3
```

--------------------------------

### DeBERTa Tokenizer Initialization and Text Processing

Source: https://context7.com/microsoft/deberta/llms.txt

Illustrates how to initialize DeBERTa tokenizers (GPT2Tokenizer for V1, SPMTokenizer for V2/V3) and process text into token IDs. It covers loading vocabularies, tokenizing text, converting tokens to IDs, and preparing input for the model, including padding.

```python
from DeBERTa import deberta

# Load vocabulary and get tokenizer type for a pre-trained model
vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base')
tokenizer = deberta.tokenizers[vocab_type](vocab_path)

# For V2/V3 models (uses SentencePiece)
vocab_path, vocab_type = deberta.load_vocab(pretrained_id='deberta-v3-base')
tokenizer = deberta.tokenizers[vocab_type](vocab_path)  # SPMTokenizer

# Tokenize text
text = "Hello, this is an example of DeBERTa tokenization."
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")

# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")

# Build full input with special tokens
max_seq_len = 128
tokens = tokenizer.tokenize(text)
tokens = tokens[:max_seq_len - 2]  # Reserve space for [CLS] and [SEP]
tokens = ['[CLS]'] + tokens + ['[SEP]']

input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)

# Pad to max length
padding_length = max_seq_len - len(input_ids)
input_ids = input_ids + [0] * padding_length
input_mask = input_mask + [0] * padding_length
```

--------------------------------

### Tokenizer Initialization and Text Processing

Source: https://context7.com/microsoft/deberta/llms.txt

Details on initializing DeBERTa tokenizers and processing text for model input.

```APIDOC
## Tokenizer Initialization and Text Processing

### Description
Provides instructions for initializing DeBERTa tokenizers (GPT2Tokenizer for V1, SPMTokenizer for V2/V3) and converting text into token IDs suitable for model input.

### Method
```python
from DeBERTa import deberta
```

### Usage

#### Initializing Tokenizers
DeBERTa offers two tokenizer types. The `load_vocab` function helps retrieve the vocabulary path and type based on a pre-trained model ID.

```python
# Load vocabulary and get tokenizer type for a pre-trained model (V1 example)
vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base')
tokenizer = deberta.tokenizers[vocab_type](vocab_path)

# For V2/V3 models (uses SentencePiece)
vocab_path, vocab_type = deberta.load_vocab(pretrained_id='deberta-v3-base')
tokenizer = deberta.tokenizers[vocab_type](vocab_path)  # This will be an SPMTokenizer instance
```

#### Tokenizing Text
Convert raw text into tokens and then into numerical IDs.
```python
# Tokenize text
text = "Hello, this is an example of DeBERTa tokenization."
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")

# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")
```

#### Building Full Input for Model
Prepare input sequences by adding special tokens ([CLS], [SEP]) and padding to a maximum sequence length.
```python
# Build full input with special tokens and padding
max_seq_len = 128
text = "Hello, this is an example of DeBERTa tokenization."
tokens = tokenizer.tokenize(text)
tokens = tokens[:max_seq_len - 2]  # Reserve space for [CLS] and [SEP]
tokens = ['[CLS]'] + tokens + ['[SEP]']

input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)

# Pad to max length
padding_length = max_seq_len - len(input_ids)
input_ids = input_ids + [0] * padding_length
input_mask = input_mask + [0] * padding_length

print(f"Input IDs: {input_ids}")
print(f"Attention Mask: {input_mask}")
```
```

--------------------------------

### Load DeBERTa Model and Vocabulary

Source: https://context7.com/microsoft/deberta/llms.txt

Demonstrates how to load pre-trained model states and vocabulary files from a specific cache directory. This is useful for managing large model assets in custom environments.

```python
model_state, model_config = load_model_state('deberta-v3-base', cache_dir='/custom/cache/dir', no_cache=False)
vocab_path, vocab_type = load_vocab(pretrained_id='deberta-v3-base', cache_dir='/custom/cache/dir')
```

--------------------------------

### Continuously Train DeBERTaV3 Models with RTD

Source: https://github.com/microsoft/deberta/blob/master/experiments/language_model/README.md

This script enables continuous training of DeBERTaV3 models using the RTD task. It requires specifying the model size and potentially the initialization checkpoints for the generator and discriminator. Check the script for detailed configuration of initialization models.

```bash
#!/bin/bash

if [ $# -ne 1 ]; then
    echo "Usage: $0 <model_size>-continue"
    exit 1
fi

MODEL_SIZE=$1

if [[ "$MODEL_SIZE" == "deberta-v3-xsmall-continue" ]]; then
    echo "Continuously training DeBERTaV3 XSmall model with RTD..."
    # Specify generator and discriminator checkpoints here
    # Example: python train.py --task rtd --model deberta-v3-xsmall --continue_train --generator_checkpoint path/to/generator.bin --discriminator_checkpoint path/to/discriminator.bin
elif [[ "$MODEL_SIZE" == "deberta-v3-small-continue" ]]; then
    echo "Continuously training DeBERTaV3 Small model with RTD..."
    # Specify generator and discriminator checkpoints here
elif [[ "$MODEL_SIZE" == "deberta-v3-large-continue" ]]; then
    echo "Continuously training DeBERTaV3 Large model with RTD..."
    # Specify generator and discriminator checkpoints here
else
    echo "Unknown model size for continue training: $MODEL_SIZE"
    exit 1
fi
```

--------------------------------

### DeBERTa Model Configuration Management

Source: https://context7.com/microsoft/deberta/llms.txt

This snippet illustrates how to manage DeBERTa model configurations using the `ModelConfig` class. It covers creating default configurations, modifying hyperparameters, loading configurations from JSON files, converting to dictionaries, saving as JSON strings, and creating configurations from dictionaries. This is essential for customizing DeBERTa models.

```python
from DeBERTa.deberta import ModelConfig

# Create default configuration
config = ModelConfig()

# Modify base parameters
config.hidden_size = 1024
config.num_hidden_layers = 24
config.num_attention_heads = 16
config.intermediate_size = 4096
config.hidden_dropout_prob = 0.1
config.attention_probs_dropout_prob = 0.1
config.max_position_embeddings = 512
config.vocab_size = 128000

# DeBERTa-specific parameters
config.relative_attention = True           # Enable disentangled attention
config.position_biased_input = True        # Add position to content embedding
config.pos_att_type = 'p2c|c2p'           # Position attention types: p2c, c2p, p2p
config.max_relative_positions = 256        # Max relative position distance
config.position_buckets = 256              # Use log bucket encoding for positions

# Load from JSON file
config = ModelConfig.from_json_file('/path/to/config.json')

# Convert to dictionary
config_dict = config.to_dict()

# Save as JSON string
json_str = config.to_json_string()
print(json_str)

# Create from dictionary
config = ModelConfig.from_dict({
    'hidden_size': 768,
    'num_hidden_layers': 12,
    'num_attention_heads': 12,
    'intermediate_size': 3072,
    'hidden_act': 'gelu',
    'hidden_dropout_prob': 0.1,
    'attention_probs_dropout_prob': 0.1,
    'max_position_embeddings': 512,
    'type_vocab_size': 0,
    'relative_attention': True,
    'pos_att_type': 'p2c|c2p'
})
```

--------------------------------

### DeBERTa Named Entity Recognition (NER) Model

Source: https://context7.com/microsoft/deberta/llms.txt

Illustrates the initialization and usage of the NERModel for token-level classification tasks like NER and POS tagging. It covers preparing inputs, performing a training forward pass with labels, and inference.

```python
import torch
from DeBERTa.apps.models import NERModel
from DeBERTa.deberta import ModelConfig

# Initialize NER model
config = ModelConfig()
config.hidden_size = 768
config.num_hidden_layers = 12

# NER with BIO tagging: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, etc.
num_labels = 9  # Number of entity tags

model = NERModel(
    config=config,
    num_labels=num_labels,
    drop_out=0.1
)

# Prepare inputs
batch_size = 4
seq_length = 64
input_ids = torch.randint(0, 50265, (batch_size, seq_length))
input_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
type_ids = torch.zeros((batch_size, seq_length), dtype=torch.long)

# Labels: tag ID for each token, -1 for padding/ignored tokens
labels = torch.randint(-1, num_labels, (batch_size, seq_length))

# Training forward pass
model.train()
output = model(
    input_ids=input_ids,
    type_ids=type_ids,
    input_mask=input_mask,
    labels=labels
)

logits = output['logits']  # Shape: [batch_size, seq_length, num_labels]
loss = output['loss']
print(f"NER Loss: {loss.item()}")

# Inference
model.eval()
with torch.no_grad():
    output = model(input_ids=input_ids, input_mask=input_mask)
    tag_predictions = output['logits'].argmax(dim=-1)  # [batch_size, seq_length]
```

--------------------------------

### DeBERTa Tokenization and Decoding

Source: https://context7.com/microsoft/deberta/llms.txt

Demonstrates how to access special token IDs (CLS, SEP, MASK, PAD) from the tokenizer and decode token IDs back into human-readable text.

```python
cls_id = tokenizer.vocab['[CLS]']
sep_id = tokenizer.vocab['[SEP]']
mask_id = tokenizer.vocab['[MASK]']
pad_id = tokenizer.vocab['[PAD]']

decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
```

--------------------------------

### Masked Language Model Pre-training with DeBERTa

Source: https://context7.com/microsoft/deberta/llms.txt

This snippet demonstrates how to perform Masked Language Model (MLM) pre-training using DeBERTa's enhanced mask decoder. It initializes the `MaskedLanguageModel` with a specified configuration, prepares input data including masked tokens and labels, and performs a forward pass to obtain MLM loss.

```python
import torch
from DeBERTa.apps.models import MaskedLanguageModel
from DeBERTa.deberta import ModelConfig

config = ModelConfig()
config.hidden_size = 768
config.num_hidden_layers = 12
config.vocab_size = 50265
config.max_position_embeddings = 512

model = MaskedLanguageModel(config=config)

batch_size = 4
seq_length = 128

input_ids = torch.randint(0, 50265, (batch_size, seq_length))
input_mask = torch.ones((batch_size, seq_length), dtype=torch.long)

# Labels: token IDs for masked positions, 0 for non-masked positions
# Non-zero values indicate positions to predict
labels = torch.zeros((batch_size, seq_length), dtype=torch.long)
mask_positions = torch.randint(1, seq_length-1, (batch_size, 15))  # ~15% masking
for i in range(batch_size):
    for pos in mask_positions[i]:
        labels[i, pos] = input_ids[i, pos]
        input_ids[i, pos] = 50264  # [MASK] token ID

# Forward pass
output = model(
    input_ids=input_ids,
    input_mask=input_mask,
    labels=labels
)

lm_logits = output['logits']   # Logits for masked positions
lm_labels = output['labels']   # Flattened labels for masked positions
loss = output['loss']          # MLM loss

print(f"MLM Loss: {loss.mean().item()}")
```

--------------------------------

### Implement Adversarial Training with SiFT

Source: https://context7.com/microsoft/deberta/llms.txt

Shows how to apply Scaled Invariant Fine-Tuning (SiFT) to a DeBERTa model to improve robustness. It involves hooking a perturbation layer and calculating an adversarial loss component during the training loop.

```python
import torch
from DeBERTa.deberta import DeBERTa
from DeBERTa.sift import AdversarialLearner, hook_sift_layer

model = DeBERTa(pre_trained='base')
adv_modules = hook_sift_layer(model, hidden_size=768, learning_rate=1e-4, init_perturbation=1e-2, target_module='embeddings.LayerNorm')
adv = AdversarialLearner(model, adv_modules)

def logits_fn(model, **data):
    outputs = model(**data)
    return outputs['hidden_states'][-1][:, 0]

adv_loss = adv.loss(logits, logits_fn, loss_fn='symmetric-kl', input_ids=input_ids, attention_mask=attention_mask)
total_loss = standard_loss + 1.0 * adv_loss
```

--------------------------------

### DeBERTa Model Architecture Components

Source: https://github.com/microsoft/deberta/blob/master/docs/source/modules/deberta.md

Overview of the primary modules and classes used within the DeBERTa implementation.

```APIDOC
## DeBERTa Model Components

### Description
The DeBERTa model consists of several core components including DisentangledSelfAttention, ContextPooler, and BertEncoder. These modules work together to provide improved performance over standard BERT architectures.

### Key Classes
- **DeBERTa**: The main model class.
- **DisentangledSelfAttention**: Implements the disentangled attention mechanism.
- **ContextPooler**: Handles the pooling of hidden states.
- **BertEncoder**: The stack of transformer layers.
- **StableDropout**: A robust dropout implementation for stability.

### Configuration
- **ModelConfig**: Defines the hyperparameters for the model.
- **PoolConfig**: Defines the pooling configuration settings.
```

--------------------------------

### Perform Adversarial Training with SiFT

Source: https://context7.com/microsoft/deberta/llms.txt

Configures the DeBERTa training process to include adversarial training (SiFT) to improve model robustness. It utilizes symmetric-kl loss and specific perturbation parameters.

```bash
python3 -m DeBERTa.apps.run --task_name RTE --do_train --data_dir $cache_dir/glue_tasks/RTE --output_dir /tmp/DeBERTa/output/rte --init_model large --vat_lambda 1.0 --vat_learning_rate 1e-4 --vat_init_perturbation 1e-2 --vat_loss_fn symmetric-kl
```

--------------------------------

### Integrating DeBERTa as an Encoder

Source: https://github.com/microsoft/deberta/blob/master/README.md

This section demonstrates how to replace the encoder in your existing PyTorch model with DeBERTa.

```APIDOC
## Integrating DeBERTa as an Encoder

### Description
This code snippet shows how to modify a PyTorch model to use DeBERTa as its encoder. It includes initializing the DeBERTa model and defining the forward pass.

### Method
N/A (Code Example)

### Endpoint
N/A

### Parameters
N/A

### Request Example
```python
from DeBERTa import deberta
import torch

class MyModel(torch.nn.Module):
  def __init__(self):
    super().__init__()
    # Your existing model code
    self.deberta = deberta.DeBERTa(pre_trained='base') # Or 'large', 'base-mnli', etc.
    # Your existing model code
    # do initialization as before
    self.deberta.apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor

  def forward(self, input_ids):
    # The inputs to DeBERTa forward are:
    # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
    # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length]
    # `attention_mask`: an optional parameter for input mask or attention mask.
    # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True
    encoding = deberta.bert(input_ids)[-1]
```

### Response
N/A

### Response Example
N/A
```

--------------------------------

### Pre-train BERT-like Model with MLM using mlm.sh

Source: https://github.com/microsoft/deberta/blob/master/experiments/language_model/README.md

This script trains a BERT-like model using the Masked Language Modeling (MLM) task. It supports training standard BERT models or DeBERTa models with disentangled attention. The script takes the model type as an argument.

```bash
#!/bin/bash

if [ $# -ne 1 ]; then
    echo "Usage: $0 <model_type>"
    exit 1
fi

MODEL_TYPE=$1

if [ "$MODEL_TYPE" = "bert-base" ]; then
    echo "Training BERT base model with MLM..."
    # Add BERT base MLM training command here
elif [ "$MODEL_TYPE" = "deberta-base" ]; then
    echo "Training DeBERTa base model with MLM and Disentangled Attention..."
    # Add DeBERTa base MLM training command here
else
    echo "Unknown model type: $MODEL_TYPE"
    exit 1
fi

# Example placeholder for training command
# python train.py --task mlm --model $MODEL_TYPE --data path/to/data
```

--------------------------------

### Context Pooler for Encoder Output

Source: https://context7.com/microsoft/deberta/llms.txt

This snippet demonstrates the usage of the `ContextPooler` class, which is designed to extract a fixed-size representation from the encoder's output, typically by utilizing the [CLS] token. It shows how to configure and initialize the pooler, either with a dedicated `PoolConfig` or derived from a `ModelConfig`, and how to apply it to encoder outputs to obtain a pooled representation.

```python
import torch
from DeBERTa.deberta import ContextPooler, PoolConfig, DeBERTa

# Create pooler configuration
config = PoolConfig()
config.hidden_size = 768
config.dropout = 0.1
config.hidden_act = 'gelu'

# Or create from model config
from DeBERTa.deberta import ModelConfig
model_config = ModelConfig()
model_config.hidden_size = 768
model_config.pooling = {
    'hidden_size': 768,
    'hidden_act': 'gelu',
    'dropout': 0.1
}
pool_config = PoolConfig(model_config)

# Initialize pooler
pooler = ContextPooler(pool_config)

# Use with encoder output
encoder_output = torch.randn(4, 128, 768)  # [batch, seq_len, hidden]
pooled = pooler(encoder_output)             # [batch, hidden]
print(f"Pooled shape: {pooled.shape}")     # torch.Size([4, 768])

# Get output dimension
out_dim = pooler.output_dim()  # 768
```

--------------------------------

### Integrate DeBERTa into Existing Python Code

Source: https://github.com/microsoft/deberta/blob/master/README.md

This Python code demonstrates how to apply a pre-trained DeBERTa model to your existing codebase. It requires modifications to your code to load and utilize the model's capabilities.

```python
# To apply DeBERTa to your existing code, you need to make two changes to your code,

```

--------------------------------

### Integrate DeBERTa with HuggingFace Transformers

Source: https://context7.com/microsoft/deberta/llms.txt

Shows how to load DeBERTa models using HuggingFace Transformers, perform tokenization, extract embeddings, and fine-tune for sequence classification tasks.

```python
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
import torch

model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "DeBERTa achieves state-of-the-art results on NLU benchmarks."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    last_hidden_state = outputs.last_hidden_state
    pooler_output = last_hidden_state[:, 0]

classifier = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-base", num_labels=2)
labels = torch.tensor([1])
outputs = classifier(**inputs, labels=labels)
loss = outputs.loss
```

--------------------------------

### Tokenize Text with DeBERTa Tokenizer

Source: https://github.com/microsoft/deberta/blob/master/README.md

This code snippet shows how to load and use the tokenizer built into the DeBERTa library. It covers loading the vocabulary, tokenizing input text, truncating sequences, adding special tokens, converting tokens to IDs, and padding sequences to a fixed length. The output is a dictionary of tensors suitable for model input.

```python
from DeBERTa import deberta
import torch

vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base')
tokenizer = deberta.tokenizers[vocab_type](vocab_path)
# We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK]
max_seq_len = 512
tokens = tokenizer.tokenize('Examples input text of DeBERTa')
# Truncate long sequence
tokens = tokens[:max_seq_len -2]
# Add special tokens to the `tokens`
tokens = ['[CLS]'] + tokens + ['[SEP]']
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1]*len(input_ids)
# padding
paddings = max_seq_len-len(input_ids)
input_ids = input_ids + [0]*paddings
input_mask = input_mask + [0]*paddings
features = {
'input_ids': torch.tensor(input_ids, dtype=torch.int),
'input_mask': torch.tensor(input_mask, dtype=torch.int)
}
```

--------------------------------

### Pre-train ELECTRA-like Model with RTD using rtd.sh

Source: https://github.com/microsoft/deberta/blob/master/experiments/language_model/README.md

This script trains an ELECTRA-like model using the Replaced Token Detection (RTD) task. It allows training various sizes of DeBERTaV3 models, specifying backbone and embedding parameters. The script takes the model size as an argument.

```bash
#!/bin/bash

if [ $# -ne 1 ]; then
    echo "Usage: $0 <model_size>"
    exit 1
fi

MODEL_SIZE=$1

if [ "$MODEL_SIZE" = "deberta-v3-xsmall" ]; then
    echo "Training DeBERTaV3 XSmall model with RTD..."
    # Add DeBERTaV3 XSmall RTD training command here
    # Example: python train.py --task rtd --model deberta-v3-xsmall --backbone_layers 12 --backbone_hidden 256 --embedding_vocab 128000 --embedding_size 32M
elif [ "$MODEL_SIZE" = "deberta-v3-base" ]; then
    echo "Training DeBERTaV3 Base model with RTD..."
    # Add DeBERTaV3 Base RTD training command here
    # Example: python train.py --task rtd --model deberta-v3-base --backbone_layers 12 --backbone_hidden 768 --embedding_vocab 128000 --embedding_size 96M
elif [ "$MODEL_SIZE" = "deberta-v3-large" ]; then
    echo "Training DeBERTaV3 Large model with RTD..."
    # Add DeBERTaV3 Large RTD training command here
    # Example: python train.py --task rtd --model deberta-v3-large --backbone_layers 24 --backbone_hidden 1024 --embedding_vocab 128000 --embedding_size 128M
else
    echo "Unknown model size: $MODEL_SIZE"
    exit 1
fi
```

--------------------------------

### Integrate SiFT into DeBERTa Training

Source: https://github.com/microsoft/deberta/blob/master/DeBERTa/sift/README.md

This Python code demonstrates how to integrate SiFT modules into an existing DeBERTa model for adversarial learning. It involves hooking SiFT layers and creating an AdversarialLearner to augment the loss function with adversarial components. This is useful for enhancing model robustness during training.

```python
from transformers import DebertaModel, DebertaConfig
from your_sift_library import hook_sift_layer, AdversarialLearner

# Assuming 'model' is a pre-initialized DeBERTa model and 'data' is your training data
# Example: model = DebertaModel(DebertaConfig())

# Create DeBERTa model
adv_modules = hook_sift_layer(model, hidden_size=768)
adv = AdversarialLearner(model, adv_modules)

def logits_fn(model, *wargs, **kwargs):
    logits,_ = model(*wargs, **kwargs)
    return logits

# Assuming 'data' is a dictionary containing inputs for the model
# Example: data = {'input_ids': ..., 'attention_mask': ...}
logits,loss = model(**data)

loss = loss + adv.loss(logits, logits_fn, **data)
# Other steps is the same as general training.
```

--------------------------------

### Export DeBERTa to ONNX

Source: https://context7.com/microsoft/deberta/llms.txt

Exports a trained DeBERTa model to the ONNX format for optimized production deployment.

```bash
python3 -m DeBERTa.apps.run --task_name SST-2 --do_eval --export_onnx_model True --data_dir $cache_dir/glue_tasks/SST-2 --output_dir /tmp/DeBERTa/output/sst2
```

--------------------------------

### DeBERTa Model Forward Pass

Source: https://context7.com/microsoft/deberta/llms.txt

Shows how to perform a forward pass with the DeBERTa model, processing input token IDs and masks to obtain encoded representations from all transformer layers or just the last layer. It also demonstrates how to retrieve attention matrices.

```python
import torch
from DeBERTa import deberta

# Initialize model
bert = deberta.DeBERTa(pre_trained='base')
bert.eval()

# Prepare input
batch_size = 2
seq_length = 128
input_ids = torch.randint(0, 50265, (batch_size, seq_length))
attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
token_type_ids = torch.zeros((batch_size, seq_length), dtype=torch.long)

# Forward pass - get all encoder layers
outputs = bert(
    input_ids=input_ids,
    attention_mask=attention_mask,
    token_type_ids=token_type_ids,
    output_all_encoded_layers=True,
    return_att=False
)

# Access outputs
encoder_layers = outputs['hidden_states']  # List of tensors for each layer
last_hidden_state = encoder_layers[-1]     # Shape: [batch_size, seq_length, hidden_size]
position_embeddings = outputs['position_embeddings']

# Get only last layer output
outputs = bert(
    input_ids=input_ids,
    attention_mask=attention_mask,
    output_all_encoded_layers=False
)
final_output = outputs['hidden_states']  # Single tensor: [batch_size, seq_length, hidden_size]

# Get attention matrices
outputs = bert(
    input_ids=input_ids,
    attention_mask=attention_mask,
    return_att=True
)
attention_weights = outputs.get('attention_probs')  # Attention matrices from each layer
```