### Install DeBERTa as a Pip Package Source: https://github.com/microsoft/deberta/blob/master/README.md This command installs the DeBERTa library as a Python package, making it available for use in your projects. ```bash pip install deberta ``` -------------------------------- ### DeBERTa Multiple Choice Model Source: https://context7.com/microsoft/deberta/llms.txt Demonstrates how to set up and use the MultiChoiceModel for tasks requiring selection from multiple options, such as RACE or SWAG. Includes examples for training and inference. ```python import torch from DeBERTa.apps.models import MultiChoiceModel from DeBERTa.deberta import ModelConfig config = ModelConfig() config.hidden_size = 768 config.num_hidden_layers = 12 model = MultiChoiceModel( config=config, num_labels=4, # Number of choices drop_out=0.1 ) # Input shape: [batch_size, num_choices, seq_length] batch_size = 2 num_choices = 4 seq_length = 128 # Each choice is a separate sequence (e.g., context + answer option) input_ids = torch.randint(0, 50265, (batch_size, num_choices, seq_length)) input_mask = torch.ones((batch_size, num_choices, seq_length), dtype=torch.long) type_ids = torch.zeros((batch_size, num_choices, seq_length), dtype=torch.long) labels = torch.randint(0, num_choices, (batch_size,)) # Correct choice index # Training model.train() output = model( input_ids=input_ids, type_ids=type_ids, input_mask=input_mask, labels=labels ) logits = output['logits'] # Shape: [batch_size, num_choices] loss = output['loss'] # Inference model.eval() with torch.no_grad(): output = model(input_ids=input_ids, input_mask=input_mask) choice_predictions = output['logits'].argmax(dim=-1) ``` -------------------------------- ### Configure Distributed Training Source: https://context7.com/microsoft/deberta/llms.txt Provides a template for using the DistributedTrainer class to handle multi-GPU training. It includes setup for arguments, data preparation, custom loss functions, and checkpoint management. ```python from DeBERTa.training import DistributedTrainer, set_random_seed from DeBERTa.apps.models import SequenceClassificationModel from DeBERTa.deberta import ModelConfig args = argparse.Namespace(seed=42, rank=0, world_size=1, train_batch_size=32, accumulative_update=2, num_train_epochs=3, output_dir='/output/path', fp16=False) set_random_seed(args.seed) model = SequenceClassificationModel(ModelConfig(), num_labels=2, pre_trained='base') trainer = DistributedTrainer(args=args, output_dir=args.output_dir, model=model, device=device, data_fn=data_fn, loss_fn=loss_fn, eval_fn=eval_fn, dump_interval=10000, name='classification') trainer.train() ``` -------------------------------- ### Integrate DeBERTa Encoder into PyTorch Model Source: https://github.com/microsoft/deberta/blob/master/README.md This snippet demonstrates how to replace the encoder of a custom PyTorch model with DeBERTa. It shows the initialization of the DeBERTa model and how to pass input IDs through it to get encodings. Dependencies include PyTorch and the DeBERTa library. ```python from DeBERTa import deberta import torch class MyModel(torch.nn.Module): def __init__(self): super().__init__() # Your existing model code self.deberta = deberta.DeBERTa(pre_trained='base') # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2' # Your existing model code # do inilization as before # self.deberta.apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor # def forward(self, input_ids): # The inputs to DeBERTa forward are # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. # Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details). # `attention_mask`: an optional parameter for input mask or attention mask. # - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. # It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. # It's the mask that we typically use for attention when a batch has varying length sentences. # - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. # In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True encoding = deberta.bert(input_ids)[-1] ``` -------------------------------- ### DeBERTa Sequence Classification Model Source: https://context7.com/microsoft/deberta/llms.txt Shows how to initialize and use the SequenceClassificationModel for tasks like sentiment analysis or text classification. It includes examples for both training with labels and inference without labels, as well as a regression task. ```python import torch from DeBERTa.apps.models import SequenceClassificationModel from DeBERTa.deberta import ModelConfig # Initialize classification model config = ModelConfig() config.hidden_size = 768 config.num_hidden_layers = 12 config.num_attention_heads = 12 model = SequenceClassificationModel( config=config, num_labels=3, # Number of classes drop_out=0.1, # Classification dropout pre_trained='base' # Load pre-trained weights ) # Prepare batch input batch_size = 8 seq_length = 128 input_ids = torch.randint(0, 50265, (batch_size, seq_length)) input_mask = torch.ones((batch_size, seq_length), dtype=torch.long) type_ids = torch.zeros((batch_size, seq_length), dtype=torch.long) labels = torch.randint(0, 3, (batch_size,)) # Class labels # Forward pass with labels (training) model.train() output = model( input_ids=input_ids, type_ids=type_ids, input_mask=input_mask, labels=labels ) logits = output['logits'] # Shape: [batch_size, num_labels] loss = output['loss'] # CrossEntropy loss print(f"Loss: {loss.item()}") # Inference (no labels) model.eval() with torch.no_grad(): output = model( input_ids=input_ids, type_ids=type_ids, input_mask=input_mask ) predictions = output['logits'].argmax(dim=-1) print(f"Predictions: {predictions}") # Regression task (num_labels=1) regression_model = SequenceClassificationModel( config=config, num_labels=1, pre_trained='base' ) labels = torch.randn(batch_size) # Continuous labels output = regression_model(input_ids=input_ids, input_mask=input_mask, labels=labels) # Uses MSE loss for regression ``` -------------------------------- ### Running DeBERTa Experiments from Command Line Source: https://github.com/microsoft/deberta/blob/master/README.md Instructions for downloading data and running DeBERTa experiments for GLUE tasks. ```APIDOC ## Running DeBERTa Experiments from Command Line ### Description This section provides the command-line instructions to download datasets for GLUE tasks and to run DeBERTa experiments using the `run.py` script. ### Method N/A (Command Line) ### Endpoint N/A ### Parameters N/A ### Request Example #### 1. Get the data ```bash cache_dir=/tmp/DeBERTa/ cd experiments/glue ./download_data.sh $cache_dir/glue_tasks ``` #### 2. Run task ```bash task=STS-B OUTPUT=/tmp/DeBERTa/exps/$task export OMP_NUM_THREADS=1 python3 -m DeBERTa.apps.run --task_name $task --do_train \ --data_dir $cache_dir/glue_tasks/$task \ --eval_batch_size 128 \ --predict_batch_size 128 \ --output_dir $OUTPUT \ --scale_steps 250 \ --loss_scale 16384 \ --accumulative_update 1 \ --num_train_epochs 6 \ --warmup 100 \ --learning_rate 2e-5 \ --train_batch_size 32 \ --max_seq_len 128 ``` ### Response N/A ### Response Example N/A ``` -------------------------------- ### Run Command-Line Fine-tuning Source: https://context7.com/microsoft/deberta/llms.txt Illustrates how to execute fine-tuning experiments for tasks like GLUE using the DeBERTa command-line interface. It covers data downloading and parameter configuration for training and evaluation. ```bash cache_dir=/tmp/DeBERTa cd experiments/glue ./download_data.sh $cache_dir/glue_tasks python3 -m DeBERTa.apps.run --task_name SST-2 --do_train --do_eval --data_dir $cache_dir/glue_tasks/SST-2 --output_dir /tmp/DeBERTa/output/sst2 --init_model base --max_seq_length 128 --train_batch_size 32 --eval_batch_size 128 --num_train_epochs 3 --learning_rate 2e-5 --warmup_proportion 0.1 --cls_drop_out 0.1 ``` -------------------------------- ### DeBERTa Model Initialization Source: https://context7.com/microsoft/deberta/llms.txt Demonstrates how to initialize the DeBERTa model using pre-trained weights, custom configurations, or local checkpoints. ```APIDOC ## DeBERTa Model Initialization ### Description Initializes the core DeBERTa encoder model. Supports initialization with various pre-trained model configurations, custom configurations, or local checkpoint paths. ### Method ```python import torch from DeBERTa import deberta ``` ### Initialization Options #### Pre-trained Models Initialize with readily available pre-trained weights. Common identifiers include 'base', 'large', 'xlarge', 'xlarge-v2', 'xxlarge-v2', 'deberta-v3-small', 'deberta-v3-base', 'deberta-v3-large', and 'mdeberta-v3-base'. ```python # Initialize with pre-trained model 'base' model = deberta.DeBERTa(pre_trained='base') ``` #### Custom Configuration Initialize with a custom `ModelConfig` object, allowing fine-grained control over model architecture. ```python # Initialize with custom configuration config = deberta.ModelConfig() config.hidden_size = 768 config.num_hidden_layers = 12 config.num_attention_heads = 12 config.intermediate_size = 3072 config.hidden_dropout_prob = 0.1 config.attention_probs_dropout_prob = 0.1 config.max_position_embeddings = 512 model = deberta.DeBERTa(config=config) ``` #### Local Checkpoint Initialize by loading weights from a local model checkpoint file. ```python # Load from local checkpoint model = deberta.DeBERTa(pre_trained='/path/to/model/checkpoint') ``` ``` -------------------------------- ### Fine-tune DeBERTa on GLUE Tasks via CLI Source: https://context7.com/microsoft/deberta/llms.txt Demonstrates how to use the DeBERTa CLI to fine-tune models on benchmarks like MNLI, QNLI, and RTE. It supports configurations for batch size, sequence length, and mixed-precision training. ```bash python3 -m DeBERTa.apps.run --task_name MNLI --do_train --do_eval --data_dir $cache_dir/glue_tasks/MNLI --output_dir /tmp/DeBERTa/output/mnli --init_model large --max_seq_length 256 --train_batch_size 16 --accumulative_update 2 --num_train_epochs 3 --learning_rate 1e-5 --fp16 True ``` -------------------------------- ### Initialize DeBERTa Model Source: https://context7.com/microsoft/deberta/llms.txt Demonstrates initializing the DeBERTa encoder model using pre-trained weights, custom configurations, or local checkpoints. It supports various model sizes and multilingual variants. ```python import torch from DeBERTa import deberta # Initialize with pre-trained model # Available: 'base', 'large', 'xlarge', 'xlarge-v2', 'xxlarge-v2', # 'deberta-v3-small', 'deberta-v3-base', 'deberta-v3-large', 'mdeberta-v3-base' model = deberta.DeBERTa(pre_trained='base') # Initialize with custom configuration config = deberta.ModelConfig() config.hidden_size = 768 config.num_hidden_layers = 12 config.num_attention_heads = 12 config.intermediate_size = 3072 config.hidden_dropout_prob = 0.1 config.attention_probs_dropout_prob = 0.1 config.max_position_embeddings = 512 model = deberta.DeBERTa(config=config) # Load from local checkpoint model = deberta.DeBERTa(pre_trained='/path/to/model/checkpoint') ``` -------------------------------- ### Run DeBERTa GLUE Task Experiment Source: https://github.com/microsoft/deberta/blob/master/README.md This command executes a DeBERTa model training or evaluation run for a specified GLUE task. It requires setting environment variables like OMP_NUM_THREADS, and provides numerous arguments to configure the task, data directory, batch sizes, output directory, training epochs, learning rate, and sequence length. ```bash task=STS-B OUTPUT=/tmp/DeBERTa/exps/$task export OMP_NUM_THREADS=1 python3 -m DeBERTa.apps.run --task_name $task --do_train \ --data_dir $cache_dir/glue_tasks/$task \ --eval_batch_size 128 \ --predict_batch_size 128 \ --output_dir $OUTPUT \ --scale_steps 250 \ --loss_scale 16384 \ --accumulative_update 1 \ --num_train_epochs 6 \ --warmup 100 \ --learning_rate 2e-5 \ --train_batch_size 32 \ --max_seq_len 128 ``` -------------------------------- ### Using DeBERTa Tokenizer Source: https://github.com/microsoft/deberta/blob/master/README.md This section explains how to load and use the DeBERTa tokenizer for preparing input data. ```APIDOC ## Using DeBERTa Tokenizer ### Description This code demonstrates how to load the vocabulary and tokenizer for DeBERTa, tokenize example text, add special tokens, convert tokens to IDs, and prepare input features including padding and attention masks. ### Method N/A (Code Example) ### Endpoint N/A ### Parameters N/A ### Request Example ```python from DeBERTa import deberta import torch vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base') tokenizer = deberta.tokenizers[vocab_type](vocab_path) max_seq_len = 512 tokens = tokenizer.tokenize('Examples input text of DeBERTa') # Truncate long sequence tokens = tokens[:max_seq_len - 2] # Add special tokens to the `tokens` tokens = ['[CLS]'] + tokens + ['[SEP]'] input_ids = tokenizer.convert_tokens_to_ids(tokens) input_mask = [1]*len(input_ids) # padding paddings = max_seq_len - len(input_ids) input_ids = input_ids + [0]*paddings input_mask = input_mask + [0]*paddings features = { 'input_ids': torch.tensor(input_ids, dtype=torch.int), 'input_mask': torch.tensor(input_mask, dtype=torch.int) } ``` ### Response N/A ### Response Example N/A ``` -------------------------------- ### Download Data for DeBERTa GLUE Experiments Source: https://github.com/microsoft/deberta/blob/master/README.md This bash script downloads the necessary datasets for running DeBERTa experiments on GLUE tasks. It requires a cache directory path as an argument and changes the current directory to the experiments/glue directory before executing the download script. ```bash cache_dir=/tmp/DeBERTa/ cd experiments/glue ./download_data.sh $cache_dir/glue_tasks ``` -------------------------------- ### Configure Distributed Training Environment Variables Source: https://github.com/microsoft/deberta/blob/master/experiments/language_model/README.md This section outlines the environment variables required for distributed training across multiple nodes. It includes setting the total number of nodes, master node address and port, and the rank of the current node. ```bash # Example for Node 0 export WORLD_SIZE=2 export MASTER_ADDR=node0 export MASTER_PORT=7488 export RANK=0 ./rtd.sh deberta-v3-xsmall # Example for Node 1 export WORLD_SIZE=2 export MASTER_ADDR=node0 export MASTER_PORT=7488 export RANK=1 ./rtd.sh deberta-v3-xsmall ``` -------------------------------- ### Loading Pre-trained DeBERTa Models and Vocabularies Source: https://context7.com/microsoft/deberta/llms.txt This snippet explains how to load pre-trained DeBERTa models and their corresponding vocabularies from the HuggingFace Hub or custom file paths. It includes listing available pre-trained models, loading model states and configurations, and loading vocabulary files, differentiating between vocabulary types for different DeBERTa versions. ```python from DeBERTa.deberta import load_model_state, load_vocab, pretrained_models # List available pre-trained models print("Available models:") for name in pretrained_models.keys(): print(f" - {name}") # base, large, xlarge, base-mnli, large-mnli, xlarge-mnli # xlarge-v2, xxlarge-v2, xlarge-v2-mnli, xxlarge-v2-mnli # deberta-v3-small, deberta-v3-base, deberta-v3-large # deberta-v3-xsmall, mdeberta-v3-base # Load model state and config model_state, model_config = load_model_state('base') print(f"Config: {model_config}") # Load from custom path model_state, model_config = load_model_state('/path/to/pytorch_model.bin') # Load vocabulary vocab_path, vocab_type = load_vocab(pretrained_id='base') print(f"Vocab path: {vocab_path}") print(f"Vocab type: {vocab_type}") # 'gpt2' for v1, 'spm' for v2/v3 ``` -------------------------------- ### DeBERTa Tokenizer Initialization and Text Processing Source: https://context7.com/microsoft/deberta/llms.txt Illustrates how to initialize DeBERTa tokenizers (GPT2Tokenizer for V1, SPMTokenizer for V2/V3) and process text into token IDs. It covers loading vocabularies, tokenizing text, converting tokens to IDs, and preparing input for the model, including padding. ```python from DeBERTa import deberta # Load vocabulary and get tokenizer type for a pre-trained model vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base') tokenizer = deberta.tokenizers[vocab_type](vocab_path) # For V2/V3 models (uses SentencePiece) vocab_path, vocab_type = deberta.load_vocab(pretrained_id='deberta-v3-base') tokenizer = deberta.tokenizers[vocab_type](vocab_path) # SPMTokenizer # Tokenize text text = "Hello, this is an example of DeBERTa tokenization." tokens = tokenizer.tokenize(text) print(f"Tokens: {tokens}") # Convert tokens to IDs token_ids = tokenizer.convert_tokens_to_ids(tokens) print(f"Token IDs: {token_ids}") # Build full input with special tokens max_seq_len = 128 tokens = tokenizer.tokenize(text) tokens = tokens[:max_seq_len - 2] # Reserve space for [CLS] and [SEP] tokens = ['[CLS]'] + tokens + ['[SEP]'] input_ids = tokenizer.convert_tokens_to_ids(tokens) input_mask = [1] * len(input_ids) # Pad to max length padding_length = max_seq_len - len(input_ids) input_ids = input_ids + [0] * padding_length input_mask = input_mask + [0] * padding_length ``` -------------------------------- ### Tokenizer Initialization and Text Processing Source: https://context7.com/microsoft/deberta/llms.txt Details on initializing DeBERTa tokenizers and processing text for model input. ```APIDOC ## Tokenizer Initialization and Text Processing ### Description Provides instructions for initializing DeBERTa tokenizers (GPT2Tokenizer for V1, SPMTokenizer for V2/V3) and converting text into token IDs suitable for model input. ### Method ```python from DeBERTa import deberta ``` ### Usage #### Initializing Tokenizers DeBERTa offers two tokenizer types. The `load_vocab` function helps retrieve the vocabulary path and type based on a pre-trained model ID. ```python # Load vocabulary and get tokenizer type for a pre-trained model (V1 example) vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base') tokenizer = deberta.tokenizers[vocab_type](vocab_path) # For V2/V3 models (uses SentencePiece) vocab_path, vocab_type = deberta.load_vocab(pretrained_id='deberta-v3-base') tokenizer = deberta.tokenizers[vocab_type](vocab_path) # This will be an SPMTokenizer instance ``` #### Tokenizing Text Convert raw text into tokens and then into numerical IDs. ```python # Tokenize text text = "Hello, this is an example of DeBERTa tokenization." tokens = tokenizer.tokenize(text) print(f"Tokens: {tokens}") # Convert tokens to IDs token_ids = tokenizer.convert_tokens_to_ids(tokens) print(f"Token IDs: {token_ids}") ``` #### Building Full Input for Model Prepare input sequences by adding special tokens ([CLS], [SEP]) and padding to a maximum sequence length. ```python # Build full input with special tokens and padding max_seq_len = 128 text = "Hello, this is an example of DeBERTa tokenization." tokens = tokenizer.tokenize(text) tokens = tokens[:max_seq_len - 2] # Reserve space for [CLS] and [SEP] tokens = ['[CLS]'] + tokens + ['[SEP]'] input_ids = tokenizer.convert_tokens_to_ids(tokens) input_mask = [1] * len(input_ids) # Pad to max length padding_length = max_seq_len - len(input_ids) input_ids = input_ids + [0] * padding_length input_mask = input_mask + [0] * padding_length print(f"Input IDs: {input_ids}") print(f"Attention Mask: {input_mask}") ``` ``` -------------------------------- ### Load DeBERTa Model and Vocabulary Source: https://context7.com/microsoft/deberta/llms.txt Demonstrates how to load pre-trained model states and vocabulary files from a specific cache directory. This is useful for managing large model assets in custom environments. ```python model_state, model_config = load_model_state('deberta-v3-base', cache_dir='/custom/cache/dir', no_cache=False) vocab_path, vocab_type = load_vocab(pretrained_id='deberta-v3-base', cache_dir='/custom/cache/dir') ``` -------------------------------- ### Continuously Train DeBERTaV3 Models with RTD Source: https://github.com/microsoft/deberta/blob/master/experiments/language_model/README.md This script enables continuous training of DeBERTaV3 models using the RTD task. It requires specifying the model size and potentially the initialization checkpoints for the generator and discriminator. Check the script for detailed configuration of initialization models. ```bash #!/bin/bash if [ $# -ne 1 ]; then echo "Usage: $0 -continue" exit 1 fi MODEL_SIZE=$1 if [[ "$MODEL_SIZE" == "deberta-v3-xsmall-continue" ]]; then echo "Continuously training DeBERTaV3 XSmall model with RTD..." # Specify generator and discriminator checkpoints here # Example: python train.py --task rtd --model deberta-v3-xsmall --continue_train --generator_checkpoint path/to/generator.bin --discriminator_checkpoint path/to/discriminator.bin elif [[ "$MODEL_SIZE" == "deberta-v3-small-continue" ]]; then echo "Continuously training DeBERTaV3 Small model with RTD..." # Specify generator and discriminator checkpoints here elif [[ "$MODEL_SIZE" == "deberta-v3-large-continue" ]]; then echo "Continuously training DeBERTaV3 Large model with RTD..." # Specify generator and discriminator checkpoints here else echo "Unknown model size for continue training: $MODEL_SIZE" exit 1 fi ``` -------------------------------- ### DeBERTa Model Configuration Management Source: https://context7.com/microsoft/deberta/llms.txt This snippet illustrates how to manage DeBERTa model configurations using the `ModelConfig` class. It covers creating default configurations, modifying hyperparameters, loading configurations from JSON files, converting to dictionaries, saving as JSON strings, and creating configurations from dictionaries. This is essential for customizing DeBERTa models. ```python from DeBERTa.deberta import ModelConfig # Create default configuration config = ModelConfig() # Modify base parameters config.hidden_size = 1024 config.num_hidden_layers = 24 config.num_attention_heads = 16 config.intermediate_size = 4096 config.hidden_dropout_prob = 0.1 config.attention_probs_dropout_prob = 0.1 config.max_position_embeddings = 512 config.vocab_size = 128000 # DeBERTa-specific parameters config.relative_attention = True # Enable disentangled attention config.position_biased_input = True # Add position to content embedding config.pos_att_type = 'p2c|c2p' # Position attention types: p2c, c2p, p2p config.max_relative_positions = 256 # Max relative position distance config.position_buckets = 256 # Use log bucket encoding for positions # Load from JSON file config = ModelConfig.from_json_file('/path/to/config.json') # Convert to dictionary config_dict = config.to_dict() # Save as JSON string json_str = config.to_json_string() print(json_str) # Create from dictionary config = ModelConfig.from_dict({ 'hidden_size': 768, 'num_hidden_layers': 12, 'num_attention_heads': 12, 'intermediate_size': 3072, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 0, 'relative_attention': True, 'pos_att_type': 'p2c|c2p' }) ``` -------------------------------- ### DeBERTa Named Entity Recognition (NER) Model Source: https://context7.com/microsoft/deberta/llms.txt Illustrates the initialization and usage of the NERModel for token-level classification tasks like NER and POS tagging. It covers preparing inputs, performing a training forward pass with labels, and inference. ```python import torch from DeBERTa.apps.models import NERModel from DeBERTa.deberta import ModelConfig # Initialize NER model config = ModelConfig() config.hidden_size = 768 config.num_hidden_layers = 12 # NER with BIO tagging: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, etc. num_labels = 9 # Number of entity tags model = NERModel( config=config, num_labels=num_labels, drop_out=0.1 ) # Prepare inputs batch_size = 4 seq_length = 64 input_ids = torch.randint(0, 50265, (batch_size, seq_length)) input_mask = torch.ones((batch_size, seq_length), dtype=torch.long) type_ids = torch.zeros((batch_size, seq_length), dtype=torch.long) # Labels: tag ID for each token, -1 for padding/ignored tokens labels = torch.randint(-1, num_labels, (batch_size, seq_length)) # Training forward pass model.train() output = model( input_ids=input_ids, type_ids=type_ids, input_mask=input_mask, labels=labels ) logits = output['logits'] # Shape: [batch_size, seq_length, num_labels] loss = output['loss'] print(f"NER Loss: {loss.item()}") # Inference model.eval() with torch.no_grad(): output = model(input_ids=input_ids, input_mask=input_mask) tag_predictions = output['logits'].argmax(dim=-1) # [batch_size, seq_length] ``` -------------------------------- ### DeBERTa Tokenization and Decoding Source: https://context7.com/microsoft/deberta/llms.txt Demonstrates how to access special token IDs (CLS, SEP, MASK, PAD) from the tokenizer and decode token IDs back into human-readable text. ```python cls_id = tokenizer.vocab['[CLS]'] sep_id = tokenizer.vocab['[SEP]'] mask_id = tokenizer.vocab['[MASK]'] pad_id = tokenizer.vocab['[PAD]'] decoded = tokenizer.decode(tokens) print(f"Decoded: {decoded}") ``` -------------------------------- ### Masked Language Model Pre-training with DeBERTa Source: https://context7.com/microsoft/deberta/llms.txt This snippet demonstrates how to perform Masked Language Model (MLM) pre-training using DeBERTa's enhanced mask decoder. It initializes the `MaskedLanguageModel` with a specified configuration, prepares input data including masked tokens and labels, and performs a forward pass to obtain MLM loss. ```python import torch from DeBERTa.apps.models import MaskedLanguageModel from DeBERTa.deberta import ModelConfig config = ModelConfig() config.hidden_size = 768 config.num_hidden_layers = 12 config.vocab_size = 50265 config.max_position_embeddings = 512 model = MaskedLanguageModel(config=config) batch_size = 4 seq_length = 128 input_ids = torch.randint(0, 50265, (batch_size, seq_length)) input_mask = torch.ones((batch_size, seq_length), dtype=torch.long) # Labels: token IDs for masked positions, 0 for non-masked positions # Non-zero values indicate positions to predict labels = torch.zeros((batch_size, seq_length), dtype=torch.long) mask_positions = torch.randint(1, seq_length-1, (batch_size, 15)) # ~15% masking for i in range(batch_size): for pos in mask_positions[i]: labels[i, pos] = input_ids[i, pos] input_ids[i, pos] = 50264 # [MASK] token ID # Forward pass output = model( input_ids=input_ids, input_mask=input_mask, labels=labels ) lm_logits = output['logits'] # Logits for masked positions lm_labels = output['labels'] # Flattened labels for masked positions loss = output['loss'] # MLM loss print(f"MLM Loss: {loss.mean().item()}") ``` -------------------------------- ### Implement Adversarial Training with SiFT Source: https://context7.com/microsoft/deberta/llms.txt Shows how to apply Scaled Invariant Fine-Tuning (SiFT) to a DeBERTa model to improve robustness. It involves hooking a perturbation layer and calculating an adversarial loss component during the training loop. ```python import torch from DeBERTa.deberta import DeBERTa from DeBERTa.sift import AdversarialLearner, hook_sift_layer model = DeBERTa(pre_trained='base') adv_modules = hook_sift_layer(model, hidden_size=768, learning_rate=1e-4, init_perturbation=1e-2, target_module='embeddings.LayerNorm') adv = AdversarialLearner(model, adv_modules) def logits_fn(model, **data): outputs = model(**data) return outputs['hidden_states'][-1][:, 0] adv_loss = adv.loss(logits, logits_fn, loss_fn='symmetric-kl', input_ids=input_ids, attention_mask=attention_mask) total_loss = standard_loss + 1.0 * adv_loss ``` -------------------------------- ### DeBERTa Model Architecture Components Source: https://github.com/microsoft/deberta/blob/master/docs/source/modules/deberta.md Overview of the primary modules and classes used within the DeBERTa implementation. ```APIDOC ## DeBERTa Model Components ### Description The DeBERTa model consists of several core components including DisentangledSelfAttention, ContextPooler, and BertEncoder. These modules work together to provide improved performance over standard BERT architectures. ### Key Classes - **DeBERTa**: The main model class. - **DisentangledSelfAttention**: Implements the disentangled attention mechanism. - **ContextPooler**: Handles the pooling of hidden states. - **BertEncoder**: The stack of transformer layers. - **StableDropout**: A robust dropout implementation for stability. ### Configuration - **ModelConfig**: Defines the hyperparameters for the model. - **PoolConfig**: Defines the pooling configuration settings. ``` -------------------------------- ### Perform Adversarial Training with SiFT Source: https://context7.com/microsoft/deberta/llms.txt Configures the DeBERTa training process to include adversarial training (SiFT) to improve model robustness. It utilizes symmetric-kl loss and specific perturbation parameters. ```bash python3 -m DeBERTa.apps.run --task_name RTE --do_train --data_dir $cache_dir/glue_tasks/RTE --output_dir /tmp/DeBERTa/output/rte --init_model large --vat_lambda 1.0 --vat_learning_rate 1e-4 --vat_init_perturbation 1e-2 --vat_loss_fn symmetric-kl ``` -------------------------------- ### Integrating DeBERTa as an Encoder Source: https://github.com/microsoft/deberta/blob/master/README.md This section demonstrates how to replace the encoder in your existing PyTorch model with DeBERTa. ```APIDOC ## Integrating DeBERTa as an Encoder ### Description This code snippet shows how to modify a PyTorch model to use DeBERTa as its encoder. It includes initializing the DeBERTa model and defining the forward pass. ### Method N/A (Code Example) ### Endpoint N/A ### Parameters N/A ### Request Example ```python from DeBERTa import deberta import torch class MyModel(torch.nn.Module): def __init__(self): super().__init__() # Your existing model code self.deberta = deberta.DeBERTa(pre_trained='base') # Or 'large', 'base-mnli', etc. # Your existing model code # do initialization as before self.deberta.apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor def forward(self, input_ids): # The inputs to DeBERTa forward are: # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] # `attention_mask`: an optional parameter for input mask or attention mask. # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True encoding = deberta.bert(input_ids)[-1] ``` ### Response N/A ### Response Example N/A ``` -------------------------------- ### Pre-train BERT-like Model with MLM using mlm.sh Source: https://github.com/microsoft/deberta/blob/master/experiments/language_model/README.md This script trains a BERT-like model using the Masked Language Modeling (MLM) task. It supports training standard BERT models or DeBERTa models with disentangled attention. The script takes the model type as an argument. ```bash #!/bin/bash if [ $# -ne 1 ]; then echo "Usage: $0 " exit 1 fi MODEL_TYPE=$1 if [ "$MODEL_TYPE" = "bert-base" ]; then echo "Training BERT base model with MLM..." # Add BERT base MLM training command here elif [ "$MODEL_TYPE" = "deberta-base" ]; then echo "Training DeBERTa base model with MLM and Disentangled Attention..." # Add DeBERTa base MLM training command here else echo "Unknown model type: $MODEL_TYPE" exit 1 fi # Example placeholder for training command # python train.py --task mlm --model $MODEL_TYPE --data path/to/data ``` -------------------------------- ### Context Pooler for Encoder Output Source: https://context7.com/microsoft/deberta/llms.txt This snippet demonstrates the usage of the `ContextPooler` class, which is designed to extract a fixed-size representation from the encoder's output, typically by utilizing the [CLS] token. It shows how to configure and initialize the pooler, either with a dedicated `PoolConfig` or derived from a `ModelConfig`, and how to apply it to encoder outputs to obtain a pooled representation. ```python import torch from DeBERTa.deberta import ContextPooler, PoolConfig, DeBERTa # Create pooler configuration config = PoolConfig() config.hidden_size = 768 config.dropout = 0.1 config.hidden_act = 'gelu' # Or create from model config from DeBERTa.deberta import ModelConfig model_config = ModelConfig() model_config.hidden_size = 768 model_config.pooling = { 'hidden_size': 768, 'hidden_act': 'gelu', 'dropout': 0.1 } pool_config = PoolConfig(model_config) # Initialize pooler pooler = ContextPooler(pool_config) # Use with encoder output encoder_output = torch.randn(4, 128, 768) # [batch, seq_len, hidden] pooled = pooler(encoder_output) # [batch, hidden] print(f"Pooled shape: {pooled.shape}") # torch.Size([4, 768]) # Get output dimension out_dim = pooler.output_dim() # 768 ``` -------------------------------- ### Integrate DeBERTa into Existing Python Code Source: https://github.com/microsoft/deberta/blob/master/README.md This Python code demonstrates how to apply a pre-trained DeBERTa model to your existing codebase. It requires modifications to your code to load and utilize the model's capabilities. ```python # To apply DeBERTa to your existing code, you need to make two changes to your code, ``` -------------------------------- ### Integrate DeBERTa with HuggingFace Transformers Source: https://context7.com/microsoft/deberta/llms.txt Shows how to load DeBERTa models using HuggingFace Transformers, perform tokenization, extract embeddings, and fine-tune for sequence classification tasks. ```python from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification import torch model_name = "microsoft/deberta-v3-base" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) text = "DeBERTa achieves state-of-the-art results on NLU benchmarks." inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state pooler_output = last_hidden_state[:, 0] classifier = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-base", num_labels=2) labels = torch.tensor([1]) outputs = classifier(**inputs, labels=labels) loss = outputs.loss ``` -------------------------------- ### Tokenize Text with DeBERTa Tokenizer Source: https://github.com/microsoft/deberta/blob/master/README.md This code snippet shows how to load and use the tokenizer built into the DeBERTa library. It covers loading the vocabulary, tokenizing input text, truncating sequences, adding special tokens, converting tokens to IDs, and padding sequences to a fixed length. The output is a dictionary of tensors suitable for model input. ```python from DeBERTa import deberta import torch vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base') tokenizer = deberta.tokenizers[vocab_type](vocab_path) # We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK] max_seq_len = 512 tokens = tokenizer.tokenize('Examples input text of DeBERTa') # Truncate long sequence tokens = tokens[:max_seq_len -2] # Add special tokens to the `tokens` tokens = ['[CLS]'] + tokens + ['[SEP]'] input_ids = tokenizer.convert_tokens_to_ids(tokens) input_mask = [1]*len(input_ids) # padding paddings = max_seq_len-len(input_ids) input_ids = input_ids + [0]*paddings input_mask = input_mask + [0]*paddings features = { 'input_ids': torch.tensor(input_ids, dtype=torch.int), 'input_mask': torch.tensor(input_mask, dtype=torch.int) } ``` -------------------------------- ### Pre-train ELECTRA-like Model with RTD using rtd.sh Source: https://github.com/microsoft/deberta/blob/master/experiments/language_model/README.md This script trains an ELECTRA-like model using the Replaced Token Detection (RTD) task. It allows training various sizes of DeBERTaV3 models, specifying backbone and embedding parameters. The script takes the model size as an argument. ```bash #!/bin/bash if [ $# -ne 1 ]; then echo "Usage: $0 " exit 1 fi MODEL_SIZE=$1 if [ "$MODEL_SIZE" = "deberta-v3-xsmall" ]; then echo "Training DeBERTaV3 XSmall model with RTD..." # Add DeBERTaV3 XSmall RTD training command here # Example: python train.py --task rtd --model deberta-v3-xsmall --backbone_layers 12 --backbone_hidden 256 --embedding_vocab 128000 --embedding_size 32M elif [ "$MODEL_SIZE" = "deberta-v3-base" ]; then echo "Training DeBERTaV3 Base model with RTD..." # Add DeBERTaV3 Base RTD training command here # Example: python train.py --task rtd --model deberta-v3-base --backbone_layers 12 --backbone_hidden 768 --embedding_vocab 128000 --embedding_size 96M elif [ "$MODEL_SIZE" = "deberta-v3-large" ]; then echo "Training DeBERTaV3 Large model with RTD..." # Add DeBERTaV3 Large RTD training command here # Example: python train.py --task rtd --model deberta-v3-large --backbone_layers 24 --backbone_hidden 1024 --embedding_vocab 128000 --embedding_size 128M else echo "Unknown model size: $MODEL_SIZE" exit 1 fi ``` -------------------------------- ### Integrate SiFT into DeBERTa Training Source: https://github.com/microsoft/deberta/blob/master/DeBERTa/sift/README.md This Python code demonstrates how to integrate SiFT modules into an existing DeBERTa model for adversarial learning. It involves hooking SiFT layers and creating an AdversarialLearner to augment the loss function with adversarial components. This is useful for enhancing model robustness during training. ```python from transformers import DebertaModel, DebertaConfig from your_sift_library import hook_sift_layer, AdversarialLearner # Assuming 'model' is a pre-initialized DeBERTa model and 'data' is your training data # Example: model = DebertaModel(DebertaConfig()) # Create DeBERTa model adv_modules = hook_sift_layer(model, hidden_size=768) adv = AdversarialLearner(model, adv_modules) def logits_fn(model, *wargs, **kwargs): logits,_ = model(*wargs, **kwargs) return logits # Assuming 'data' is a dictionary containing inputs for the model # Example: data = {'input_ids': ..., 'attention_mask': ...} logits,loss = model(**data) loss = loss + adv.loss(logits, logits_fn, **data) # Other steps is the same as general training. ``` -------------------------------- ### Export DeBERTa to ONNX Source: https://context7.com/microsoft/deberta/llms.txt Exports a trained DeBERTa model to the ONNX format for optimized production deployment. ```bash python3 -m DeBERTa.apps.run --task_name SST-2 --do_eval --export_onnx_model True --data_dir $cache_dir/glue_tasks/SST-2 --output_dir /tmp/DeBERTa/output/sst2 ``` -------------------------------- ### DeBERTa Model Forward Pass Source: https://context7.com/microsoft/deberta/llms.txt Shows how to perform a forward pass with the DeBERTa model, processing input token IDs and masks to obtain encoded representations from all transformer layers or just the last layer. It also demonstrates how to retrieve attention matrices. ```python import torch from DeBERTa import deberta # Initialize model bert = deberta.DeBERTa(pre_trained='base') bert.eval() # Prepare input batch_size = 2 seq_length = 128 input_ids = torch.randint(0, 50265, (batch_size, seq_length)) attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long) token_type_ids = torch.zeros((batch_size, seq_length), dtype=torch.long) # Forward pass - get all encoder layers outputs = bert( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, output_all_encoded_layers=True, return_att=False ) # Access outputs encoder_layers = outputs['hidden_states'] # List of tensors for each layer last_hidden_state = encoder_layers[-1] # Shape: [batch_size, seq_length, hidden_size] position_embeddings = outputs['position_embeddings'] # Get only last layer output outputs = bert( input_ids=input_ids, attention_mask=attention_mask, output_all_encoded_layers=False ) final_output = outputs['hidden_states'] # Single tensor: [batch_size, seq_length, hidden_size] # Get attention matrices outputs = bert( input_ids=input_ids, attention_mask=attention_mask, return_att=True ) attention_weights = outputs.get('attention_probs') # Attention matrices from each layer ```