### Install Dependencies

Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/code2nl/README.md

Required packages for running the experiments.

```shell
pip install torch==1.4.0
pip install transformers==2.5.0
pip install filelock
```

--------------------------------

### Fine-tuning Configuration

Source: https://github.com/microsoft/codebert/blob/master/CodeReviewer/README.md

Example bash script configuration for distributed training of the quality estimation model.

```bash
mnt_dir="/home/codereview"

# You may change the following block for multiple gpu training
MASTER_HOST=localhost && echo MASTER_HOST: ${MASTER_HOST}
MASTER_PORT=23333 && echo MASTER_PORT: ${MASTER_PORT}
RANK=0 && echo RANK: ${RANK}
PER_NODE_GPU=1 && echo PER_NODE_GPU: ${PER_NODE_GPU}
WORLD_SIZE=1 && echo WORLD_SIZE: ${WORLD_SIZE}
NODES=1 && echo NODES: ${NODES}
NCCL_DEBUG=INFO

bash test_nltk.sh


# Change the arguments as required:
#   model_name_or_path, load_model_path: the path of the model to be finetuned
#   eval_file: the path of the evaluation data
#   output_dir: the directory to save finetuned model (not used at infer/test time)
#   out_file: the path of the output file
#   train_file_name: can be a directory contraining files named with "train*.jsonl"

python -m torch.distributed.launch --nproc_per_node ${PER_NODE_GPU} --node_rank=${RANK} --nnodes=${NODES} --master_addr=${MASTER_HOST} --master_port=${MASTER_PORT} ../run_finetune_cls.py  \
  --train_epochs 30 \
  --model_name_or_path microsoft/codereviewer \
  --output_dir ../../save/cls \
  --train_filename ../../dataset/Diff_Quality_Estimation \
  --dev_filename ../../dataset/Diff_Quality_Estimation/cls-valid.jsonl \
  --max_source_length 512 \
  --max_target_length 128 \
  --train_batch_size 12 \
  --learning_rate 3e-4 \
  --gradient_accumulation_steps 3 \
  --mask_rate 0.15 \
  --save_steps 3600 \
  --log_steps 100 \
  --train_steps 120000 \
  --gpu_per_node=${PER_NODE_GPU} \
  --node_index=${RANK} \
  --seed 2233
```

--------------------------------

### Initialize UniXcoder Model

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/README.md

Setup the UniXcoder model instance and move it to the appropriate device.

```python
import torch
from unixcoder import UniXcoder

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = UniXcoder("microsoft/unixcoder-base")
model.to(device)
```

--------------------------------

### Code-to-Code Search Demo Script

Source: https://github.com/microsoft/codebert/blob/master/CodeExecutor/README.md

Example configuration for the run.sh script to perform code-to-code search.

```bash
# Change the arguments as required:
#   trace_file: the path to the prediction file either downloaded or generated in the last step

source_lang=python
target_lang=python
python run.py \
    --model_name_or_path microsoft/unixcoder-base  \
    --query_data_file ../data/code_to_code_search_test.json \
    --candidate_data_file ../data/code_to_code_search_test.json \
    --trace_file ../data/code_to_code_search_preds.txt \
    --query_lang ${source_lang} \
    --candidate_lang ${target_lang} \
    --code_length 512 \
    --eval_batch_size 256
```

--------------------------------

### Similarity Output Example

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/README.md

Example output for similarity calculations.

```python
tensor([[0.3002]], device='cuda:0', grad_fn=<ViewBackward>)
tensor([[0.1881]], device='cuda:0', grad_fn=<ViewBackward>)
```

--------------------------------

### Fine-Tune CodeBERT for Clone Detection (Training)

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/BCB/README.md

Execute this shell command to initiate the training process for CodeBERT. Ensure the dataset is downloaded and dependencies are installed.

```shell
python run.py \
    --output_dir saved_models \
    --model_name_or_path microsoft/unixcoder-base \
    --do_train \
    --train_data_file dataset/train.txt \
    --eval_data_file dataset/valid.txt \
    --num_train_epochs 1 \
    --block_size 512 \
    --train_batch_size 16 \
    --eval_batch_size 32 \
    --learning_rate 5e-5 \
    --max_grad_norm 1.0 \
    --seed 123456 

```

--------------------------------

### Install CodeBERT Dependencies

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-summarization/README.md

Install the required libraries for CodeBERT using pip.

```bash
pip install torch
pip install transformers
```

--------------------------------

### Configure LongCoder for Code Completion

Source: https://context7.com/microsoft/codebert/llms.txt

Setup for the LongCoder model, including attention window configuration and Seq2Seq model initialization.

```python
import torch
from transformers import LongformerConfig, RobertaTokenizer
from longcoder import LongcoderModel
from model import Seq2Seq

# Load LongCoder model
config = LongformerConfig.from_pretrained("microsoft/longcoder-base")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/longcoder-base")

# Configure attention window for long sequences
config.attention_window = [512] * len(config.attention_window)
config.is_decoder_only = True

# Build encoder with LongCoder architecture
encoder = LongcoderModel.from_pretrained("microsoft/longcoder-base", config=config)

# Define end-of-line token for code
eos_ids = [tokenizer.convert_tokens_to_ids('Ċ')]

# Create Seq2Seq model for code completion
model = Seq2Seq(
    encoder=encoder,
    decoder=encoder,
    config=config,
    tokenizer=tokenizer,
    beam_size=5,
    max_length=128,
    sos_id=tokenizer.cls_token_id,
    eos_id=eos_ids
)
```

--------------------------------

### Demo Data Structure

Source: https://github.com/microsoft/codebert/blob/master/CodeReviewer/README.md

Example JSON structure representing the input format for code review tasks.

```python
{
    "old_file": "import torch",  # f1
    "diff_hunk": "@@ -1 +1,2 @@\n import torch\n +import torch.nn as nn",  # f1->f2
    "comment": "I don't think we need to import torch.nn here.",  # requirements for f2->f3
    "target": "import torch"  # f3
}
```

--------------------------------

### Prediction File Format

Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/clonedetection/README.md

Example of the expected tab-separated format for prediction files.

```text
13653451	21955002	0
1188160	8831513	1
1141235	14322332	0
16765164	17526811	1
```

--------------------------------

### Embedding Output Example

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/README.md

Example output shape and tensor values for a code embedding.

```python
torch.Size([1, 768])
tensor([[ 8.6533e-01, -1.9796e+00, -8.6849e-01,  4.2652e-01, -5.3696e-01,
         -1.5521e-01,  5.3770e-01,  3.4199e-01,  3.6305e-01, -3.9391e-01,
         -1.1816e+00,  2.6010e+00, -7.7133e-01,  1.8441e+00,  2.3645e+00,
				 ...,
         -2.9188e+00,  1.2555e+00, -1.9953e+00, -1.9795e+00,  1.7279e+00,
          6.4590e-01, -5.2769e-02,  2.4965e-01,  2.3962e-02,  5.9996e-02,
          2.5659e+00,  3.6533e+00,  2.0301e+00]], device='cuda:0',
       grad_fn=<DivBackward0>)
```

--------------------------------

### Fine-Tune and Evaluate CodeBERT

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/POJ-104/README.md

Execute these commands to train the model on the provided JSONL files and perform evaluation. Ensure torch and transformers are installed in the environment.

```shell
# Training
python run.py \
    --output_dir saved_models \
    --model_name_or_path microsoft/unixcoder-base \
    --do_train \
    --train_data_file dataset/train.jsonl \
    --eval_data_file dataset/valid.jsonl \
    --test_data_file dataset/test.jsonl \
    --num_train_epochs 2 \
    --block_size 400 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --seed 123456
    
# Evaluating	
python run.py \
    --output_dir saved_models \
    --model_name_or_path microsoft/unixcoder-base \
    --do_eval \
    --do_test \
    --eval_data_file dataset/valid.jsonl \
    --test_data_file dataset/test.jsonl \
    --num_train_epochs 2 \
    --block_size 400 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --seed 123456
```

--------------------------------

### Evaluate LongCoder Model Performance

Source: https://github.com/microsoft/codebert/blob/master/LongCoder/README.md

This script evaluates the fine-tuned LongCoder model on test datasets. It requires the path to the best performing model checkpoint. Configure parameters such as language, batch size, and sequence lengths to match the fine-tuning setup.

```shell
lang=csharp #csharp, python, java
batch_size=16
beam_size=5
source_length=3968
target_length=128
global_length=64
window_size=512
output_dir=saved_models/$lang
reload_model=$output_dir/checkpoint-best-acc/model.bin

python run.py \
--do_test \
--lang $lang \
--load_model_path $reload_model \
--output_dir $output_dir \
--model_name_or_path microsoft/longcoder-base \
--filename microsoft/LCC_$lang \
--max_source_length $source_length \
--max_target_length $target_length \
--max_global_length $global_length \
--window_size $window_size \
--beam_size $beam_size \
--train_batch_size $batch_size \
--eval_batch_size $batch_size \
--num_train_epochs $epochs 2>&1| tee $output_dir/test.log
```

--------------------------------

### Initialize CodeReviewer Model

Source: https://context7.com/microsoft/codebert/llms.txt

Sets up the configuration and loads the pre-trained CodeReviewer model.

```python
import torch
from transformers import T5Config, RobertaTokenizer
from models import ReviewerModel, build_or_load_gen_model

# Load CodeReviewer model
class Args:
    model_name_or_path = "microsoft/codereviewer"
    load_model_path = None
    local_rank = 0

args = Args()
config, model, tokenizer = build_or_load_gen_model(args)
```

--------------------------------

### Download and Preprocess POJ-104 Dataset

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/POJ-104/README.md

Use these commands to navigate to the dataset directory, download the archive, and run the preprocessing script.

```bash
cd dataset
pip install gdown
gdown https://drive.google.com/uc?id=0B2i-vWnOu7MxVlJwQXN6eVNONUU
tar -xvf programs.tar.gz
python preprocess.py
cd ..
```

--------------------------------

### Download and Prepare Code Summarization Dataset

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-summarization/README.md

Use these bash commands to download the dataset, unzip language-specific code files, remove archives and intermediate files, and run the preprocessing script.

```bash
wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Text/code-to-text/dataset.zip
unzip dataset.zip
rm dataset.zip
cd dataset
wget https://zenodo.org/record/7857872/files/python.zip
wget https://zenodo.org/record/7857872/files/java.zip
wget https://zenodo.org/record/7857872/files/ruby.zip
wget https://zenodo.org/record/7857872/files/javascript.zip
wget https://zenodo.org/record/7857872/files/go.zip
wget https://zenodo.org/record/7857872/files/php.zip

unzip python.zip
unzip java.zip
unzip ruby.zip
unzip javascript.zip
unzip go.zip
unzip php.zip
rm *.zip
rm *.pkl

python preprocess.py
rm -r */final
cd ..
```

--------------------------------

### Download Dataset

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-generation/README.md

Commands to create a directory and download the required training, development, and test JSON files.

```bash
mkdir dataset
cd dataset
wget https://github.com/microsoft/CodeXGLUE/raw/main/Text-Code/text-to-code/dataset/concode/train.json
wget https://github.com/microsoft/CodeXGLUE/raw/main/Text-Code/text-to-code/dataset/concode/dev.json
wget https://github.com/microsoft/CodeXGLUE/raw/main/Text-Code/text-to-code/dataset/concode/test.json
cd ..
```

--------------------------------

### Run Fine-tuning Script

Source: https://github.com/microsoft/codebert/blob/master/CodeReviewer/README.md

Commands to navigate to the script directory and execute the fine-tuning process.

```bash
# prepare model checkpoint and datasets
cd code/sh
# adjust the arguments in the *sh* scripts
bash finetune-cls.sh
```

--------------------------------

### Download and Preprocess Dataset

Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/codesearch/README.md

Commands to extract the dataset and execute the preprocessing script.

```shell
unzip dataset.zip
cd dataset
bash run.sh 
cd ..
```

--------------------------------

### Download and Preprocess Datasets

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-completion/README.md

This script downloads and preprocesses datasets for code completion. Ensure you have the necessary tools like unzip, bash, and python.

```bash
unzip dataset.zip

cd dataset/javaCorpus/
bash download.sh
python preprocess.py --base_dir=token_completion --output_dir=./
wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/CodeCompletion-line/dataset/javaCorpus/line_completion/test.json

cd ../py150
bash download.sh
python preprocess.py --base_dir=py150_files --output_dir=./
wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/CodeCompletion-line/dataset/py150/line_completion/test.json

cd ../..
```

--------------------------------

### Download BigCloneBench Dataset

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/BCB/README.md

Use these bash commands to download the BigCloneBench dataset and its associated files into a 'dataset' directory.

```bash
mkdir dataset
cd dataset
wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/Clone-detection-BigCloneBench/dataset/data.jsonl
wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/Clone-detection-BigCloneBench/dataset/test.txt
wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/Clone-detection-BigCloneBench/dataset/train.txt
wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/Clone-detection-BigCloneBench/dataset/valid.txt
cd ..

```

--------------------------------

### Implement GraphCodeBERT for Code Search

Source: https://context7.com/microsoft/codebert/llms.txt

Defines a PyTorch module that integrates data flow information into the encoder for semantic code search. Includes a command-line interface example for training.

```python
import torch
import torch.nn as nn

class GraphCodeBERTModel(nn.Module):
    """GraphCodeBERT model with data flow integration for code search."""

    def __init__(self, encoder):
        super(GraphCodeBERTModel, self).__init__()
        self.encoder = encoder

    def forward(self, code_inputs=None, attn_mask=None, position_idx=None, nl_inputs=None):
        if code_inputs is not None:
            # Process code with data flow graph
            nodes_mask = position_idx.eq(0)
            token_mask = position_idx.ge(2)
            inputs_embeddings = self.encoder.embeddings.word_embeddings(code_inputs)

            # Compute node embeddings from token embeddings using data flow
            nodes_to_token_mask = nodes_mask[:, :, None] & token_mask[:, None, :] & attn_mask
            nodes_to_token_mask = nodes_to_token_mask / (nodes_to_token_mask.sum(-1) + 1e-10)[:, :, None]
            avg_embeddings = torch.einsum("abc,acd->abd", nodes_to_token_mask, inputs_embeddings)
            inputs_embeddings = inputs_embeddings * (~nodes_mask)[:, :, None] + avg_embeddings * nodes_mask[:, :, None]

            return self.encoder(inputs_embeds=inputs_embeddings, attention_mask=attn_mask,
                              position_ids=position_idx)[1]
        else:
            # Process natural language query
            return self.encoder(nl_inputs, attention_mask=nl_inputs.ne(1))[1]

# Usage for code search
# python run.py \
#     --output_dir saved_models \
#     --model_name_or_path microsoft/graphcodebert-base \
#     --do_train \
#     --train_data_file dataset/train.jsonl \
#     --eval_data_file dataset/valid.jsonl \
#     --code_length 256 \
#     --nl_length 128 \
#     --train_batch_size 32 \
#     --eval_batch_size 64 \
#     --learning_rate 2e-5 \
#     --num_train_epochs 10
```

--------------------------------

### Download and Preprocess Data

Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/codesearch/README.md

Commands to download the preprocessed training/validation datasets and run the local preprocessing script for test data.

```shell
mkdir data data/codesearch
cd data/codesearch
gdown https://drive.google.com/uc?id=1xgSR34XO8xXZg4cZScDYj2eGerBE9iGo  
unzip codesearch_data.zip
rm  codesearch_data.zip
cd ../../codesearch
python process_data.py
cd ..
```

--------------------------------

### Implement GraphCodeBERT for Clone Detection

Source: https://context7.com/microsoft/codebert/llms.txt

Defines a classifier model that compares two code snippets using GraphCodeBERT embeddings. Includes a command-line interface example for training and evaluation.

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss

class CloneDetectionModel(nn.Module):
    """GraphCodeBERT-based code clone detection model."""

    def __init__(self, encoder, config, tokenizer, args):
        super(CloneDetectionModel, self).__init__()
        self.encoder = encoder
        self.config = config
        self.tokenizer = tokenizer
        self.classifier = nn.Sequential(
            nn.Linear(config.hidden_size * 2, config.hidden_size),
            nn.Dropout(config.hidden_dropout_prob),
            nn.Tanh(),
            nn.Dropout(config.hidden_dropout_prob),
            nn.Linear(config.hidden_size, 2)
        )

    def forward(self, inputs_ids_1, position_idx_1, attn_mask_1,
                inputs_ids_2, position_idx_2, attn_mask_2, labels=None):
        bs, l = inputs_ids_1.size()

        # Concatenate both code snippets for batch processing
        inputs_ids = torch.cat((inputs_ids_1.unsqueeze(1), inputs_ids_2.unsqueeze(1)), 1).view(bs * 2, l)
        position_idx = torch.cat((position_idx_1.unsqueeze(1), position_idx_2.unsqueeze(1)), 1).view(bs * 2, l)
        attn_mask = torch.cat((attn_mask_1.unsqueeze(1), attn_mask_2.unsqueeze(1)), 1).view(bs * 2, l, l)

        # Get embeddings with data flow
        outputs = self._encode_with_dataflow(inputs_ids, position_idx, attn_mask)
        logits = self.classifier(outputs)
        prob = F.softmax(logits, dim=-1)

        if labels is not None:
            loss = CrossEntropyLoss()(logits, labels)
            return loss, prob
        return prob

# Training command
# python run.py \
#     --output_dir saved_models/clone_detection \
#     --model_name_or_path microsoft/graphcodebert-base \
#     --do_train --do_eval --do_test \
#     --train_data_file dataset/train.txt \
#     --eval_data_file dataset/valid.txt \
#     --test_data_file dataset/test.txt \
#     --block_size 400 \
#     --train_batch_size 16 \
#     --eval_batch_size 32 \
#     --learning_rate 5e-5 \
#     --num_train_epochs 2
```

--------------------------------

### Pre-training Configuration

Source: https://github.com/microsoft/codebert/blob/master/CodeExecutor/README.md

This is a detailed bash script for configuring and running the pre-training of the CodeExecutor model using PyTorch distributed training. It specifies numerous parameters for data paths, model checkpoints, batch sizes, learning rates, and optimization settings.

```bash
# Change the arguments as required:
#   output_dir: the output directory to save inference results
#   data_cache_dir: the output directory to save the data cache 
#   train_data_path: the path of the pre-training file
#   eval_data_path: the path of the test file
#   model_name_or_path: the path of the model to be evaluated

PER_NODE_GPU=8
python -m torch.distributed.launch --nproc_per_node=${PER_NODE_GPU} run.py \
    --output_dir ../saved_models/pretrain_codeexecutor_stage_3 \
    --data_cache_dir ../saved_models/pretrain_codeexecutor_stage_3 \
    --train_data_path /drive/pretrain_codenetmut.json \
    --another_train_data_path /drive/pretrain_tutorial.json \
    --third_train_data_path /drive/single_line_hard_3_million.json \
    --eval_data_path ../data/codenetmut_test.json \
    --model_name_or_path ../saved_models/pretrain_codeexecutor_stage_2 \
    --block_size 1024 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --learning_rate 4e-4 \
    --node_index=0 \
    --gpu_per_node $PER_NODE_GPU \
    --weight_decay 0.01 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 1.0 \
    --max_steps 1000000 \
    --warmup_steps 10000 \
    --save_steps 5000 \
    --seed 123
```

--------------------------------

### Load CodeBERT Base Model and Tokenizer

Source: https://context7.com/microsoft/codebert/llms.txt

Loads the CodeBERT base model and tokenizer for natural language and code processing. Ensure PyTorch and Hugging Face Transformers are installed. The model can be moved to a CUDA-enabled GPU if available.

```python
import torch
from transformers import RobertaTokenizer, RobertaModel

# Load CodeBERT model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)

# Tokenize natural language and code
nl_tokens = tokenizer.tokenize("return maximum value")
code_tokens = tokenizer.tokenize("def max(a,b): if a>b: return a else return b")

# Combine NL and code tokens with special tokens
tokens = [tokenizer.cls_token] + nl_tokens + [tokenizer.sep_token] + code_tokens + [tokenizer.eos_token]
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)

# Get embeddings
context_embeddings = model(torch.tensor(tokens_ids)[None, :].to(device))[0]
# Output shape: torch.Size([1, 23, 768])
print(f"Embedding shape: {context_embeddings.shape}")
```

--------------------------------

### Download CSN Dataset

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md

Downloads and initializes the CSN dataset.

```bash
cd dataset
wget https://github.com/microsoft/CodeBERT/raw/master/GraphCodeBERT/codesearch/dataset.zip
unzip dataset.zip && rm -r dataset.zip && mv dataset CSN && cd CSN
bash run.sh 
cd ../..
```

--------------------------------

### Download AdvTest Dataset

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md

Downloads and preprocesses the AdvTest dataset for code search.

```bash
mkdir dataset && cd dataset
wget https://github.com/microsoft/CodeXGLUE/raw/main/Text-Code/NL-code-search-Adv/dataset.zip
unzip dataset.zip && rm -r dataset.zip && mv dataset AdvTest && cd AdvTest
wget https://zenodo.org/record/7857872/files/python.zip
unzip python.zip && python preprocess.py && rm -r python && rm -r *.pkl && rm python.zip
cd ../..
```

--------------------------------

### Pre-training Script

Source: https://github.com/microsoft/codebert/blob/master/CodeExecutor/README.md

This bash script initiates the pre-training process for the CodeExecutor model. Ensure to adjust arguments like output directories, data paths, and model paths as needed.

```bash
# prepare model checkpoint and datasets
cd pretrain
bash run.sh
```

--------------------------------

### Fine-Tune and Evaluate Model

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-generation/README.md

Shell commands to execute training and testing phases using the run.py script with specified model parameters.

```shell
# Training
python run.py \
	--do_train \
	--do_eval \
	--model_name_or_path microsoft/unixcoder-base \
	--train_filename dataset/train.json \
	--dev_filename dataset/dev.json \
	--output_dir saved_models \
	--max_source_length 350 \
	--max_target_length 150 \
	--beam_size 3 \
	--train_batch_size 32 \
	--eval_batch_size 32 \
	--learning_rate 5e-5 \
	--gradient_accumulation_steps 1 \
	--num_train_epochs 30 

# Output results
python run.py \
	--do_test \
	--model_name_or_path microsoft/unixcoder-base \
	--test_filename dataset/test.json \
	--output_dir saved_models \
	--max_source_length 350 \
	--max_target_length 150 \
	--beam_size 3 \
	--train_batch_size 32 \
	--eval_batch_size 32 \
	--learning_rate 5e-5 \
	--gradient_accumulation_steps 1 \
	--num_train_epochs 30
```

--------------------------------

### Download and Preprocess CodeNet Dataset

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/zero-shot-search/README.md

Use this bash script to download the CodeNet dataset and preprocess it for use with the code search model. Ensure you are in the 'dataset' directory before running.

```bash
cd dataset
wget https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet.tar.gz
tar -xvf Project_CodeNet.tar.gz
python preprocess.py
cd ..
```

--------------------------------

### Run Inference and Evaluation

Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/code2nl/README.md

Commands to perform inference and evaluate the fine-tuned model using a checkpoint.

```shell
lang=php #programming language
beam_size=10
batch_size=128
source_length=256
target_length=128
output_dir=model/$lang
data_dir=../data/code2nl/CodeSearchNet
dev_file=$data_dir/$lang/valid.jsonl
test_file=$data_dir/$lang/test.jsonl
test_model=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test

python run.py --do_test --model_type roberta --model_name_or_path microsoft/codebert-base --load_model_path $test_model --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size
```

--------------------------------

### Configure UniXcoder Training Parameters

Source: https://context7.com/microsoft/codebert/llms.txt

Common hyperparameters for training and evaluation processes.

```bash
#     --max_target_length 128 \
#     --max_global_length 64 \
#     --window_size 512 \
#     --beam_size 5 \
#     --train_batch_size 16 \
#     --eval_batch_size 16 \
#     --learning_rate 2e-4 \
#     --num_train_epochs 10
```

--------------------------------

### Train Quality Estimation Model

Source: https://context7.com/microsoft/codebert/llms.txt

Command to initiate training for the quality estimation task using the specified dataset and model parameters.

```bash
# python -m torch.distributed.launch --nproc_per_node 1 run_finetune_cls.py \
#     --train_epochs 30 \
#     --model_name_or_path microsoft/codereviewer \
#     --output_dir ./save/cls \
#     --train_filename ./dataset/Diff_Quality_Estimation \
#     --dev_filename ./dataset/Diff_Quality_Estimation/cls-valid.jsonl \
#     --max_source_length 512 \
#     --max_target_length 128 \
#     --train_batch_size 12 \
#     --learning_rate 3e-4 \
#     --gradient_accumulation_steps 3
```

--------------------------------

### Train CodeExecutor Model

Source: https://context7.com/microsoft/codebert/llms.txt

Command to run pre-training for the CodeExecutor model.

```bash
# python run.py \
#     --do_train \
#     --model_name_or_path microsoft/unixcoder-base \
#     --output_dir ./saved_models \
#     --train_data_file ./data/train.jsonl \
#     --eval_data_file ./data/valid.jsonl \
#     --max_source_length 512 \
#     --max_target_length 256 \
#     --train_batch_size 8 \
#     --learning_rate 5e-5 \
#     --num_train_epochs 10
```

--------------------------------

### Run Inference

Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/refinement/README.md

Execute the model on the test dataset using a saved checkpoint.

```bash
batch_size=64
dev_file=data/$scale/valid.buggy-fixed.buggy,data/$scale/valid.buggy-fixed.fixed
test_file=data/$scale/test.buggy-fixed.buggy,data/$scale/test.buggy-fixed.fixed
load_model_path=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test

python run.py --do_test --model_type roberta --model_name_or_path $pretrained_model --tokenizer_name microsoft/graphcodebert-base --config_name microsoft/graphcodebert-base --load_model_path $load_model_path --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size 2>&1| tee $output_dir/test.log
```

--------------------------------

### Train CodeBERT Model

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md

Executes the training process for a specified language using the run.py script.

```bash
lang=python
python run.py \
    --output_dir saved_models/CSN/$lang \
    --model_name_or_path microsoft/unixcoder-base  \
    --do_train \
    --train_data_file dataset/CSN/$lang/train.jsonl \
    --eval_data_file dataset/CSN/$lang/valid.jsonl \
    --codebase_file dataset/CSN/$lang/codebase.jsonl \
    --num_train_epochs 10 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 64 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456
```

--------------------------------

### Run Inference

Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/translation/README.md

Perform inference on the test dataset using a trained model checkpoint.

```bash
batch_size=64
dev_file=data/valid.java-cs.txt.$source,data/valid.java-cs.txt.$target
test_file=data/test.java-cs.txt.$source,data/test.java-cs.txt.$target
load_model_path=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test

python run.py \
--do_test \
--model_type roberta \
--source_lang $source \
--model_name_or_path $pretrained_model \
--tokenizer_name microsoft/graphcodebert-base \
--config_name microsoft/graphcodebert-base \
--load_model_path $load_model_path \
--dev_filename $dev_file \
--test_filename $test_file \
--output_dir $output_dir \
--max_source_length $source_length \
--max_target_length $target_length \
--beam_size $beam_size \
--eval_batch_size $batch_size 2>&1| tee $output_dir/test.log
```

--------------------------------

### Fine-Tune Model

Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/code2nl/README.md

Configuration and execution command for fine-tuning the model on the specified dataset.

```shell
cd code2nl

lang=php #programming language
lr=5e-5
batch_size=64
beam_size=10
source_length=256
target_length=128
data_dir=../data/code2nl/CodeSearchNet
output_dir=model/$lang
train_file=$data_dir/$lang/train.jsonl
dev_file=$data_dir/$lang/valid.jsonl
eval_steps=1000 #400 for ruby, 600 for javascript, 1000 for others
train_steps=50000 #20000 for ruby, 30000 for javascript, 50000 for others
pretrained_model=microsoft/codebert-base #Roberta: roberta-base

python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --train_steps $train_steps --eval_steps $eval_steps
```

--------------------------------

### Demo Data for Zero-Shot Code Search

Source: https://github.com/microsoft/codebert/blob/master/CodeExecutor/README.md

This JSON object shows a simplified demo data structure for the zero-shot code-to-code search task, including original code and code provided with a test case.

```json
{
    "id": 0,  
    "code_id": "s204511158", 
    "problem_id": 340, # solve which problem
    "original_code": "s = list(input())", # code without providing the test case
    "code": "s = ['x', 'y', 'z']",  # code provided with a test case
    "code_tokens": ["<0>", "s", "=", "[", "'x'", ",", "'y'", ",", "'z'", "]"],
    "trace": ["<line> <0> <state> s : [ x , y , z ] </state>"],
    "trace_tokens": ["<line>", "<0>", "<state>", "s", ":", "[", "x", ",", "y", ",", "z", "]", "</state>"]
}
```

--------------------------------

### Execute UniXcoder Code Search Tasks

Source: https://context7.com/microsoft/codebert/llms.txt

Commands for zero-shot evaluation and fine-tuning on specific datasets.

```bash
# Zero-shot code search (no training required)
# python run.py \
#     --output_dir saved_models/AdvTest \
#     --model_name_or_path microsoft/unixcoder-base \
#     --do_zero_shot --do_test \
#     --test_data_file dataset/AdvTest/test.jsonl \
#     --codebase_file dataset/AdvTest/test.jsonl \
#     --code_length 256 \
#     --nl_length 128 \
#     --eval_batch_size 64 \
#     --seed 123456

# Fine-tuning on AdvTest dataset
# python run.py \
#     --output_dir saved_models/AdvTest \
#     --model_name_or_path microsoft/unixcoder-base \
#     --do_train \
#     --train_data_file dataset/AdvTest/train.jsonl \
#     --eval_data_file dataset/AdvTest/valid.jsonl \
#     --codebase_file dataset/AdvTest/valid.jsonl \
#     --num_train_epochs 2 \
#     --code_length 256 \
#     --nl_length 128 \
#     --train_batch_size 64 \
#     --eval_batch_size 64 \
#     --learning_rate 2e-5 \
#     --seed 123456

# Fine-tuning on CodeSearchNet for Python
# lang=python
# python run.py \
#     --output_dir saved_models/CSN/$lang \
#     --model_name_or_path microsoft/unixcoder-base \
#     --do_train \
#     --train_data_file dataset/CSN/$lang/train.jsonl \
#     --eval_data_file dataset/CSN/$lang/valid.jsonl \
#     --codebase_file dataset/CSN/$lang/codebase.jsonl \
#     --num_train_epochs 10 \
#     --code_length 256 \
#     --nl_length 128 \
#     --train_batch_size 64 \
#     --eval_batch_size 64 \
#     --learning_rate 2e-5 \
#     --seed 123456
```

--------------------------------

### Download UniXcoder Class

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/README.md

Command to download the necessary Python class for UniXcoder.

```shell
wget https://raw.githubusercontent.com/microsoft/CodeBERT/master/UniXcoder/unixcoder.py
```

--------------------------------

### Fine-Tune CodeBERT for Clone Detection (Evaluation)

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/BCB/README.md

Use this shell command to evaluate the fine-tuned CodeBERT model on the test dataset. This command is similar to training but uses the '--do_test' flag.

```shell
python run.py \
    --output_dir saved_models \
    --model_name_or_path microsoft/unixcoder-base \
    --do_test \
    --test_data_file dataset/test.txt \
    --num_train_epochs 1 \
    --block_size 512 \
    --train_batch_size 16 \
    --eval_batch_size 32 \
    --learning_rate 5e-5 \
    --max_grad_norm 1.0 \
    --seed 123456 

```

--------------------------------

### Unzip Dataset

Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/translation/README.md

Extract the dataset files from the compressed archive.

```bash
unzip data.zip
```

--------------------------------

### Fine-tune GraphCodeBERT

Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/clonedetection/README.md

Execute the training process for the GraphCodeBERT model on the clone detection dataset.

```shell
mkdir saved_models
python run.py \
    --output_dir=saved_models \
    --config_name=microsoft/graphcodebert-base \
    --model_name_or_path=microsoft/graphcodebert-base \
    --tokenizer_name=microsoft/graphcodebert-base \
    --do_train \
    --train_data_file=dataset/train.txt \
    --eval_data_file=dataset/valid.txt \
    --test_data_file=dataset/test.txt \
    --epoch 1 \
    --code_length 512 \
    --data_flow_length 128 \
    --train_batch_size 16 \
    --eval_batch_size 32 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456 2>&1| tee saved_models/train.log
```

--------------------------------

### Load CodeBERT Model

Source: https://github.com/microsoft/codebert/blob/master/README.md

Initializes the CodeBERT tokenizer and model using the Hugging Face transformers library.

```python
import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)
```

--------------------------------

### Fine-tune GraphCodeBERT

Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/refinement/README.md

Configure hyperparameters and execute the training script for the code refinement model.

```bash
scale=small
lr=1e-4
batch_size=32
beam_size=10
source_length=320
target_length=256
output_dir=saved_models/$scale/
train_file=data/$scale/train.buggy-fixed.buggy,data/$scale/train.buggy-fixed.fixed
dev_file=data/$scale/valid.buggy-fixed.buggy,data/$scale/valid.buggy-fixed.fixed
epochs=50 
pretrained_model=microsoft/graphcodebert-base

mkdir -p $output_dir
python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --tokenizer_name microsoft/graphcodebert-base --config_name microsoft/graphcodebert-base --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs 2>&1| tee $output_dir/train.log
```

--------------------------------

### Fine-Tune Code Search

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md

Commands for training and evaluating models on AdvTest and CosQA datasets.

```shell
# Training
python run.py \
    --output_dir saved_models/AdvTest \
    --model_name_or_path microsoft/unixcoder-base  \
    --do_train \
    --train_data_file dataset/AdvTest/train.jsonl \
    --eval_data_file dataset/AdvTest/valid.jsonl \
    --codebase_file dataset/AdvTest/valid.jsonl \
    --num_train_epochs 2 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 64 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456
    
# Evaluating
python run.py \
    --output_dir saved_models/AdvTest \
    --model_name_or_path microsoft/unixcoder-base  \
    --do_test \
    --test_data_file dataset/AdvTest/test.jsonl \
    --codebase_file dataset/AdvTest/test.jsonl \
    --num_train_epochs 2 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 64 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456
```

```bash
# Training
python run.py \
    --output_dir saved_models/cosqa \
    --model_name_or_path microsoft/unixcoder-base  \
    --do_train \
    --train_data_file dataset/cosqa/cosqa-retrieval-train-19604.json \
    --eval_data_file dataset/cosqa/cosqa-retrieval-dev-500.json \
    --codebase_file dataset/cosqa/code_idx_map.txt \
    --num_train_epochs 10 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 64 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456

# Evaluating
python run.py \
    --output_dir saved_models/cosqa \
    --model_name_or_path microsoft/unixcoder-base  \
    --do_eval \
    --do_test \
    --eval_data_file dataset/cosqa/cosqa-retrieval-dev-500.json \
    --test_data_file dataset/cosqa/cosqa-retrieval-test-500.json \
    --codebase_file dataset/cosqa/code_idx_map.txt \
    --num_train_epochs 10 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 64 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456
```

--------------------------------

### Unzip Dataset

Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/clonedetection/README.md

Extract the dataset archive before processing.

```bash
unzip dataset.zip
```

--------------------------------

### Download CosQA Dataset

Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md

Downloads the required JSON and text files for the CosQA dataset.

```bash
cd dataset
mkdir cosqa && cd cosqa
wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/code_idx_map.txt
wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/cosqa-retrieval-dev-500.json
wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/cosqa-retrieval-test-500.json
wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/cosqa-retrieval-train-19604.json
cd ../..
```

--------------------------------

### Fine-Tune CodeBERT Model

Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/codesearch/README.md

Command to initiate fine-tuning of the CodeBERT model for a specific programming language on GPU hardware.

```shell
cd codesearch

lang=php #fine-tuning a language-specific model for each programming language 
pretrained_model=microsoft/codebert-base  #Roberta: roberta-base

python run_classifier.py \
--model_type roberta \
--task_name codesearch \
--do_train \
--do_eval \
--eval_all_checkpoints \
--train_file train.txt \
--dev_file valid.txt \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--gradient_accumulation_steps 1 \
--overwrite_output_dir \
--data_dir ../data/codesearch/train_valid/$lang \
--output_dir ./models/$lang  \
--model_name_or_path $pretrained_model
```

--------------------------------

### Perform Encoder-Decoder Tasks with UniXcoder

Source: https://context7.com/microsoft/codebert/llms.txt

Demonstrates function name prediction, API recommendation, and code summarization using mask tokens in encoder-decoder mode.

```python
import torch
from unixcoder import UniXcoder

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = UniXcoder("microsoft/unixcoder-base")
model.to(device)

# Function name prediction with <mask0>
context = """
def <mask0>(data, file_path):
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""

tokens_ids = model.tokenize([context], max_length=512, mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)

# Extract function name predictions
names = [x.replace("<mask0>", "").strip() for x in predictions[0]]
print(names)  # Output: ['write_json', 'write_file', 'to_json']

# API recommendation
api_context = """
def write_json(data, file_path):
    data = <mask0>(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""

tokens_ids = model.tokenize([api_context], max_length=512, mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)

apis = [x.replace("<mask0>", "").strip() for x in predictions[0]]
print(apis)  # Output: ['json.dumps', 'json.loads', 'str']

# Code summarization
summary_context = """
# <mask0>
def write_json(data, file_path):
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""

tokens_ids = model.tokenize([summary_context], max_length=512, mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)

summaries = [x.replace("<mask0>", "").strip() for x in predictions[0]]
print(summaries)  # Output: ['Write JSON to file', 'Write json to file', 'Write a json file']
```

--------------------------------

### Download Dataset

Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/code2nl/README.md

Commands to download and extract the cleaned CodeSearchNet dataset.

```shell
pip install gdown
mkdir data data/code2nl
cd data/code2nl
gdown https://drive.google.com/uc?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h
unzip Cleaned_CodeSearchNet.zip
rm Cleaned_CodeSearchNet.zip
cd ../..
```