### Install Dependencies Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/code2nl/README.md Required packages for running the experiments. ```shell pip install torch==1.4.0 pip install transformers==2.5.0 pip install filelock ``` -------------------------------- ### Fine-tuning Configuration Source: https://github.com/microsoft/codebert/blob/master/CodeReviewer/README.md Example bash script configuration for distributed training of the quality estimation model. ```bash mnt_dir="/home/codereview" # You may change the following block for multiple gpu training MASTER_HOST=localhost && echo MASTER_HOST: ${MASTER_HOST} MASTER_PORT=23333 && echo MASTER_PORT: ${MASTER_PORT} RANK=0 && echo RANK: ${RANK} PER_NODE_GPU=1 && echo PER_NODE_GPU: ${PER_NODE_GPU} WORLD_SIZE=1 && echo WORLD_SIZE: ${WORLD_SIZE} NODES=1 && echo NODES: ${NODES} NCCL_DEBUG=INFO bash test_nltk.sh # Change the arguments as required: # model_name_or_path, load_model_path: the path of the model to be finetuned # eval_file: the path of the evaluation data # output_dir: the directory to save finetuned model (not used at infer/test time) # out_file: the path of the output file # train_file_name: can be a directory contraining files named with "train*.jsonl" python -m torch.distributed.launch --nproc_per_node ${PER_NODE_GPU} --node_rank=${RANK} --nnodes=${NODES} --master_addr=${MASTER_HOST} --master_port=${MASTER_PORT} ../run_finetune_cls.py \ --train_epochs 30 \ --model_name_or_path microsoft/codereviewer \ --output_dir ../../save/cls \ --train_filename ../../dataset/Diff_Quality_Estimation \ --dev_filename ../../dataset/Diff_Quality_Estimation/cls-valid.jsonl \ --max_source_length 512 \ --max_target_length 128 \ --train_batch_size 12 \ --learning_rate 3e-4 \ --gradient_accumulation_steps 3 \ --mask_rate 0.15 \ --save_steps 3600 \ --log_steps 100 \ --train_steps 120000 \ --gpu_per_node=${PER_NODE_GPU} \ --node_index=${RANK} \ --seed 2233 ``` -------------------------------- ### Initialize UniXcoder Model Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/README.md Setup the UniXcoder model instance and move it to the appropriate device. ```python import torch from unixcoder import UniXcoder device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = UniXcoder("microsoft/unixcoder-base") model.to(device) ``` -------------------------------- ### Code-to-Code Search Demo Script Source: https://github.com/microsoft/codebert/blob/master/CodeExecutor/README.md Example configuration for the run.sh script to perform code-to-code search. ```bash # Change the arguments as required: # trace_file: the path to the prediction file either downloaded or generated in the last step source_lang=python target_lang=python python run.py \ --model_name_or_path microsoft/unixcoder-base \ --query_data_file ../data/code_to_code_search_test.json \ --candidate_data_file ../data/code_to_code_search_test.json \ --trace_file ../data/code_to_code_search_preds.txt \ --query_lang ${source_lang} \ --candidate_lang ${target_lang} \ --code_length 512 \ --eval_batch_size 256 ``` -------------------------------- ### Similarity Output Example Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/README.md Example output for similarity calculations. ```python tensor([[0.3002]], device='cuda:0', grad_fn=) tensor([[0.1881]], device='cuda:0', grad_fn=) ``` -------------------------------- ### Fine-Tune CodeBERT for Clone Detection (Training) Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/BCB/README.md Execute this shell command to initiate the training process for CodeBERT. Ensure the dataset is downloaded and dependencies are installed. ```shell python run.py \ --output_dir saved_models \ --model_name_or_path microsoft/unixcoder-base \ --do_train \ --train_data_file dataset/train.txt \ --eval_data_file dataset/valid.txt \ --num_train_epochs 1 \ --block_size 512 \ --train_batch_size 16 \ --eval_batch_size 32 \ --learning_rate 5e-5 \ --max_grad_norm 1.0 \ --seed 123456 ``` -------------------------------- ### Install CodeBERT Dependencies Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-summarization/README.md Install the required libraries for CodeBERT using pip. ```bash pip install torch pip install transformers ``` -------------------------------- ### Configure LongCoder for Code Completion Source: https://context7.com/microsoft/codebert/llms.txt Setup for the LongCoder model, including attention window configuration and Seq2Seq model initialization. ```python import torch from transformers import LongformerConfig, RobertaTokenizer from longcoder import LongcoderModel from model import Seq2Seq # Load LongCoder model config = LongformerConfig.from_pretrained("microsoft/longcoder-base") tokenizer = RobertaTokenizer.from_pretrained("microsoft/longcoder-base") # Configure attention window for long sequences config.attention_window = [512] * len(config.attention_window) config.is_decoder_only = True # Build encoder with LongCoder architecture encoder = LongcoderModel.from_pretrained("microsoft/longcoder-base", config=config) # Define end-of-line token for code eos_ids = [tokenizer.convert_tokens_to_ids('Ċ')] # Create Seq2Seq model for code completion model = Seq2Seq( encoder=encoder, decoder=encoder, config=config, tokenizer=tokenizer, beam_size=5, max_length=128, sos_id=tokenizer.cls_token_id, eos_id=eos_ids ) ``` -------------------------------- ### Demo Data Structure Source: https://github.com/microsoft/codebert/blob/master/CodeReviewer/README.md Example JSON structure representing the input format for code review tasks. ```python { "old_file": "import torch", # f1 "diff_hunk": "@@ -1 +1,2 @@\n import torch\n +import torch.nn as nn", # f1->f2 "comment": "I don't think we need to import torch.nn here.", # requirements for f2->f3 "target": "import torch" # f3 } ``` -------------------------------- ### Prediction File Format Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/clonedetection/README.md Example of the expected tab-separated format for prediction files. ```text 13653451 21955002 0 1188160 8831513 1 1141235 14322332 0 16765164 17526811 1 ``` -------------------------------- ### Embedding Output Example Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/README.md Example output shape and tensor values for a code embedding. ```python torch.Size([1, 768]) tensor([[ 8.6533e-01, -1.9796e+00, -8.6849e-01, 4.2652e-01, -5.3696e-01, -1.5521e-01, 5.3770e-01, 3.4199e-01, 3.6305e-01, -3.9391e-01, -1.1816e+00, 2.6010e+00, -7.7133e-01, 1.8441e+00, 2.3645e+00, ..., -2.9188e+00, 1.2555e+00, -1.9953e+00, -1.9795e+00, 1.7279e+00, 6.4590e-01, -5.2769e-02, 2.4965e-01, 2.3962e-02, 5.9996e-02, 2.5659e+00, 3.6533e+00, 2.0301e+00]], device='cuda:0', grad_fn=) ``` -------------------------------- ### Fine-Tune and Evaluate CodeBERT Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/POJ-104/README.md Execute these commands to train the model on the provided JSONL files and perform evaluation. Ensure torch and transformers are installed in the environment. ```shell # Training python run.py \ --output_dir saved_models \ --model_name_or_path microsoft/unixcoder-base \ --do_train \ --train_data_file dataset/train.jsonl \ --eval_data_file dataset/valid.jsonl \ --test_data_file dataset/test.jsonl \ --num_train_epochs 2 \ --block_size 400 \ --train_batch_size 8 \ --eval_batch_size 16 \ --learning_rate 2e-5 \ --max_grad_norm 1.0 \ --seed 123456 # Evaluating python run.py \ --output_dir saved_models \ --model_name_or_path microsoft/unixcoder-base \ --do_eval \ --do_test \ --eval_data_file dataset/valid.jsonl \ --test_data_file dataset/test.jsonl \ --num_train_epochs 2 \ --block_size 400 \ --train_batch_size 8 \ --eval_batch_size 16 \ --learning_rate 2e-5 \ --max_grad_norm 1.0 \ --seed 123456 ``` -------------------------------- ### Evaluate LongCoder Model Performance Source: https://github.com/microsoft/codebert/blob/master/LongCoder/README.md This script evaluates the fine-tuned LongCoder model on test datasets. It requires the path to the best performing model checkpoint. Configure parameters such as language, batch size, and sequence lengths to match the fine-tuning setup. ```shell lang=csharp #csharp, python, java batch_size=16 beam_size=5 source_length=3968 target_length=128 global_length=64 window_size=512 output_dir=saved_models/$lang reload_model=$output_dir/checkpoint-best-acc/model.bin python run.py \ --do_test \ --lang $lang \ --load_model_path $reload_model \ --output_dir $output_dir \ --model_name_or_path microsoft/longcoder-base \ --filename microsoft/LCC_$lang \ --max_source_length $source_length \ --max_target_length $target_length \ --max_global_length $global_length \ --window_size $window_size \ --beam_size $beam_size \ --train_batch_size $batch_size \ --eval_batch_size $batch_size \ --num_train_epochs $epochs 2>&1| tee $output_dir/test.log ``` -------------------------------- ### Initialize CodeReviewer Model Source: https://context7.com/microsoft/codebert/llms.txt Sets up the configuration and loads the pre-trained CodeReviewer model. ```python import torch from transformers import T5Config, RobertaTokenizer from models import ReviewerModel, build_or_load_gen_model # Load CodeReviewer model class Args: model_name_or_path = "microsoft/codereviewer" load_model_path = None local_rank = 0 args = Args() config, model, tokenizer = build_or_load_gen_model(args) ``` -------------------------------- ### Download and Preprocess POJ-104 Dataset Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/POJ-104/README.md Use these commands to navigate to the dataset directory, download the archive, and run the preprocessing script. ```bash cd dataset pip install gdown gdown https://drive.google.com/uc?id=0B2i-vWnOu7MxVlJwQXN6eVNONUU tar -xvf programs.tar.gz python preprocess.py cd .. ``` -------------------------------- ### Download and Prepare Code Summarization Dataset Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-summarization/README.md Use these bash commands to download the dataset, unzip language-specific code files, remove archives and intermediate files, and run the preprocessing script. ```bash wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Text/code-to-text/dataset.zip unzip dataset.zip rm dataset.zip cd dataset wget https://zenodo.org/record/7857872/files/python.zip wget https://zenodo.org/record/7857872/files/java.zip wget https://zenodo.org/record/7857872/files/ruby.zip wget https://zenodo.org/record/7857872/files/javascript.zip wget https://zenodo.org/record/7857872/files/go.zip wget https://zenodo.org/record/7857872/files/php.zip unzip python.zip unzip java.zip unzip ruby.zip unzip javascript.zip unzip go.zip unzip php.zip rm *.zip rm *.pkl python preprocess.py rm -r */final cd .. ``` -------------------------------- ### Download Dataset Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-generation/README.md Commands to create a directory and download the required training, development, and test JSON files. ```bash mkdir dataset cd dataset wget https://github.com/microsoft/CodeXGLUE/raw/main/Text-Code/text-to-code/dataset/concode/train.json wget https://github.com/microsoft/CodeXGLUE/raw/main/Text-Code/text-to-code/dataset/concode/dev.json wget https://github.com/microsoft/CodeXGLUE/raw/main/Text-Code/text-to-code/dataset/concode/test.json cd .. ``` -------------------------------- ### Run Fine-tuning Script Source: https://github.com/microsoft/codebert/blob/master/CodeReviewer/README.md Commands to navigate to the script directory and execute the fine-tuning process. ```bash # prepare model checkpoint and datasets cd code/sh # adjust the arguments in the *sh* scripts bash finetune-cls.sh ``` -------------------------------- ### Download and Preprocess Dataset Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/codesearch/README.md Commands to extract the dataset and execute the preprocessing script. ```shell unzip dataset.zip cd dataset bash run.sh cd .. ``` -------------------------------- ### Download and Preprocess Datasets Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-completion/README.md This script downloads and preprocesses datasets for code completion. Ensure you have the necessary tools like unzip, bash, and python. ```bash unzip dataset.zip cd dataset/javaCorpus/ bash download.sh python preprocess.py --base_dir=token_completion --output_dir=./ wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/CodeCompletion-line/dataset/javaCorpus/line_completion/test.json cd ../py150 bash download.sh python preprocess.py --base_dir=py150_files --output_dir=./ wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/CodeCompletion-line/dataset/py150/line_completion/test.json cd ../.. ``` -------------------------------- ### Download BigCloneBench Dataset Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/BCB/README.md Use these bash commands to download the BigCloneBench dataset and its associated files into a 'dataset' directory. ```bash mkdir dataset cd dataset wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/Clone-detection-BigCloneBench/dataset/data.jsonl wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/Clone-detection-BigCloneBench/dataset/test.txt wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/Clone-detection-BigCloneBench/dataset/train.txt wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Code/Clone-detection-BigCloneBench/dataset/valid.txt cd .. ``` -------------------------------- ### Implement GraphCodeBERT for Code Search Source: https://context7.com/microsoft/codebert/llms.txt Defines a PyTorch module that integrates data flow information into the encoder for semantic code search. Includes a command-line interface example for training. ```python import torch import torch.nn as nn class GraphCodeBERTModel(nn.Module): """GraphCodeBERT model with data flow integration for code search.""" def __init__(self, encoder): super(GraphCodeBERTModel, self).__init__() self.encoder = encoder def forward(self, code_inputs=None, attn_mask=None, position_idx=None, nl_inputs=None): if code_inputs is not None: # Process code with data flow graph nodes_mask = position_idx.eq(0) token_mask = position_idx.ge(2) inputs_embeddings = self.encoder.embeddings.word_embeddings(code_inputs) # Compute node embeddings from token embeddings using data flow nodes_to_token_mask = nodes_mask[:, :, None] & token_mask[:, None, :] & attn_mask nodes_to_token_mask = nodes_to_token_mask / (nodes_to_token_mask.sum(-1) + 1e-10)[:, :, None] avg_embeddings = torch.einsum("abc,acd->abd", nodes_to_token_mask, inputs_embeddings) inputs_embeddings = inputs_embeddings * (~nodes_mask)[:, :, None] + avg_embeddings * nodes_mask[:, :, None] return self.encoder(inputs_embeds=inputs_embeddings, attention_mask=attn_mask, position_ids=position_idx)[1] else: # Process natural language query return self.encoder(nl_inputs, attention_mask=nl_inputs.ne(1))[1] # Usage for code search # python run.py \ # --output_dir saved_models \ # --model_name_or_path microsoft/graphcodebert-base \ # --do_train \ # --train_data_file dataset/train.jsonl \ # --eval_data_file dataset/valid.jsonl \ # --code_length 256 \ # --nl_length 128 \ # --train_batch_size 32 \ # --eval_batch_size 64 \ # --learning_rate 2e-5 \ # --num_train_epochs 10 ``` -------------------------------- ### Download and Preprocess Data Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/codesearch/README.md Commands to download the preprocessed training/validation datasets and run the local preprocessing script for test data. ```shell mkdir data data/codesearch cd data/codesearch gdown https://drive.google.com/uc?id=1xgSR34XO8xXZg4cZScDYj2eGerBE9iGo unzip codesearch_data.zip rm codesearch_data.zip cd ../../codesearch python process_data.py cd .. ``` -------------------------------- ### Implement GraphCodeBERT for Clone Detection Source: https://context7.com/microsoft/codebert/llms.txt Defines a classifier model that compares two code snippets using GraphCodeBERT embeddings. Includes a command-line interface example for training and evaluation. ```python import torch import torch.nn as nn import torch.nn.functional as F from torch.nn import CrossEntropyLoss class CloneDetectionModel(nn.Module): """GraphCodeBERT-based code clone detection model.""" def __init__(self, encoder, config, tokenizer, args): super(CloneDetectionModel, self).__init__() self.encoder = encoder self.config = config self.tokenizer = tokenizer self.classifier = nn.Sequential( nn.Linear(config.hidden_size * 2, config.hidden_size), nn.Dropout(config.hidden_dropout_prob), nn.Tanh(), nn.Dropout(config.hidden_dropout_prob), nn.Linear(config.hidden_size, 2) ) def forward(self, inputs_ids_1, position_idx_1, attn_mask_1, inputs_ids_2, position_idx_2, attn_mask_2, labels=None): bs, l = inputs_ids_1.size() # Concatenate both code snippets for batch processing inputs_ids = torch.cat((inputs_ids_1.unsqueeze(1), inputs_ids_2.unsqueeze(1)), 1).view(bs * 2, l) position_idx = torch.cat((position_idx_1.unsqueeze(1), position_idx_2.unsqueeze(1)), 1).view(bs * 2, l) attn_mask = torch.cat((attn_mask_1.unsqueeze(1), attn_mask_2.unsqueeze(1)), 1).view(bs * 2, l, l) # Get embeddings with data flow outputs = self._encode_with_dataflow(inputs_ids, position_idx, attn_mask) logits = self.classifier(outputs) prob = F.softmax(logits, dim=-1) if labels is not None: loss = CrossEntropyLoss()(logits, labels) return loss, prob return prob # Training command # python run.py \ # --output_dir saved_models/clone_detection \ # --model_name_or_path microsoft/graphcodebert-base \ # --do_train --do_eval --do_test \ # --train_data_file dataset/train.txt \ # --eval_data_file dataset/valid.txt \ # --test_data_file dataset/test.txt \ # --block_size 400 \ # --train_batch_size 16 \ # --eval_batch_size 32 \ # --learning_rate 5e-5 \ # --num_train_epochs 2 ``` -------------------------------- ### Pre-training Configuration Source: https://github.com/microsoft/codebert/blob/master/CodeExecutor/README.md This is a detailed bash script for configuring and running the pre-training of the CodeExecutor model using PyTorch distributed training. It specifies numerous parameters for data paths, model checkpoints, batch sizes, learning rates, and optimization settings. ```bash # Change the arguments as required: # output_dir: the output directory to save inference results # data_cache_dir: the output directory to save the data cache # train_data_path: the path of the pre-training file # eval_data_path: the path of the test file # model_name_or_path: the path of the model to be evaluated PER_NODE_GPU=8 python -m torch.distributed.launch --nproc_per_node=${PER_NODE_GPU} run.py \ --output_dir ../saved_models/pretrain_codeexecutor_stage_3 \ --data_cache_dir ../saved_models/pretrain_codeexecutor_stage_3 \ --train_data_path /drive/pretrain_codenetmut.json \ --another_train_data_path /drive/pretrain_tutorial.json \ --third_train_data_path /drive/single_line_hard_3_million.json \ --eval_data_path ../data/codenetmut_test.json \ --model_name_or_path ../saved_models/pretrain_codeexecutor_stage_2 \ --block_size 1024 \ --per_gpu_train_batch_size 4 \ --per_gpu_eval_batch_size 8 \ --gradient_accumulation_steps 8 \ --learning_rate 4e-4 \ --node_index=0 \ --gpu_per_node $PER_NODE_GPU \ --weight_decay 0.01 \ --adam_epsilon 1e-6 \ --max_grad_norm 1.0 \ --max_steps 1000000 \ --warmup_steps 10000 \ --save_steps 5000 \ --seed 123 ``` -------------------------------- ### Load CodeBERT Base Model and Tokenizer Source: https://context7.com/microsoft/codebert/llms.txt Loads the CodeBERT base model and tokenizer for natural language and code processing. Ensure PyTorch and Hugging Face Transformers are installed. The model can be moved to a CUDA-enabled GPU if available. ```python import torch from transformers import RobertaTokenizer, RobertaModel # Load CodeBERT model and tokenizer device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") model = RobertaModel.from_pretrained("microsoft/codebert-base") model.to(device) # Tokenize natural language and code nl_tokens = tokenizer.tokenize("return maximum value") code_tokens = tokenizer.tokenize("def max(a,b): if a>b: return a else return b") # Combine NL and code tokens with special tokens tokens = [tokenizer.cls_token] + nl_tokens + [tokenizer.sep_token] + code_tokens + [tokenizer.eos_token] tokens_ids = tokenizer.convert_tokens_to_ids(tokens) # Get embeddings context_embeddings = model(torch.tensor(tokens_ids)[None, :].to(device))[0] # Output shape: torch.Size([1, 23, 768]) print(f"Embedding shape: {context_embeddings.shape}") ``` -------------------------------- ### Download CSN Dataset Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md Downloads and initializes the CSN dataset. ```bash cd dataset wget https://github.com/microsoft/CodeBERT/raw/master/GraphCodeBERT/codesearch/dataset.zip unzip dataset.zip && rm -r dataset.zip && mv dataset CSN && cd CSN bash run.sh cd ../.. ``` -------------------------------- ### Download AdvTest Dataset Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md Downloads and preprocesses the AdvTest dataset for code search. ```bash mkdir dataset && cd dataset wget https://github.com/microsoft/CodeXGLUE/raw/main/Text-Code/NL-code-search-Adv/dataset.zip unzip dataset.zip && rm -r dataset.zip && mv dataset AdvTest && cd AdvTest wget https://zenodo.org/record/7857872/files/python.zip unzip python.zip && python preprocess.py && rm -r python && rm -r *.pkl && rm python.zip cd ../.. ``` -------------------------------- ### Pre-training Script Source: https://github.com/microsoft/codebert/blob/master/CodeExecutor/README.md This bash script initiates the pre-training process for the CodeExecutor model. Ensure to adjust arguments like output directories, data paths, and model paths as needed. ```bash # prepare model checkpoint and datasets cd pretrain bash run.sh ``` -------------------------------- ### Fine-Tune and Evaluate Model Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-generation/README.md Shell commands to execute training and testing phases using the run.py script with specified model parameters. ```shell # Training python run.py \ --do_train \ --do_eval \ --model_name_or_path microsoft/unixcoder-base \ --train_filename dataset/train.json \ --dev_filename dataset/dev.json \ --output_dir saved_models \ --max_source_length 350 \ --max_target_length 150 \ --beam_size 3 \ --train_batch_size 32 \ --eval_batch_size 32 \ --learning_rate 5e-5 \ --gradient_accumulation_steps 1 \ --num_train_epochs 30 # Output results python run.py \ --do_test \ --model_name_or_path microsoft/unixcoder-base \ --test_filename dataset/test.json \ --output_dir saved_models \ --max_source_length 350 \ --max_target_length 150 \ --beam_size 3 \ --train_batch_size 32 \ --eval_batch_size 32 \ --learning_rate 5e-5 \ --gradient_accumulation_steps 1 \ --num_train_epochs 30 ``` -------------------------------- ### Download and Preprocess CodeNet Dataset Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/zero-shot-search/README.md Use this bash script to download the CodeNet dataset and preprocess it for use with the code search model. Ensure you are in the 'dataset' directory before running. ```bash cd dataset wget https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet.tar.gz tar -xvf Project_CodeNet.tar.gz python preprocess.py cd .. ``` -------------------------------- ### Run Inference and Evaluation Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/code2nl/README.md Commands to perform inference and evaluate the fine-tuned model using a checkpoint. ```shell lang=php #programming language beam_size=10 batch_size=128 source_length=256 target_length=128 output_dir=model/$lang data_dir=../data/code2nl/CodeSearchNet dev_file=$data_dir/$lang/valid.jsonl test_file=$data_dir/$lang/test.jsonl test_model=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test python run.py --do_test --model_type roberta --model_name_or_path microsoft/codebert-base --load_model_path $test_model --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size ``` -------------------------------- ### Configure UniXcoder Training Parameters Source: https://context7.com/microsoft/codebert/llms.txt Common hyperparameters for training and evaluation processes. ```bash # --max_target_length 128 \ # --max_global_length 64 \ # --window_size 512 \ # --beam_size 5 \ # --train_batch_size 16 \ # --eval_batch_size 16 \ # --learning_rate 2e-4 \ # --num_train_epochs 10 ``` -------------------------------- ### Train Quality Estimation Model Source: https://context7.com/microsoft/codebert/llms.txt Command to initiate training for the quality estimation task using the specified dataset and model parameters. ```bash # python -m torch.distributed.launch --nproc_per_node 1 run_finetune_cls.py \ # --train_epochs 30 \ # --model_name_or_path microsoft/codereviewer \ # --output_dir ./save/cls \ # --train_filename ./dataset/Diff_Quality_Estimation \ # --dev_filename ./dataset/Diff_Quality_Estimation/cls-valid.jsonl \ # --max_source_length 512 \ # --max_target_length 128 \ # --train_batch_size 12 \ # --learning_rate 3e-4 \ # --gradient_accumulation_steps 3 ``` -------------------------------- ### Train CodeExecutor Model Source: https://context7.com/microsoft/codebert/llms.txt Command to run pre-training for the CodeExecutor model. ```bash # python run.py \ # --do_train \ # --model_name_or_path microsoft/unixcoder-base \ # --output_dir ./saved_models \ # --train_data_file ./data/train.jsonl \ # --eval_data_file ./data/valid.jsonl \ # --max_source_length 512 \ # --max_target_length 256 \ # --train_batch_size 8 \ # --learning_rate 5e-5 \ # --num_train_epochs 10 ``` -------------------------------- ### Run Inference Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/refinement/README.md Execute the model on the test dataset using a saved checkpoint. ```bash batch_size=64 dev_file=data/$scale/valid.buggy-fixed.buggy,data/$scale/valid.buggy-fixed.fixed test_file=data/$scale/test.buggy-fixed.buggy,data/$scale/test.buggy-fixed.fixed load_model_path=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test python run.py --do_test --model_type roberta --model_name_or_path $pretrained_model --tokenizer_name microsoft/graphcodebert-base --config_name microsoft/graphcodebert-base --load_model_path $load_model_path --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size 2>&1| tee $output_dir/test.log ``` -------------------------------- ### Train CodeBERT Model Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md Executes the training process for a specified language using the run.py script. ```bash lang=python python run.py \ --output_dir saved_models/CSN/$lang \ --model_name_or_path microsoft/unixcoder-base \ --do_train \ --train_data_file dataset/CSN/$lang/train.jsonl \ --eval_data_file dataset/CSN/$lang/valid.jsonl \ --codebase_file dataset/CSN/$lang/codebase.jsonl \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 64 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 ``` -------------------------------- ### Run Inference Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/translation/README.md Perform inference on the test dataset using a trained model checkpoint. ```bash batch_size=64 dev_file=data/valid.java-cs.txt.$source,data/valid.java-cs.txt.$target test_file=data/test.java-cs.txt.$source,data/test.java-cs.txt.$target load_model_path=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test python run.py \ --do_test \ --model_type roberta \ --source_lang $source \ --model_name_or_path $pretrained_model \ --tokenizer_name microsoft/graphcodebert-base \ --config_name microsoft/graphcodebert-base \ --load_model_path $load_model_path \ --dev_filename $dev_file \ --test_filename $test_file \ --output_dir $output_dir \ --max_source_length $source_length \ --max_target_length $target_length \ --beam_size $beam_size \ --eval_batch_size $batch_size 2>&1| tee $output_dir/test.log ``` -------------------------------- ### Fine-Tune Model Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/code2nl/README.md Configuration and execution command for fine-tuning the model on the specified dataset. ```shell cd code2nl lang=php #programming language lr=5e-5 batch_size=64 beam_size=10 source_length=256 target_length=128 data_dir=../data/code2nl/CodeSearchNet output_dir=model/$lang train_file=$data_dir/$lang/train.jsonl dev_file=$data_dir/$lang/valid.jsonl eval_steps=1000 #400 for ruby, 600 for javascript, 1000 for others train_steps=50000 #20000 for ruby, 30000 for javascript, 50000 for others pretrained_model=microsoft/codebert-base #Roberta: roberta-base python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --train_steps $train_steps --eval_steps $eval_steps ``` -------------------------------- ### Demo Data for Zero-Shot Code Search Source: https://github.com/microsoft/codebert/blob/master/CodeExecutor/README.md This JSON object shows a simplified demo data structure for the zero-shot code-to-code search task, including original code and code provided with a test case. ```json { "id": 0, "code_id": "s204511158", "problem_id": 340, # solve which problem "original_code": "s = list(input())", # code without providing the test case "code": "s = ['x', 'y', 'z']", # code provided with a test case "code_tokens": ["<0>", "s", "=", "[", "'x'", ",", "'y'", ",", "'z'", "]"], "trace": [" <0> s : [ x , y , z ] "], "trace_tokens": ["", "<0>", "", "s", ":", "[", "x", ",", "y", ",", "z", "]", ""] } ``` -------------------------------- ### Execute UniXcoder Code Search Tasks Source: https://context7.com/microsoft/codebert/llms.txt Commands for zero-shot evaluation and fine-tuning on specific datasets. ```bash # Zero-shot code search (no training required) # python run.py \ # --output_dir saved_models/AdvTest \ # --model_name_or_path microsoft/unixcoder-base \ # --do_zero_shot --do_test \ # --test_data_file dataset/AdvTest/test.jsonl \ # --codebase_file dataset/AdvTest/test.jsonl \ # --code_length 256 \ # --nl_length 128 \ # --eval_batch_size 64 \ # --seed 123456 # Fine-tuning on AdvTest dataset # python run.py \ # --output_dir saved_models/AdvTest \ # --model_name_or_path microsoft/unixcoder-base \ # --do_train \ # --train_data_file dataset/AdvTest/train.jsonl \ # --eval_data_file dataset/AdvTest/valid.jsonl \ # --codebase_file dataset/AdvTest/valid.jsonl \ # --num_train_epochs 2 \ # --code_length 256 \ # --nl_length 128 \ # --train_batch_size 64 \ # --eval_batch_size 64 \ # --learning_rate 2e-5 \ # --seed 123456 # Fine-tuning on CodeSearchNet for Python # lang=python # python run.py \ # --output_dir saved_models/CSN/$lang \ # --model_name_or_path microsoft/unixcoder-base \ # --do_train \ # --train_data_file dataset/CSN/$lang/train.jsonl \ # --eval_data_file dataset/CSN/$lang/valid.jsonl \ # --codebase_file dataset/CSN/$lang/codebase.jsonl \ # --num_train_epochs 10 \ # --code_length 256 \ # --nl_length 128 \ # --train_batch_size 64 \ # --eval_batch_size 64 \ # --learning_rate 2e-5 \ # --seed 123456 ``` -------------------------------- ### Download UniXcoder Class Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/README.md Command to download the necessary Python class for UniXcoder. ```shell wget https://raw.githubusercontent.com/microsoft/CodeBERT/master/UniXcoder/unixcoder.py ``` -------------------------------- ### Fine-Tune CodeBERT for Clone Detection (Evaluation) Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/clone-detection/BCB/README.md Use this shell command to evaluate the fine-tuned CodeBERT model on the test dataset. This command is similar to training but uses the '--do_test' flag. ```shell python run.py \ --output_dir saved_models \ --model_name_or_path microsoft/unixcoder-base \ --do_test \ --test_data_file dataset/test.txt \ --num_train_epochs 1 \ --block_size 512 \ --train_batch_size 16 \ --eval_batch_size 32 \ --learning_rate 5e-5 \ --max_grad_norm 1.0 \ --seed 123456 ``` -------------------------------- ### Unzip Dataset Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/translation/README.md Extract the dataset files from the compressed archive. ```bash unzip data.zip ``` -------------------------------- ### Fine-tune GraphCodeBERT Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/clonedetection/README.md Execute the training process for the GraphCodeBERT model on the clone detection dataset. ```shell mkdir saved_models python run.py \ --output_dir=saved_models \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=dataset/train.txt \ --eval_data_file=dataset/valid.txt \ --test_data_file=dataset/test.txt \ --epoch 1 \ --code_length 512 \ --data_flow_length 128 \ --train_batch_size 16 \ --eval_batch_size 32 \ --learning_rate 2e-5 \ --max_grad_norm 1.0 \ --evaluate_during_training \ --seed 123456 2>&1| tee saved_models/train.log ``` -------------------------------- ### Load CodeBERT Model Source: https://github.com/microsoft/codebert/blob/master/README.md Initializes the CodeBERT tokenizer and model using the Hugging Face transformers library. ```python import torch from transformers import RobertaTokenizer, RobertaConfig, RobertaModel device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") model = RobertaModel.from_pretrained("microsoft/codebert-base") model.to(device) ``` -------------------------------- ### Fine-tune GraphCodeBERT Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/refinement/README.md Configure hyperparameters and execute the training script for the code refinement model. ```bash scale=small lr=1e-4 batch_size=32 beam_size=10 source_length=320 target_length=256 output_dir=saved_models/$scale/ train_file=data/$scale/train.buggy-fixed.buggy,data/$scale/train.buggy-fixed.fixed dev_file=data/$scale/valid.buggy-fixed.buggy,data/$scale/valid.buggy-fixed.fixed epochs=50 pretrained_model=microsoft/graphcodebert-base mkdir -p $output_dir python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --tokenizer_name microsoft/graphcodebert-base --config_name microsoft/graphcodebert-base --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs 2>&1| tee $output_dir/train.log ``` -------------------------------- ### Fine-Tune Code Search Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md Commands for training and evaluating models on AdvTest and CosQA datasets. ```shell # Training python run.py \ --output_dir saved_models/AdvTest \ --model_name_or_path microsoft/unixcoder-base \ --do_train \ --train_data_file dataset/AdvTest/train.jsonl \ --eval_data_file dataset/AdvTest/valid.jsonl \ --codebase_file dataset/AdvTest/valid.jsonl \ --num_train_epochs 2 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 64 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 # Evaluating python run.py \ --output_dir saved_models/AdvTest \ --model_name_or_path microsoft/unixcoder-base \ --do_test \ --test_data_file dataset/AdvTest/test.jsonl \ --codebase_file dataset/AdvTest/test.jsonl \ --num_train_epochs 2 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 64 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 ``` ```bash # Training python run.py \ --output_dir saved_models/cosqa \ --model_name_or_path microsoft/unixcoder-base \ --do_train \ --train_data_file dataset/cosqa/cosqa-retrieval-train-19604.json \ --eval_data_file dataset/cosqa/cosqa-retrieval-dev-500.json \ --codebase_file dataset/cosqa/code_idx_map.txt \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 64 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 # Evaluating python run.py \ --output_dir saved_models/cosqa \ --model_name_or_path microsoft/unixcoder-base \ --do_eval \ --do_test \ --eval_data_file dataset/cosqa/cosqa-retrieval-dev-500.json \ --test_data_file dataset/cosqa/cosqa-retrieval-test-500.json \ --codebase_file dataset/cosqa/code_idx_map.txt \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 64 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 ``` -------------------------------- ### Unzip Dataset Source: https://github.com/microsoft/codebert/blob/master/GraphCodeBERT/clonedetection/README.md Extract the dataset archive before processing. ```bash unzip dataset.zip ``` -------------------------------- ### Download CosQA Dataset Source: https://github.com/microsoft/codebert/blob/master/UniXcoder/downstream-tasks/code-search/README.md Downloads the required JSON and text files for the CosQA dataset. ```bash cd dataset mkdir cosqa && cd cosqa wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/code_idx_map.txt wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/cosqa-retrieval-dev-500.json wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/cosqa-retrieval-test-500.json wget https://github.com/Jun-jie-Huang/CoCLR/raw/main/data/search/cosqa-retrieval-train-19604.json cd ../.. ``` -------------------------------- ### Fine-Tune CodeBERT Model Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/codesearch/README.md Command to initiate fine-tuning of the CodeBERT model for a specific programming language on GPU hardware. ```shell cd codesearch lang=php #fine-tuning a language-specific model for each programming language pretrained_model=microsoft/codebert-base #Roberta: roberta-base python run_classifier.py \ --model_type roberta \ --task_name codesearch \ --do_train \ --do_eval \ --eval_all_checkpoints \ --train_file train.txt \ --dev_file valid.txt \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --gradient_accumulation_steps 1 \ --overwrite_output_dir \ --data_dir ../data/codesearch/train_valid/$lang \ --output_dir ./models/$lang \ --model_name_or_path $pretrained_model ``` -------------------------------- ### Perform Encoder-Decoder Tasks with UniXcoder Source: https://context7.com/microsoft/codebert/llms.txt Demonstrates function name prediction, API recommendation, and code summarization using mask tokens in encoder-decoder mode. ```python import torch from unixcoder import UniXcoder device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = UniXcoder("microsoft/unixcoder-base") model.to(device) # Function name prediction with context = """ def (data, file_path): data = json.dumps(data) with open(file_path, 'w') as f: f.write(data) """ tokens_ids = model.tokenize([context], max_length=512, mode="") source_ids = torch.tensor(tokens_ids).to(device) prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128) predictions = model.decode(prediction_ids) # Extract function name predictions names = [x.replace("", "").strip() for x in predictions[0]] print(names) # Output: ['write_json', 'write_file', 'to_json'] # API recommendation api_context = """ def write_json(data, file_path): data = (data) with open(file_path, 'w') as f: f.write(data) """ tokens_ids = model.tokenize([api_context], max_length=512, mode="") source_ids = torch.tensor(tokens_ids).to(device) prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128) predictions = model.decode(prediction_ids) apis = [x.replace("", "").strip() for x in predictions[0]] print(apis) # Output: ['json.dumps', 'json.loads', 'str'] # Code summarization summary_context = """ # def write_json(data, file_path): data = json.dumps(data) with open(file_path, 'w') as f: f.write(data) """ tokens_ids = model.tokenize([summary_context], max_length=512, mode="") source_ids = torch.tensor(tokens_ids).to(device) prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128) predictions = model.decode(prediction_ids) summaries = [x.replace("", "").strip() for x in predictions[0]] print(summaries) # Output: ['Write JSON to file', 'Write json to file', 'Write a json file'] ``` -------------------------------- ### Download Dataset Source: https://github.com/microsoft/codebert/blob/master/CodeBERT/code2nl/README.md Commands to download and extract the cleaned CodeSearchNet dataset. ```shell pip install gdown mkdir data data/code2nl cd data/code2nl gdown https://drive.google.com/uc?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h unzip Cleaned_CodeSearchNet.zip rm Cleaned_CodeSearchNet.zip cd ../.. ```