### Install CodeGeeX from Source Source: https://context7.com/thudm/codegeex/llms.txt Clone the repository and install the package using pip. Alternatively, use the pre-built Docker image for a quick setup. ```bash git clone git@github.com:THUDM/CodeGeeX.git cd CodeGeeX pip install -e . ``` ```bash # Or use the pre-built Docker image (requires nvidia-docker) docker pull codegeex/codegeex:latest docker run --gpus '"device=0,1"' -it --ipc=host --name=codegeex codegeex/codegeex ``` -------------------------------- ### Install CodeGeeX from Source Source: https://github.com/thudm/codegeex/blob/main/README.md Clone the repository and install the package in editable mode. Requires Python 3.7+, CUDA 11+, PyTorch 1.10+, and DeepSpeed 0.6+. ```bash git clone git@github.com:THUDM/CodeGeeX.git cd CodeGeeX pip install -e . ``` -------------------------------- ### Go Bubble Sort Example (Stealth Mode) Source: https://github.com/thudm/codegeex/blob/main/vscode-extension/README.md Demonstrates code generation in Stealth mode. The generated code appears in gray and can be inserted by pressing Tab. Note that modifying code before generation finishes may cause bugs. ```go package main import "fmt" func main() { // CodeGeeX will generate the following code // when you stop writing. // Press Tab to insert the generated code. // Example: Bubble sort implementation in Go arr := []int{64, 34, 25, 12, 22, 11, 90} fmt.Println("Unsorted array: ", arr) // Bubble sort logic n := len(arr) for i := 0; i < n-1; i++ { for j := 0; j < n-i-1; j++ { if arr[j] > arr[j+1] { arr[j], arr[j+1] = arr[j+1], arr[j] } } } fmt.Println("Sorted array: ", arr) } ``` -------------------------------- ### Launch 8-GPU Pre-training with DeepSpeed ZeRO-2 Source: https://context7.com/thudm/codegeex/llms.txt Launches an 8-GPU pre-training job using DeepSpeed ZeRO-2. Ensure to consult the full arguments in configs/codegeex_13b.sh. ```bash deepspeed --num_gpus 8 codegeex/megatron/tools/pretrain_codegeex.py \ --num-layers 40 \ --hidden-size 5120 \ --num-attention-heads 40 \ --seq-length 2048 \ --max-position-embeddings 2048 \ --micro-batch-size 4 \ --global-batch-size 512 \ --lr 1e-4 \ --train-iters 500000 \ --lr-decay-iters 480000 \ --data-path /data/code_corpus_text_document \ --vocab-file tokenizer/vocab.json \ --merge-file tokenizer/merges.txt \ --data-impl mmap \ --split 949,50,1 \ --distributed-backend nccl \ --fp16 \ --deepspeed \ --deepspeed_config configs/ds_config.json \ --tokenizer-type GPT2BPETokenizer \ --save /checkpoints/codegeex_13b \ --load /checkpoints/codegeex_13b ``` -------------------------------- ### Run CodeGeeX with Docker Source: https://github.com/thudm/codegeex/blob/main/README.md Pull the latest CodeGeeX Docker image and run it. Use the --gpus flag to enable GPU support, specifying device IDs. ```bash docker pull codegeex/codegeex:latest # To enable GPU support, clarify device ids with --device docker run --gpus '"device=0,1"' -it --ipc=host --name=codegeex codegeex/codegeex ``` -------------------------------- ### Building CodeGeeX Docker Image Source: https://github.com/thudm/codegeex/blob/main/codegeex/benchmark/README.md Build the Docker image from the provided Dockerfile if you prefer to customize the environment. ```bash cd codegeex/docker docker build [OPTIONS] . ``` -------------------------------- ### Generate Samples with LM Evaluation Harness Source: https://context7.com/thudm/codegeex/llms.txt Use this wrapper to tokenize context, run generation, and decode text with the lm-evaluation-harness. Ensure model and tokenizer are properly initialized. ```python from codegeex.megatron.code_generation_utils import generate_samples_eval from codegeex.megatron import get_args, get_tokenizer args = get_args() args.seq_length = 2048 args.temperature = 1.0 args.top_k = 0 args.top_p = 1.0 args.greedy = False args.beam_search = False tokenizer = get_tokenizer() context = "# language: Python\ndef add(a, b):\n """Return the sum of a and b."""\n" generated_text = generate_samples_eval( model, context=context, max_gen_length=64, eos_token_id=tokenizer.eod, ) print("Generated:", generated_text) # Expected output: # return a + b ``` -------------------------------- ### Running CodeGeeX Docker Container Source: https://github.com/thudm/codegeex/blob/main/codegeex/benchmark/README.md Launch a container from the CodeGeeX Docker image, mounting local directories as needed. ```bash docker run -it --gpus all --mount type=bind,source=,target= [OPTIONS] ``` -------------------------------- ### Run Inference with Quantization Source: https://github.com/thudm/codegeex/blob/main/README.md Perform inference using quantization, requiring more than 15GB of RAM. Specify the GPU ID and the path to the prompt file. ```bash # With quantization (with more than 15GB RAM) bash ./scripts/test_inference_quantized.sh ./tests/test_prompt.txt ``` -------------------------------- ### Download and Merge Model Weights Source: https://context7.com/thudm/codegeex/llms.txt Download the model checkpoint using aria2c and then merge and extract the files. The result is a directory containing model state files. ```bash # Download all parts in parallel aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt # Merge and extract cat codegeex_13b.tar.gz.* tar xvf codegeex_13b.tar.gz # Result: directory containing mp_rank_00_model_states.pt (and others for MP) ``` -------------------------------- ### Python Code Explanation Template Source: https://github.com/thudm/codegeex/blob/main/vscode-extension/README.md Use this template in prompt mode to explain Python code line by line. The `` tag is where the selected code will be inserted. ```python # language: Python def sum_squares(lst): sum = 0 for i in range(len(lst)): if i % 3 == 0: lst[i] = lst[i]**2 elif i % 4 == 0: lst[i] = lst[i]**3 sum += lst[i] return sum # Explain the code line by line def sum_squares(lst): # initialize sum sum = 0 # loop through the list for i in range(len(lst)): # if the index is a multiple of 3 if i % 3 == 0: # square the entry lst[i] = lst[i]**2 # if the index is a multiple of 4 elif i % 4 == 0: # cube the entry lst[i] = lst[i]**3 # add the entry to the sum sum += lst[i] # return the sum return sum # Explain the code line by line ``` -------------------------------- ### Download Model Weights with aria2c Source: https://github.com/thudm/codegeex/blob/main/README.md Use aria2c to download model weights from a list of URLs provided in urls.txt. Ensure sufficient disk space for the ~26GB checkpoint. ```bash aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt ``` -------------------------------- ### Pulling CodeGeeX Docker Image Source: https://github.com/thudm/codegeex/blob/main/codegeex/benchmark/README.md Use this command to pull the pre-built Docker image containing the required environments for evaluation. ```bash docker pull rishubi/codegeex:latest ``` -------------------------------- ### Single-GPU Inference with CodeGeeX Source: https://context7.com/thudm/codegeex/llms.txt Run code generation from a prompt file on a single NVIDIA GPU. Quantized mode reduces memory requirements. Multi-GPU inference requires checkpoint conversion first. ```bash # Write a prompt to a file echo "# language: Python def bubble_sort(arr): """Sort a list using bubble sort algorithm.""" " > tests/test_prompt.txt # Standard inference (>27 GB GPU RAM) bash ./scripts/test_inference.sh 0 ./tests/test_prompt.txt # Quantized inference (>15 GB GPU RAM) bash ./scripts/test_inference_quantized.sh 0 ./tests/test_prompt.txt # Multi-GPU inference (first convert checkpoint, then run) bash ./scripts/convert_ckpt_parallel.sh /path/to/ckpt /path/to/mp_ckpt 2 bash ./scripts/test_inference_parallel.sh 2 ./tests/test_prompt.txt ``` -------------------------------- ### Build Cross-lingual Translation Prompts Source: https://context7.com/thudm/codegeex/llms.txt Load source and target language HumanEval-X datasets to construct code translation prompts. The prompt format includes source language declaration, solution, and target language declaration. ```python from codegeex.benchmark.utils import read_translation_dataset dataset = read_translation_dataset( data_file_src="codegeex/benchmark/humaneval-x/python/data/humaneval_python.jsonl.gz", data_file_tgt="codegeex/benchmark/humaneval-x/cpp/data/humaneval_cpp.jsonl.gz", lang_src="python", lang_tgt="cpp", dataset_type="humaneval", ) sample = list(dataset.values())[0] print(sample["prompt"]) # code translation # Python: # def has_close_elements(numbers: List[float], threshold: float) -> bool: # for idx, elem in enumerate(numbers): # ... # C++: # bool has_close_elements(vector numbers, float threshold) { ``` -------------------------------- ### Run Inference on Single GPU Source: https://github.com/thudm/codegeex/blob/main/README.md Execute inference on a single GPU with more than 27GB of RAM. Specify the GPU ID and the path to the prompt file. ```bash # On a single GPU (with more than 27GB RAM) bash ./scripts/test_inference.sh ./tests/test_prompt.txt ``` -------------------------------- ### Sliding Window for Pre-training Data Source: https://context7.com/thudm/codegeex/llms.txt Generates overlapping prompt-code token pairs from a long source file for pre-training. Ensures each window fits within `seq_len` and meets `minimum_code_len`. Requires `stream_jsonl` and `sliding_window` utilities. ```python from codegeex.data.data_utils import stream_jsonl from codegeex.data.data_utils import sliding_window # Tokenize a long source file (tokens are assumed to be pre-encoded int lists) prompt_tokens = [1, 2, 3, 4, 5] # e.g., language tag tokens code_tokens = list(range(100, 600)) # 500 tokens of code windows = list(sliding_window( prompt_tokens=prompt_tokens, code_tokens=code_tokens, seq_len=128, sliding_stride=64, minimum_code_len=8, )) print(f"Total windows: {len(windows)}") for i, (p, c) in enumerate(windows[:3]): print(f"Window {i}: prompt_len={len(p)}, code_len={len(c)}, total={len(p)+len(c)}") ``` -------------------------------- ### Evaluating Generated Codes Source: https://github.com/thudm/codegeex/blob/main/codegeex/benchmark/README.md Execute this script to evaluate generated code samples using the HumanEval-X benchmark. Ensure you understand the risks associated with running generated code. ```bash bash scripts/evaluate_humaneval_x.sh ``` -------------------------------- ### Convert Checkpoint for Model Parallelism Source: https://github.com/thudm/codegeex/blob/main/README.md Convert a checkpoint to be partitioned for model parallelism. This is a prerequisite for running inference on multiple GPUs with limited RAM per GPU. ```bash # On multiple GPUs (with more than 6GB RAM, need to first convert ckpt to MP_SIZE partitions) bash ./scripts/convert_ckpt_parallel.sh ``` -------------------------------- ### generate_samples_eval — LM Evaluation Harness Integration Source: https://context7.com/thudm/codegeex/llms.txt A convenience wrapper compatible with the EleutherAI lm-evaluation-harness. It tokenizes a context string, runs generation up to max_gen_length new tokens, and returns the decoded generated text. ```APIDOC ## `generate_samples_eval` — LM Evaluation Harness Integration `generate_samples_eval` is a convenience wrapper compatible with the [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). It tokenizes a context string, runs generation up to `max_gen_length` new tokens, and returns the decoded generated text. ```python from codegeex.megatron.code_generation_utils import generate_samples_eval from codegeex.megatron import get_args, get_tokenizer args = get_args() args.seq_length = 2048 args.temperature = 1.0 args.top_k = 0 args.top_p = 1.0 args.greedy = False args.beam_search = False tokenizer = get_tokenizer() context = "# language: Python\ndef add(a, b):\n \"\"\"Return the sum of a and b.\"\"\"\n" generated_text = generate_samples_eval( model, context=context, max_gen_length=64, eos_token_id=tokenizer.eod, ) print("Generated:", generated_text) # Expected output: # return a + b ``` ``` -------------------------------- ### CodeGeeXModel PyTorch Class for Inference Source: https://context7.com/thudm/codegeex/llms.txt Instantiate and use the `CodeGeeXModel` for inference. This involves initializing Megatron-LM, loading a checkpoint, preparing tokenized input, and generating logits. ```python import torch from codegeex.megatron.model import CodeGeeXModel from codegeex.megatron import initialize_megatron, get_tokenizer # Initialize Megatron-LM runtime (sets up distributed, tokenizer, args) initialize_megatron(args_defaults={"tokenizer_type": "GPT2BPETokenizer"}) # Instantiate model for inference (no distributed output splitting) model = CodeGeeXModel(num_tokentypes=0, parallel_output=False) model = model.half().cuda() # Load checkpoint state_dict = torch.load("mp_rank_00_model_states.pt", map_location="cpu") if "module" in state_dict: state_dict = state_dict["module"] model.load_state_dict(state_dict) model.eval() # Prepare a tokenized prompt tokenizer = get_tokenizer() prompt = "# language: Python def fibonacci(n): """Return the nth Fibonacci number.""" " tokens = tokenizer.tokenize(prompt) # list of int token IDs # Build a minimal batch tensor input_ids = torch.cuda.LongTensor([tokens]) # (1, seq_len) position_ids = torch.arange(len(tokens)).unsqueeze(0).cuda() # (1, seq_len) attention_mask = torch.tril(torch.ones(1, 1, len(tokens), len(tokens), dtype=torch.bool)).cuda() with torch.no_grad(): logits = model(input_ids, position_ids, attention_mask) # logits shape: (1, seq_len, vocab_size) next_token_id = logits[0, -1, :].argmax().item() print("Next token:", tokenizer.detokenize([next_token_id])) ``` -------------------------------- ### REST API for Multilingual Code Generation Source: https://context7.com/thudm/codegeex/llms.txt Integrate with the Tianqi platform's REST API for code generation. Authenticate using an API key and secret. The API returns multiple completions for a given prompt. ```python import json import requests API_KEY = "YOUR_API_KEY" API_SECRET = "YOUR_API_SECRET" # Code generation endpoint url = "https://tianqi.aminer.cn/api/v2/multilingual_code_generate" headers = {"Content-Type": "application/json"} payload = { "apikey": API_KEY, "apisecret": API_SECRET, "prompt": ( "from typing import List\n\n" "def has_close_elements(numbers: List[float], threshold: float) -> bool:\n" " "" Check if any two numbers in the list are closer than threshold.\n" " >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n" " False\n" " ""\n" ), "n": 3, # number of completions to return "lang": "Python", } response = requests.post(url, headers=headers, data=json.dumps(payload)) response.raise_for_status() result = response.json() # result format: {"result": [{"code": " ..."}, {"code": " ..."}, ...]}} for i, item in enumerate(result.get("result", [])): print(f"--- Completion {i+1} ---\n{item['code']}\n") ``` -------------------------------- ### Parallel Nucleus Sampling Generation Source: https://context7.com/thudm/codegeex/llms.txt Generate multiple independent completions in parallel using nucleus (top-p) sampling with `generate_nuclear_sampling`. It returns a list of `Handle` objects, each containing a token sequence and its log-probability score. Ensure `get_args` and `get_tokenizer` are imported. ```Python from codegeex.megatron.code_generation_utils import generate_nuclear_sampling, Handle from codegeex.megatron import get_args, get_tokenizer args = get_args() args.seq_length = 2048 args.out_seq_length = 128 tokenizer = get_tokenizer() prompt = "# language: Python def merge_sorted_lists(a, b): """Merge two sorted lists.""" " context_tokens = tokenizer.tokenize(prompt) n_prompt = len(context_tokens) handles = generate_nuclear_sampling( model, context_tokens=context_tokens, num_samples=10, temperature=0.8, top_p=0.95, top_k=0, ) finished = [h for h in handles if h.is_finished()] print(f"Finished: {len(finished)}/{len(handles)}") for h in finished[:3]: code = tokenizer.detokenize(h.tokens[n_prompt:]) print(f"Score: {h.score:.3f}\n{code}\n") ``` -------------------------------- ### REST API — `multilingual_code_generate` Source: https://context7.com/thudm/codegeex/llms.txt The Tianqi platform exposes a JSON REST endpoint for code generation. Authenticate with an API key/secret obtained from tianqi.aminer.cn. ```APIDOC ## REST API — `multilingual_code_generate` The Tianqi platform exposes a JSON REST endpoint for code generation. Authenticate with an API key/secret obtained from [tianqi.aminer.cn](https://tianqi.aminer.cn/open/). ```python import json import requests API_KEY = "YOUR_API_KEY" API_SECRET = "YOUR_API_SECRET" # Code generation endpoint url = "https://tianqi.aminer.cn/api/v2/multilingual_code_generate" headers = {"Content-Type": "application/json"} payload = { "apikey": API_KEY, "apisecret": API_SECRET, "prompt": ( "from typing import List\n\n" "def has_close_elements(numbers: List[float], threshold: float) -> bool:\n" " \"\"\" Check if any two numbers in the list are closer than threshold.\n" " >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\"\n" " False\n" " \"\"\" ), "n": 3, # number of completions to return "lang": "Python", } response = requests.post(url, headers=headers, data=json.dumps(payload)) response.raise_for_status() result = response.json() # result format: {"result": [{"code": " ..."}, {"code": " ..."}, ...]}} for i, item in enumerate(result.get("result", [])): print(f"--- Completion {i+1} ---\n{item['code']}\n") ``` ``` -------------------------------- ### Stream and Write JSONL Datasets Source: https://context7.com/thudm/codegeex/llms.txt Utilize `stream_jsonl` for lazy reading of `.jsonl` or `.jsonl.gz` files and `write_jsonl` for writing iterables of dictionaries. These are essential for handling HumanEval-X datasets and generation outputs. ```python from codegeex.data.data_utils import stream_jsonl, write_jsonl # Read a gzip-compressed HumanEval-X dataset data_file = "codegeex/benchmark/humaneval-x/python/data/humaneval_python.jsonl.gz" problems = {task["task_id"]: task for task in stream_jsonl(data_file)} # Inspect a sample sample = problems["Python/0"] print("task_id:", sample["task_id"]) print("prompt:\n", sample["prompt"]) print("canonical_solution:\n", sample["canonical_solution"]) # task_id: Python/0 # prompt: from typing import List\ndef has_close_elements(...) -> bool:\n ... # canonical_solution: for idx, elem in enumerate(numbers): ... # Write generated completions to a gzip output file generations = [ {"task_id": "Python/0", "generation": " for i in range(len(numbers)):\n ..."}, {"task_id": "Python/1", "generation": " return numbers[0] if len(numbers) == 1 else ..."}, ] write_jsonl("my_generations.jsonl.gz", generations) ``` -------------------------------- ### Python Docstring Generation Template Source: https://github.com/thudm/codegeex/blob/main/vscode-extension/README.md This template can be used in prompt mode for generating Python docstrings. The `` tag is where the selected code will be inserted. ```python def add_binary(a, b): ''' Returns the sum of two decimal numbers in binary digits. Parameters: a (int): A decimal integer b (int): Another decimal integer Returns: binary_sum (str): Binary string of the sum of a and b '''. binary_sum = bin(a+b)[2:] return binary_sum ``` -------------------------------- ### Extract Model Weights Source: https://github.com/thudm/codegeex/blob/main/README.md Concatenate split model weight files and then extract the tar archive to obtain the full model weights. ```bash cat codegeex_13b.tar.gz.* > codegeex_13b.tar.gz tar xvf codegeex_13b.tar.gz ``` -------------------------------- ### sliding_window Source: https://context7.com/thudm/codegeex/llms.txt Generates overlapping (prompt_tokens, code_tokens) pairs from a long source file for pre-training data preparation. Ensures each window fits within a specified sequence length. ```APIDOC ## sliding_window ### Description Generates overlapping (prompt_tokens, code_tokens) pairs from a long source file for pre-training data preparation. Ensures each window fits within a specified sequence length. ### Parameters - **prompt_tokens** (list of integers) - Required - Tokens representing the prompt. - **code_tokens** (list of integers) - Required - Tokens representing the code. - **seq_len** (integer) - Required - The maximum sequence length for each window. - **sliding_stride** (integer) - Required - The stride for sliding the window. - **minimum_code_len** (integer) - Required - The minimum length of code tokens required in a window. ``` -------------------------------- ### Run Inference on Multiple GPUs Source: https://github.com/thudm/codegeex/blob/main/README.md Execute inference across multiple GPUs using a pre-converted checkpoint. Specify the model parallelism size and the path to the prompt file. ```bash bash ./scripts/test_inference_parallel.sh ./tests/test_prompt.txt ``` -------------------------------- ### read_translation_dataset — Cross-lingual Translation Prompt Builder Source: https://context7.com/thudm/codegeex/llms.txt read_translation_dataset loads a source and target language HumanEval-X dataset and constructs the "code translation" prompt format: source language declaration + solution, followed by the target language declaration. ```APIDOC ## `read_translation_dataset` — Cross-lingual Translation Prompt Builder `read_translation_dataset` loads a source and target language HumanEval-X dataset and constructs the "code translation" prompt format: source language declaration + solution, followed by the target language declaration. ```python from codegeex.benchmark.utils import read_translation_dataset dataset = read_translation_dataset( data_file_src="codegeex/benchmark/humaneval-x/python/data/humaneval_python.jsonl.gz", data_file_tgt="codegeex/benchmark/humaneval-x/cpp/data/humaneval_cpp.jsonl.gz", lang_src="python", lang_tgt="cpp", dataset_type="humaneval", ) sample = list(dataset.values())[0] print(sample["prompt"]) # code translation # Python: # def has_close_elements(numbers: List[float], threshold: float) -> bool: # for idx, elem in enumerate(numbers): # ... # C++: # bool has_close_elements(vector numbers, float threshold) { ``` ``` -------------------------------- ### generate_nuclear_sampling Source: https://context7.com/thudm/codegeex/llms.txt `generate_nuclear_sampling` generates `num_samples` independent completions in parallel using nucleus (top-p) sampling, returning a list of `Handle` objects each containing a token sequence and log-probability score. ```APIDOC ## `generate_nuclear_sampling` — Parallel Nucleus Sampling `generate_nuclear_sampling` generates `num_samples` independent completions in parallel using nucleus (top-p) sampling, returning a list of `Handle` objects each containing a token sequence and log-probability score. ```python from codegeex.megatron.code_generation_utils import generate_nuclear_sampling, Handle from codegeex.megatron import get_args, get_tokenizer args = get_args() args.seq_length = 2048 args.out_seq_length = 128 tokenizer = get_tokenizer() prompt = "# language: Python\ndef merge_sorted_lists(a, b):\n """Merge two sorted lists."""\n" context_tokens = tokenizer.tokenize(prompt) n_prompt = len(context_tokens) handles = generate_nuclear_sampling( model, context_tokens=context_tokens, num_samples=10, temperature=0.8, top_p=0.95, top_k=0, ) finished = [h for h in handles if h.is_finished()] print(f"Finished: {len(finished)}/{len(handles)}") for h in finished[:3]: code = tokenizer.detokenize(h.tokens[n_prompt:]) print(f"Score: {h.score:.3f}\n{code}\n") ``` ``` -------------------------------- ### is_code_generation_finished / cleanup_code Source: https://context7.com/thudm/codegeex/llms.txt Detects natural completion boundaries in code generation and trims trailing content. `is_code_generation_finished` identifies the boundary, and `cleanup_code` performs the trimming. ```APIDOC ## is_code_generation_finished / cleanup_code ### Description Detects natural completion boundaries in code generation and trims trailing content. `is_code_generation_finished` identifies the boundary, and `cleanup_code` performs the trimming. ### Parameters for `is_code_generation_finished` - **code** (string) - Required - The code string to check. - **language_type** (string) - Required - The programming language of the code (e.g., "python", "java"). - **dataset** (string) - Required - The dataset used for generation (e.g., "humaneval"). ### Parameters for `cleanup_code` - **code** (string) - Required - The code string to clean. - **language_type** (string) - Required - The programming language of the code (e.g., "python", "java"). - **dataset** (string) - Required - The dataset used for generation (e.g., "humaneval"). ``` -------------------------------- ### stream_jsonl / write_jsonl — JSONL Dataset Utilities Source: https://context7.com/thudm/codegeex/llms.txt stream_jsonl lazily reads .jsonl or .jsonl.gz files as dictionaries, while write_jsonl writes an iterable of dicts to the same formats. These are the primary I/O primitives for HumanEval-X datasets and generation outputs. ```APIDOC ## `stream_jsonl` / `write_jsonl` — JSONL Dataset Utilities `stream_jsonl` lazily reads `.jsonl` or `.jsonl.gz` files as dictionaries, while `write_jsonl` writes an iterable of dicts to the same formats. These are the primary I/O primitives for HumanEval-X datasets and generation outputs. ```python from codegeex.data.data_utils import stream_jsonl, write_jsonl # Read a gzip-compressed HumanEval-X dataset data_file = "codegeex/benchmark/humaneval-x/python/data/humaneval_python.jsonl.gz" problems = {task["task_id"]: task for task in stream_jsonl(data_file)} # Inspect a sample sample = problems["Python/0"] print("task_id:", sample["task_id"]) print("prompt:\n", sample["prompt"]) print("canonical_solution:\n", sample["canonical_solution"]) # task_id: Python/0 # prompt: from typing import List\ndef has_close_elements(...) -> bool:\n ... # canonical_solution: for idx, elem in enumerate(numbers): ... # Write generated completions to a gzip output file generations = [ {"task_id": "Python/0", "generation": " for i in range(len(numbers)):\n ..."}, {"task_id": "Python/1", "generation": " return numbers[0] if len(numbers) == 1 else ..."}, ] write_jsonl("my_generations.jsonl.gz", generations) ``` ``` -------------------------------- ### Beam Search Decoding Source: https://context7.com/thudm/codegeex/llms.txt Implement greedy beam search decoding using `beam_search` to find top-scoring sequences. This function requires the model, context tokens, and the number of beams. Note that it does not support model parallelism. Ensure `get_args` and `get_tokenizer` are imported. ```Python from codegeex.megatron.code_generation_utils import beam_search, Beam from codegeex.megatron import get_args, get_tokenizer args = get_args() args.seq_length = 2048 args.out_seq_length = 256 args.beam_warmup = False args.evaluation = True # use is_code_generation_finished() as stop criterion tokenizer = get_tokenizer() prompt = "// language: Java public static int factorial(int n) { // Return n! " context_tokens = tokenizer.tokenize(prompt) beams = beam_search(model, context_tokens=context_tokens, num_beams=5) for rank, beam in enumerate(beams): code = tokenizer.detokenize(beam.tokens[len(context_tokens):]) print(f"Beam {rank} | score={beam.score:.4f}\n{code}\n") # Example output: # Beam 0 | score=-3.2145 # if (n <= 1) return 1; # return n * factorial(n - 1); # } ``` -------------------------------- ### Code Generation Stopping and Cleanup Source: https://context7.com/thudm/codegeex/llms.txt Detects natural completion boundaries in model output and trims trailing content. `is_code_generation_finished` checks for completion, and `cleanup_code` performs the trimming. Supports Python and Java. ```python from codegeex.benchmark.utils import is_code_generation_finished, cleanup_code # --- Python: stop at a new top-level statement --- code_py = " result = []\n for x in nums:\n result.append(x)\n return result\n\ndef helper():\n pass" print(is_code_generation_finished(code_py, language_type="python", dataset="humaneval")) # True (new top-level def detected) cleaned_py = cleanup_code(code_py, language_type="python", dataset="humaneval") print(cleaned_py) # " result = []\n for x in nums:\n result.append(x)\n return result" # --- Java: stop when braces balance (one extra closing brace) --- code_java = " return n <= 1 ? n : fib(n-1) + fib(n-2);\n}" print(is_code_generation_finished(code_java, language_type="java", dataset="humaneval")) # True (opens=0 + 1 == closes=1) cleaned_java = cleanup_code(code_java, language_type="java", dataset="humaneval") print(cleaned_java) # " return n <= 1 ? n : fib(n-1) + fib(n-2);\n}" ``` -------------------------------- ### Sampling Filter for Logits Source: https://context7.com/thudm/codegeex/llms.txt Use `top_k_logits` to filter logits in-place, retaining only top-k tokens or those within a cumulative top-p probability mass. This is applied before softmax sampling. Ensure PyTorch and `torch.nn.functional` are imported. ```Python import torch import torch.nn.functional as F from codegeex.megatron.code_generation_utils import top_k_logits # Simulated logits for a vocabulary of 50400 tokens logits = torch.randn(1, 50400).cuda() # Apply top-k=50, top-p=0.9 filtering filtered = top_k_logits(logits.clone(), top_k=50, top_p=0.9) # Sample from filtered distribution probs = F.softmax(filtered, dim=-1) next_token = torch.multinomial(probs, num_samples=1) print("Sampled token id:", next_token.item()) # Pure top-p (nucleus) sampling with no top-k cap filtered_p = top_k_logits(logits.clone(), top_k=0, top_p=0.95) probs_p = F.softmax(filtered_p, dim=-1) print("Non-zero logits remaining:", (filtered_p > -1e9).sum().item()) ``` -------------------------------- ### beam_search Source: https://context7.com/thudm/codegeex/llms.txt `beam_search` implements a greedy beam search over the model's token distribution, returning the top-scoring complete or partial beam sequences. Note: does not support model parallelism. ```APIDOC ## `beam_search` — Beam Search Decoding `beam_search` implements a greedy beam search over the model's token distribution, returning the top-scoring complete or partial beam sequences. Note: does not support model parallelism. ```python from codegeex.megatron.code_generation_utils import beam_search, Beam from codegeex.megatron import get_args, get_tokenizer args = get_args() args.seq_length = 2048 args.out_seq_length = 256 args.beam_warmup = False args.evaluation = True # use is_code_generation_finished() as stop criterion tokenizer = get_tokenizer() prompt = "// language: Java\npublic static int factorial(int n) {\n // Return n!" context_tokens = tokenizer.tokenize(prompt) beams = beam_search(model, context_tokens=context_tokens, num_beams=5) for rank, beam in enumerate(beams): code = tokenizer.detokenize(beam.tokens[len(context_tokens):]) print(f"Beam {rank} | score={beam.score:.4f}\n{code}\n") # Example output: # Beam 0 | score=-3.2145 # if (n <= 1) return 1; # return n * factorial(n - 1); # } ``` ``` -------------------------------- ### evaluate_functional_correctness — HumanEval-X Evaluation Pipeline Source: https://context7.com/thudm/codegeex/llms.txt evaluate_functional_correctness runs the full evaluation pipeline: it loads a JSONL file of model generations, executes each against the language-specific test suite in parallel, and computes the unbiased pass@k metric for k ∈ {1, 10, 100}. ```APIDOC ## `evaluate_functional_correctness` — HumanEval-X Evaluation Pipeline `evaluate_functional_correctness` runs the full evaluation pipeline: it loads a JSONL file of model generations, executes each against the language-specific test suite in parallel, and computes the unbiased pass@k metric for k ∈ {1, 10, 100}. ```python from codegeex.benchmark.evaluate_humaneval_x import evaluate_functional_correctness ``` ``` -------------------------------- ### Evaluate Functional Correctness Source: https://context7.com/thudm/codegeex/llms.txt Execute the full evaluation pipeline for HumanEval-X datasets. This function loads model generations, runs tests in parallel, and computes pass@k metrics. ```python from codegeex.benchmark.evaluate_humaneval_x import evaluate_functional_correctness ``` -------------------------------- ### top_k_logits Source: https://context7.com/thudm/codegeex/llms.txt `top_k_logits` filters a logit tensor in-place to keep only the top-k tokens and/or tokens within a cumulative top-p probability mass, setting all other logits to `-inf` before softmax sampling. ```APIDOC ## `top_k_logits` — Sampling Filter `top_k_logits` filters a logit tensor in-place to keep only the top-k tokens and/or tokens within a cumulative top-p probability mass, setting all other logits to `-inf` before softmax sampling. ```python import torch import torch.nn.functional as F from codegeex.megatron.code_generation_utils import top_k_logits # Simulated logits for a vocabulary of 50400 tokens logits = torch.randn(1, 50400).cuda() # Apply top-k=50, top-p=0.9 filtering filtered = top_k_logits(logits.clone(), top_k=50, top_p=0.9) # Sample from filtered distribution probs = F.softmax(filtered, dim=-1) next_token = torch.multinomial(probs, num_samples=1) print("Sampled token id:", next_token.item()) # Pure top-p (nucleus) sampling with no top-k cap filtered_p = top_k_logits(logits.clone(), top_k=0, top_p=0.95) probs_p = F.softmax(filtered_p, dim=-1) print("Non-zero logits remaining:", (filtered_p > -1e9).sum().item()) ``` ``` -------------------------------- ### estimate_pass_at_k Source: https://context7.com/thudm/codegeex/llms.txt Computes the unbiased pass@k estimator from Codex. Given n total samples and c correct samples per problem, it estimates the probability that at least one of k random draws is correct. ```APIDOC ## estimate_pass_at_k ### Description Computes the unbiased pass@k estimator from Codex. Given n total samples and c correct samples per problem, it estimates the probability that at least one of k random draws is correct. ### Parameters - **num_samples** (integer) - Required - Total number of samples per problem. - **num_correct** (numpy.array) - Required - Array of correct counts per problem. - **k** (integer) - Required - The value of k for the pass@k metric. ``` -------------------------------- ### Evaluate Python Generations Source: https://context7.com/thudm/codegeex/llms.txt Evaluates the functional correctness of Python code generations against a problem set. Specify input generation files, problem files, and output directories. Can be invoked from bash. ```python evaluate_functional_correctness( input_file="my_generations.jsonl", # {"task_id": "Python/0", "generation": "..."} problem_file="codegeex/benchmark/humaneval-x/python/data/humaneval_python.jsonl.gz", tmp_dir="./tmp_eval/", n_workers=16, timeout=10.0, out_dir="./results/", k=[1, 10, 100], example_test=False, # True = use public test cases only ) ``` ```bash bash scripts/evaluate_humaneval_x.sh my_generations.jsonl python 16 ``` -------------------------------- ### get_token_stream Source: https://context7.com/thudm/codegeex/llms.txt `get_token_stream` is the central generation function that returns a Python generator yielding successive token tensors as the model decodes. Supports greedy, top-k, top-p (nucleus), temperature sampling, and beam search via runtime args. ```APIDOC ## `get_token_stream` — Autoregressive Generation Iterator `get_token_stream` is the central generation function that returns a Python generator yielding successive token tensors as the model decodes. Supports greedy, top-k, top-p (nucleus), temperature sampling, and beam search via runtime args. ```python import copy import torch from codegeex.megatron import get_args, get_tokenizer from codegeex.megatron.code_generation_utils import get_token_stream tokenizer = get_tokenizer() args = get_args() args.out_seq_length = 128 # max new tokens to generate args.temperature = 0.8 args.top_k = 0 args.top_p = 0.95 args.greedy = False args.beam_search = False prompt = "# language: Python\ndef is_prime(n):\n """Check if n is a prime number."""\n" context_tokens = tokenizer.tokenize(prompt) n_prompt = len(context_tokens) # Batched generation: micro_batch_size copies of the same prompt micro_batch_size = 4 token_stream = get_token_stream( model, [copy.deepcopy(context_tokens) for _ in range(micro_batch_size)], return_scores=True, prompt_length=n_prompt, micro_batch_size=micro_batch_size, bad_ids=None, # token IDs to suppress, e.g. [tokenizer.eod] temperature=0.8, topp=0.95, topk=0, ) # Consume the stream; last iteration holds the full generated sequences for generated_tokens, (lengths, scores) in token_stream: pass # Decode each sample for i in range(micro_batch_size): toks = generated_tokens[i].cpu().numpy().tolist() code = tokenizer.detokenize(toks[n_prompt:]) print(f"--- Sample {i} (score={scores[i]:.3f}) ---\n{code}\n") ``` ``` -------------------------------- ### Autoregressive Generation Iterator Source: https://context7.com/thudm/codegeex/llms.txt Use `get_token_stream` to generate successive token tensors for autoregressive decoding. It supports various sampling methods like greedy, top-k, top-p, and beam search, configurable via runtime arguments. Ensure necessary imports and tokenizer/arguments are initialized. ```Python import copy import torch from codegeex.megatron import get_args, get_tokenizer from codegeex.megatron.code_generation_utils import get_token_stream tokenizer = get_tokenizer() args = get_args() args.out_seq_length = 128 # max new tokens to generate args.temperature = 0.8 args.top_k = 0 args.top_p = 0.95 args.greedy = False args.beam_search = False prompt = "# language: Python def is_prime(n): """Check if n is a prime number.""" " context_tokens = tokenizer.tokenize(prompt) n_prompt = len(context_tokens) # Batched generation: micro_batch_size copies of the same prompt micro_batch_size = 4 token_stream = get_token_stream( model, [copy.deepcopy(context_tokens) for _ in range(micro_batch_size)], return_scores=True, prompt_length=n_prompt, micro_batch_size=micro_batch_size, bad_ids=None, # token IDs to suppress, e.g. [tokenizer.eod] temperature=0.8, topp=0.95, topk=0, ) # Consume the stream; last iteration holds the full generated sequences for generated_tokens, (lengths, scores) in token_stream: pass # Decode each sample for i in range(micro_batch_size): toks = generated_tokens[i].cpu().numpy().tolist() code = tokenizer.detokenize(toks[n_prompt:]) print(f"--- Sample {i} (score={scores[i]:.3f}) ---\n{code}\n") ``` -------------------------------- ### evaluate_functional_correctness Source: https://context7.com/thudm/codegeex/llms.txt Evaluates the functional correctness of generated Python code against a set of problems. It takes an input file of generations and a problem file, then outputs results to a specified directory. ```APIDOC ## evaluate_functional_correctness ### Description Evaluates the functional correctness of generated Python code against a set of problems. It takes an input file of generations and a problem file, then outputs results to a specified directory. ### Parameters - **input_file** (string) - Required - Path to the JSONL file containing code generations. - **problem_file** (string) - Required - Path to the JSONL.gz file containing the problems. - **tmp_dir** (string) - Optional - Directory for temporary files during evaluation. - **n_workers** (integer) - Optional - Number of worker processes to use for evaluation. - **timeout** (float) - Optional - Timeout in seconds for each test case. - **out_dir** (string) - Optional - Directory to save the evaluation results. - **k** (list of integers) - Optional - List of k values for pass@k calculation. - **example_test** (boolean) - Optional - If True, use only public test cases. ``` -------------------------------- ### Estimate Pass@k Metric Source: https://context7.com/thudm/codegeex/llms.txt Computes the unbiased pass@k estimator. Use this to estimate the probability that at least one of k random draws is correct, given total samples and per-problem correct counts. Requires numpy. ```python import numpy as np from codegeex.benchmark.metric import estimate_pass_at_k # Example: 200 samples per problem, with varying correct counts num_samples = 200 num_correct = np.array([20, 50, 100, 180, 0, 200]) # per-problem correct counts for k in [1, 10, 100]: scores = estimate_pass_at_k(num_samples, num_correct, k) print(f"pass@{k}: {scores.mean():.4f} (per-problem: {np.round(scores, 3)})") ``` -------------------------------- ### Check for Close Elements in Array (Java) Source: https://github.com/thudm/codegeex/blob/main/tests/test_prompt.txt Checks if any two elements in an integer array are within a specified threshold of each other. This is useful for proximity checks. ```Java public class Solution { public static boolean hasCloseElements(int[] nums, int threshold) { for (int i = 0; i < nums.length - 1; i++) { for (int j = i + 1; j < nums.length; j++) { if (Math.abs(nums[i] - nums[j]) < threshold) { return true; } } } return false; } } ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.