### Install BigCodeBench Dependencies Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md Installs the necessary Python dependencies for BigCodeBench evaluation by referencing a requirements file from a GitHub repository. This command should be run before executing the evaluation locally. ```bash pip install -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt ``` -------------------------------- ### Install BigCodeBench and Dependencies Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md Installs the BigCodeBench package and essential dependencies like flash-attn for code generation. This setup is crucial for utilizing the project's code generation and evaluation features. ```bash pip install bigcodebench --upgrade pip install packaging ninja pip install flash-attn --no-build-isolation ``` -------------------------------- ### Setup BigCodeBench as Local Repository (Bash) Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md Configures the BigCodeBench project for local use by cloning the repository, navigating into the directory, setting the PYTHONPATH environment variable, and installing the package in editable mode. This allows for direct modification and testing of the library's code. ```bash git clone https://github.com/bigcode-project/bigcodebench.git cd bigcodebench export PYTHONPATH=$PYTHONPATH:$(pwd) pip install -e . ``` -------------------------------- ### Install BigCodeBench Evaluation Requirements (Bash) Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md Installs the necessary Python packages for evaluating BigCodeBench locally. It's recommended to perform this installation in an isolated environment to avoid dependency conflicts. The command fetches the requirements file from a remote URL. ```bash pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt ``` -------------------------------- ### BigCodeBench Docker with Other Backend Authentication Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md Examples of running BigCodeBench via Docker for backends that require API keys or specific authentication tokens, such as OpenAI, Anthropic, Mistral, and Google. The relevant environment variable for the API key is passed to the container. ```bash docker run -e OPENAI_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments ``` ```bash docker run -e ANTHROPIC_KEY=$ANTHROPIC_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments ``` ```bash docker run -e MISTRAL_KEY=$MISTRAL_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments ``` ```bash docker run -e GOOGLE_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments ``` -------------------------------- ### Generate Code Samples using BigCodeBench Docker (CPU) Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command uses a pre-built Docker image for BigCodeBench to generate code samples on CPUs. It maps the current directory to the container and allows specification of generation parameters. ```bash docker run -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \ --model [model_name] \ --split [complete|instruct] \ --subset [full|hard] \ [--greedy] \ --bs [bs] \ --temperature [temp] \ --n_samples [n_samples] \ --resume \ --backend [vllm|hf|openai|mistral|anthropic|google] ``` -------------------------------- ### Run BigCodeBench Result Analysis Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command initiates the analysis of BigCodeBench evaluation results, including metrics like Elo Rating and Task Solve Rate. It requires all `samples_eval_results.json` files to be placed in a 'results' folder within the 'analysis' directory. ```bash cd analysis python get_results.py ``` -------------------------------- ### Run BigCodeBench Evaluation Locally Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md Executes the BigCodeBench evaluation locally after installing dependencies. This command includes options to specify the execution mode, data split, subset, and sample file. Additional flags like `--no-gt` and `--save_pass_rate` allow for skipping ground truth checks and saving evaluation results, respectively. ```bash bigcodebench.evaluate --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl bigcodebench.evaluate --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --no-gt bigcodebench.evaluate --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --save_pass_rate ``` -------------------------------- ### Local Evaluation of BigCodeBench Samples Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command initiates a local evaluation of BigCodeBench code samples using Docker. It maps the current directory to the container and specifies the split and subset for evaluation. Optional arguments for memory limits and execution time are also mentioned. ```bash # Mount the current directory to the container # If you want to change the RAM address space limit (in MB, 30 GB by default): `--max-as-limit XXX` # If you want to change the RAM data segment limit (in MB, 30 GB by default): `--max-data-limit` # If you want to change the RAM stack limit (in MB, 10 MB by default): `--max-stack-limit` # If you want to increase the execution time limit (in seconds, 240 seconds by default): `--min-time-limit` docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl ``` -------------------------------- ### Syntax Check with BigCodeBench Docker Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command runs the `bigcodebench.syncheck` utility within a pre-built Docker image. It allows for syntax validation of samples stored in a JSONL file in an isolated environment. ```bash docker run -it --entrypoint bigcodebench.syncheck -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --samples samples.jsonl ``` -------------------------------- ### Install Nightly BigCodeBench from Source Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md Installs the latest development version of BigCodeBench directly from its GitHub repository. This command is useful for accessing the most recent features or bug fixes before they are officially released. ```bash pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade ``` -------------------------------- ### Evaluate Solutions Locally with BigCodeBench Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command installs necessary dependencies and then executes pre-generated code solutions in a local, sandboxed environment. It requires a `requirements-eval.txt` file and outputs pass rates for specified `pass_k` values. ```bash pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt bigcodebench.evaluate \ --execution local \ --split complete \ --subset full \ --samples bcb_results/model-output-sanitized-calibrated.jsonl \ --parallel 16 \ --min_time_limit 1 \ --max_as_limit 30720 \ --pass_k "1,5,10" \ --save_pass_rate ``` -------------------------------- ### Generate Code Samples with BigCodeBench CLI Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command generates code samples using the BigCodeBench tool. It supports various models and backends, with options for greedy decoding, temperature, number of samples, and tensor parallel size. The output is stored in a JSONL file. ```bash bigcodebench.generate \ --model [model_name] \ --split [complete|instruct] \ --subset [full|hard] \ [--greedy] \ --bs [bs] \ --temperature [temp] \ --n_samples [n_samples] \ --resume \ --backend [vllm|openai|mistral|anthropic|google|hf] \ --tp [TENSOR_PARALLEL_SIZE] \ [--trust_remote_code] \ [--base_url [base_url]] \ [--tokenizer_name [tokenizer_name]] ``` -------------------------------- ### Generate Code Samples using BigCodeBench Docker (GPU) Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command utilizes a pre-built Docker image for BigCodeBench to generate code samples, specifically configured for GPU usage. It requires mapping the current directory and specifies various generation parameters. ```bash docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \ --model [model_name] \ --split [complete|instruct] \ --subset [full|hard] \ [--greedy] \ --bs [bs] \ --temperature [temp] \ --n_samples [n_samples] \ --resume \ --backend [vllm|openai|mistral|anthropic|google|hf] \ --tp [TENSOR_PARALLEL_SIZE] ``` -------------------------------- ### Syntax Check of BigCodeBench Samples (JSONL) Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command uses `bigcodebench.syncheck` to validate the syntax of code samples stored in a JSONL file. It helps identify and report erroneous code snippets before or after post-processing. ```bash # If you are storing codes in jsonl: bigcodebench.syncheck --samples samples.jsonl ``` -------------------------------- ### Syntax Check of BigCodeBench Samples (Directories) Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command uses `bigcodebench.syncheck` to validate the syntax of code samples stored in directories. It is useful for checking code generated and organized in a file structure. ```bash # If you are storing codes in directories: bigcodebench.syncheck --samples /path/to/vicuna-[??]b_temp_[??] ``` -------------------------------- ### Upgrade BigCodeBench Package (Bash) Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md Upgrades the bigcodebench Python package to the latest version. This command ensures you have the most recent features and bug fixes for the library. It's a straightforward pip installation command. ```bash pip install bigcodebench --upgrade ``` -------------------------------- ### BigCodeBench Docker with HuggingFace Token Authentication Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command demonstrates how to run BigCodeBench within a Docker container when using gated or private HuggingFace models. It requires passing the Hugging Face Hub token as an environment variable. ```bash docker run -e HUGGING_FACE_HUB_TOKEN=$token -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments ``` -------------------------------- ### Clean Up BigCodeBench Processes and Temporary Files Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This script is recommended for cleaning up the environment after a BigCodeBench evaluation. It identifies and terminates any running processes related to 'bigcodebench' and removes all temporary files. ```bash pids=$(ps -u $(id -u) -o pid,comm | grep 'bigcodebench' | awk '{print $1}'); if [ -n "$pids" ]; then echo $pids | xargs -r kill; fi; rm -rf /tmp/* ``` -------------------------------- ### Run BigCodeBench Evaluation with Docker Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command runs the BigCodeBench evaluation using Docker. It mounts the current directory to the container and specifies parameters for execution mode, data split, subset, and sample file. The `--check-gt-only` flag is used to only verify ground truths. ```bash docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --check-gt-only ``` -------------------------------- ### Customize Pass@k Metric Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md This command-line flag allows users to customize the 'k' value for the Pass@k metric. Multiple 'k' values can be specified as a comma-separated string, for example, to evaluate Pass@1 and Pass@100. ```bash --pass_k 1,100 ``` -------------------------------- ### Inspect Failed BigCodeBench Samples Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md Command to inspect failed samples from a BigCodeBench evaluation. It requires the evaluation results file and specifies the data split and subset. The `--in_place` flag can be used to re-run the inspection directly. ```bash bigcodebench.inspect --eval_results sample-sanitized-calibrated_eval_results.json --split complete --subset hard bigcodebench.inspect --eval_results sample-sanitized-calibrated_eval_results.json --split complete --subset hard --in_place ``` -------------------------------- ### Implement a Custom Model Provider for BigCodeBench Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This Python code demonstrates how to create a custom model backend for BigCodeBench by subclassing `DecoderBase`. It includes methods for initializing the model, generating code completions, and specifying whether the model performs direct completion. The custom provider is then registered and used via the `make_model` function. ```python from bigcodebench.provider import DecoderBase from typing import List class CustomLLM(DecoderBase): def __init__(self, name: str, subset: str, split: str, **kwargs): super().__init__(name, subset, split, **kwargs) # Initialize your model here self.model = self.load_model() def load_model(self): # Load your custom model return YourModelClass(self.name) def codegen(self, prompts: List[str], do_sample: bool = True, num_samples: int = 1) -> List[List[str]]: """Generate code completions for given prompts.""" all_outputs = [] for prompt in prompts: outputs = [] for _ in range(num_samples): # Generate code with your model generated = self.model.generate( prompt, max_tokens=self.max_new_tokens, temperature=self.temperature if do_sample else 0.0 ) outputs.append(generated) all_outputs.append(outputs) return all_outputs def is_direct_completion(self) -> bool: """Return True if model does direct completion (base models).""" return self.direct_completion # Register and use the custom provider from bigcodebench.provider import make_model model = make_model( model="custom/my-model", backend="custom", subset="full", split="instruct", temperature=0.7, max_new_tokens=1280 ) ``` -------------------------------- ### Evaluate Solutions using E2B Sandbox with BigCodeBench Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command evaluates code using a secure remote E2B sandbox. It requires setting the `E2B_API_KEY` environment variable and specifies the execution environment, dataset splits, and parallel processing configuration. The E2B sandbox offers network isolation, resource limits, and a clean environment for each execution. ```bash # Set up E2B API key export E2B_API_KEY=your-e2b-api-key # Evaluate using E2B sandbox bigcodebench.evaluate \ --execution e2b \ --split instruct \ --subset hard \ --samples bcb_results/samples-sanitized-calibrated.jsonl \ --e2b_endpoint bigcodebench_evaluator \ --parallel 8 \ --calibrated ``` -------------------------------- ### Evaluate Models using Google Gemini API Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command evaluates models using the Google Gemini API. It requires a GOOGLE_API_KEY to be set in the environment. The evaluation uses the 'gemini-2.0-flash-exp' model, 'google' backend, 'instruct' split, 'hard' subset, and 'gradio' execution. ```bash export GOOGLE_API_KEY=your-key bigcodebench.evaluate \ --model gemini-2.0-flash-exp \ --backend google \ --split instruct \ --subset hard \ --execution gradio ``` -------------------------------- ### Evaluate Generated Code Samples with BigCodeBench Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md Evaluates generated code samples using the BigCodeBench CLI. It allows specifying the model, execution backend, data split, subset, and inference backend, enabling flexible and customized evaluation workflows. ```bash bigcodebench.evaluate \ --model meta-llama/Meta-Llama-3.1-8B-Instruct \ --execution [e2b|gradio|local] \ --split [complete|instruct] \ --subset [full|hard] \ --backend [vllm|openai|anthropic|google|mistral|hf|hf-inference] ``` -------------------------------- ### Evaluate Models using OpenAI API Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command evaluates models using the OpenAI API. It requires an OPENAI_API_KEY to be set in the environment. The evaluation uses the 'gpt-4-turbo' model, 'openai' backend, 'instruct' split, 'hard' subset, and 'gradio' execution. ```bash export OPENAI_API_KEY=sk-your-key bigcodebench.evaluate \ --model gpt-4-turbo \ --backend openai \ --split instruct \ --subset hard \ --execution gradio ``` -------------------------------- ### Generate Solutions for a Range of Tasks Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command generates solutions for a specified range of tasks (0-50). It uses the 'vllm' backend and 'greedy' decoding strategy. The generation is for the 'instruct' split and 'full' subset. ```bash # Evaluate a range of tasks using id_range during generation bigcodebench.generate \ --model meta-llama/Meta-Llama-3.1-8B-Instruct \ --split instruct \ --subset full \ --backend vllm \ --id_range 0-50 \ --greedy # This generates solutions only for the first 50 tasks ``` -------------------------------- ### Run BigCodeBench Evaluation Using Docker Source: https://context7.com/bigcode-project/bigcodebench/llms.txt These commands demonstrate how to run BigCodeBench evaluations within Docker containers for consistent and reproducible environments. The first command pulls the evaluation image and runs a local evaluation. The second command shows how to generate code using a specific model within Docker, including GPU support and token configuration. ```bash # Pull the evaluation image docker pull bigcodebench/bigcodebench-evaluate:latest # Run evaluation in Docker with local samples docker run \ -v $(pwd):/app \ bigcodebench/bigcodebench-evaluate:latest \ --execution local \ --split instruct \ --subset full \ --samples samples-sanitized-calibrated.jsonl \ --parallel 8 \ --min_time_limit 1 # Generate code using Docker with GPU support docker run \ --gpus '"device=0,1"' \ -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \ -v $(pwd):/app \ bigcodebench/bigcodebench-generate:latest \ --model meta-llama/Meta-Llama-3.1-70B-Instruct \ --split complete \ --subset hard \ --backend vllm \ --tp 2 \ --greedy # Results are saved to the mounted /app directory ``` -------------------------------- ### Set Google API Key for BigCodeBench Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md Configures the API key for accessing Google's AI models, such as Gemini, for use in BigCodeBench evaluations. ```bash export GOOGLE_API_KEY= ``` -------------------------------- ### Set Hugging Face Inference API Key Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md Sets the API key for the Hugging Face Serverless Inference API, allowing BigCodeBench to utilize models hosted on Hugging Face for evaluation. ```bash export HF_INFERENCE_API_KEY= ``` -------------------------------- ### Evaluate Models using Hugging Face Inference API Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command evaluates models using the Hugging Face Inference API. It requires an HF_INFERENCE_API_KEY to be set in the environment. The evaluation uses the 'meta-llama/Meta-Llama-3.1-70B-Instruct' model, 'hf-inference' backend, 'instruct' split, 'hard' subset, and 'gradio' execution. ```bash export HF_INFERENCE_API_KEY=hf_your-key bigcodebench.evaluate \ --model meta-llama/Meta-Llama-3.1-70B-Instruct \ --backend hf-inference \ --split instruct \ --subset hard \ --execution gradio ``` -------------------------------- ### Evaluate Models using Anthropic API Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command evaluates models using the Anthropic API. It requires an ANTHROPIC_API_KEY to be set in the environment. The evaluation uses the 'claude-3-5-sonnet-20241022' model, 'anthropic' backend, 'instruct' split, 'hard' subset, and 'gradio' execution. ```bash export ANTHROPIC_API_KEY=sk-ant-your-key bigcodebench.evaluate \ --model claude-3-5-sonnet-20241022 \ --backend anthropic \ --split instruct \ --subset hard \ --execution gradio ``` -------------------------------- ### Evaluate Models using Mistral API Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command evaluates models using the Mistral API. It requires a MISTRAL_API_KEY to be set in the environment. The evaluation uses the 'mistral-large-latest' model, 'mistral' backend, 'instruct' split, 'hard' subset, and 'gradio' execution. ```bash export MISTRAL_API_KEY=your-key bigcodebench.evaluate \ --model mistral-large-latest \ --backend mistral \ --split instruct \ --subset hard \ --execution gradio ``` -------------------------------- ### Set OpenAI API Key for BigCodeBench Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md Sets the API key for accessing OpenAI services, which can be used as a backend for code generation or evaluation within BigCodeBench. ```bash export OPENAI_API_KEY= ``` -------------------------------- ### Set E2B API Key for BigCodeBench Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md Configures the necessary API key for using the E2B sandbox environment within BigCodeBench. This is required for evaluations that leverage E2B for code execution. ```bash export E2B_API_KEY= ``` -------------------------------- ### Inspect Failed Test Cases with BigCodeBench Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command generates a detailed report for failed test cases from evaluation results. It creates an `inspect/` directory containing lists of failed tasks, generated solutions for these tasks, and a categorized analysis of error types. The `--in_place` flag updates the report directly. ```bash # Generate detailed failure report bigcodebench.inspect \ --eval_results bcb_results/model-output-sanitized-calibrated_eval_results.json \ --split complete \ --subset hard # Creates inspect/ directory with: # - inspect/failed_tasks.json: List of failed task IDs # - inspect/BigCodeBench_X.py: Generated solutions for each failed task # - inspect/error_analysis.json: Categorized error types # Re-run inspection to update results in place bigcodebench.inspect \ --eval_results bcb_results/model-output-sanitized-calibrated_eval_results.json \ --split complete \ --subset hard \ --in_place ``` -------------------------------- ### End-to-End Evaluation with bigcodebench.evaluate Source: https://context7.com/bigcode-project/bigcodebench/llms.txt Performs end-to-end evaluation by combining code generation and execution for supported model backends. It requires API keys for cloud services and outputs JSONL files with results and pass@k metrics. Supports various execution environments like Gradio. ```bash # Evaluate GPT-4 on the instruct split using remote Gradio execution export OPENAI_API_KEY=sk-your-key-here bigcodebench.evaluate \ --model gpt-4 \ --split instruct \ --subset hard \ --backend openai \ --execution gradio \ --temperature 0.0 \ --n_samples 1 \ --max_new_tokens 1280 # Output files in bcb_results/: # - gpt-4--main--bigcodebench-hard-instruct--openai-0.0-1-sanitized_calibrated.jsonl # - gpt-4--main--bigcodebench-hard-instruct--openai-0.0-1-sanitized_calibrated_eval_results.json # - gpt-4--main--bigcodebench-hard-instruct--openai-0.0-1-sanitized_calibrated_pass_at_k.json # Console output: # BigCodeBench-Instruct (Hard) # Groundtruth pass rate: 1.000 # pass@1: 0.568 ``` -------------------------------- ### Programmatic Dataset Access with get_bigcodebench Source: https://context7.com/bigcode-project/bigcodebench/llms.txt Loads benchmark tasks programmatically for custom evaluation pipelines using Python. The `get_bigcodebench` function allows loading the full dataset or specific subsets like 'hard'. It provides access to task details including prompts, entry points, canonical solutions, and test cases. ```python from bigcodebench.data import get_bigcodebench # Load the full benchmark dataset problems = get_bigcodebench(subset="full") # Access a specific task task = problems["BigCodeBench/0"] print(f"Task ID: {task['task_id']}") print(f"Entry Point: {task['entry_point']}") print(f"Complete Prompt:\n{task['complete_prompt']}") print(f"Instruct Prompt:\n{task['instruct_prompt']}") print(f"Canonical Solution:\n{task['canonical_solution']}") print(f"Test Cases:\n{task['test']}") # Load only the harder subset hard_problems = get_bigcodebench(subset="hard") print(f"Hard subset contains {len(hard_problems)} tasks") # Example output: # Task ID: BigCodeBench/0 # Entry Point: task_func # Complete Prompt: # import numpy as np # def task_func(arr: np.ndarray) -> np.ndarray: # """ # Normalize a numpy array to have zero mean and unit variance. # ... # Canonical Solution: # def task_func(arr): # return (arr - np.mean(arr)) / np.std(arr) ``` -------------------------------- ### Set Anthropic API Key for BigCodeBench Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md Configures the API key for interacting with Anthropic's models, enabling their use within the BigCodeBench evaluation framework. ```bash export ANTHROPIC_API_KEY= ``` -------------------------------- ### Set Mistral API Key for BigCodeBench Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md Sets the API key required to use Mistral AI models via their API within the BigCodeBench evaluation process. ```bash export MISTRAL_API_KEY= ``` -------------------------------- ### Code Sanitization with bigcodebench.sanitize Source: https://context7.com/bigcode-project/bigcodebench/llms.txt Post-processes LLM-generated code to extract relevant functions and remove extraneous content like print statements, test code, and markdown blocks. The `--calibrate` flag enables calibration, and `--parallel` specifies the number of parallel processes for faster sanitization. It outputs a sanitized JSONL file. ```bash # Sanitize a JSONL file of generated solutions bigcodebench.sanitize \ --samples bcb_results/model-output.jsonl \ --calibrate \ --parallel 32 # Creates: bcb_results/model-output-sanitized-calibrated.jsonl # The sanitizer removes: # - Extra print statements # - Test code after function definitions # - Unused helper functions # - Markdown code blocks # Before sanitization: # ```python # def task_func(x): # return x * 2 # print(task_func(5)) # test # ``` # After sanitization: # def task_func(x): # return x * 2 ``` -------------------------------- ### Local Code Generation with bigcodebench.generate Source: https://context7.com/bigcode-project/bigcodebench/llms.txt Generates code solutions from an LLM without executing them, using specified backends like vLLM. Requires setting up CUDA_VISIBLE_DEVICES for multi-GPU usage and outputs a JSONL file containing task IDs and generated solutions. Supports resuming interrupted generation processes. ```bash # Generate solutions using vLLM backend with a local model export CUDA_VISIBLE_DEVICES=0,1 bigcodebench.generate \ --model meta-llama/Meta-Llama-3.1-8B-Instruct \ --split complete \ --subset full \ --backend vllm \ --tp 2 \ --temperature 0.8 \ --n_samples 10 \ --bs 4 \ --max_new_tokens 1280 \ --resume # Generates: bcb_results/meta-llama--Meta-Llama-3.1-8B-Instruct--main--bigcodebench-complete--vllm-0.8-10-sanitized_calibrated.jsonl # Sample output format: # {"task_id": "BigCodeBench/0", "solution": "import numpy as np\ndef task_func(...):\n ...", "raw_solution": "..."} # {"task_id": "BigCodeBench/1", "solution": "import pandas as pd\ndef task_func(...):\n ...", "raw_solution": "..."} ``` -------------------------------- ### Selective Task Evaluation Source: https://context7.com/bigcode-project/bigcodebench/llms.txt This command evaluates only specific tasks identified by their IDs. It uses a local execution backend and specifies the 'complete' split and 'full' subset. The tasks to be evaluated are 'BigCodeBench/10', 'BigCodeBench/11', and 'BigCodeBench/12'. ```bash # Evaluate only tasks 10, 11, and 12 bigcodebench.evaluate \ --samples model-output-sanitized-calibrated.jsonl \ --execution local \ --split complete \ --subset full \ --selective_evaluate "BigCodeBench/10,BigCodeBench/11,BigCodeBench/12" ``` -------------------------------- ### Syntax Validation with bigcodebench.syncheck Source: https://context7.com/bigcode-project/bigcodebench/llms.txt Verifies the Python syntax validity of generated code within a JSONL file before evaluation. It flags tasks with syntax errors, providing line numbers and the specific error encountered, which is crucial for filtering out invalid code submissions. ```bash # Check syntax of all solutions in a JSONL file bigcodebench.syncheck --samples bcb_results/model-output-sanitized.jsonl # Example output for files with syntax errors: # Task BigCodeBench/15 has syntax error: # File "", line 5 # def task_func(x) # ^ # SyntaxError: invalid syntax # # Task BigCodeBench/42 has syntax error: ```