### Install BigCodeBench Dependencies

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

Installs the necessary Python dependencies for BigCodeBench evaluation by referencing a requirements file from a GitHub repository. This command should be run before executing the evaluation locally.

```bash
pip install -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt
```

--------------------------------

### Install BigCodeBench and Dependencies

Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md

Installs the BigCodeBench package and essential dependencies like flash-attn for code generation. This setup is crucial for utilizing the project's code generation and evaluation features.

```bash
pip install bigcodebench --upgrade
pip install packaging ninja
pip install flash-attn --no-build-isolation
```

--------------------------------

### Setup BigCodeBench as Local Repository (Bash)

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

Configures the BigCodeBench project for local use by cloning the repository, navigating into the directory, setting the PYTHONPATH environment variable, and installing the package in editable mode. This allows for direct modification and testing of the library's code.

```bash
git clone https://github.com/bigcode-project/bigcodebench.git
cd bigcodebench
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -e .
```

--------------------------------

### Install BigCodeBench Evaluation Requirements (Bash)

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

Installs the necessary Python packages for evaluating BigCodeBench locally. It's recommended to perform this installation in an isolated environment to avoid dependency conflicts. The command fetches the requirements file from a remote URL.

```bash
pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt
```

--------------------------------

### BigCodeBench Docker with Other Backend Authentication

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

Examples of running BigCodeBench via Docker for backends that require API keys or specific authentication tokens, such as OpenAI, Anthropic, Mistral, and Google. The relevant environment variable for the API key is passed to the container.

```bash
docker run -e OPENAI_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
```

```bash
docker run -e ANTHROPIC_KEY=$ANTHROPIC_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
```

```bash
docker run -e MISTRAL_KEY=$MISTRAL_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
```

```bash
docker run -e GOOGLE_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
```

--------------------------------

### Generate Code Samples using BigCodeBench Docker (CPU)

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command uses a pre-built Docker image for BigCodeBench to generate code samples on CPUs. It maps the current directory to the container and allows specification of generation parameters.

```bash
docker run -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
    --model [model_name] \
    --split [complete|instruct] \
    --subset [full|hard] \
    [--greedy] \
    --bs [bs] \
    --temperature [temp] \
    --n_samples [n_samples] \
    --resume \
    --backend [vllm|hf|openai|mistral|anthropic|google]
```

--------------------------------

### Run BigCodeBench Result Analysis

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command initiates the analysis of BigCodeBench evaluation results, including metrics like Elo Rating and Task Solve Rate. It requires all `samples_eval_results.json` files to be placed in a 'results' folder within the 'analysis' directory.

```bash
cd analysis
python get_results.py
```

--------------------------------

### Run BigCodeBench Evaluation Locally

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

Executes the BigCodeBench evaluation locally after installing dependencies. This command includes options to specify the execution mode, data split, subset, and sample file. Additional flags like `--no-gt` and `--save_pass_rate` allow for skipping ground truth checks and saving evaluation results, respectively.

```bash
bigcodebench.evaluate --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl
bigcodebench.evaluate --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --no-gt
bigcodebench.evaluate --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --save_pass_rate
```

--------------------------------

### Local Evaluation of BigCodeBench Samples

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command initiates a local evaluation of BigCodeBench code samples using Docker. It maps the current directory to the container and specifies the split and subset for evaluation. Optional arguments for memory limits and execution time are also mentioned.

```bash
# Mount the current directory to the container
# If you want to change the RAM address space limit (in MB, 30 GB by default): `--max-as-limit XXX`
# If you want to change the RAM data segment limit (in MB, 30 GB by default): `--max-data-limit`
# If you want to change the RAM stack limit (in MB, 10 MB by default): `--max-stack-limit`
# If you want to increase the execution time limit (in seconds, 240 seconds by default): `--min-time-limit`
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl
```

--------------------------------

### Syntax Check with BigCodeBench Docker

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command runs the `bigcodebench.syncheck` utility within a pre-built Docker image. It allows for syntax validation of samples stored in a JSONL file in an isolated environment.

```bash
docker run -it --entrypoint bigcodebench.syncheck -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --samples samples.jsonl
```

--------------------------------

### Install Nightly BigCodeBench from Source

Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md

Installs the latest development version of BigCodeBench directly from its GitHub repository. This command is useful for accessing the most recent features or bug fixes before they are officially released.

```bash
pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
```

--------------------------------

### Evaluate Solutions Locally with BigCodeBench

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command installs necessary dependencies and then executes pre-generated code solutions in a local, sandboxed environment. It requires a `requirements-eval.txt` file and outputs pass rates for specified `pass_k` values.

```bash
pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt

bigcodebench.evaluate \
  --execution local \
  --split complete \
  --subset full \
  --samples bcb_results/model-output-sanitized-calibrated.jsonl \
  --parallel 16 \
  --min_time_limit 1 \
  --max_as_limit 30720 \
  --pass_k "1,5,10" \
  --save_pass_rate
```

--------------------------------

### Generate Code Samples with BigCodeBench CLI

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command generates code samples using the BigCodeBench tool. It supports various models and backends, with options for greedy decoding, temperature, number of samples, and tensor parallel size. The output is stored in a JSONL file.

```bash
bigcodebench.generate \
    --model [model_name] \
    --split [complete|instruct] \
    --subset [full|hard] \
    [--greedy] \
    --bs [bs] \
    --temperature [temp] \
    --n_samples [n_samples] \
    --resume \
    --backend [vllm|openai|mistral|anthropic|google|hf] \
    --tp [TENSOR_PARALLEL_SIZE] \
    [--trust_remote_code] \
    [--base_url [base_url]] \
    [--tokenizer_name [tokenizer_name]]
```

--------------------------------

### Generate Code Samples using BigCodeBench Docker (GPU)

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command utilizes a pre-built Docker image for BigCodeBench to generate code samples, specifically configured for GPU usage. It requires mapping the current directory and specifies various generation parameters.

```bash
docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
    --model [model_name] \
    --split [complete|instruct] \
    --subset [full|hard] \
    [--greedy] \
    --bs [bs] \
    --temperature [temp] \
    --n_samples [n_samples] \
    --resume \
    --backend [vllm|openai|mistral|anthropic|google|hf] \
    --tp [TENSOR_PARALLEL_SIZE]
```

--------------------------------

### Syntax Check of BigCodeBench Samples (JSONL)

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command uses `bigcodebench.syncheck` to validate the syntax of code samples stored in a JSONL file. It helps identify and report erroneous code snippets before or after post-processing.

```bash
# If you are storing codes in jsonl:
bigcodebench.syncheck --samples samples.jsonl
```

--------------------------------

### Syntax Check of BigCodeBench Samples (Directories)

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command uses `bigcodebench.syncheck` to validate the syntax of code samples stored in directories. It is useful for checking code generated and organized in a file structure.

```bash
# If you are storing codes in directories:
bigcodebench.syncheck --samples /path/to/vicuna-[??]b_temp_[??]
```

--------------------------------

### Upgrade BigCodeBench Package (Bash)

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

Upgrades the bigcodebench Python package to the latest version. This command ensures you have the most recent features and bug fixes for the library. It's a straightforward pip installation command.

```bash
pip install bigcodebench --upgrade
```

--------------------------------

### BigCodeBench Docker with HuggingFace Token Authentication

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command demonstrates how to run BigCodeBench within a Docker container when using gated or private HuggingFace models. It requires passing the Hugging Face Hub token as an environment variable.

```bash
docker run -e HUGGING_FACE_HUB_TOKEN=$token -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
```

--------------------------------

### Clean Up BigCodeBench Processes and Temporary Files

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This script is recommended for cleaning up the environment after a BigCodeBench evaluation. It identifies and terminates any running processes related to 'bigcodebench' and removes all temporary files.

```bash
pids=$(ps -u $(id -u) -o pid,comm | grep 'bigcodebench' | awk '{print $1}'); if [ -n "$pids" ]; then echo $pids | xargs -r kill; fi;
rm -rf /tmp/*
```

--------------------------------

### Run BigCodeBench Evaluation with Docker

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command runs the BigCodeBench evaluation using Docker. It mounts the current directory to the container and specifies parameters for execution mode, data split, subset, and sample file. The `--check-gt-only` flag is used to only verify ground truths.

```bash
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --execution local --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --check-gt-only
```

--------------------------------

### Customize Pass@k Metric

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

This command-line flag allows users to customize the 'k' value for the Pass@k metric. Multiple 'k' values can be specified as a comma-separated string, for example, to evaluate Pass@1 and Pass@100.

```bash
--pass_k 1,100
```

--------------------------------

### Inspect Failed BigCodeBench Samples

Source: https://github.com/bigcode-project/bigcodebench/blob/main/ADVANCED_USAGE.md

Command to inspect failed samples from a BigCodeBench evaluation. It requires the evaluation results file and specifies the data split and subset. The `--in_place` flag can be used to re-run the inspection directly.

```bash
bigcodebench.inspect --eval_results sample-sanitized-calibrated_eval_results.json --split complete --subset hard
bigcodebench.inspect --eval_results sample-sanitized-calibrated_eval_results.json --split complete --subset hard --in_place
```

--------------------------------

### Implement a Custom Model Provider for BigCodeBench

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This Python code demonstrates how to create a custom model backend for BigCodeBench by subclassing `DecoderBase`. It includes methods for initializing the model, generating code completions, and specifying whether the model performs direct completion. The custom provider is then registered and used via the `make_model` function.

```python
from bigcodebench.provider import DecoderBase
from typing import List

class CustomLLM(DecoderBase):
    def __init__(self, name: str, subset: str, split: str, **kwargs):
        super().__init__(name, subset, split, **kwargs)
        # Initialize your model here
        self.model = self.load_model()

    def load_model(self):
        # Load your custom model
        return YourModelClass(self.name)

    def codegen(self, prompts: List[str], do_sample: bool = True,
                num_samples: int = 1) -> List[List[str]]:
        """Generate code completions for given prompts."""
        all_outputs = []
        for prompt in prompts:
            outputs = []
            for _ in range(num_samples):
                # Generate code with your model
                generated = self.model.generate(
                    prompt,
                    max_tokens=self.max_new_tokens,
                    temperature=self.temperature if do_sample else 0.0
                )
                outputs.append(generated)
            all_outputs.append(outputs)
        return all_outputs

    def is_direct_completion(self) -> bool:
        """Return True if model does direct completion (base models)."""
        return self.direct_completion

# Register and use the custom provider
from bigcodebench.provider import make_model

model = make_model(
    model="custom/my-model",
    backend="custom",
    subset="full",
    split="instruct",
    temperature=0.7,
    max_new_tokens=1280
)
```

--------------------------------

### Evaluate Solutions using E2B Sandbox with BigCodeBench

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command evaluates code using a secure remote E2B sandbox. It requires setting the `E2B_API_KEY` environment variable and specifies the execution environment, dataset splits, and parallel processing configuration. The E2B sandbox offers network isolation, resource limits, and a clean environment for each execution.

```bash
# Set up E2B API key
export E2B_API_KEY=your-e2b-api-key

# Evaluate using E2B sandbox
bigcodebench.evaluate \
  --execution e2b \
  --split instruct \
  --subset hard \
  --samples bcb_results/samples-sanitized-calibrated.jsonl \
  --e2b_endpoint bigcodebench_evaluator \
  --parallel 8 \
  --calibrated
```

--------------------------------

### Evaluate Models using Google Gemini API

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command evaluates models using the Google Gemini API. It requires a GOOGLE_API_KEY to be set in the environment. The evaluation uses the 'gemini-2.0-flash-exp' model, 'google' backend, 'instruct' split, 'hard' subset, and 'gradio' execution.

```bash
export GOOGLE_API_KEY=your-key
bigcodebench.evaluate \
  --model gemini-2.0-flash-exp \
  --backend google \
  --split instruct \
  --subset hard \
  --execution gradio
```

--------------------------------

### Evaluate Generated Code Samples with BigCodeBench

Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md

Evaluates generated code samples using the BigCodeBench CLI. It allows specifying the model, execution backend, data split, subset, and inference backend, enabling flexible and customized evaluation workflows.

```bash
bigcodebench.evaluate \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --execution [e2b|gradio|local] \
  --split [complete|instruct] \
  --subset [full|hard] \
  --backend [vllm|openai|anthropic|google|mistral|hf|hf-inference]
```

--------------------------------

### Evaluate Models using OpenAI API

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command evaluates models using the OpenAI API. It requires an OPENAI_API_KEY to be set in the environment. The evaluation uses the 'gpt-4-turbo' model, 'openai' backend, 'instruct' split, 'hard' subset, and 'gradio' execution.

```bash
export OPENAI_API_KEY=sk-your-key
bigcodebench.evaluate \
  --model gpt-4-turbo \
  --backend openai \
  --split instruct \
  --subset hard \
  --execution gradio
```

--------------------------------

### Generate Solutions for a Range of Tasks

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command generates solutions for a specified range of tasks (0-50). It uses the 'vllm' backend and 'greedy' decoding strategy. The generation is for the 'instruct' split and 'full' subset.

```bash
# Evaluate a range of tasks using id_range during generation
bigcodebench.generate \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --split instruct \
  --subset full \
  --backend vllm \
  --id_range 0-50 \
  --greedy

# This generates solutions only for the first 50 tasks
```

--------------------------------

### Run BigCodeBench Evaluation Using Docker

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

These commands demonstrate how to run BigCodeBench evaluations within Docker containers for consistent and reproducible environments. The first command pulls the evaluation image and runs a local evaluation. The second command shows how to generate code using a specific model within Docker, including GPU support and token configuration.

```bash
# Pull the evaluation image
docker pull bigcodebench/bigcodebench-evaluate:latest

# Run evaluation in Docker with local samples
docker run \
  -v $(pwd):/app \
  bigcodebench/bigcodebench-evaluate:latest \
  --execution local \
  --split instruct \
  --subset full \
  --samples samples-sanitized-calibrated.jsonl \
  --parallel 8 \
  --min_time_limit 1

# Generate code using Docker with GPU support
docker run \
  --gpus '"device=0,1"' \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -v $(pwd):/app \
  bigcodebench/bigcodebench-generate:latest \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --split complete \
  --subset hard \
  --backend vllm \
  --tp 2 \
  --greedy

# Results are saved to the mounted /app directory
```

--------------------------------

### Set Google API Key for BigCodeBench

Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md

Configures the API key for accessing Google's AI models, such as Gemini, for use in BigCodeBench evaluations.

```bash
export GOOGLE_API_KEY=<your_google_api_key>
```

--------------------------------

### Set Hugging Face Inference API Key

Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md

Sets the API key for the Hugging Face Serverless Inference API, allowing BigCodeBench to utilize models hosted on Hugging Face for evaluation.

```bash
export HF_INFERENCE_API_KEY=<your_hf_api_key>
```

--------------------------------

### Evaluate Models using Hugging Face Inference API

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command evaluates models using the Hugging Face Inference API. It requires an HF_INFERENCE_API_KEY to be set in the environment. The evaluation uses the 'meta-llama/Meta-Llama-3.1-70B-Instruct' model, 'hf-inference' backend, 'instruct' split, 'hard' subset, and 'gradio' execution.

```bash
export HF_INFERENCE_API_KEY=hf_your-key
bigcodebench.evaluate \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --backend hf-inference \
  --split instruct \
  --subset hard \
  --execution gradio
```

--------------------------------

### Evaluate Models using Anthropic API

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command evaluates models using the Anthropic API. It requires an ANTHROPIC_API_KEY to be set in the environment. The evaluation uses the 'claude-3-5-sonnet-20241022' model, 'anthropic' backend, 'instruct' split, 'hard' subset, and 'gradio' execution.

```bash
export ANTHROPIC_API_KEY=sk-ant-your-key
bigcodebench.evaluate \
  --model claude-3-5-sonnet-20241022 \
  --backend anthropic \
  --split instruct \
  --subset hard \
  --execution gradio
```

--------------------------------

### Evaluate Models using Mistral API

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command evaluates models using the Mistral API. It requires a MISTRAL_API_KEY to be set in the environment. The evaluation uses the 'mistral-large-latest' model, 'mistral' backend, 'instruct' split, 'hard' subset, and 'gradio' execution.

```bash
export MISTRAL_API_KEY=your-key
bigcodebench.evaluate \
  --model mistral-large-latest \
  --backend mistral \
  --split instruct \
  --subset hard \
  --execution gradio
```

--------------------------------

### Set OpenAI API Key for BigCodeBench

Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md

Sets the API key for accessing OpenAI services, which can be used as a backend for code generation or evaluation within BigCodeBench.

```bash
export OPENAI_API_KEY=<your_openai_api_key>
```

--------------------------------

### Set E2B API Key for BigCodeBench

Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md

Configures the necessary API key for using the E2B sandbox environment within BigCodeBench. This is required for evaluations that leverage E2B for code execution.

```bash
export E2B_API_KEY=<your_e2b_api_key>
```

--------------------------------

### Inspect Failed Test Cases with BigCodeBench

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command generates a detailed report for failed test cases from evaluation results. It creates an `inspect/` directory containing lists of failed tasks, generated solutions for these tasks, and a categorized analysis of error types. The `--in_place` flag updates the report directly.

```bash
# Generate detailed failure report
bigcodebench.inspect \
  --eval_results bcb_results/model-output-sanitized-calibrated_eval_results.json \
  --split complete \
  --subset hard

# Creates inspect/ directory with:
# - inspect/failed_tasks.json: List of failed task IDs
# - inspect/BigCodeBench_X.py: Generated solutions for each failed task
# - inspect/error_analysis.json: Categorized error types

# Re-run inspection to update results in place
bigcodebench.inspect \
  --eval_results bcb_results/model-output-sanitized-calibrated_eval_results.json \
  --split complete \
  --subset hard \
  --in_place
```

--------------------------------

### End-to-End Evaluation with bigcodebench.evaluate

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

Performs end-to-end evaluation by combining code generation and execution for supported model backends. It requires API keys for cloud services and outputs JSONL files with results and pass@k metrics. Supports various execution environments like Gradio.

```bash
# Evaluate GPT-4 on the instruct split using remote Gradio execution
export OPENAI_API_KEY=sk-your-key-here

bigcodebench.evaluate \
  --model gpt-4 \
  --split instruct \
  --subset hard \
  --backend openai \
  --execution gradio \
  --temperature 0.0 \
  --n_samples 1 \
  --max_new_tokens 1280

# Output files in bcb_results/:
# - gpt-4--main--bigcodebench-hard-instruct--openai-0.0-1-sanitized_calibrated.jsonl
# - gpt-4--main--bigcodebench-hard-instruct--openai-0.0-1-sanitized_calibrated_eval_results.json
# - gpt-4--main--bigcodebench-hard-instruct--openai-0.0-1-sanitized_calibrated_pass_at_k.json

# Console output:
# BigCodeBench-Instruct (Hard)
# Groundtruth pass rate: 1.000
# pass@1: 0.568

```

--------------------------------

### Programmatic Dataset Access with get_bigcodebench

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

Loads benchmark tasks programmatically for custom evaluation pipelines using Python. The `get_bigcodebench` function allows loading the full dataset or specific subsets like 'hard'. It provides access to task details including prompts, entry points, canonical solutions, and test cases.

```python
from bigcodebench.data import get_bigcodebench

# Load the full benchmark dataset
problems = get_bigcodebench(subset="full")

# Access a specific task
task = problems["BigCodeBench/0"]
print(f"Task ID: {task['task_id']}")
print(f"Entry Point: {task['entry_point']}")
print(f"Complete Prompt:\n{task['complete_prompt']}")
print(f"Instruct Prompt:\n{task['instruct_prompt']}")
print(f"Canonical Solution:\n{task['canonical_solution']}")
print(f"Test Cases:\n{task['test']}")

# Load only the harder subset
hard_problems = get_bigcodebench(subset="hard")
print(f"Hard subset contains {len(hard_problems)} tasks")

# Example output:
# Task ID: BigCodeBench/0
# Entry Point: task_func
# Complete Prompt:
# import numpy as np
# def task_func(arr: np.ndarray) -> np.ndarray:
#     """
#     Normalize a numpy array to have zero mean and unit variance.
#     ...
# Canonical Solution:
# def task_func(arr):
#     return (arr - np.mean(arr)) / np.std(arr)

```

--------------------------------

### Set Anthropic API Key for BigCodeBench

Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md

Configures the API key for interacting with Anthropic's models, enabling their use within the BigCodeBench evaluation framework.

```bash
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
```

--------------------------------

### Set Mistral API Key for BigCodeBench

Source: https://github.com/bigcode-project/bigcodebench/blob/main/README.md

Sets the API key required to use Mistral AI models via their API within the BigCodeBench evaluation process.

```bash
export MISTRAL_API_KEY=<your_mistral_api_key>
```

--------------------------------

### Code Sanitization with bigcodebench.sanitize

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

Post-processes LLM-generated code to extract relevant functions and remove extraneous content like print statements, test code, and markdown blocks. The `--calibrate` flag enables calibration, and `--parallel` specifies the number of parallel processes for faster sanitization. It outputs a sanitized JSONL file.

```bash
# Sanitize a JSONL file of generated solutions
bigcodebench.sanitize \
  --samples bcb_results/model-output.jsonl \
  --calibrate \
  --parallel 32

# Creates: bcb_results/model-output-sanitized-calibrated.jsonl
# The sanitizer removes:
# - Extra print statements
# - Test code after function definitions
# - Unused helper functions
# - Markdown code blocks

# Before sanitization:
# ```python
# def task_func(x):
#     return x * 2
# print(task_func(5))  # test
# ```

# After sanitization:
# def task_func(x):
#     return x * 2

```

--------------------------------

### Local Code Generation with bigcodebench.generate

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

Generates code solutions from an LLM without executing them, using specified backends like vLLM. Requires setting up CUDA_VISIBLE_DEVICES for multi-GPU usage and outputs a JSONL file containing task IDs and generated solutions. Supports resuming interrupted generation processes.

```bash
# Generate solutions using vLLM backend with a local model
export CUDA_VISIBLE_DEVICES=0,1

bigcodebench.generate \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --split complete \
  --subset full \
  --backend vllm \
  --tp 2 \
  --temperature 0.8 \
  --n_samples 10 \
  --bs 4 \
  --max_new_tokens 1280 \
  --resume

# Generates: bcb_results/meta-llama--Meta-Llama-3.1-8B-Instruct--main--bigcodebench-complete--vllm-0.8-10-sanitized_calibrated.jsonl
# Sample output format:
# {"task_id": "BigCodeBench/0", "solution": "import numpy as np\ndef task_func(...):\n    ...", "raw_solution": "..."}
# {"task_id": "BigCodeBench/1", "solution": "import pandas as pd\ndef task_func(...):\n    ...", "raw_solution": "..."}

```

--------------------------------

### Selective Task Evaluation

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

This command evaluates only specific tasks identified by their IDs. It uses a local execution backend and specifies the 'complete' split and 'full' subset. The tasks to be evaluated are 'BigCodeBench/10', 'BigCodeBench/11', and 'BigCodeBench/12'.

```bash
# Evaluate only tasks 10, 11, and 12
bigcodebench.evaluate \
  --samples model-output-sanitized-calibrated.jsonl \
  --execution local \
  --split complete \
  --subset full \
  --selective_evaluate "BigCodeBench/10,BigCodeBench/11,BigCodeBench/12"
```

--------------------------------

### Syntax Validation with bigcodebench.syncheck

Source: https://context7.com/bigcode-project/bigcodebench/llms.txt

Verifies the Python syntax validity of generated code within a JSONL file before evaluation. It flags tasks with syntax errors, providing line numbers and the specific error encountered, which is crucial for filtering out invalid code submissions.

```bash
# Check syntax of all solutions in a JSONL file
bigcodebench.syncheck --samples bcb_results/model-output-sanitized.jsonl

# Example output for files with syntax errors:
# Task BigCodeBench/15 has syntax error:
#   File "<string>", line 5
#     def task_func(x)
#                     ^
# SyntaxError: invalid syntax
# 
# Task BigCodeBench/42 has syntax error:

```