### Setup Environment for StructEval

Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/README.md

Installs necessary dependencies for StructEval evaluation using conda and pip. Ensure you are in the 'struct_benchmark' directory.

```bash
cd struct_benchmark
conda create --name structbench python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
pip install -e 'git+https://github.com/c-box/opencompass.git#egg=opencompass'
```

--------------------------------

### Full Generation Pipeline Example

Source: https://context7.com/c-box/structeval/llms.txt

A complete end-to-end example demonstrating the sequence of running the Bloom generation, concept generation, and data combination steps using bash convenience scripts. Set the BENCHMARK and SPLIT variables accordingly.

```bash
cd struct_generate

BENCHMARK=demo   # use "demo" for a quick test with provided sample data
SPLIT=test

# Step 1: Extract test objectives and retrieve Wikipedia evidence
bash scripts/run_bloom_generate.bash $BENCHMARK $SPLIT
# → runs topic_extract.py then bloom_generation.py

# Step 2: Extract key concepts and generate concept questions
bash scripts/run_concept_generation.bash $BENCHMARK $SPLIT
# → runs concept_generation.py (concept_extract.py is commented-in when needed)

# Step 3: Combine and filter to produce the final benchmark
bash scripts/run_data_combine.bash $BENCHMARK $SPLIT
# → outputs to struct_data/demo/struct_test_gpt-4o-mini.json

# A complete running example with all intermediate files is at:
# processed_data/example/
```

--------------------------------

### Data Format Example

Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md

Example JSON format for input data. Ensure your data files are named `0_{split}_with_idx.json` and placed in the appropriate directory.

```json
{"question": "Which of the following is true regarding reflexes?", "subject": "clinical_knowledge", "choices": ["A positive babinski reflex is the same as a normal flexor response in the assessment of the plantar reflex", "An extensor plantar response indicates a lower motor neurone lesion", "The root value of the ankle reflex is S1", "The root value of the knee reflex is L1, L2"], "answer": 2, "idx": 0}
```

--------------------------------

### Prepare Environment with Conda

Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md

Steps to set up the Conda environment for StructEval. Ensure you have Anaconda installed.

```bash
mkdir ~/anaconda3/envs/structeval
tar -xzvf asset/structeval.tar.gz -C ~/anaconda3/envs/structeval
```

```bash
conda info -e
conda activate structeval
```

--------------------------------

### Seed Instance JSONL Format Example

Source: https://context7.com/c-box/structeval/llms.txt

This is an example of the JSONL format required for input seed instances. Each line represents a single question with its choices, correct answer, and an index.

```json
{
  "question": "Which of the following is true regarding reflexes?",
  "subject": "clinical_knowledge",
  "choices": [
    "A positive babinski reflex is the same as a normal flexor response in the assessment of the plantar reflex",
    "An extensor plantar response indicates a lower motor neurone lesion",
    "The root value of the ankle reflex is S1",
    "The root value of the knee reflex is L1, L2"
  ],
  "answer": 2,
  "idx": 0
}
```

--------------------------------

### Configure Models and Datasets for Evaluation

Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/README.md

Python script to import datasets and model configurations for evaluation. This example configures evaluation for MMLU, ARC-Challenge, and OpenbookQA datasets using the Llama3-8B model.

```python
from mmengine.config import read_base

with read_base():
    from ..data_config.struct_arc_challenge.struct_arc_challenge_v1_ppl import struct_arc_challenge_v1_datasets
    from ..data_config.struct_openbook.struct_openbook_v1_ppl import struct_openbookqa_v1_datasets
    from ..data_config.struct_mmlu.struct_mmlu_v1_ppl import struct_mmlu_V1_datasets
    from model_configs.hf_llama.hf_llama3_8b import models as hf_llama3_8b_model
    
datasets = [*struct_arc_challenge_v1_datasets, *struct_openbookqa_v1_datasets, *struct_mmlu_V1_datasets]
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
```

--------------------------------

### Setup Conda Environment for StructEval Benchmark Generation

Source: https://context7.com/c-box/structeval/llms.txt

This command sequence sets up the necessary Conda environment for StructEval's benchmark generation module. It involves creating and activating a new environment from a tarball and configuring the OpenAI API key.

```bash
mkdir ~/anaconda3/envs/structeval
tar -xzvf asset/structeval.tar.gz -C ~/anaconda3/envs/structeval
conda activate structeval
```

```python
from openai import OpenAI
import os

os.environ["OPENAI_API_KEY"] = "sk-your-api-key-here"
client = OpenAI(
    base_url="https://api.openai.com/v1",
)
```

--------------------------------

### Elasticsearch Client Usage

Source: https://context7.com/c-box/structeval/llms.txt

This Python script demonstrates how to initialize and use the ESClient to query an Elasticsearch index. It shows examples of searching by entity title and by topic name with boosted title matching.

```python
from common_utils.es_client import ESClient

es = ESClient(config_file="config/es_config.yaml")

# Search by entity title
results = es.search(
    index="wikipedia-monthly-enwiki",
    body={
        "query": {
            "match": {"title": "Ankle jerk reflex"}
        }
    }
)

hit = results["hits"]["hits"][0]["_source"]
print(hit["title"])   # "Ankle jerk reflex"
print(hit["url"])     # "https://en.wikipedia.org/wiki/Ankle_jerk_reflex"
print(hit["text"][:200])  # full Wikipedia page text

# Search by topic name + description (boosted title match)
results = es.search(
    index="wikipedia-monthly-enwiki",
    body={
        "query": {
            "bool": {
                "should": [
                    {"match": {"title": {"query": "Reflex", "boost": 3}}},
                    {"match": {"text": {"query": "involuntary response stimulus", "boost": 1}}}
                ]
            }
        }
    }
)
```

--------------------------------

### ElasticSearch Index Mapping

Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md

Example JSON configuration for an ElasticSearch index mapping. This is required for concept-related instance generation based on Wikipedia.

```json
{
  "wikipedia-monthly-enwiki" : {
    "mappings" : {
      "properties" : {
        "id" : {
          "type" : "text"
        },
        "text" : {
          "type" : "text"
        },
        "title" : {
          "type" : "text"
        },
        "url" : {
          "type" : "text"
        }
      }
    }
  }
}
```

--------------------------------

### Topic Extraction Intermediate Output Example

Source: https://context7.com/c-box/structeval/llms.txt

This JSON structure shows the expected intermediate output after running the topic extraction script. It includes the original question data augmented with identified topic information and Wikipedia context.

```json
[
  {
    "question": "Which of the following is true regarding reflexes?",
    "subject": "clinical_knowledge",
    "choices": [...],
    "answer": 2,
    "idx": 0,
    "topic": {"name": "Reflex", "description": "Involuntary physiological response to a stimulus"},
    "topic_match": true,
    "topic_wiki_info": {
      "wiki_id": "25427",
      "wiki_name": "Reflex",
      "wiki_intro": "A reflex, or reflex action, is an involuntary ...",
      "related_content": "..."
    }
  }
]
```

--------------------------------

### Format Seed Questions for Prompts

Source: https://context7.com/c-box/structeval/llms.txt

The `build_example` function formats a dictionary into a string suitable for prompt construction. Use `with_answer=True` to include the correct answer.

```python
from common_utils.utils import build_example

# Format a seed question for prompt construction
example_str = build_example(
    data[0],
    with_answer=True,   # include the correct answer
    with_explain=False  # omit explanation field
)
# Output:
# "Question: Which of the following is true regarding reflexes?
#  A. A positive babinski reflex...
#  B. An extensor plantar response...
#  C. The root value of the ankle reflex is S1
#  D. The root value of the knee reflex is L1, L2
#  Answer: C. The root value of the ankle reflex is S1"
```

--------------------------------

### Run StructEval Evaluations

Source: https://context7.com/c-box/structeval/llms.txt

These bash commands demonstrate how to execute various evaluation scenarios using the `run.py` script. They cover instruct/chat model evaluation on StructMMLU, base model evaluation on all StructEval benchmarks in PPL mode, and specific benchmark evaluations.

```bash
cd struct_benchmark

# Evaluate a single model on StructMMLU (instruct/chat models)
python run.py eval_config/eval_struct_mmlu_v1_instruct.py \
    -w output/struct_mmlu_v1_instruct
# Results saved to: struct_benchmark/output/struct_mmlu_v1_instruct/

# Evaluate a base model on all 3 StructEval benchmarks (PPL mode)
python run.py eval_config/eval_struct_all_v1_ppl.py \
    -w output/struct_all_v1_ppl
# Results saved to: struct_benchmark/output/struct_all_v1_ppl/

# Evaluate on ARC-Challenge only
python run.py eval_config/eval_struct_arc_challenge_v1_ppl.py \
    -w output/struct_arc_challenge_v1_ppl

# Evaluate on OpenBookQA only
python run.py eval_config/eval_struct_openbookqa_v1_ppl.py \
    -w output/struct_openbookqa_v1_ppl
```

--------------------------------

### Run Concept-based Generation

Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md

Execute the script to generate test instances based on essential concepts. Replace `{benchmark_name}` and `{split}` with your specific values.

```bash
bash scripts/run_concept_generation.bash demo test
```

--------------------------------

### Run Bloom-based Generation

Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md

Execute the script to generate test instances based on Bloom's Taxonomy. Replace `{benchmark_name}` and `{split}` with your specific values.

```bash
bash scripts/run_bloom_generate.bash {benchmark_name} {split}
```

```bash
bash scripts/run_bloom_generate.bash demo test
```

--------------------------------

### Run StructEval Evaluation Command

Source: https://github.com/c-box/structeval/blob/main/README.md

This bash command initiates the evaluation process for StructEval benchmarks after setting up the configuration. Navigate to the 'struct_benchmark' directory before running this command. The evaluation results will be saved in the specified output directory.

```bash
cd struct_benchmark
python run.py eval_config/eval_struct_mmlu_v1_instruct.py -w output/struct_mmlu_v1_instruct
```

--------------------------------

### Run Bloom's Taxonomy Question Generation

Source: https://context7.com/c-box/structeval/llms.txt

This command initiates the question generation process using `bloom_generation.py`. It takes the topic-matched data and generates questions across six cognitive levels of Bloom's Taxonomy, applying a RAG filter for answerability.

```bash
cd struct_generate
```

--------------------------------

### Run HF InternLM Chat Model Benchmarks

Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/model_configs/hf_internlm/README.md

Execute benchmark tests for the hf_internlm2_chat_7b model against multiple datasets. Use the --debug flag for detailed output.

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets mmlu_gen_4d595a --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets cmmlu_gen_c13365 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets ceval_internal_gen_2daf24 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets GaokaoBench_no_subjective_gen_4c31db --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets triviaqa_wiki_1shot_gen_eaf81e --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets nq_open_1shot_gen_01cf41 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets race_gen_69ee4f --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets winogrande_5shot_gen_b36770 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets hellaswag_10shot_gen_e42710 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets bbh_gen_5b92b0 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets gsm8k_gen_1d7fe4 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets math_0shot_gen_393424 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets TheoremQA_5shot_gen_6f0af8 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets humaneval_gen_8e312c --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets sanitized_mbpp_mdblock_gen_a447ff --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets lcbench_gen_5ff288 --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets gpqa_gen_4baadb --debug
```

```bash
python3 run.py --models hf_internlm2_chat_7b --datasets IFEval_gen_3321a3 --debug
```

--------------------------------

### Run Data Combine

Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md

Execute the script to combine generated test instances from both Bloom-based and Concept-based modules. Replace `{benchmark_name}` and `{split}` with your specific values.

```bash
bash scripts/run_data_combine.bash demo test
```

--------------------------------

### StructEval Benchmark Configuration

Source: https://context7.com/c-box/structeval/llms.txt

This file defines the configuration for evaluating models on StructMMLU benchmarks using the OpenCompass 2.0 framework. It specifies parameters for the evaluation process.

```python
# struct_benchmark/eval_config/eval_struct_mmlu_v1_instruct.py
```

--------------------------------

### Evaluate Llama-3-8b-instruct on StructMMLU

Source: https://github.com/c-box/structeval/blob/main/README.md

This Python script demonstrates how to configure and run an evaluation for the 'llama-3-8b-instruct' model on the 'StructMMLU' dataset using Opencompass 2.0. Ensure you have the necessary model and dataset configurations imported.

```python
from mmengine.config import read_base
with read_base():
    from ..data_config.struct_mmlu.struct_mmlu_v1_instruct import struct_mmlu_V1_datasets
    from ..model_configs.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model
datasets = [*struct_mmlu_V1_datasets]
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
```

--------------------------------

### Evaluate Qwen1.5 Base Models

Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/model_configs/qwen/README.md

Command to evaluate Qwen1.5 base models across various datasets. Ensure the correct model and dataset identifiers are used.

```bash
python3 run.py --models hf_qwen1_5_7b --datasets mmlu_ppl_ac766d --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets cmmlu_ppl_041cbf --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets ceval_internal_ppl_93e5ce --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets GaokaoBench_no_subjective_gen_d21e37 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets triviaqa_wiki_1shot_gen_20a989 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets nq_open_1shot_gen_20a989 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets race_ppl_abed12 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets winogrande_5shot_ll_252f01 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets hellaswag_10shot_ppl_59c85e --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets bbh_gen_98fba6 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets gsm8k_gen_17d0dc --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets math_4shot_base_gen_db136b --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets TheoremQA_5shot_gen_6f0af8 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets deprecated_humaneval_gen_d2537e --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets sanitized_mbpp_gen_742f0c --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets lcbench_gen_5ff288 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b --datasets gpqa_ppl_6bf57a --debug
```

--------------------------------

### Run StructEval Benchmark Generation Scripts

Source: https://github.com/c-box/structeval/blob/main/README.md

Execute these bash commands to generate benchmarks using StructEval. Navigate to the 'struct_generate' directory first. These scripts handle Bloom's Taxonomy generation, concept generation, and data combination.

```bash
cd struct_generate
bash scripts/run_bloom_generate.bash demo test
bash scripts/run_concept_generation.bash demo test
bash scripts/run_data_combine.bash demo test
```

--------------------------------

### Utility Functions

Source: https://context7.com/c-box/structeval/llms.txt

Core helper functions for data I/O, prompt formatting, and token management.

```APIDOC
## Utility Functions (`common_utils/utils.py`)

Core helper functions used throughout the pipeline for data I/O, prompt formatting, and token management.

```python
from common_utils.utils import (
    load_file, load_json_dic, save_json_dic, save_jsonl_data,
    build_example, get_text_chunks, parse_json_response,
    find_answer, random_select_choice, set_seed, token_count
)

# Load JSONL seed data
data = load_file("processed_data/demo/0_test_with_idx.json")
# Returns: list of dicts, one per line

# Load / save JSON checkpoint files
checkpoint = load_json_dic("processed_data/demo/1_test_with_topic.json")
save_json_dic(checkpoint, "processed_data/demo/1_test_with_topic.json")

# Format a seed question for prompt construction
example_str = build_example(
    data[0],
    with_answer=True,   # include the correct answer
    with_explain=False  # omit explanation field
)
# Output:
# "Question: Which of the following is true regarding reflexes?\n#  A. A positive babinski reflex...\n#  B. An extensor plantar response...\n#  C. The root value of the ankle reflex is S1\n#  D. The root value of the knee reflex is L1, L2\n#  Answer: C. The root value of the ankle reflex is S1"

# Split a long Wikipedia page text into token-bounded chunks
chunks = get_text_chunks(wiki_text, chunk_size=256)
# Returns: list of strings, each ≤ 256 tokens, split on sentence boundaries

# Parse a GPT response that wraps JSON in a code block
questions = parse_json_response("```json\n[{\"level\": \"remembering\", ...}]\n```")

# Randomly permute answer choices to prevent position bias
shuffled = random_select_choice(question_dict)
# Swaps answer key contents so "answer" field always points to correct option
```
```

--------------------------------

### StructMMLU PPL Evaluation Configuration

Source: https://context7.com/c-box/structeval/llms.txt

This Python script configures the StructMMLU dataset for perplexity-based evaluation. It defines input/output columns, prompt templates, and specifies the use of ZeroRetriever and PPLInferencer.

```python
# struct_benchmark/data_config/struct_mmlu/struct_mmlu_v1_ppl.py
# PPL (perplexity-based) evaluation config for all 57 MMLU subjects

from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccwithDetailsEvaluator
from opencompass.datasets import StructMMLU_V1

mmlu_reader_cfg = dict(
    input_columns=['input', 'A', 'B', 'C', 'D'],
    output_column='target',
    train_split='dev',
    test_split='test'
)

# Build dataset config for a single subject
_name = 'clinical_knowledge'
_hint = f'The following are multiple choice questions (with answers) about {_name.replace("_", " ")}.\n\n'
question_overall = '{input}\nA. {A}\nB. {B}\nC. {C}\nD. {D}'

mmlu_infer_cfg = dict(
    ice_template=dict(
        type=PromptTemplate,
        template={opt: f'{question_overall}\nAnswer: {opt}\n' for opt in ['A', 'B', 'C', 'D']},
    ),
    prompt_template=dict(
        type=PromptTemplate,
        template={opt: f'{_hint}</E>{question_overall}\nAnswer: {opt}' for opt in ['A', 'B', 'C', 'D']},
        ice_token='</E>',
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=PPLInferencer),
)

dataset_cfg = dict(
    abbr=f'struct_mmlu_{_name}',
    type=StructMMLU_V1,
    path='./struct_data/struct_mmlu',  # path to generated struct_data
    name=_name,
    reader_cfg=mmlu_reader_cfg,
    infer_cfg=mmlu_infer_cfg,
    eval_cfg=dict(evaluator=dict(type=AccwithDetailsEvaluator)),
)
```

--------------------------------

### Evaluate Qwen1.5 Chat Models

Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/model_configs/qwen/README.md

Command to evaluate Qwen1.5 chat models across various datasets. Ensure the correct model and dataset identifiers are used.

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets mmlu_gen_4d595a --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets cmmlu_gen_c13365 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets ceval_internal_gen_2daf24 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets GaokaoBench_no_subjective_gen_4c31db --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets triviaqa_wiki_1shot_gen_eaf81e --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets nq_open_1shot_gen_01cf41 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets race_gen_69ee4f --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets winogrande_5shot_gen_b36770 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets hellaswag_10shot_gen_e42710 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets bbh_gen_5b92b0 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets gsm8k_gen_1d7fe4 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets math_0shot_gen_393424 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets TheoremQA_5shot_gen_6f0af8 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets humaneval_gen_8e312c --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets sanitized_mbpp_mdblock_gen_a447ff --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets lcbench_gen_5ff288 --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets gpqa_gen_4baadb --debug
```

```bash
python3 run.py --models hf_qwen1_5_7b_chat --datasets IFEval_gen_3321a3 --debug
```

--------------------------------

### Run StructEval Evaluation

Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/README.md

Executes the evaluation script using a specified configuration file and output directory. The results will be saved in the designated output path.

```bash
python run.py eval_config/eval_struct_all_v1_ppl.py -w output/struct_all_v1_ppl
```

--------------------------------

### Merge Bloom and Concept Outputs for Final Benchmark

Source: https://context7.com/c-box/structeval/llms.txt

Combines outputs from Bloom-based and concept-based generation pipelines into a final structured benchmark file, retaining only RAG-verified questions. Specify the benchmark and split. This script is also used for MMLU benchmarks.

```bash
cd struct_generate

python data_combine.py \
    --benchmark arc_challenge \
    --split test
```

```bash
# Or use the convenience script:
bash scripts/run_data_combine.bash arc_challenge test
```

```bash
# For MMLU (all 57 subjects):
python data_combine.py --benchmark mmlu --split test
```

--------------------------------

### Evaluate Llama-3-8b-instruct on StructMMLU

Source: https://context7.com/c-box/structeval/llms.txt

This Python script configures and prepares datasets and models for evaluation. It reads base configurations and aggregates dataset and model definitions.

```python
from mmengine.config import read_base

with read_base():
    from ..data_config.struct_mmlu.struct_mmlu_v1_instruct import struct_mmlu_V1_datasets
    from ..model_configs.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model

datasets = [*struct_mmlu_V1_datasets]
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
```

--------------------------------

### Evaluate InternLM2 Base Models on Benchmarks

Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/model_configs/hf_internlm/README.md

These commands execute evaluations for the 'hf_internlm2_7b' base model across a wide range of datasets. Ensure the 'run.py' script and specified datasets are available in your environment.

```bash
python3 run.py --models hf_internlm2_7b --datasets mmlu_ppl_ac766d --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets cmmlu_ppl_041cbf --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets ceval_internal_ppl_93e5ce --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets GaokaoBench_no_subjective_gen_d21e37 --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets triviaqa_wiki_1shot_gen_20a989 --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets nq_open_1shot_gen_20a989 --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets race_ppl_abed12 --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets winogrande_5shot_ll_252f01 --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets hellaswag_10shot_ppl_59c85e --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets bbh_gen_98fba6 --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets gsm8k_gen_17d0dc --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets math_4shot_base_gen_db136b --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets TheoremQA_5shot_gen_6f0af8 --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets deprecated_humaneval_gen_d2537e --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets sanitized_mbpp_gen_742f0c --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets lcbench_gen_5ff288 --debug
```

```bash
python3 run.py --models hf_internlm2_7b --datasets gpqa_ppl_6bf57a --debug
```

--------------------------------

### Run Bloom-based Generation with RAG Filtering

Source: https://context7.com/c-box/structeval/llms.txt

Executes the Bloom generation process combined with RAG filtering. Specify the benchmark, split, and model type. The output file includes Bloom questions with RAG verification.

```bash
python bloom_generation.py \
    --benchmark arc_challenge \
    --split test \
    --model-type gpt-4o-mini
```

```bash
bash scripts/run_bloom_generate.bash arc_challenge test
```

--------------------------------

### OpenAI API Wrappers

Source: https://context7.com/c-box/structeval/llms.txt

Single and multithreaded OpenAI API wrappers with retry logic.

```APIDOC
## `query_gpt` / `multi_query_gpt` (`common_utils/prompt_utils.py`)

Single and multithreaded OpenAI API wrappers with retry logic (up to 30 attempts with exponential backoff).

```python
from common_utils.prompt_utils import query_gpt, multi_query_gpt, entity_match_multithreaded

# Single synchronous query
response = query_gpt(
    query="What is the capital of France?",
    temp=0.0,            # 0 for deterministic, 0.7 for generation
    model_type="gpt-4o-mini"
)
# Returns: "Paris"

# Parallel multi-query (uses threading; preserves input order)
prompts = [
    "What is the capital of France?",
    "What is the capital of Germany?",
    "What is the capital of Japan?"
]
responses = multi_query_gpt(prompts, temp=0.0, model_type="gpt-4o-mini")
# Returns: ["Paris", "Berlin", "Tokyo"]

# Multithreaded entity matching — determines if two mentions refer to the same entity
entity_infos = [
    ("Ankle reflex", "a deep tendon reflex",
     "Ankle jerk reflex", "The ankle jerk reflex, also known as..."),
    ("Babinski sign", "a neurological reflex",
     "Plantar reflex", "The plantar reflex is an important..."),
]
matches = entity_match_multithreaded(entity_infos)
# Returns: [True, True]  — list of booleans in input order
```
```

--------------------------------

### Run Topic Extraction for Benchmark Generation

Source: https://context7.com/c-box/structeval/llms.txt

This command executes the `topic_extract.py` script to identify the core topic of seed questions. It supports single benchmarks or MMLU, allowing configuration of ranking methods and paragraph retrieval parameters.

```bash
cd struct_generate

# For a single benchmark (e.g., arc_challenge)
python topic_extract.py \
    --benchmark arc_challenge \
    --split test \
    --rank-method bge \
    --para-num 3 \
    --chunk-size 256

# For MMLU (iterates over all 57 subjects automatically)
python topic_extract.py \
    --benchmark mmlu \
    --split test \
    --rank-method bge \
    --para-num 3
```

--------------------------------

### Elasticsearch Client Configuration

Source: https://context7.com/c-box/structeval/llms.txt

This YAML file configures the connection parameters for the Elasticsearch client. It specifies host, username, password, and various connection settings like timeouts and sniffing behavior.

```yaml
# struct_generate/config/es_config.yaml
es_config:
  hosts:
    - "http://localhost:9200"
  username: "elastic"
  password: "your-password"
  timeout: 600
  sniff_on_start: false
  sniff_on_connection_fail: false
  sniff_timeout: 10
  sniffer_timeout: 60
```

--------------------------------

### Extract Key Concepts and Wikipedia Information

Source: https://context7.com/c-box/structeval/llms.txt

Extracts up to 5 important entities from seed questions, retrieves their Wikipedia pages, and uses GPT for entity matching. Requires specifying the benchmark, split, rank-method, and paragraph number. The output file contains extracted entities with matching status and Wikipedia details.

```bash
cd struct_generate

python concept_extract.py \
    --benchmark arc_challenge \
    --split test \
    --rank-method bge \
    --para-num 1
```

--------------------------------

### Generate Concept-Based Multiple-Choice Questions

Source: https://context7.com/c-box/structeval/llms.txt

Generates multiple-choice questions for each matched entity, grounded in Wikipedia content and filtered by RAG. Use this script to test model understanding of concepts. Specify the benchmark, split, and model type. Convenience scripts are also available.

```bash
cd struct_generate

python concept_generation.py \
    --benchmark arc_challenge \
    --split test \
    --model-type gpt-4o-mini
```

```bash
# Or use the convenience script:
bash scripts/run_concept_generation.bash arc_challenge test
```

--------------------------------

### Synchronous GPT API Query

Source: https://context7.com/c-box/structeval/llms.txt

The `query_gpt` function provides a synchronous wrapper for making a single request to the OpenAI API. It allows specifying the query, temperature for randomness, and the model type.

```python
from common_utils.prompt_utils import query_gpt

# Single synchronous query
response = query_gpt(
    query="What is the capital of France?",
    temp=0.0,            # 0 for deterministic, 0.7 for generation
    model_type="gpt-4o-mini"
)
# Returns: "Paris"
```

--------------------------------

### Load and Save JSON Data

Source: https://context7.com/c-box/structeval/llms.txt

Use `load_json_dic` to load JSON data from a file and `save_json_dic` to save a dictionary to a JSON file. These are useful for managing checkpoints or configuration.

```python
from common_utils.utils import load_file, load_json_dic, save_json_dic, save_jsonl_data

# Load JSONL seed data
data = load_file("processed_data/demo/0_test_with_idx.json")
# Returns: list of dicts, one per line

# Load / save JSON checkpoint files
checkpoint = load_json_dic("processed_data/demo/1_test_with_topic.json")
save_json_dic(checkpoint, "processed_data/demo/1_test_with_topic.json")
```

--------------------------------

### Parallel GPT API Queries

Source: https://context7.com/c-box/structeval/llms.txt

Use `multi_query_gpt` for making multiple GPT API requests concurrently using threading. It preserves the order of responses corresponding to the input prompts.

```python
from common_utils.prompt_utils import multi_query_gpt

# Parallel multi-query (uses threading; preserves input order)
prompts = [
    "What is the capital of France?",
    "What is the capital of Germany?",
    "What is the capital of Japan?"
]
responses = multi_query_gpt(prompts, temp=0.0, model_type="gpt-4o-mini")
# Returns: ["Paris", "Berlin", "Tokyo"]
```

--------------------------------

### Randomly Select Answer Choices

Source: https://context7.com/c-box/structeval/llms.txt

Use `random_select_choice` to shuffle the order of answer choices in a question dictionary. This helps prevent response bias related to answer position.

```python
from common_utils.utils import random_select_choice

# Randomly permute answer choices to prevent position bias
shuffled = random_select_choice(question_dict)
# Swaps answer key contents so "answer" field always points to correct option
```

--------------------------------

### Token-Aware Text Chunking

Source: https://context7.com/c-box/structeval/llms.txt

Use `get_text_chunks` to split a long text into smaller segments, ensuring each chunk is within a specified token limit and split at sentence boundaries.

```python
from common_utils.utils import get_text_chunks

# Split a long Wikipedia page text into token-bounded chunks
chunks = get_text_chunks(wiki_text, chunk_size=256)
# Returns: list of strings, each ≤ 256 tokens, split on sentence boundaries
```

--------------------------------

### Calculate Token Count

Source: https://context7.com/c-box/structeval/llms.txt

The `token_count` function (imported but not shown in use) is available for calculating the number of tokens in a given text, likely using a specific tokenizer.

```python
from common_utils.utils import token_count
```

--------------------------------

### Retrieve Wikipedia Content by Entity

Source: https://context7.com/c-box/structeval/llms.txt

The `wiki_retrieve_by_entity` function fetches and ranks relevant Wikipedia paragraphs for a given topic. It supports different ranking methods like BGE or GPT and allows configuration of chunk size and number of documents.

```python
from common_utils.wiki_search import wiki_retrieve_by_entity
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import argparse

# Set up args (mirrors CLI arguments in topic_extract.py / concept_extract.py)
args = argparse.Namespace(
    rank_method="bge",       # "bge" or "gpt"
    para_num=1,              # number of top paragraphs to return
    chunk_size=256,          # token size per Wikipedia chunk
    use_openai_chunk=True    # use tiktoken-based chunking
)

# Load BGE reranker (required when rank_method="bge")
bge_model = AutoModelForSequenceClassification.from_pretrained("BAAI/bge-reranker-large").cuda()
bge_tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-reranker-large")
bge_model.eval()

# Retrieve relevant Wikipedia content
wiki_info = wiki_retrieve_by_entity(
    args=args,
    topic_name="Ankle jerk reflex",
    topic_des="A deep tendon reflex elicited by striking the Achilles tendon",
    seed_question="The root value of the ankle reflex is S1",
    bge_model=bge_model,
    beg_tokenizer=bge_tokenizer,
    doc_num=1
)

# Returns:
# {
#   "wiki_id": "1234567",
#   "wiki_name": "Ankle jerk reflex",
#   "wiki_intro": "The ankle jerk reflex, also known as the Achilles reflex...",
#   "wiki_page": "<full page text>",
#   "related_content": "<intro paragraph + top-ranked paragraph>"
# }
```

--------------------------------

### Wikipedia Retrieval

Source: https://context7.com/c-box/structeval/llms.txt

Retrieves and ranks relevant Wikipedia paragraphs for a given entity.

```APIDOC
## `wiki_retrieve_by_entity` (`common_utils/wiki_search.py`)

Retrieves and ranks the most relevant Wikipedia paragraphs for a given entity name and description, using ElasticSearch for retrieval and either BGE or GPT for passage reranking.

```python
from common_utils.wiki_search import wiki_retrieve_by_entity
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import argparse

# Set up args (mirrors CLI arguments in topic_extract.py / concept_extract.py)
args = argparse.Namespace(
    rank_method="bge",       # "bge" or "gpt"
    para_num=1,              # number of top paragraphs to return
    chunk_size=256,          # token size per Wikipedia chunk
    use_openai_chunk=True    # use tiktoken-based chunking
)

# Load BGE reranker (required when rank_method="bge")
bge_model = AutoModelForSequenceClassification.from_pretrained("BAAI/bge-reranker-large").cuda()
bge_tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-reranker-large")
bge_model.eval()

# Retrieve relevant Wikipedia content
wiki_info = wiki_retrieve_by_entity(
    args=args,
    topic_name="Ankle jerk reflex",
    topic_des="A deep tendon reflex elicited by striking the Achilles tendon",
    seed_question="The root value of the ankle reflex is S1",
    bge_model=bge_model,
    beg_tokenizer=bge_tokenizer,
    doc_num=1
)

# Returns:
# {
#   "wiki_id": "1234567",
#   "wiki_name": "Ankle jerk reflex",
#   "wiki_intro": "The ankle jerk reflex, also known as the Achilles reflex...",
#   "wiki_page": "<full page text>",
#   "related_content": "<intro paragraph + top-ranked paragraph>"
# }
```
```

--------------------------------

### StructEval Citation

Source: https://github.com/c-box/structeval/blob/main/README.md

BibTeX entry for citing the StructEval paper. Include this in your academic work when referencing the framework.

```bibtex
@misc{cao2024structevaldeepenbroadenlarge,
      title={StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation},
      author={Boxi Cao and Mengjie Ren and Hongyu Lin and Xianpei Han and Feng Zhang and Junfeng Zhan and Le Sun},
      year={2024},
      eprint={2408.03281},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.03281},
}
```

--------------------------------

### Multithreaded Entity Matching

Source: https://context7.com/c-box/structeval/llms.txt

The `entity_match_multithreaded` function performs entity matching in parallel using threads. It compares pairs of mentions and their descriptions to determine if they refer to the same entity.

```python
from common_utils.prompt_utils import entity_match_multithreaded

# Multithreaded entity matching — determines if two mentions refer to the same entity
entity_infos = [
    ("Ankle reflex", "a deep tendon reflex",
     "Ankle jerk reflex", "The ankle jerk reflex, also known as..."),
    ("Babinski sign", "a neurological reflex",
     "Plantar reflex", "The plantar reflex is an important..."),
]
matches = entity_match_multithreaded(entity_infos)
# Returns: [True, True]  — list of booleans in input order
```

--------------------------------

### Parse JSON from GPT Responses

Source: https://context7.com/c-box/structeval/llms.txt

The `parse_json_response` function extracts and parses JSON content that might be embedded within a code block in a GPT model's output.

```python
from common_utils.utils import parse_json_response

# Parse a GPT response that wraps JSON in a code block
questions = parse_json_response("```json\n[{\"level\": \"remembering\", ...}]\n```")
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.