### Setup Environment for StructEval Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/README.md Installs necessary dependencies for StructEval evaluation using conda and pip. Ensure you are in the 'struct_benchmark' directory. ```bash cd struct_benchmark conda create --name structbench python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y pip install -e 'git+https://github.com/c-box/opencompass.git#egg=opencompass' ``` -------------------------------- ### Full Generation Pipeline Example Source: https://context7.com/c-box/structeval/llms.txt A complete end-to-end example demonstrating the sequence of running the Bloom generation, concept generation, and data combination steps using bash convenience scripts. Set the BENCHMARK and SPLIT variables accordingly. ```bash cd struct_generate BENCHMARK=demo # use "demo" for a quick test with provided sample data SPLIT=test # Step 1: Extract test objectives and retrieve Wikipedia evidence bash scripts/run_bloom_generate.bash $BENCHMARK $SPLIT # → runs topic_extract.py then bloom_generation.py # Step 2: Extract key concepts and generate concept questions bash scripts/run_concept_generation.bash $BENCHMARK $SPLIT # → runs concept_generation.py (concept_extract.py is commented-in when needed) # Step 3: Combine and filter to produce the final benchmark bash scripts/run_data_combine.bash $BENCHMARK $SPLIT # → outputs to struct_data/demo/struct_test_gpt-4o-mini.json # A complete running example with all intermediate files is at: # processed_data/example/ ``` -------------------------------- ### Data Format Example Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md Example JSON format for input data. Ensure your data files are named `0_{split}_with_idx.json` and placed in the appropriate directory. ```json {"question": "Which of the following is true regarding reflexes?", "subject": "clinical_knowledge", "choices": ["A positive babinski reflex is the same as a normal flexor response in the assessment of the plantar reflex", "An extensor plantar response indicates a lower motor neurone lesion", "The root value of the ankle reflex is S1", "The root value of the knee reflex is L1, L2"], "answer": 2, "idx": 0} ``` -------------------------------- ### Prepare Environment with Conda Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md Steps to set up the Conda environment for StructEval. Ensure you have Anaconda installed. ```bash mkdir ~/anaconda3/envs/structeval tar -xzvf asset/structeval.tar.gz -C ~/anaconda3/envs/structeval ``` ```bash conda info -e conda activate structeval ``` -------------------------------- ### Seed Instance JSONL Format Example Source: https://context7.com/c-box/structeval/llms.txt This is an example of the JSONL format required for input seed instances. Each line represents a single question with its choices, correct answer, and an index. ```json { "question": "Which of the following is true regarding reflexes?", "subject": "clinical_knowledge", "choices": [ "A positive babinski reflex is the same as a normal flexor response in the assessment of the plantar reflex", "An extensor plantar response indicates a lower motor neurone lesion", "The root value of the ankle reflex is S1", "The root value of the knee reflex is L1, L2" ], "answer": 2, "idx": 0 } ``` -------------------------------- ### Configure Models and Datasets for Evaluation Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/README.md Python script to import datasets and model configurations for evaluation. This example configures evaluation for MMLU, ARC-Challenge, and OpenbookQA datasets using the Llama3-8B model. ```python from mmengine.config import read_base with read_base(): from ..data_config.struct_arc_challenge.struct_arc_challenge_v1_ppl import struct_arc_challenge_v1_datasets from ..data_config.struct_openbook.struct_openbook_v1_ppl import struct_openbookqa_v1_datasets from ..data_config.struct_mmlu.struct_mmlu_v1_ppl import struct_mmlu_V1_datasets from model_configs.hf_llama.hf_llama3_8b import models as hf_llama3_8b_model datasets = [*struct_arc_challenge_v1_datasets, *struct_openbookqa_v1_datasets, *struct_mmlu_V1_datasets] models = sum([v for k, v in locals().items() if k.endswith('_model')], []) ``` -------------------------------- ### Setup Conda Environment for StructEval Benchmark Generation Source: https://context7.com/c-box/structeval/llms.txt This command sequence sets up the necessary Conda environment for StructEval's benchmark generation module. It involves creating and activating a new environment from a tarball and configuring the OpenAI API key. ```bash mkdir ~/anaconda3/envs/structeval tar -xzvf asset/structeval.tar.gz -C ~/anaconda3/envs/structeval conda activate structeval ``` ```python from openai import OpenAI import os os.environ["OPENAI_API_KEY"] = "sk-your-api-key-here" client = OpenAI( base_url="https://api.openai.com/v1", ) ``` -------------------------------- ### Elasticsearch Client Usage Source: https://context7.com/c-box/structeval/llms.txt This Python script demonstrates how to initialize and use the ESClient to query an Elasticsearch index. It shows examples of searching by entity title and by topic name with boosted title matching. ```python from common_utils.es_client import ESClient es = ESClient(config_file="config/es_config.yaml") # Search by entity title results = es.search( index="wikipedia-monthly-enwiki", body={ "query": { "match": {"title": "Ankle jerk reflex"} } } ) hit = results["hits"]["hits"][0]["_source"] print(hit["title"]) # "Ankle jerk reflex" print(hit["url"]) # "https://en.wikipedia.org/wiki/Ankle_jerk_reflex" print(hit["text"][:200]) # full Wikipedia page text # Search by topic name + description (boosted title match) results = es.search( index="wikipedia-monthly-enwiki", body={ "query": { "bool": { "should": [ {"match": {"title": {"query": "Reflex", "boost": 3}}}, {"match": {"text": {"query": "involuntary response stimulus", "boost": 1}}} ] } } } ) ``` -------------------------------- ### ElasticSearch Index Mapping Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md Example JSON configuration for an ElasticSearch index mapping. This is required for concept-related instance generation based on Wikipedia. ```json { "wikipedia-monthly-enwiki" : { "mappings" : { "properties" : { "id" : { "type" : "text" }, "text" : { "type" : "text" }, "title" : { "type" : "text" }, "url" : { "type" : "text" } } } } } ``` -------------------------------- ### Topic Extraction Intermediate Output Example Source: https://context7.com/c-box/structeval/llms.txt This JSON structure shows the expected intermediate output after running the topic extraction script. It includes the original question data augmented with identified topic information and Wikipedia context. ```json [ { "question": "Which of the following is true regarding reflexes?", "subject": "clinical_knowledge", "choices": [...], "answer": 2, "idx": 0, "topic": {"name": "Reflex", "description": "Involuntary physiological response to a stimulus"}, "topic_match": true, "topic_wiki_info": { "wiki_id": "25427", "wiki_name": "Reflex", "wiki_intro": "A reflex, or reflex action, is an involuntary ...", "related_content": "..." } } ] ``` -------------------------------- ### Format Seed Questions for Prompts Source: https://context7.com/c-box/structeval/llms.txt The `build_example` function formats a dictionary into a string suitable for prompt construction. Use `with_answer=True` to include the correct answer. ```python from common_utils.utils import build_example # Format a seed question for prompt construction example_str = build_example( data[0], with_answer=True, # include the correct answer with_explain=False # omit explanation field ) # Output: # "Question: Which of the following is true regarding reflexes? # A. A positive babinski reflex... # B. An extensor plantar response... # C. The root value of the ankle reflex is S1 # D. The root value of the knee reflex is L1, L2 # Answer: C. The root value of the ankle reflex is S1" ``` -------------------------------- ### Run StructEval Evaluations Source: https://context7.com/c-box/structeval/llms.txt These bash commands demonstrate how to execute various evaluation scenarios using the `run.py` script. They cover instruct/chat model evaluation on StructMMLU, base model evaluation on all StructEval benchmarks in PPL mode, and specific benchmark evaluations. ```bash cd struct_benchmark # Evaluate a single model on StructMMLU (instruct/chat models) python run.py eval_config/eval_struct_mmlu_v1_instruct.py \ -w output/struct_mmlu_v1_instruct # Results saved to: struct_benchmark/output/struct_mmlu_v1_instruct/ # Evaluate a base model on all 3 StructEval benchmarks (PPL mode) python run.py eval_config/eval_struct_all_v1_ppl.py \ -w output/struct_all_v1_ppl # Results saved to: struct_benchmark/output/struct_all_v1_ppl/ # Evaluate on ARC-Challenge only python run.py eval_config/eval_struct_arc_challenge_v1_ppl.py \ -w output/struct_arc_challenge_v1_ppl # Evaluate on OpenBookQA only python run.py eval_config/eval_struct_openbookqa_v1_ppl.py \ -w output/struct_openbookqa_v1_ppl ``` -------------------------------- ### Run Concept-based Generation Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md Execute the script to generate test instances based on essential concepts. Replace `{benchmark_name}` and `{split}` with your specific values. ```bash bash scripts/run_concept_generation.bash demo test ``` -------------------------------- ### Run Bloom-based Generation Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md Execute the script to generate test instances based on Bloom's Taxonomy. Replace `{benchmark_name}` and `{split}` with your specific values. ```bash bash scripts/run_bloom_generate.bash {benchmark_name} {split} ``` ```bash bash scripts/run_bloom_generate.bash demo test ``` -------------------------------- ### Run StructEval Evaluation Command Source: https://github.com/c-box/structeval/blob/main/README.md This bash command initiates the evaluation process for StructEval benchmarks after setting up the configuration. Navigate to the 'struct_benchmark' directory before running this command. The evaluation results will be saved in the specified output directory. ```bash cd struct_benchmark python run.py eval_config/eval_struct_mmlu_v1_instruct.py -w output/struct_mmlu_v1_instruct ``` -------------------------------- ### Run Bloom's Taxonomy Question Generation Source: https://context7.com/c-box/structeval/llms.txt This command initiates the question generation process using `bloom_generation.py`. It takes the topic-matched data and generates questions across six cognitive levels of Bloom's Taxonomy, applying a RAG filter for answerability. ```bash cd struct_generate ``` -------------------------------- ### Run HF InternLM Chat Model Benchmarks Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/model_configs/hf_internlm/README.md Execute benchmark tests for the hf_internlm2_chat_7b model against multiple datasets. Use the --debug flag for detailed output. ```bash python3 run.py --models hf_internlm2_chat_7b --datasets mmlu_gen_4d595a --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets cmmlu_gen_c13365 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets ceval_internal_gen_2daf24 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets GaokaoBench_no_subjective_gen_4c31db --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets triviaqa_wiki_1shot_gen_eaf81e --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets nq_open_1shot_gen_01cf41 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets race_gen_69ee4f --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets winogrande_5shot_gen_b36770 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets hellaswag_10shot_gen_e42710 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets bbh_gen_5b92b0 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets gsm8k_gen_1d7fe4 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets math_0shot_gen_393424 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets TheoremQA_5shot_gen_6f0af8 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets humaneval_gen_8e312c --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets sanitized_mbpp_mdblock_gen_a447ff --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets lcbench_gen_5ff288 --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets gpqa_gen_4baadb --debug ``` ```bash python3 run.py --models hf_internlm2_chat_7b --datasets IFEval_gen_3321a3 --debug ``` -------------------------------- ### Run Data Combine Source: https://github.com/c-box/structeval/blob/main/struct_generate/README.md Execute the script to combine generated test instances from both Bloom-based and Concept-based modules. Replace `{benchmark_name}` and `{split}` with your specific values. ```bash bash scripts/run_data_combine.bash demo test ``` -------------------------------- ### StructEval Benchmark Configuration Source: https://context7.com/c-box/structeval/llms.txt This file defines the configuration for evaluating models on StructMMLU benchmarks using the OpenCompass 2.0 framework. It specifies parameters for the evaluation process. ```python # struct_benchmark/eval_config/eval_struct_mmlu_v1_instruct.py ``` -------------------------------- ### Evaluate Llama-3-8b-instruct on StructMMLU Source: https://github.com/c-box/structeval/blob/main/README.md This Python script demonstrates how to configure and run an evaluation for the 'llama-3-8b-instruct' model on the 'StructMMLU' dataset using Opencompass 2.0. Ensure you have the necessary model and dataset configurations imported. ```python from mmengine.config import read_base with read_base(): from ..data_config.struct_mmlu.struct_mmlu_v1_instruct import struct_mmlu_V1_datasets from ..model_configs.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model datasets = [*struct_mmlu_V1_datasets] models = sum([v for k, v in locals().items() if k.endswith('_model')], []) ``` -------------------------------- ### Evaluate Qwen1.5 Base Models Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/model_configs/qwen/README.md Command to evaluate Qwen1.5 base models across various datasets. Ensure the correct model and dataset identifiers are used. ```bash python3 run.py --models hf_qwen1_5_7b --datasets mmlu_ppl_ac766d --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets cmmlu_ppl_041cbf --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets ceval_internal_ppl_93e5ce --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets GaokaoBench_no_subjective_gen_d21e37 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets triviaqa_wiki_1shot_gen_20a989 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets nq_open_1shot_gen_20a989 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets race_ppl_abed12 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets winogrande_5shot_ll_252f01 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets hellaswag_10shot_ppl_59c85e --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets bbh_gen_98fba6 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets gsm8k_gen_17d0dc --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets math_4shot_base_gen_db136b --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets TheoremQA_5shot_gen_6f0af8 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets deprecated_humaneval_gen_d2537e --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets sanitized_mbpp_gen_742f0c --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets lcbench_gen_5ff288 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b --datasets gpqa_ppl_6bf57a --debug ``` -------------------------------- ### Run StructEval Benchmark Generation Scripts Source: https://github.com/c-box/structeval/blob/main/README.md Execute these bash commands to generate benchmarks using StructEval. Navigate to the 'struct_generate' directory first. These scripts handle Bloom's Taxonomy generation, concept generation, and data combination. ```bash cd struct_generate bash scripts/run_bloom_generate.bash demo test bash scripts/run_concept_generation.bash demo test bash scripts/run_data_combine.bash demo test ``` -------------------------------- ### Utility Functions Source: https://context7.com/c-box/structeval/llms.txt Core helper functions for data I/O, prompt formatting, and token management. ```APIDOC ## Utility Functions (`common_utils/utils.py`) Core helper functions used throughout the pipeline for data I/O, prompt formatting, and token management. ```python from common_utils.utils import ( load_file, load_json_dic, save_json_dic, save_jsonl_data, build_example, get_text_chunks, parse_json_response, find_answer, random_select_choice, set_seed, token_count ) # Load JSONL seed data data = load_file("processed_data/demo/0_test_with_idx.json") # Returns: list of dicts, one per line # Load / save JSON checkpoint files checkpoint = load_json_dic("processed_data/demo/1_test_with_topic.json") save_json_dic(checkpoint, "processed_data/demo/1_test_with_topic.json") # Format a seed question for prompt construction example_str = build_example( data[0], with_answer=True, # include the correct answer with_explain=False # omit explanation field ) # Output: # "Question: Which of the following is true regarding reflexes?\n# A. A positive babinski reflex...\n# B. An extensor plantar response...\n# C. The root value of the ankle reflex is S1\n# D. The root value of the knee reflex is L1, L2\n# Answer: C. The root value of the ankle reflex is S1" # Split a long Wikipedia page text into token-bounded chunks chunks = get_text_chunks(wiki_text, chunk_size=256) # Returns: list of strings, each ≤ 256 tokens, split on sentence boundaries # Parse a GPT response that wraps JSON in a code block questions = parse_json_response("```json\n[{\"level\": \"remembering\", ...}]\n```") # Randomly permute answer choices to prevent position bias shuffled = random_select_choice(question_dict) # Swaps answer key contents so "answer" field always points to correct option ``` ``` -------------------------------- ### StructMMLU PPL Evaluation Configuration Source: https://context7.com/c-box/structeval/llms.txt This Python script configures the StructMMLU dataset for perplexity-based evaluation. It defines input/output columns, prompt templates, and specifies the use of ZeroRetriever and PPLInferencer. ```python # struct_benchmark/data_config/struct_mmlu/struct_mmlu_v1_ppl.py # PPL (perplexity-based) evaluation config for all 57 MMLU subjects from opencompass.openicl.icl_prompt_template import PromptTemplate from opencompass.openicl.icl_retriever import ZeroRetriever from opencompass.openicl.icl_inferencer import PPLInferencer from opencompass.openicl.icl_evaluator import AccwithDetailsEvaluator from opencompass.datasets import StructMMLU_V1 mmlu_reader_cfg = dict( input_columns=['input', 'A', 'B', 'C', 'D'], output_column='target', train_split='dev', test_split='test' ) # Build dataset config for a single subject _name = 'clinical_knowledge' _hint = f'The following are multiple choice questions (with answers) about {_name.replace("_", " ")}.\n\n' question_overall = '{input}\nA. {A}\nB. {B}\nC. {C}\nD. {D}' mmlu_infer_cfg = dict( ice_template=dict( type=PromptTemplate, template={opt: f'{question_overall}\nAnswer: {opt}\n' for opt in ['A', 'B', 'C', 'D']}, ), prompt_template=dict( type=PromptTemplate, template={opt: f'{_hint}{question_overall}\nAnswer: {opt}' for opt in ['A', 'B', 'C', 'D']}, ice_token='', ), retriever=dict(type=ZeroRetriever), inferencer=dict(type=PPLInferencer), ) dataset_cfg = dict( abbr=f'struct_mmlu_{_name}', type=StructMMLU_V1, path='./struct_data/struct_mmlu', # path to generated struct_data name=_name, reader_cfg=mmlu_reader_cfg, infer_cfg=mmlu_infer_cfg, eval_cfg=dict(evaluator=dict(type=AccwithDetailsEvaluator)), ) ``` -------------------------------- ### Evaluate Qwen1.5 Chat Models Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/model_configs/qwen/README.md Command to evaluate Qwen1.5 chat models across various datasets. Ensure the correct model and dataset identifiers are used. ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets mmlu_gen_4d595a --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets cmmlu_gen_c13365 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets ceval_internal_gen_2daf24 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets GaokaoBench_no_subjective_gen_4c31db --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets triviaqa_wiki_1shot_gen_eaf81e --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets nq_open_1shot_gen_01cf41 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets race_gen_69ee4f --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets winogrande_5shot_gen_b36770 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets hellaswag_10shot_gen_e42710 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets bbh_gen_5b92b0 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets gsm8k_gen_1d7fe4 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets math_0shot_gen_393424 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets TheoremQA_5shot_gen_6f0af8 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets humaneval_gen_8e312c --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets sanitized_mbpp_mdblock_gen_a447ff --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets lcbench_gen_5ff288 --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets gpqa_gen_4baadb --debug ``` ```bash python3 run.py --models hf_qwen1_5_7b_chat --datasets IFEval_gen_3321a3 --debug ``` -------------------------------- ### Run StructEval Evaluation Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/README.md Executes the evaluation script using a specified configuration file and output directory. The results will be saved in the designated output path. ```bash python run.py eval_config/eval_struct_all_v1_ppl.py -w output/struct_all_v1_ppl ``` -------------------------------- ### Merge Bloom and Concept Outputs for Final Benchmark Source: https://context7.com/c-box/structeval/llms.txt Combines outputs from Bloom-based and concept-based generation pipelines into a final structured benchmark file, retaining only RAG-verified questions. Specify the benchmark and split. This script is also used for MMLU benchmarks. ```bash cd struct_generate python data_combine.py \ --benchmark arc_challenge \ --split test ``` ```bash # Or use the convenience script: bash scripts/run_data_combine.bash arc_challenge test ``` ```bash # For MMLU (all 57 subjects): python data_combine.py --benchmark mmlu --split test ``` -------------------------------- ### Evaluate Llama-3-8b-instruct on StructMMLU Source: https://context7.com/c-box/structeval/llms.txt This Python script configures and prepares datasets and models for evaluation. It reads base configurations and aggregates dataset and model definitions. ```python from mmengine.config import read_base with read_base(): from ..data_config.struct_mmlu.struct_mmlu_v1_instruct import struct_mmlu_V1_datasets from ..model_configs.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model datasets = [*struct_mmlu_V1_datasets] models = sum([v for k, v in locals().items() if k.endswith('_model')], []) ``` -------------------------------- ### Evaluate InternLM2 Base Models on Benchmarks Source: https://github.com/c-box/structeval/blob/main/struct_benchmark/model_configs/hf_internlm/README.md These commands execute evaluations for the 'hf_internlm2_7b' base model across a wide range of datasets. Ensure the 'run.py' script and specified datasets are available in your environment. ```bash python3 run.py --models hf_internlm2_7b --datasets mmlu_ppl_ac766d --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets cmmlu_ppl_041cbf --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets ceval_internal_ppl_93e5ce --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets GaokaoBench_no_subjective_gen_d21e37 --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets triviaqa_wiki_1shot_gen_20a989 --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets nq_open_1shot_gen_20a989 --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets race_ppl_abed12 --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets winogrande_5shot_ll_252f01 --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets hellaswag_10shot_ppl_59c85e --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets bbh_gen_98fba6 --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets gsm8k_gen_17d0dc --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets math_4shot_base_gen_db136b --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets TheoremQA_5shot_gen_6f0af8 --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets deprecated_humaneval_gen_d2537e --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets sanitized_mbpp_gen_742f0c --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets lcbench_gen_5ff288 --debug ``` ```bash python3 run.py --models hf_internlm2_7b --datasets gpqa_ppl_6bf57a --debug ``` -------------------------------- ### Run Bloom-based Generation with RAG Filtering Source: https://context7.com/c-box/structeval/llms.txt Executes the Bloom generation process combined with RAG filtering. Specify the benchmark, split, and model type. The output file includes Bloom questions with RAG verification. ```bash python bloom_generation.py \ --benchmark arc_challenge \ --split test \ --model-type gpt-4o-mini ``` ```bash bash scripts/run_bloom_generate.bash arc_challenge test ``` -------------------------------- ### OpenAI API Wrappers Source: https://context7.com/c-box/structeval/llms.txt Single and multithreaded OpenAI API wrappers with retry logic. ```APIDOC ## `query_gpt` / `multi_query_gpt` (`common_utils/prompt_utils.py`) Single and multithreaded OpenAI API wrappers with retry logic (up to 30 attempts with exponential backoff). ```python from common_utils.prompt_utils import query_gpt, multi_query_gpt, entity_match_multithreaded # Single synchronous query response = query_gpt( query="What is the capital of France?", temp=0.0, # 0 for deterministic, 0.7 for generation model_type="gpt-4o-mini" ) # Returns: "Paris" # Parallel multi-query (uses threading; preserves input order) prompts = [ "What is the capital of France?", "What is the capital of Germany?", "What is the capital of Japan?" ] responses = multi_query_gpt(prompts, temp=0.0, model_type="gpt-4o-mini") # Returns: ["Paris", "Berlin", "Tokyo"] # Multithreaded entity matching — determines if two mentions refer to the same entity entity_infos = [ ("Ankle reflex", "a deep tendon reflex", "Ankle jerk reflex", "The ankle jerk reflex, also known as..."), ("Babinski sign", "a neurological reflex", "Plantar reflex", "The plantar reflex is an important..."), ] matches = entity_match_multithreaded(entity_infos) # Returns: [True, True] — list of booleans in input order ``` ``` -------------------------------- ### Run Topic Extraction for Benchmark Generation Source: https://context7.com/c-box/structeval/llms.txt This command executes the `topic_extract.py` script to identify the core topic of seed questions. It supports single benchmarks or MMLU, allowing configuration of ranking methods and paragraph retrieval parameters. ```bash cd struct_generate # For a single benchmark (e.g., arc_challenge) python topic_extract.py \ --benchmark arc_challenge \ --split test \ --rank-method bge \ --para-num 3 \ --chunk-size 256 # For MMLU (iterates over all 57 subjects automatically) python topic_extract.py \ --benchmark mmlu \ --split test \ --rank-method bge \ --para-num 3 ``` -------------------------------- ### Elasticsearch Client Configuration Source: https://context7.com/c-box/structeval/llms.txt This YAML file configures the connection parameters for the Elasticsearch client. It specifies host, username, password, and various connection settings like timeouts and sniffing behavior. ```yaml # struct_generate/config/es_config.yaml es_config: hosts: - "http://localhost:9200" username: "elastic" password: "your-password" timeout: 600 sniff_on_start: false sniff_on_connection_fail: false sniff_timeout: 10 sniffer_timeout: 60 ``` -------------------------------- ### Extract Key Concepts and Wikipedia Information Source: https://context7.com/c-box/structeval/llms.txt Extracts up to 5 important entities from seed questions, retrieves their Wikipedia pages, and uses GPT for entity matching. Requires specifying the benchmark, split, rank-method, and paragraph number. The output file contains extracted entities with matching status and Wikipedia details. ```bash cd struct_generate python concept_extract.py \ --benchmark arc_challenge \ --split test \ --rank-method bge \ --para-num 1 ``` -------------------------------- ### Generate Concept-Based Multiple-Choice Questions Source: https://context7.com/c-box/structeval/llms.txt Generates multiple-choice questions for each matched entity, grounded in Wikipedia content and filtered by RAG. Use this script to test model understanding of concepts. Specify the benchmark, split, and model type. Convenience scripts are also available. ```bash cd struct_generate python concept_generation.py \ --benchmark arc_challenge \ --split test \ --model-type gpt-4o-mini ``` ```bash # Or use the convenience script: bash scripts/run_concept_generation.bash arc_challenge test ``` -------------------------------- ### Synchronous GPT API Query Source: https://context7.com/c-box/structeval/llms.txt The `query_gpt` function provides a synchronous wrapper for making a single request to the OpenAI API. It allows specifying the query, temperature for randomness, and the model type. ```python from common_utils.prompt_utils import query_gpt # Single synchronous query response = query_gpt( query="What is the capital of France?", temp=0.0, # 0 for deterministic, 0.7 for generation model_type="gpt-4o-mini" ) # Returns: "Paris" ``` -------------------------------- ### Load and Save JSON Data Source: https://context7.com/c-box/structeval/llms.txt Use `load_json_dic` to load JSON data from a file and `save_json_dic` to save a dictionary to a JSON file. These are useful for managing checkpoints or configuration. ```python from common_utils.utils import load_file, load_json_dic, save_json_dic, save_jsonl_data # Load JSONL seed data data = load_file("processed_data/demo/0_test_with_idx.json") # Returns: list of dicts, one per line # Load / save JSON checkpoint files checkpoint = load_json_dic("processed_data/demo/1_test_with_topic.json") save_json_dic(checkpoint, "processed_data/demo/1_test_with_topic.json") ``` -------------------------------- ### Parallel GPT API Queries Source: https://context7.com/c-box/structeval/llms.txt Use `multi_query_gpt` for making multiple GPT API requests concurrently using threading. It preserves the order of responses corresponding to the input prompts. ```python from common_utils.prompt_utils import multi_query_gpt # Parallel multi-query (uses threading; preserves input order) prompts = [ "What is the capital of France?", "What is the capital of Germany?", "What is the capital of Japan?" ] responses = multi_query_gpt(prompts, temp=0.0, model_type="gpt-4o-mini") # Returns: ["Paris", "Berlin", "Tokyo"] ``` -------------------------------- ### Randomly Select Answer Choices Source: https://context7.com/c-box/structeval/llms.txt Use `random_select_choice` to shuffle the order of answer choices in a question dictionary. This helps prevent response bias related to answer position. ```python from common_utils.utils import random_select_choice # Randomly permute answer choices to prevent position bias shuffled = random_select_choice(question_dict) # Swaps answer key contents so "answer" field always points to correct option ``` -------------------------------- ### Token-Aware Text Chunking Source: https://context7.com/c-box/structeval/llms.txt Use `get_text_chunks` to split a long text into smaller segments, ensuring each chunk is within a specified token limit and split at sentence boundaries. ```python from common_utils.utils import get_text_chunks # Split a long Wikipedia page text into token-bounded chunks chunks = get_text_chunks(wiki_text, chunk_size=256) # Returns: list of strings, each ≤ 256 tokens, split on sentence boundaries ``` -------------------------------- ### Calculate Token Count Source: https://context7.com/c-box/structeval/llms.txt The `token_count` function (imported but not shown in use) is available for calculating the number of tokens in a given text, likely using a specific tokenizer. ```python from common_utils.utils import token_count ``` -------------------------------- ### Retrieve Wikipedia Content by Entity Source: https://context7.com/c-box/structeval/llms.txt The `wiki_retrieve_by_entity` function fetches and ranks relevant Wikipedia paragraphs for a given topic. It supports different ranking methods like BGE or GPT and allows configuration of chunk size and number of documents. ```python from common_utils.wiki_search import wiki_retrieve_by_entity from transformers import AutoModelForSequenceClassification, AutoTokenizer import argparse # Set up args (mirrors CLI arguments in topic_extract.py / concept_extract.py) args = argparse.Namespace( rank_method="bge", # "bge" or "gpt" para_num=1, # number of top paragraphs to return chunk_size=256, # token size per Wikipedia chunk use_openai_chunk=True # use tiktoken-based chunking ) # Load BGE reranker (required when rank_method="bge") bge_model = AutoModelForSequenceClassification.from_pretrained("BAAI/bge-reranker-large").cuda() bge_tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-reranker-large") bge_model.eval() # Retrieve relevant Wikipedia content wiki_info = wiki_retrieve_by_entity( args=args, topic_name="Ankle jerk reflex", topic_des="A deep tendon reflex elicited by striking the Achilles tendon", seed_question="The root value of the ankle reflex is S1", bge_model=bge_model, beg_tokenizer=bge_tokenizer, doc_num=1 ) # Returns: # { # "wiki_id": "1234567", # "wiki_name": "Ankle jerk reflex", # "wiki_intro": "The ankle jerk reflex, also known as the Achilles reflex...", # "wiki_page": "", # "related_content": "" # } ``` -------------------------------- ### Wikipedia Retrieval Source: https://context7.com/c-box/structeval/llms.txt Retrieves and ranks relevant Wikipedia paragraphs for a given entity. ```APIDOC ## `wiki_retrieve_by_entity` (`common_utils/wiki_search.py`) Retrieves and ranks the most relevant Wikipedia paragraphs for a given entity name and description, using ElasticSearch for retrieval and either BGE or GPT for passage reranking. ```python from common_utils.wiki_search import wiki_retrieve_by_entity from transformers import AutoModelForSequenceClassification, AutoTokenizer import argparse # Set up args (mirrors CLI arguments in topic_extract.py / concept_extract.py) args = argparse.Namespace( rank_method="bge", # "bge" or "gpt" para_num=1, # number of top paragraphs to return chunk_size=256, # token size per Wikipedia chunk use_openai_chunk=True # use tiktoken-based chunking ) # Load BGE reranker (required when rank_method="bge") bge_model = AutoModelForSequenceClassification.from_pretrained("BAAI/bge-reranker-large").cuda() bge_tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-reranker-large") bge_model.eval() # Retrieve relevant Wikipedia content wiki_info = wiki_retrieve_by_entity( args=args, topic_name="Ankle jerk reflex", topic_des="A deep tendon reflex elicited by striking the Achilles tendon", seed_question="The root value of the ankle reflex is S1", bge_model=bge_model, beg_tokenizer=bge_tokenizer, doc_num=1 ) # Returns: # { # "wiki_id": "1234567", # "wiki_name": "Ankle jerk reflex", # "wiki_intro": "The ankle jerk reflex, also known as the Achilles reflex...", # "wiki_page": "", # "related_content": "" # } ``` ``` -------------------------------- ### StructEval Citation Source: https://github.com/c-box/structeval/blob/main/README.md BibTeX entry for citing the StructEval paper. Include this in your academic work when referencing the framework. ```bibtex @misc{cao2024structevaldeepenbroadenlarge, title={StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation}, author={Boxi Cao and Mengjie Ren and Hongyu Lin and Xianpei Han and Feng Zhang and Junfeng Zhan and Le Sun}, year={2024}, eprint={2408.03281}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2408.03281}, } ``` -------------------------------- ### Multithreaded Entity Matching Source: https://context7.com/c-box/structeval/llms.txt The `entity_match_multithreaded` function performs entity matching in parallel using threads. It compares pairs of mentions and their descriptions to determine if they refer to the same entity. ```python from common_utils.prompt_utils import entity_match_multithreaded # Multithreaded entity matching — determines if two mentions refer to the same entity entity_infos = [ ("Ankle reflex", "a deep tendon reflex", "Ankle jerk reflex", "The ankle jerk reflex, also known as..."), ("Babinski sign", "a neurological reflex", "Plantar reflex", "The plantar reflex is an important..."), ] matches = entity_match_multithreaded(entity_infos) # Returns: [True, True] — list of booleans in input order ``` -------------------------------- ### Parse JSON from GPT Responses Source: https://context7.com/c-box/structeval/llms.txt The `parse_json_response` function extracts and parses JSON content that might be embedded within a code block in a GPT model's output. ```python from common_utils.utils import parse_json_response # Parse a GPT response that wraps JSON in a code block questions = parse_json_response("```json\n[{\"level\": \"remembering\", ...}]\n```") ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.