### Install AutoRound Kernel from Source Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/README.md Follow these steps to build and install the AutoRound Kernel library from its source code. ```bash python setup.py bdist_wheel;pip install dist/* ``` -------------------------------- ### Install AutoRound from Source (GPU/CPU) Source: https://github.com/intel/auto-round/blob/main/AGENTS.md Install the AutoRound library from source for GPU/CPU support. The `--no-build-isolation` flag is required if PyTorch is already installed. ```bash pip install --no-build-isolation -e . ``` -------------------------------- ### Install Auto-Round Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Install the auto-round library using pip. This is the first step before proceeding with quantization. ```bash pip install auto-round ``` -------------------------------- ### Install AutoRound XPU Variant Source: https://github.com/intel/auto-round/blob/main/AGENTS.md Install the XPU-specific variant of the AutoRound library. Ensure Intel PyTorch is installed first, then proceed with the standard installation. ```bash pip install torch --index-url https://download.pytorch.org/whl/xpu pip install --no-build-isolation . ``` -------------------------------- ### Enable vLLM-Ext at Runtime Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/vllm_ext/README.md Start the vLLM server with the VLLM_ENABLE_AR_EXT environment variable set to 1 to activate the auto-round extension. ```bash VLLM_ENABLE_AR_EXT=1 vllm serve ... ``` -------------------------------- ### Build and Install vLLM Extension Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/vllm_ext/README.md Clone the vLLM repository with the fused-moe-ar branch and install it using pip with precompiled extensions enabled. Use verbose output for debugging. ```bash git clone --branch fused-moe-ar https://github.com/yiliu30/vllm-fork.git VLLM_USE_PRECOMPILED=1 pip install --editable . -vvv ``` -------------------------------- ### Build Auto Round from Source Source: https://github.com/intel/auto-round/blob/main/README.md Instructions for building Auto Round from source for CPU/GPU, HPU, or XPU. For HPU, a specific setup command is required. ```bash # CPU(Xeon)/GPU(CUDA) pip install . # HPU(Gaudi) python setup.py install hpu # XPU(Intel GPU) pip install torch --index-url https://download.pytorch.org/whl/xpu pip install . ``` -------------------------------- ### Project and Dependency Setup Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/bestla/CMakeLists.txt Initializes the CMake project, sets the minimum version, and includes necessary modules for SIMD and SYCL support. It also fetches and makes available the xbyak library. ```cmake cmake_minimum_required(VERSION 3.12) project(bestla LANGUAGES CXX VERSION 0.1.0) if(BTLA_SYCL) include(cmake/sycl.cmake) endif() include(cmake/FindSIMD.cmake) file(GLOB headers ${PROJECT_NAME}/*.h ${PROJECT_NAME}/*.hpp) FetchContent_Declare( xbyak GIT_REPOSITORY https://github.com/herumi/xbyak.git GIT_TAG v7.06 ) FetchContent_MakeAvailable(xbyak) add_library(${PROJECT_NAME} INTERFACE) target_link_libraries(${PROJECT_NAME} INTERFACE xbyak) add_library(neural_speed::${PROJECT_NAME} ALIAS ${PROJECT_NAME}) target_include_directories( ${PROJECT_NAME} INTERFACE "$" "$" ) ``` -------------------------------- ### Install Auto Round from PyPI Source: https://github.com/intel/auto-round/blob/main/README.md Install the Auto Round package for CPU/GPU, nightly builds, HPU, or XPU. For HPU, installation must be done inside a specific Docker container. ```bash # CPU(Xeon)/GPU(CUDA) pip install auto-round # CPU(Xeon)/GPU(CUDA) nightly pip install auto-round-nightly # HPU(Gaudi) # install inside the hpu docker container, e.g. vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest pip install auto-round-hpu # XPU(Intel GPU) pip install torch --index-url https://download.pytorch.org/whl/xpu pip install auto-round ``` -------------------------------- ### Install AutoRound HPU Variant Source: https://github.com/intel/auto-round/blob/main/AGENTS.md Install the HPU-specific variant of the AutoRound library. This can be done using the `BUILD_HPU_ONLY` environment variable or by running the setup script directly. ```bash BUILD_HPU_ONLY=1 pip install --no-build-isolation . or: python setup.py hpu install ``` -------------------------------- ### Install AutoRound Kernel via Pip Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/README.md Use this command to install the AutoRound Kernel library using pip. ```bash pip install auto-round-lib ``` -------------------------------- ### Minimal QuantLinear Usage Example Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/README.md Demonstrates the basic lifecycle of using the QuantLinear module for inference. Ensure quantized tensors are loaded and post_init() is called before forward pass. ```python from auto_round_kernel.qlinear import QuantLinear qlinear = QuantLinear( bits=4, group_size=128, sym=True, in_features=in_features, out_features=out_features, bias=bias is not None, weight_dtype=weight_dtype, ) # Load qweight, qzeros, scales, and bias from checkpoint. qlinear.post_init() # Run inference y = qlinear(x) ``` -------------------------------- ### Specify Inference Backend with AutoRound Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Use AutoRoundConfig to specify a preferred backend like 'ark' for CPU and Intel GPU. Ensure corresponding libraries are installed. ```python from transformers import AutoModelForCausalLM, AutoTokenizer, AutoRoundConfig model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc" quantization_config = AutoRoundConfig(backend="ark") model = AutoModelForCausalLM.from_pretrained( model_name, device_map="cpu", quantization_config=quantization_config, torch_dtype="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) text = "There is a girl who likes adventure," inputs = tokenizer(text, return_tensors="pt").to(model.device) print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0])) ``` -------------------------------- ### Install Dependencies Source: https://github.com/intel/auto-round/blob/main/test/README.md Install project dependencies and pytest using pip. ```sh pip install -r ../requirements.txt pip install pytest ``` -------------------------------- ### Basic Quantization Test Source: https://github.com/intel/auto-round/blob/main/test/README.md Example of a basic test case for a new quantization method using AutoRound. ```python # test_cpu/quantization/test_new_method.py import pytest from auto_round import AutoRound from ...helpers import opt_name_or_path class TestNewQuantMethod: def test_quantization(self, tiny_opt_model_path, dataloader): """Test new quantization method.""" autoround = AutoRound(model=tiny_opt_model_path, bits=4, group_size=128, iters=2, dataset=dataloader) autoround.quantize() assert autoround is not None ``` -------------------------------- ### vLLM Model Inference Source: https://github.com/intel/auto-round/blob/main/README.md Demonstrates how to perform model inference using the vLLM library. This example loads a quantized model and generates text based on provided prompts and sampling parameters. Ensure the model path is correctly specified. ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95) model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound" llm = LLM(model=model_name) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` -------------------------------- ### Apply AutoScheme with Fixed Layer Configuration Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md This example demonstrates how to apply AutoScheme while fixing the quantization scheme for specific layers using the `layer_config` parameter. It's useful for fine-tuning quantization on a per-layer basis. ```python from auto_round import AutoRound, AutoScheme model_name = "Qwen/Qwen3-8B" avg_bits = 3.0 scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True) layer_config = {"lm_head": "GGUF:Q6_K"} ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0) ar.quantize_and_save() ``` -------------------------------- ### Model Inference Test with Helpers Source: https://github.com/intel/auto-round/blob/main/test/README.md Example of using helper functions for model path resolution and inference within a test. ```python from ...helpers import model_infer, opt_name_or_path, get_model_path def test_model_inference(tiny_opt_model_path): # Use predefined model path model_name = opt_name_or_path # Or resolve custom model path custom_model = get_model_path("custom/model-name") # Run inference using helper from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained(tiny_opt_model_path) tokenizer = AutoTokenizer.from_pretrained(tiny_opt_model_path) output = model_infer(model, tokenizer, "Hello world") ``` -------------------------------- ### Quantize VLM Model with AutoRound Source: https://github.com/intel/auto-round/blob/main/README.md Example of quantizing a Vision-Language Model (VLM) using AutoRound. This snippet shows how to load a VLM and apply a specified quantization scheme, saving the quantized model to an output directory. Note that quantizing non-text modules is an experimental feature. ```python from auto_round import AutoRound # Load the model model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct" # Quantize the model ar = AutoRound(model_name_or_path, scheme="W4A16") output_dir = "./qmodel" ar.quantize_and_save(output_dir) ``` -------------------------------- ### AutoRound Command Line Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Use the AutoRound recipe for a good balance of accuracy and tuning cost. Recommended for most scenarios. ```bash auto-round --model Qwen/Qwen3-0.6B --scheme "W4A16" --format "auto_gptq,auto_awq,auto_round" ``` -------------------------------- ### Build and Run Bestla Benchmark Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/bestla/README.md Compile the benchmark using CMake and then execute it. Ensure all necessary flags are set for a complete build. ```shell mkdir build cd build cmake .. -DBTLA_UT_BENCHMARK=ON -DBTLA_UT_ALL=ON -DCMAKE_BUILD_TYPE=Release cmake --build . -j ./bin/bestla_benchmark ``` -------------------------------- ### AutoRoundLight Command Line Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Use the AutoRoundLight recipe for the best speed, suitable for 4-bit settings and larger models. May reduce accuracy for small models or 2-bit quantization. ```bash auto-round-light --model Qwen/Qwen3-0.6B --scheme "W4A16" --format "auto_gptq,auto_awq,auto_round" ``` -------------------------------- ### Compile Benchmark Executable Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/bestla/CMakeLists.txt Sets up the build for the benchmark executable, including source file selection, OpenMP support, and platform-specific linker options. ```cmake if(BTLA_UT_BENCHMARK) file(GLOB ut_headers ${PROJECT_NAME}/ut/*.h) include_directories(${PROJECT_NAME}) if(NOT BTLA_SYCL) list(REMOVE_ITEM benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/sycl_benchmark.cpp) endif() add_executable(${PROJECT_NAME}_benchmark ${benchmark_srcs} ${headers} ${ut_headers}) if(BTLA_UT_OPENMP) include(FindOpenMP) target_compile_definitions(${PROJECT_NAME} INTERFACE BTLA_USE_OPENMP) target_link_libraries(${PROJECT_NAME}_benchmark PRIVATE OpenMP::OpenMP_CXX) endif() if(NOT WIN32) target_link_options(${PROJECT_NAME}_benchmark PRIVATE -lpthread) else() target_link_options(${PROJECT_NAME}_benchmark PUBLIC /STACK:5242880) endif() target_link_libraries(${PROJECT_NAME}_benchmark PRIVATE ${PROJECT_NAME} ${sycl_libs} dnnl) target_compile_options(${PROJECT_NAME}_benchmark PRIVATE -w) # Add SYCL target for Intel GPUs with XMX/2D block IO support (required for sycl-tla flash attention) if(BTLA_SYCL AND ARK_SYCL_TLA) # Header-only consumption of sycl-tla (do NOT build sycl-tla as a subproject). set(SYCL_TLA_GIT_REPOSITORY "https://github.com/intel/sycl-tla.git" CACHE STRING "sycl-tla git repository") set(SYCL_TLA_GIT_TAG "main" CACHE STRING "sycl-tla git tag/commit") FetchContent_Declare( sycl_tla GIT_REPOSITORY ${SYCL_TLA_GIT_REPOSITORY} GIT_TAG ${SYCL_TLA_GIT_TAG} ) FetchContent_GetProperties(sycl_tla) if(NOT sycl_tla_POPULATED) FetchContent_Populate(sycl_tla) endif() set(_sycl_tla_include_dirs ${sycl_tla_SOURCE_DIR}/include ${sycl_tla_SOURCE_DIR}/applications ${sycl_tla_SOURCE_DIR}/tools/util/include ${sycl_tla_SOURCE_DIR}/examples/common ${sycl_tla_SOURCE_DIR}/examples/06_bmg_flash_attention ${sycl_tla_SOURCE_DIR}/examples/12_xe20_moe_gemm_cute_interface ) foreach(_inc_dir IN LISTS _sycl_tla_include_dirs) if(EXISTS "${_inc_dir}") target_include_directories(${PROJECT_NAME}_benchmark PRIVATE "${_inc_dir}") endif() endforeach() # AOT compile target for Intel GPUs # Use intel_gpu_pvc for Data Center GPU Max series, or intel_gpu_bmg_g21 for Battlemage set(DPCPP_SYCL_TARGET "intel_gpu_bmg_g21" CACHE STRING "SYCL target (intel_gpu_pvc, intel_gpu_bmg_g21)") # Map target to device name for -Xs flag if(DPCPP_SYCL_TARGET STREQUAL "intel_gpu_bmg_g21" OR DPCPP_SYCL_TARGET STREQUAL "bmg") set(SYCL_DEVICE_NAME "bmg_g21") elseif(DPCPP_SYCL_TARGET STREQUAL "intel_gpu_pvc" OR DPCPP_SYCL_TARGET STREQUAL "pvc") set(SYCL_DEVICE_NAME "pvc") else() set(SYCL_DEVICE_NAME "${DPCPP_SYCL_TARGET}") endif() target_compile_definitions(${PROJECT_NAME}_benchmark PRIVATE ARK_SYCL_TLA=1 CUTLASS_ENABLE_SYCL=1 SYCL_INTEL_TARGET=1) # Compile flags (no AOT, JIT at runtime) target_compile_options(${PROJECT_NAME}_benchmark PRIVATE -fsycl -fno-sycl-instrument-device-code) # Link flags: use spir64 (JIT) with device hint and enable required SPIR-V extensions target_link_options(${PROJECT_NAME}_benchmark PRIVATE -fsycl -fsycl-targets=spir64 "-Xs" "-device ${SYCL_DEVICE_NAME}" -Xspirv-translator "-spirv-ext=+SPV_INTEL_split_barrier,+SPV_INTEL_2d_block_io,+SPV_INTEL_subgroup_matrix_multiply_accumulate") endif() endif(BTLA_UT_BENCHMARK) ``` -------------------------------- ### AutoRoundBest Command Line Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Use the AutoRoundBest recipe for the highest accuracy, especially for 2-bit quantization. This is slower than the standard AutoRound recipe. ```bash auto-round-best --model Qwen/Qwen3-0.6B --scheme "W4A16" --format "auto_gptq,auto_awq,auto_round" ``` -------------------------------- ### Initialize AutoRound with Multi-GPU Device Map Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Instantiate AutoRound specifying multiple GPUs for tuning using a comma-separated string of device IDs. ```python from auto_round import AutoRound model_name_or_path = "Qwen/Qwen3-0.6B" ar = AutoRound( model=model_name_or_path, device_map="0,1,2,3" ) ``` -------------------------------- ### Configure CUTLASS for SYCL Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/CMakeLists.txt Sets CUTLASS build options to enable SYCL support and disable benchmarks, examples, tests, and tools. Also enables exporting compile commands. ```cmake set(CUTLASS_ENABLE_SYCL ON) set(CUTLASS_ENABLE_BENCHMARKS OFF) set(CUTLASS_ENABLE_EXAMPLES OFF) set(CUTLASS_ENABLE_TESTS OFF) set(CMAKE_EXPORT_COMPILE_COMMANDS ON) set(CUTLASS_ENABLE_LIBRARY OFF) set(CUTLASS_ENABLE_TOOLS OFF) set(CUTLASS_ENABLE_GDC_FOR_SM100_DEFAULT OFF CACHE BOOL "DISABLE CUDA") ``` -------------------------------- ### Run All Tests Source: https://github.com/intel/auto-round/blob/main/test/README.md Execute all tests in the project. ```sh pytest ``` -------------------------------- ### Load and Quantize Model with AutoRound Source: https://github.com/intel/auto-round/blob/main/README.md Demonstrates loading a model and performing quantization using the AutoRound library. Specifies the quantization scheme and output directory. Supports various model formats. ```python from auto_round import AutoRound # Load a model (supports FP8/BF16/FP16/FP32) model_name_or_path = "Qwen/Qwen3-0.6B" # Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc. ar = AutoRound(model_name_or_path, scheme="W4A16") # Highest accuracy (4–5× slower). # `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower. # ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True) # Faster quantization (2–3× speedup) with slight accuracy drop at W4G128. # ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3) # Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc. ar.quantize_and_save(output_dir="./qmodel", format="auto_round") ``` -------------------------------- ### AutoRoundOptRTN Command Line Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Use the AutoRoundOptRTN recipe for optimized RTN without gradient computation. It's calibration-free and faster than AutoRound, offering good accuracy. ```bash auto-round-opt-rtn --model Qwen/Qwen3-0.6B --scheme "W4A16" --format "auto_round" ``` -------------------------------- ### Run AutoRound CLI with Multi-GPU Device Map Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Execute the AutoRound command-line interface, specifying multiple GPUs for tuning via the `--device_map` argument. ```bash CUDA_VISIBLE_DEVICES=0,1,2,3 auto-round --model "Qwen/Qwen3-0.6B" --scheme "W4A16" --device_map "auto" ``` -------------------------------- ### Load and Generate with Transformers on Various Backends Source: https://github.com/intel/auto-round/blob/main/README.md Load a quantized model using the Transformers library, supporting automatic backend selection for CPU, Intel GPU, Gaudi, and CUDA. Avoid manually moving the model to different devices during inference. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound" model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) text = "There is a girl who likes adventure," inputs = tokenizer(text, return_tensors="pt").to(model.device) print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0])) ``` -------------------------------- ### AutoRoundRTN Command Line Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Use the AutoRoundRTN recipe for pure RTN without optimization. It's the fastest and uses the least memory but typically has lower accuracy. ```bash auto-round-rtn --model Qwen/Qwen3-0.6B --scheme "W4A16" --format "auto_round" ``` -------------------------------- ### 调整激活量化缩放系数 Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md 通过 Shell 命令设置激活量化时激活值最小/最大值的缩放系数。 ```bash export AR_ACT_SCALE=0.9 ``` -------------------------------- ### 使用 ModelScope 下载模型 Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md 通过 Shell 命令配置 AutoRound 使用 ModelScope 下载模型。 ```bash export AR_USE_MODELSCOPE=true ``` -------------------------------- ### Multi-GPU Evaluation with vLLM Backend (Manual Configuration) Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Manually configure multi-GPU evaluation for vLLM using `CUDA_VISIBLE_DEVICES` and `--vllm_args` for fine-grained control over tensor parallelism and GPU memory utilization. ```bash CUDA_VISIBLE_DEVICES=0,1 auto-round "your_model_path" --eval --tasks lambada_openai --eval_backend vllm --vllm_args="tensor_parallel_size=2,gpu_memory_utilization=0.8" ``` -------------------------------- ### Auto Round Light Speed Recipe Source: https://github.com/intel/auto-round/blob/main/README.md Utilize the 'auto-round-light' recipe for a 2-3X speedup in quantization. Expect a slight accuracy drop at W4 and a more significant drop at W2. ```bash auto-round-light \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16" ``` -------------------------------- ### Single GPU Evaluation with HF Backend Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Evaluate a model using the default HF backend. Specify the model, bits for quantization, desired formats, and evaluation tasks. ```bash auto-round --model Qwen/Qwen3-0.6B --bits 4 --format "auto_round,auto_gptq" --tasks mmlu ``` -------------------------------- ### AutoRoundLight API Usage Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Instantiate AutoRound with parameters for the AutoRoundLight recipe. This is optimized for speed and recommended for 4-bit settings and larger models. ```python from auto_round import AutoRound model_name_or_path = "Qwen/Qwen3-0.6B" ar = AutoRound( model=model_name_or_path, scheme="W4A16", iters=50, lr=5e-3, ) output_dir = "./tmp_autoround" ar.quantize_and_save(output_dir, format="auto_round") ``` -------------------------------- ### 调整 Dynamo Cache 大小限制 Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md 通过 Shell 命令调整 torch._dynamo 的 cache_size_limit 等参数的最小值。 ```bash export AR_DYNAMO_CACHE_SIZE_LIMIT=32 ``` -------------------------------- ### AWQ Algorithm CLI Usage Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Use this command to apply the AWQ algorithm for quantization via the command line. Specify the model, quantization scheme, algorithm, and output format. ```bash auto-round --model Qwen/Qwen3-0.6B --scheme "W4A16" --algorithm awq --format "auto_round" ``` -------------------------------- ### CLI Usage for Model-Free Mode with Advanced Configuration Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Configure Model-Free mode with custom group size, asymmetric quantization, per-layer bit-width overrides, and ignored layers. This allows fine-grained control over quantization. ```bash # With per-layer configuration and ignored layers auto_round meta-llama/Llama-3.2-1B-Instruct \ --model_free \ --scheme W4A16 \ --group_size 32 \ --asym \ --layer_config "{k_proj:{bits:8},v_proj:{bits:8}}" \ --ignore_layers "mlp" \ --output_dir ./int4-llama ``` -------------------------------- ### Customized Data Preparation Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Prepare a custom dataset as a list of strings for auto-round quantization. Data shorter than the sequence length will be dropped, and longer data will be truncated. ```python def customized_data(): # Important Notice!!! AutoRound will drop data < args.seqlen and truncate data to args.seqlen data = ["AutoRound is an advanced quantization algorithm for low-bits LLM inference" * 240] return data ``` -------------------------------- ### Tiny Model Creation and Saving Source: https://github.com/intel/auto-round/blob/main/test/README.md Utilities to create and save smaller versions of models for testing. ```python get_tiny_model(model_path, num_layers=2) # Create tiny model by slicing layers save_tiny_model(model_path, save_path) # Save tiny model to disk ``` -------------------------------- ### Multi-GPU Evaluation with HF Backend Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Evaluate a model across multiple GPUs using the HF backend. Use `--device_map` to specify the GPUs and `--eval` to enable evaluation mode. ```bash auto-round --model="your_model_path" --eval --device_map 0,1 --tasks lambada_openai --eval_bs 16 ``` -------------------------------- ### 禁用 OffloadManager 权重卸载 Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md 通过 Shell 命令强制禁用 AutoRound 的 OffloadManager 中的权重卸载功能。 ```bash export AR_DISABLE_OFFLOAD=1 ``` -------------------------------- ### Run Tests with Verbose Output Source: https://github.com/intel/auto-round/blob/main/test/README.md Execute tests and display detailed output, including captured stdout/stderr. ```sh pytest -v -s ``` -------------------------------- ### Multi-GPU Evaluation with vLLM Backend (Device Map) Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Perform multi-GPU evaluation with vLLM by specifying the device map. This is an alternative to manual environment variable configuration. ```bash auto-round "your_model_path" --eval --device_map 0,1 --tasks lambada_openai --eval_backend vllm ``` -------------------------------- ### Single GPU Evaluation with vLLM Backend Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Evaluate a model using the vLLM backend. This requires specifying the evaluation backend. ```bash auto-round --model Qwen/Qwen3-0.6B --bits 4 --format "auto_round,auto_gptq" --tasks mmlu --eval_backend vllm ``` -------------------------------- ### Benchmark Source Files Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/bestla/CMakeLists.txt Defines the source files for benchmark executables, including general benchmarks and SYCL-specific benchmarks. Commented-out lines indicate potential future additions or alternative benchmark files. ```cmake set(benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/bestla_benchmark.cpp) list(APPEND benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/sycl_benchmark.cpp) # Flash attention benchmarks are in separate files to avoid header conflicts #list(APPEND benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/sycl_tla_flash_attn_prefill_bench.cpp) #list(APPEND benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/sycl_tla_flash_attn_decode_bench.cpp) ``` -------------------------------- ### 禁用数据集子进程预处理 Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md 通过 Shell 命令禁用 AutoRound 的数据集子进程预处理。 ```bash export AR_DISABLE_DATASET_SUBPROCESS=true ``` -------------------------------- ### AutoRoundBest API Usage Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Instantiate AutoRound with parameters for the AutoRoundBest recipe. This is suitable for achieving the highest accuracy, especially with 2-bit quantization. ```python from auto_round import AutoRound model_name_or_path = "Qwen/Qwen3-0.6B" ar = AutoRound(model=model_name_or_path, scheme="W4A16", nsamples=512, iters=1000, low_gpu_mem_usage=True) output_dir = "./tmp_autoround" ar.quantize_and_save(output_dir, format="auto_round") ``` -------------------------------- ### AWQ Algorithm API Usage Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Instantiate AutoRound with AWQ algorithm and quantize and save the model. Ensure the output directory is specified. ```python from auto_round import AutoRound ar = AutoRound( "Qwen/Qwen3-0.6B", scheme="INT8", algorithm="awq", ) output_dir = "./tmp_awq" ar.quantize_and_save(output_dir, format="auto_round:llm_compressor") ``` -------------------------------- ### 设置工作目录 Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md 通过 Shell 命令指定 AutoRound 的自定义工作目录。 ```bash export AR_WORK_SPACE=/path/to/custom/workspace ``` -------------------------------- ### Basic CLI Usage for Model Quantization Source: https://github.com/intel/auto-round/blob/main/README.md Perform model quantization using the auto-round CLI. Set the model, quantization scheme, format, and output directory. ModelScope is supported for model downloads by setting AR_USE_MODELSCOPE=1. ```bash auto-round \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16" \ --format "auto_round" \ --output_dir ./tmp_autoround ``` -------------------------------- ### 启用编译打包 Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md 通过 Shell 命令启用 AutoRound 的编译打包优化功能。 ```bash export AR_ENABLE_COMPILE_PACKING=1 ``` -------------------------------- ### Configure SYCL TLA FetchContent Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/CMakeLists.txt Fetches the sycl-tla library from a Git repository and tag. This is used for SYCL-based builds. ```cmake set(SYCL_TLA_GIT_REPOSITORY "https://github.com/luoyu-intel/sycl-tla.git" CACHE STRING "sycl-tla git repository") set(SYCL_TLA_GIT_TAG "260409" CACHE STRING "sycl-tla git tag/commit") FetchContent_Declare( sycl_tla GIT_REPOSITORY ${SYCL_TLA_GIT_REPOSITORY} GIT_TAG ${SYCL_TLA_GIT_TAG} ) ``` -------------------------------- ### Customized Data Preparation with Tokenizer Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md Prepare a custom dataset using a tokenizer to convert text data into token IDs. Ensure data is processed according to the specified sequence length. ```python def customized_data_with_tokenizer(tokenizer, seqlen=2048): # Import notice!!! AutoRound will drop data < args.seqlen data = ["AutoRound is an advanced quantization algorithm for low-bits LLM inference" * 240] tokens = [] for d in data: token = tokenizer(d, truncation=True, max_length=seqlen, return_tensors="pt").data tokens.append(token) return tokens ``` -------------------------------- ### 禁用激活最小-最大缩放参数调优 Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md 通过 Shell 命令禁用激活量化中最小/最大缩放参数的调优。 ```bash export AR_ENABLE_ACT_MINMAX_TUNING=1 ``` -------------------------------- ### Quantize Diffusion Model using CLI Source: https://github.com/intel/auto-round/blob/main/auto_round/compressors/diffusion/README.md This bash command demonstrates how to quantize a diffusion model using the auto-round command-line interface. Specify the model, scheme, format, batch size, dataset, and output directory as arguments. ```bash auto-round \ --model black-forest-labs/FLUX.1-dev \ --scheme MXFP8 \ --format fake \ --batch_size 1 \ --dataset coco2014 \ --output_dir ./tmp_autoround ``` -------------------------------- ### Quantize MLLM using Command-Line Interface Source: https://github.com/intel/auto-round/blob/main/auto_round/compressors/mllm/README.md Execute quantization for a multimodal model directly from the terminal using the 'auto-round' command. Specify the model, quantization scheme, desired output format, and output directory. Multiple formats can be exported. ```bash auto-round \ --model Qwen/Qwen2-VL-2B-Instruct \ --scheme w4a16 \ --format "auto_round" \ --output_dir ./tmp_autoround ``` -------------------------------- ### Auto Round Pure RTN Recipe Source: https://github.com/intel/auto-round/blob/main/README.md Use the 'auto-round-rtn' recipe for the fastest quantization with pure Round-to-Nearest mode (iters=0, no AutoRound optimization). It routes to model-free mode for supported INT WOQ schemes. ```bash auto-round-rtn \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16" ``` -------------------------------- ### DataLoader Utility Source: https://github.com/intel/auto-round/blob/main/test/README.md Simple dataloader for calibration datasets. ```python DataLoader() # Simple dataloader for calibration datasets ```