### Install AutoRound Kernel from Source

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/README.md

Follow these steps to build and install the AutoRound Kernel library from its source code.

```bash
python setup.py bdist_wheel;pip install dist/*
```

--------------------------------

### Install AutoRound from Source (GPU/CPU)

Source: https://github.com/intel/auto-round/blob/main/AGENTS.md

Install the AutoRound library from source for GPU/CPU support. The `--no-build-isolation` flag is required if PyTorch is already installed.

```bash
pip install --no-build-isolation -e .
```

--------------------------------

### Install Auto-Round

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Install the auto-round library using pip. This is the first step before proceeding with quantization.

```bash
pip install auto-round
```

--------------------------------

### Install AutoRound XPU Variant

Source: https://github.com/intel/auto-round/blob/main/AGENTS.md

Install the XPU-specific variant of the AutoRound library. Ensure Intel PyTorch is installed first, then proceed with the standard installation.

```bash
pip install torch --index-url https://download.pytorch.org/whl/xpu
pip install --no-build-isolation .
```

--------------------------------

### Enable vLLM-Ext at Runtime

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/vllm_ext/README.md

Start the vLLM server with the VLLM_ENABLE_AR_EXT environment variable set to 1 to activate the auto-round extension.

```bash
VLLM_ENABLE_AR_EXT=1 vllm serve ...
```

--------------------------------

### Build and Install vLLM Extension

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/vllm_ext/README.md

Clone the vLLM repository with the fused-moe-ar branch and install it using pip with precompiled extensions enabled. Use verbose output for debugging.

```bash
git clone --branch fused-moe-ar https://github.com/yiliu30/vllm-fork.git
VLLM_USE_PRECOMPILED=1 pip install --editable . -vvv
```

--------------------------------

### Build Auto Round from Source

Source: https://github.com/intel/auto-round/blob/main/README.md

Instructions for building Auto Round from source for CPU/GPU, HPU, or XPU. For HPU, a specific setup command is required.

```bash
# CPU(Xeon)/GPU(CUDA)
pip install .

# HPU(Gaudi)
python setup.py install hpu

# XPU(Intel GPU)
pip install torch --index-url https://download.pytorch.org/whl/xpu
pip install .
```

--------------------------------

### Project and Dependency Setup

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/bestla/CMakeLists.txt

Initializes the CMake project, sets the minimum version, and includes necessary modules for SIMD and SYCL support. It also fetches and makes available the xbyak library.

```cmake
cmake_minimum_required(VERSION 3.12)
project(bestla LANGUAGES CXX VERSION 0.1.0)

if(BTLA_SYCL)
  include(cmake/sycl.cmake)
endif()
include(cmake/FindSIMD.cmake)

file(GLOB headers ${PROJECT_NAME}/*.h ${PROJECT_NAME}/*.hpp)

FetchContent_Declare(
    xbyak
    GIT_REPOSITORY https://github.com/herumi/xbyak.git
    GIT_TAG v7.06
)
FetchContent_MakeAvailable(xbyak)

add_library(${PROJECT_NAME} INTERFACE)
target_link_libraries(${PROJECT_NAME} INTERFACE xbyak)
add_library(neural_speed::${PROJECT_NAME} ALIAS ${PROJECT_NAME})
target_include_directories(
	${PROJECT_NAME} INTERFACE
	"$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}>"
	"$<INSTALL_INTERFACE:${CMAKE_INSTALL_INCLUDEDIR}>"
)
```

--------------------------------

### Install Auto Round from PyPI

Source: https://github.com/intel/auto-round/blob/main/README.md

Install the Auto Round package for CPU/GPU, nightly builds, HPU, or XPU. For HPU, installation must be done inside a specific Docker container.

```bash
# CPU(Xeon)/GPU(CUDA)
pip install auto-round

# CPU(Xeon)/GPU(CUDA) nightly
pip install auto-round-nightly

# HPU(Gaudi)
# install inside the hpu docker container, e.g. vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest  
pip install auto-round-hpu

# XPU(Intel GPU)
pip install torch --index-url https://download.pytorch.org/whl/xpu
pip install auto-round
```

--------------------------------

### Install AutoRound HPU Variant

Source: https://github.com/intel/auto-round/blob/main/AGENTS.md

Install the HPU-specific variant of the AutoRound library. This can be done using the `BUILD_HPU_ONLY` environment variable or by running the setup script directly.

```bash
BUILD_HPU_ONLY=1 pip install --no-build-isolation .
or: python setup.py hpu install
```

--------------------------------

### Install AutoRound Kernel via Pip

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/README.md

Use this command to install the AutoRound Kernel library using pip.

```bash
pip install auto-round-lib
```

--------------------------------

### Minimal QuantLinear Usage Example

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/README.md

Demonstrates the basic lifecycle of using the QuantLinear module for inference. Ensure quantized tensors are loaded and post_init() is called before forward pass.

```python
from auto_round_kernel.qlinear import QuantLinear

qlinear = QuantLinear(
    bits=4,
    group_size=128,
    sym=True,
    in_features=in_features,
    out_features=out_features,
    bias=bias is not None,
    weight_dtype=weight_dtype,
)
# Load qweight, qzeros, scales, and bias from checkpoint.
qlinear.post_init()

# Run inference
y = qlinear(x)
```

--------------------------------

### Specify Inference Backend with AutoRound

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Use AutoRoundConfig to specify a preferred backend like 'ark' for CPU and Intel GPU. Ensure corresponding libraries are installed.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoRoundConfig

model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc"
quantization_config = AutoRoundConfig(backend="ark")
model = AutoModelForCausalLM.from_pretrained(
    model_name, device_map="cpu", quantization_config=quantization_config, torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))
```

--------------------------------

### Install Dependencies

Source: https://github.com/intel/auto-round/blob/main/test/README.md

Install project dependencies and pytest using pip.

```sh
pip install -r ../requirements.txt
pip install pytest
```

--------------------------------

### Basic Quantization Test

Source: https://github.com/intel/auto-round/blob/main/test/README.md

Example of a basic test case for a new quantization method using AutoRound.

```python
# test_cpu/quantization/test_new_method.py
import pytest
from auto_round import AutoRound
from ...helpers import opt_name_or_path


class TestNewQuantMethod:
    def test_quantization(self, tiny_opt_model_path, dataloader):
        """Test new quantization method."""
        autoround = AutoRound(model=tiny_opt_model_path, bits=4, group_size=128, iters=2, dataset=dataloader)
        autoround.quantize()
        assert autoround is not None
```

--------------------------------

### vLLM Model Inference

Source: https://github.com/intel/auto-round/blob/main/README.md

Demonstrates how to perform model inference using the vLLM library. This example loads a quantized model and generates text based on provided prompts and sampling parameters. Ensure the model path is correctly specified.

```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95)
model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
llm = LLM(model=model_name)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

--------------------------------

### Apply AutoScheme with Fixed Layer Configuration

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

This example demonstrates how to apply AutoScheme while fixing the quantization scheme for specific layers using the `layer_config` parameter. It's useful for fine-tuning quantization on a per-layer basis.

```python
from auto_round import AutoRound, AutoScheme

model_name = "Qwen/Qwen3-8B"
avg_bits = 3.0
scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True)
layer_config = {"lm_head": "GGUF:Q6_K"}

ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0)
ar.quantize_and_save()
```

--------------------------------

### Model Inference Test with Helpers

Source: https://github.com/intel/auto-round/blob/main/test/README.md

Example of using helper functions for model path resolution and inference within a test.

```python
from ...helpers import model_infer, opt_name_or_path, get_model_path


def test_model_inference(tiny_opt_model_path):
    # Use predefined model path
    model_name = opt_name_or_path

    # Or resolve custom model path
    custom_model = get_model_path("custom/model-name")

    # Run inference using helper
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained(tiny_opt_model_path)
    tokenizer = AutoTokenizer.from_pretrained(tiny_opt_model_path)
    output = model_infer(model, tokenizer, "Hello world")
```

--------------------------------

### Quantize VLM Model with AutoRound

Source: https://github.com/intel/auto-round/blob/main/README.md

Example of quantizing a Vision-Language Model (VLM) using AutoRound. This snippet shows how to load a VLM and apply a specified quantization scheme, saving the quantized model to an output directory. Note that quantizing non-text modules is an experimental feature.

```python
from auto_round import AutoRound

# Load the model
model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct"
# Quantize the model
ar = AutoRound(model_name_or_path, scheme="W4A16")
output_dir = "./qmodel"
ar.quantize_and_save(output_dir)
```

--------------------------------

### AutoRound Command Line

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Use the AutoRound recipe for a good balance of accuracy and tuning cost. Recommended for most scenarios.

```bash
auto-round --model Qwen/Qwen3-0.6B  --scheme "W4A16"  --format "auto_gptq,auto_awq,auto_round"
```

--------------------------------

### Build and Run Bestla Benchmark

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/bestla/README.md

Compile the benchmark using CMake and then execute it. Ensure all necessary flags are set for a complete build.

```shell
mkdir build
cd build
cmake .. -DBTLA_UT_BENCHMARK=ON -DBTLA_UT_ALL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./bin/bestla_benchmark
```

--------------------------------

### AutoRoundLight Command Line

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Use the AutoRoundLight recipe for the best speed, suitable for 4-bit settings and larger models. May reduce accuracy for small models or 2-bit quantization.

```bash
auto-round-light --model Qwen/Qwen3-0.6B  --scheme "W4A16"  --format "auto_gptq,auto_awq,auto_round"
```

--------------------------------

### Compile Benchmark Executable

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/bestla/CMakeLists.txt

Sets up the build for the benchmark executable, including source file selection, OpenMP support, and platform-specific linker options.

```cmake
if(BTLA_UT_BENCHMARK)
  file(GLOB ut_headers ${PROJECT_NAME}/ut/*.h)
  include_directories(${PROJECT_NAME})
  if(NOT BTLA_SYCL)
    list(REMOVE_ITEM benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/sycl_benchmark.cpp)
  endif()
	add_executable(${PROJECT_NAME}_benchmark ${benchmark_srcs} ${headers} ${ut_headers})
  if(BTLA_UT_OPENMP)
    include(FindOpenMP)
    target_compile_definitions(${PROJECT_NAME} INTERFACE BTLA_USE_OPENMP)
    target_link_libraries(${PROJECT_NAME}_benchmark PRIVATE OpenMP::OpenMP_CXX)
  endif()
  if(NOT WIN32)
		target_link_options(${PROJECT_NAME}_benchmark PRIVATE -lpthread)
  else()
    target_link_options(${PROJECT_NAME}_benchmark PUBLIC /STACK:5242880)
	endif()
  target_link_libraries(${PROJECT_NAME}_benchmark PRIVATE ${PROJECT_NAME} ${sycl_libs} dnnl)
  target_compile_options(${PROJECT_NAME}_benchmark PRIVATE -w)
  # Add SYCL target for Intel GPUs with XMX/2D block IO support (required for sycl-tla flash attention)
  if(BTLA_SYCL AND ARK_SYCL_TLA)
    # Header-only consumption of sycl-tla (do NOT build sycl-tla as a subproject).
    set(SYCL_TLA_GIT_REPOSITORY "https://github.com/intel/sycl-tla.git" CACHE STRING "sycl-tla git repository")
    set(SYCL_TLA_GIT_TAG "main" CACHE STRING "sycl-tla git tag/commit")

    FetchContent_Declare(
      sycl_tla
      GIT_REPOSITORY ${SYCL_TLA_GIT_REPOSITORY}
      GIT_TAG ${SYCL_TLA_GIT_TAG}
    )
    FetchContent_GetProperties(sycl_tla)
    if(NOT sycl_tla_POPULATED)
      FetchContent_Populate(sycl_tla)
    endif()

    set(_sycl_tla_include_dirs
      ${sycl_tla_SOURCE_DIR}/include
      ${sycl_tla_SOURCE_DIR}/applications
      ${sycl_tla_SOURCE_DIR}/tools/util/include
      ${sycl_tla_SOURCE_DIR}/examples/common
      ${sycl_tla_SOURCE_DIR}/examples/06_bmg_flash_attention
      ${sycl_tla_SOURCE_DIR}/examples/12_xe20_moe_gemm_cute_interface
    )
    foreach(_inc_dir IN LISTS _sycl_tla_include_dirs)
      if(EXISTS "${_inc_dir}")
        target_include_directories(${PROJECT_NAME}_benchmark PRIVATE "${_inc_dir}")
      endif()
    endforeach()

    # AOT compile target for Intel GPUs
    # Use intel_gpu_pvc for Data Center GPU Max series, or intel_gpu_bmg_g21 for Battlemage
    set(DPCPP_SYCL_TARGET "intel_gpu_bmg_g21" CACHE STRING "SYCL target (intel_gpu_pvc, intel_gpu_bmg_g21)")
    
    # Map target to device name for -Xs flag
    if(DPCPP_SYCL_TARGET STREQUAL "intel_gpu_bmg_g21" OR DPCPP_SYCL_TARGET STREQUAL "bmg")
      set(SYCL_DEVICE_NAME "bmg_g21")
    elseif(DPCPP_SYCL_TARGET STREQUAL "intel_gpu_pvc" OR DPCPP_SYCL_TARGET STREQUAL "pvc")
      set(SYCL_DEVICE_NAME "pvc")
    else()
      set(SYCL_DEVICE_NAME "${DPCPP_SYCL_TARGET}")
    endif()
    
    target_compile_definitions(${PROJECT_NAME}_benchmark PRIVATE ARK_SYCL_TLA=1 CUTLASS_ENABLE_SYCL=1 SYCL_INTEL_TARGET=1)
    # Compile flags (no AOT, JIT at runtime)
    target_compile_options(${PROJECT_NAME}_benchmark PRIVATE 
      -fsycl
      -fno-sycl-instrument-device-code)
    # Link flags: use spir64 (JIT) with device hint and enable required SPIR-V extensions
    target_link_options(${PROJECT_NAME}_benchmark PRIVATE 
      -fsycl
      -fsycl-targets=spir64
      "-Xs" "-device ${SYCL_DEVICE_NAME}"
      -Xspirv-translator
      "-spirv-ext=+SPV_INTEL_split_barrier,+SPV_INTEL_2d_block_io,+SPV_INTEL_subgroup_matrix_multiply_accumulate")
  endif()
endif(BTLA_UT_BENCHMARK)
```

--------------------------------

### AutoRoundBest Command Line

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Use the AutoRoundBest recipe for the highest accuracy, especially for 2-bit quantization. This is slower than the standard AutoRound recipe.

```bash
auto-round-best --model Qwen/Qwen3-0.6B  --scheme "W4A16"  --format "auto_gptq,auto_awq,auto_round"
```

--------------------------------

### Initialize AutoRound with Multi-GPU Device Map

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Instantiate AutoRound specifying multiple GPUs for tuning using a comma-separated string of device IDs.

```python
from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"
ar = AutoRound(
    model=model_name_or_path,
    device_map="0,1,2,3"
)
```

--------------------------------

### Configure CUTLASS for SYCL

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/CMakeLists.txt

Sets CUTLASS build options to enable SYCL support and disable benchmarks, examples, tests, and tools. Also enables exporting compile commands.

```cmake
set(CUTLASS_ENABLE_SYCL ON)
set(CUTLASS_ENABLE_BENCHMARKS OFF)
set(CUTLASS_ENABLE_EXAMPLES OFF)
set(CUTLASS_ENABLE_TESTS OFF)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CUTLASS_ENABLE_LIBRARY OFF)
set(CUTLASS_ENABLE_TOOLS OFF)
set(CUTLASS_ENABLE_GDC_FOR_SM100_DEFAULT
    OFF
    CACHE BOOL "DISABLE CUDA")
```

--------------------------------

### Run All Tests

Source: https://github.com/intel/auto-round/blob/main/test/README.md

Execute all tests in the project.

```sh
pytest
```

--------------------------------

### Load and Quantize Model with AutoRound

Source: https://github.com/intel/auto-round/blob/main/README.md

Demonstrates loading a model and performing quantization using the AutoRound library. Specifies the quantization scheme and output directory. Supports various model formats.

```python
from auto_round import AutoRound

# Load a model (supports FP8/BF16/FP16/FP32)
model_name_or_path = "Qwen/Qwen3-0.6B"

# Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc.
ar = AutoRound(model_name_or_path, scheme="W4A16")

# Highest accuracy (4–5× slower).
# `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower.
# ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)

# Faster quantization (2–3× speedup) with slight accuracy drop at W4G128.
# ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)

# Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc.
ar.quantize_and_save(output_dir="./qmodel", format="auto_round")
```

--------------------------------

### AutoRoundOptRTN Command Line

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Use the AutoRoundOptRTN recipe for optimized RTN without gradient computation. It's calibration-free and faster than AutoRound, offering good accuracy.

```bash
auto-round-opt-rtn --model Qwen/Qwen3-0.6B  --scheme "W4A16"  --format "auto_round"
```

--------------------------------

### Run AutoRound CLI with Multi-GPU Device Map

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Execute the AutoRound command-line interface, specifying multiple GPUs for tuning via the `--device_map` argument.

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 auto-round --model "Qwen/Qwen3-0.6B" --scheme "W4A16" --device_map "auto"
```

--------------------------------

### Load and Generate with Transformers on Various Backends

Source: https://github.com/intel/auto-round/blob/main/README.md

Load a quantized model using the Transformers library, supporting automatic backend selection for CPU, Intel GPU, Gaudi, and CUDA. Avoid manually moving the model to different devices during inference.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```

--------------------------------

### AutoRoundRTN Command Line

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Use the AutoRoundRTN recipe for pure RTN without optimization. It's the fastest and uses the least memory but typically has lower accuracy.

```bash
auto-round-rtn --model Qwen/Qwen3-0.6B  --scheme "W4A16"  --format "auto_round"
```

--------------------------------

### 调整激活量化缩放系数

Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md

通过 Shell 命令设置激活量化时激活值最小/最大值的缩放系数。

```bash
export AR_ACT_SCALE=0.9
```

--------------------------------

### 使用 ModelScope 下载模型

Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md

通过 Shell 命令配置 AutoRound 使用 ModelScope 下载模型。

```bash
export AR_USE_MODELSCOPE=true
```

--------------------------------

### Multi-GPU Evaluation with vLLM Backend (Manual Configuration)

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Manually configure multi-GPU evaluation for vLLM using `CUDA_VISIBLE_DEVICES` and `--vllm_args` for fine-grained control over tensor parallelism and GPU memory utilization.

```bash
CUDA_VISIBLE_DEVICES=0,1 auto-round "your_model_path" --eval --tasks lambada_openai --eval_backend vllm --vllm_args="tensor_parallel_size=2,gpu_memory_utilization=0.8"
```

--------------------------------

### Auto Round Light Speed Recipe

Source: https://github.com/intel/auto-round/blob/main/README.md

Utilize the 'auto-round-light' recipe for a 2-3X speedup in quantization. Expect a slight accuracy drop at W4 and a more significant drop at W2.

```bash
auto-round-light \
    --model Qwen/Qwen3-0.6B \
    --scheme "W4A16" 
```

--------------------------------

### Single GPU Evaluation with HF Backend

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Evaluate a model using the default HF backend. Specify the model, bits for quantization, desired formats, and evaluation tasks.

```bash
auto-round --model Qwen/Qwen3-0.6B --bits 4 --format "auto_round,auto_gptq" --tasks mmlu
```

--------------------------------

### AutoRoundLight API Usage

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Instantiate AutoRound with parameters for the AutoRoundLight recipe. This is optimized for speed and recommended for 4-bit settings and larger models.

```python
from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"

ar = AutoRound(
    model=model_name_or_path,
    scheme="W4A16",
    iters=50,
    lr=5e-3,
)

output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="auto_round")
```

--------------------------------

### 调整 Dynamo Cache 大小限制

Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md

通过 Shell 命令调整 torch._dynamo 的 cache_size_limit 等参数的最小值。

```bash
export AR_DYNAMO_CACHE_SIZE_LIMIT=32
```

--------------------------------

### AWQ Algorithm CLI Usage

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Use this command to apply the AWQ algorithm for quantization via the command line. Specify the model, quantization scheme, algorithm, and output format.

```bash
auto-round --model Qwen/Qwen3-0.6B --scheme "W4A16" --algorithm awq --format "auto_round"
```

--------------------------------

### CLI Usage for Model-Free Mode with Advanced Configuration

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Configure Model-Free mode with custom group size, asymmetric quantization, per-layer bit-width overrides, and ignored layers. This allows fine-grained control over quantization.

```bash
# With per-layer configuration and ignored layers
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --group_size 32 \
  --asym \
  --layer_config "{k_proj:{bits:8},v_proj:{bits:8}}" \
  --ignore_layers "mlp" \
  --output_dir ./int4-llama
```

--------------------------------

### Customized Data Preparation

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Prepare a custom dataset as a list of strings for auto-round quantization. Data shorter than the sequence length will be dropped, and longer data will be truncated.

```python
def customized_data():
    # Important Notice!!! AutoRound will drop data < args.seqlen and truncate data to args.seqlen
    data = ["AutoRound is an advanced quantization algorithm for low-bits LLM inference" * 240]
    return data
```

--------------------------------

### Tiny Model Creation and Saving

Source: https://github.com/intel/auto-round/blob/main/test/README.md

Utilities to create and save smaller versions of models for testing.

```python
get_tiny_model(model_path, num_layers=2)  # Create tiny model by slicing layers
save_tiny_model(model_path, save_path)  # Save tiny model to disk
```

--------------------------------

### Multi-GPU Evaluation with HF Backend

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Evaluate a model across multiple GPUs using the HF backend. Use `--device_map` to specify the GPUs and `--eval` to enable evaluation mode.

```bash
auto-round --model="your_model_path" --eval --device_map 0,1 --tasks lambada_openai --eval_bs 16
```

--------------------------------

### 禁用 OffloadManager 权重卸载

Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md

通过 Shell 命令强制禁用 AutoRound 的 OffloadManager 中的权重卸载功能。

```bash
export AR_DISABLE_OFFLOAD=1
```

--------------------------------

### Run Tests with Verbose Output

Source: https://github.com/intel/auto-round/blob/main/test/README.md

Execute tests and display detailed output, including captured stdout/stderr.

```sh
pytest -v -s
```

--------------------------------

### Multi-GPU Evaluation with vLLM Backend (Device Map)

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Perform multi-GPU evaluation with vLLM by specifying the device map. This is an alternative to manual environment variable configuration.

```bash
auto-round "your_model_path" --eval --device_map 0,1 --tasks lambada_openai --eval_backend vllm
```

--------------------------------

### Single GPU Evaluation with vLLM Backend

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Evaluate a model using the vLLM backend. This requires specifying the evaluation backend.

```bash
auto-round --model Qwen/Qwen3-0.6B --bits 4 --format "auto_round,auto_gptq" --tasks mmlu --eval_backend vllm
```

--------------------------------

### Benchmark Source Files

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/bestla/CMakeLists.txt

Defines the source files for benchmark executables, including general benchmarks and SYCL-specific benchmarks. Commented-out lines indicate potential future additions or alternative benchmark files.

```cmake
set(benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/bestla_benchmark.cpp)
list(APPEND benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/sycl_benchmark.cpp)
# Flash attention benchmarks are in separate files to avoid header conflicts
#list(APPEND benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/sycl_tla_flash_attn_prefill_bench.cpp)
#list(APPEND benchmark_srcs ${CMAKE_CURRENT_SOURCE_DIR}/${PROJECT_NAME}/ut/sycl_tla_flash_attn_decode_bench.cpp)
```

--------------------------------

### 禁用数据集子进程预处理

Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md

通过 Shell 命令禁用 AutoRound 的数据集子进程预处理。

```bash
export AR_DISABLE_DATASET_SUBPROCESS=true
```

--------------------------------

### AutoRoundBest API Usage

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Instantiate AutoRound with parameters for the AutoRoundBest recipe. This is suitable for achieving the highest accuracy, especially with 2-bit quantization.

```python
from auto_round import AutoRound

model_name_or_path = "Qwen/Qwen3-0.6B"
ar = AutoRound(model=model_name_or_path, scheme="W4A16", nsamples=512, iters=1000, low_gpu_mem_usage=True)

output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir, format="auto_round")
```

--------------------------------

### AWQ Algorithm API Usage

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Instantiate AutoRound with AWQ algorithm and quantize and save the model. Ensure the output directory is specified.

```python
from auto_round import AutoRound

ar = AutoRound(
    "Qwen/Qwen3-0.6B",
    scheme="INT8",
    algorithm="awq",
)

output_dir = "./tmp_awq"
ar.quantize_and_save(output_dir, format="auto_round:llm_compressor")
```

--------------------------------

### 设置工作目录

Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md

通过 Shell 命令指定 AutoRound 的自定义工作目录。

```bash
export AR_WORK_SPACE=/path/to/custom/workspace
```

--------------------------------

### Basic CLI Usage for Model Quantization

Source: https://github.com/intel/auto-round/blob/main/README.md

Perform model quantization using the auto-round CLI. Set the model, quantization scheme, format, and output directory. ModelScope is supported for model downloads by setting AR_USE_MODELSCOPE=1.

```bash
auto-round \
    --model Qwen/Qwen3-0.6B \
    --scheme "W4A16" \
    --format "auto_round" \
    --output_dir ./tmp_autoround
```

--------------------------------

### 启用编译打包

Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md

通过 Shell 命令启用 AutoRound 的编译打包优化功能。

```bash
export AR_ENABLE_COMPILE_PACKING=1
```

--------------------------------

### Configure SYCL TLA FetchContent

Source: https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/auto_round_kernel/CMakeLists.txt

Fetches the sycl-tla library from a Git repository and tag. This is used for SYCL-based builds.

```cmake
set(SYCL_TLA_GIT_REPOSITORY "https://github.com/luoyu-intel/sycl-tla.git" CACHE STRING "sycl-tla git repository")
set(SYCL_TLA_GIT_TAG "260409" CACHE STRING "sycl-tla git tag/commit")

FetchContent_Declare(
  sycl_tla
  GIT_REPOSITORY ${SYCL_TLA_GIT_REPOSITORY}
  GIT_TAG ${SYCL_TLA_GIT_TAG}
)
```

--------------------------------

### Customized Data Preparation with Tokenizer

Source: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md

Prepare a custom dataset using a tokenizer to convert text data into token IDs. Ensure data is processed according to the specified sequence length.

```python
def customized_data_with_tokenizer(tokenizer, seqlen=2048):
    # Import notice!!! AutoRound will drop data < args.seqlen
    data = ["AutoRound is an advanced quantization algorithm for low-bits LLM inference" * 240]
    tokens = []
    for d in data:
        token = tokenizer(d, truncation=True, max_length=seqlen, return_tensors="pt").data
        tokens.append(token)
    return tokens
```

--------------------------------

### 禁用激活最小-最大缩放参数调优

Source: https://github.com/intel/auto-round/blob/main/docs/environments_CN.md

通过 Shell 命令禁用激活量化中最小/最大缩放参数的调优。

```bash
export AR_ENABLE_ACT_MINMAX_TUNING=1
```

--------------------------------

### Quantize Diffusion Model using CLI

Source: https://github.com/intel/auto-round/blob/main/auto_round/compressors/diffusion/README.md

This bash command demonstrates how to quantize a diffusion model using the auto-round command-line interface. Specify the model, scheme, format, batch size, dataset, and output directory as arguments.

```bash
auto-round \
    --model black-forest-labs/FLUX.1-dev \
    --scheme MXFP8 \
    --format fake \
    --batch_size 1 \
    --dataset coco2014 \
    --output_dir ./tmp_autoround
```

--------------------------------

### Quantize MLLM using Command-Line Interface

Source: https://github.com/intel/auto-round/blob/main/auto_round/compressors/mllm/README.md

Execute quantization for a multimodal model directly from the terminal using the 'auto-round' command. Specify the model, quantization scheme, desired output format, and output directory. Multiple formats can be exported.

```bash
auto-round \
    --model Qwen/Qwen2-VL-2B-Instruct \
    --scheme w4a16 \
    --format "auto_round" \
    --output_dir ./tmp_autoround
```

--------------------------------

### Auto Round Pure RTN Recipe

Source: https://github.com/intel/auto-round/blob/main/README.md

Use the 'auto-round-rtn' recipe for the fastest quantization with pure Round-to-Nearest mode (iters=0, no AutoRound optimization). It routes to model-free mode for supported INT WOQ schemes.

```bash
auto-round-rtn \
    --model Qwen/Qwen3-0.6B \
    --scheme "W4A16"
```

--------------------------------

### DataLoader Utility

Source: https://github.com/intel/auto-round/blob/main/test/README.md

Simple dataloader for calibration datasets.

```python
DataLoader()  # Simple dataloader for calibration datasets
```