### Install GPTQModel from Source

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Install GPTQModel directly from its source repository. Ensure python3-dev is installed for source builds. Optional modules can also be included.

```bash
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# python3-dev is required for some source installs
apt install python3-dev

# pip: install from source
# You can install optional modules like  vllm, sglang, bitblas.
# Example: pip install -v .[vllm,sglang,bitblas]
pip install -v .
```

--------------------------------

### Authoring Surfaces - Python DSL Example

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

A concise Python DSL example for defining quantization rules for weights.

```python
Rule(
    match="*",
    weight={
        "quantize": gptq(bits=4, sym=True, group_size=128),
        "export": {"format": "gptq"},
    },
)
```

--------------------------------

### Install Evalution for Benchmarking

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Installs the Evalution library, a benchmarking toolkit for LLMs, which integrates with GPTQModel.

```bash
# install Evalution
pip install Evalution
```

--------------------------------

### Authoring Surfaces - YAML Example

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

The equivalent YAML configuration for defining quantization rules for weights, matching the Python DSL example.

```yaml
match: "*"
weight:
  quantize:
    method: gptq
    bits: 4
    sym: true
    group_size: 128
  export:
    format: gptq
```

--------------------------------

### Define Aliases and Actions in Python

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Python example demonstrating how to define aliases for reusable tensor references and apply actions using these aliases.

```python
Rule(
    match=".*self_attn$",
    aliases={"proj": ["q_proj", "k_proj", "v_proj", "o_proj"]},
    actions=[
        record_stats(targets="@proj"),
        inspect_outliers(targets="@proj"),
    ],
)
```

--------------------------------

### Common Export Format Examples

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Examples of common export formats and their variants, including GPTQ, AWQ, FP8, FP4, and GGUF.

```json
{"format": "gptq"}
```

```json
{"format": "awq", "variant": "gemm"}
```

```json
{"format": "awq", "variant": "gemv"}
```

```json
{"format": "fp8", "variant": "e4m3fn", "impl": "transformer_engine"}
```

```json
{"format": "fp4", "variant": "nvfp4", "impl": "modelopt"}
```

```json
{"format": "gguf", "variant": "q4_k_m"}
```

--------------------------------

### Weight Target Section in Python

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Python example showing the structure of the 'weight' target section, including optional prepare, quantize, and export configurations.

```python
weight={
    "prepare": [...],     # optional
    "quantize": ...,      # optional
    "export": ...,        # optional
}
```

--------------------------------

### Install GPTQModel via PIP/UV

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Install the GPTQModel package using pip or uv. Optional modules like autoround, ipex, vllm, sglang, and bitblas can be included.

```bash
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas.
# Example: pip install -v gptqmodel[vllm,sglang,bitblas]
pip install -v gptqmodel
uv pip install -v gptqmodel
```

--------------------------------

### Separate Quantize and Export in Python

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Example in Python showing distinct 'quantize' and 'export' configurations, where RTN is used for quantization and GPTQ for export.

```python
weight={
    "quantize": rtn(bits=4, sym=True),
    "export": {"format": "gptq", "impl": "default"},
}
```

--------------------------------

### Compose Quantization Rules: Default and Skip (YAML)

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

YAML example demonstrating rule composition, with a broad default rule and a specific rule to skip quantization for 'layer0.qkv'.

```yaml
- match: "*"
  weight:
    prepa
      - method: pad.columns
        multiple: 4
        semantic: true
    quantize:
      method: gptq
      bits: 4
      sym: true
      group_size: 128
    export:
      format: gptq
      impl: default
  input:
    quantize:
      method: mxfp4
      mode: dynamic
      block_size: 32
      scale_bits: 8
    export:
      format: fp4
      variant: mxfp4
      impl: modelopt

- match: "layer0.qkv"
  weight:
    quantize:
      method: skip
```

--------------------------------

### Separate Quantize and Export in YAML

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Example in YAML showing distinct 'quantize' and 'export' configurations, where RTN is used for quantization and GPTQ for export.

```yaml
weight:
  quantize:
    method: rtn
    bits: 4
    sym: true
  export:
    format: gptq
    impl: default
```

--------------------------------

### Dynamic Quantization Configuration Example

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Shows how to configure dynamic quantization overrides for specific modules within a model. It includes positive matches for overriding bits/group_size and negative matches for skipping modules.

```python
dynamic = { 
    # `.*\. ` matches the layers_node prefix 
    # layer index starts at 0 
    
    # positive match: layer 19, gate module 
    r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  
    
    # positive match: layer 20, gate module (prefix defaults to positive if missing)
    r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  
    
    # negative match: skip layer 21, gate module
    r"-:.+\.20\..*gate.*": {}, 
    
    # negative match: skip all down modules for all layers
    r"-:.+\.down.*": {},  
 }
```

--------------------------------

### GPTQ Quantization and Export in Python

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Example in Python specifying GPTQ for quantization and 'gptq' format for export, indicating GPTQ packing is part of export realization.

```python
weight={
    "quantize": gptq(bits=4, sym=True, group_size=128),
    "export": {"format": "gptq", "impl": "default"},
}
```

--------------------------------

### GPTQ Quantization and Export in YAML

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Example in YAML specifying GPTQ for quantization and 'gptq' format for export, indicating GPTQ packing is part of export realization.

```yaml
weight:
  quantize:
    method: gptq
    bits: 4
    sym: true
    group_size: 128
  export:
    format: gptq
    impl: default
```

--------------------------------

### Quantize Model using GGUF Format

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Example of loading a model and quantizing it using the GGUF format with Q4_K_M quantization settings. Calibration is set to None for weight-only quantization.

```python
from gptqmodel import BACKEND, GGUFConfig, GPTQModel

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-GGUF-Q4_K_M"

qcfg = GGUFConfig(
    bits=4,
    format="q_k_m",
)

model = GPTQModel.load(model_id, qcfg)
model.quantize(calibration=None, backend=BACKEND.GGUF_TORCH)
model.save(quant_path)
```

--------------------------------

### GPTQModel Inference Example

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Perform inference using GPTQModel with a three-line API. Load a model and generate text, then decode the tokens to a string.

```python
from gptqmodel import GPTQModel

model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
```

--------------------------------

### Patching Export Rules in YAML

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Example of applying specific export configurations to modules matching a pattern using YAML.

```yaml
- match: "*"
  weight:
    export:
      format: awq
      variant: gemm
      impl: llm_awq
      version: 2

- match: ".*small_proj$"
  weight:
    export:
      variant: gemv
```

--------------------------------

### Define Quantization Method

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Specifies the quantization method to be used. Examples include gptq, rtn, mxfp4, int8, and skip.

```text
gptq(bits=4, sym=True, group_size=128)
```

```text
rtn(bits=4, sym=True)
```

```text
mxfp4(mode="dynamic", block_size=32, scale_bits=8)
```

```text
int8(calibration=observer("max"))
```

```text
skip()
```

--------------------------------

### Compose Quantization Rules: Override Bits (Python)

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Python example showing how to override specific quantization parameters, like 'bits', for a subset of matched layers using a narrower rule.

```python
Rule(
    match="*",
    weight={
        "quantize": gptq(bits=4, sym=True, group_size=128),
        "export": {"format": "gptq", "impl": "default"},
    },
)

Rule(
    match=".*(q_proj|k_proj)$",
    weight={
        "quantize": {"bits": 8},
    },
)
```

--------------------------------

### Quantize Model using Exllama V3 (EXL3) Format

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Example of quantizing a model using the Exllama V3 format with specified bits, head_bits, and codebook settings. Requires a calibration dataset.

```python
from datasets import load_dataset
from gptqmodel import BACKEND, EXL3Config, GPTQModel

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-EXL3"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train",
).select(range(1024))["text"]

qcfg = EXL3Config(
    bits=4.0,        # target average bits-per-weight
    head_bits=6.0,   # optional higher bitrate for attention heads / sensitive tensors
    codebook="mcg",  # one of: mcg, mul1, 3inst
)

model = GPTQModel.load(model_id, qcfg)
model.quantize(calibration_dataset, batch_size=1, backend=BACKEND.EXL3_EXLLAMA_V3)
model.save(quant_path)
```

--------------------------------

### AWQ Quantization with Fallback and Smoothing

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Configures AWQ quantization with a fallback strategy, threshold, and optional smoothing parameters. This YAML example shows detailed fallback settings.

```yaml
weight:
  quantize:
    method: awq
    bits: 4
    group_size: 128
    fallback:
      strategy: rtn
      threshold: 1.0%
      smooth:
        type: mad
        k: 2.75
```

--------------------------------

### Load and Infer with EoRA-Enhanced GPTQ Model

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/eora/README.md

Example Python code for loading a GPTQ model with an EoRA adapter for inference. Ensure the adapter path and rank are correctly specified.

```python
from gptqmodel import BACKEND, GPTQModel  # noqa: E402
from gptqmodel.adapter.adapter import Lora 

eora = Lora(
    # for eora generation, path is adapter save path; for load, it is loading path
    path='docs/eora/Llama-3.2-3B-4bits-eora_rank64_c4 ',
    rank=64,
)

model = GPTQModel.load(
    model_id_or_path='sliuau/Llama-3.2-3B_4bits_128group_size',
    adapter=eora,
)

tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)
print(f"Result: {result}")
```

--------------------------------

### Compose Quantization Rules: Default and Skip (Python)

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Example of composing quantization rules in Python, defining a global default rule and a narrower rule to skip quantization for specific layers.

```python
Rule(
    match="*",
    weight={
        "prepare": [pad.columns(multiple=4, semantic=True)],
        "quantize": gptq(bits=4, sym=True, group_size=128),
        "export": {"format": "gptq", "impl": "default"},
    },
    input={
        "quantize": mxfp4(mode="dynamic", block_size=32, scale_bits=8),
        "export": {
            "format": "fp4",
            "variant": "mxfp4",
            "impl": "modelopt",
        },
    },
)

Rule(
    match="layer0.qkv",
    weight={
        "quantize": skip(),
    },
)
```

--------------------------------

### Enable Group Aware Reordering (GAR)

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Example of creating a QuantizeConfig to enable Group Aware Reordering (GAR) by setting `act_group_aware` to True and `desc_act` to False.

```python
quant_config = QuantizeConfig(bits=4, group_size=128, act_group_aware=True)
```

--------------------------------

### Run GSM8K Benchmark with GPTQModel via Evalution

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Example of running the gsm8k_platinum benchmark using Evalution's native GPTQModel engine with the 'marlin' backend on CUDA. It specifies a model and benchmark parameters.

```python
import evalution as eval

run = (
    eval.GPTQModel(
        backend="marlin",
        device="cuda:0",
    )
    .model(eval.Model(path="ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"))
    .run(eval.benchmarks.gsm8k_platinum(apply_chat_template=True, batch_size=16))
)

print(run.to_dict()["tests"][0]["metrics"])
```

--------------------------------

### Configure GPTQ and GGUF Quantization with Preprocessors

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Demonstrates how to configure GPTQConfig and GGUFConfig with various preprocessors like SmootherConfig, AutoModuleDecoderConfig, and TensorParallelPadderConfig.

```python
import torch
from gptqmodel import GGUFConfig, GPTQConfig
from gptqmodel.quantization import (
    AutoModuleDecoderConfig,
    SmoothMAD,
    SmootherConfig,
    TensorParallelPadderConfig,
)

gptq_cfg = GPTQConfig(
    bits=4,
    group_size=128,
    preprocessors=[
        SmootherConfig(smooth=SmoothMAD(k=2.0)),
        AutoModuleDecoderConfig(target_dtype=torch.bfloat16),
        TensorParallelPadderConfig(),
    ],
)

gguf_cfg = GGUFConfig(
    bits=4,
    format="q_k_m",
    preprocessors=[
        AutoModuleDecoderConfig(target_dtype=torch.bfloat16),
        TensorParallelPadderConfig(),
    ],
)
```

--------------------------------

### Configure Activation-Aware GPTQ (Python)

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Configure GPTQ quantization with awareness of input activation modes. Use 'ignore' for classic weight-only GPTQ, 'fake' for optimization with fake-quantized inputs.

```python
gptq(
    bits=4,
    sym=True,
    group_size=128,
    activation_mode="ignore",   # or "fake", later possibly "real"
)
```

--------------------------------

### Advanced Replace Mode in YAML

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Example of using 'mode: replace' in YAML for advanced control, overriding default patch merging behavior.

```yaml
match: "layer0.qkv"
weight:
  mode: replace
  prepa
    - method: pad.columns
      multiple: 4
      semantic: true
  quantize:
    method: skip
```

--------------------------------

### Advanced Replace Mode in Python

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Example of using 'mode: replace' in Python for advanced control, overriding default patch merging behavior.

```python
Rule(
    match="layer0.qkv",
    weight={
        "mode": "replace",
        "prepare": [pad.columns(multiple=4, semantic=True)],
        "quantize": skip(),
    },
)
```

--------------------------------

### Enable GPTAQ Quantization

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Enable GPTAQ quantization by setting `gptaq = GPTAQConfig(...)`. Note that GPTAQ is experimental, not MoE compatible, and requires significantly more VRAM.

```python
# Note GPTAQ is currently experimental, not MoE compatible, and requires 2-4x more VRAM to execute
# We have many reports of GPTAQ not working better or exceeding GPTQ so please use for testing only
# If OOM on 1 GPU, please set CUDA_VISIBLE_DEVICES=0,1 to 2 GPUs and gptqmodel will auto use second GPU
quant_config = QuantizeConfig(bits=4, group_size=128, gptaq=GPTAQConfig(alpha=0.25, device="auto"))
```

--------------------------------

### Define Quantization Stages with Multiple Rules

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

This configuration sets up a 'ptq' stage with multiple rules for quantization. It includes actions for smoothquant, GPTQ quantization with specific parameters, and MXFP4 input quantization, along with export configurations for both weight and input tensors. A rule to skip quantization for 'layer0.qkv' is also included.

```python
version = 2

stages = [
    Stage(
        name="ptq",
        rules=[
            Rule(
                match=".*self_attn$",
                actions=[smoothquant(alpha=0.5)],
            ),
            Rule(
                match="*",
                weight={
                    "prepare": [clip.mad(k=2.75)],
                    "quantize": gptq(
                        bits=4,
                        sym=True,
                        group_size=128,
                        activation_mode="fake",
                    ),
                    "export": {"format": "gptq", "impl": "default"},
                },
                input={
                    "quantize": mxfp4(
                        mode="dynamic",
                        block_size=32,
                        scale_bits=8,
                    ),
                    "export": {
                        "format": "fp4",
                        "variant": "mxfp4",
                        "impl": "modelopt",
                    },
                },
            ),
            Rule(
                match="layer0.qkv",
                weight={
                    "quantize": skip(),
                },
            ),
        ],
    ),
]

```

```yaml
version: 2
stages:
  - name: ptq
    rules:
      - match: ".*self_attn$"
        actions:
          - method: smoothquant
            alpha: 0.5
      - match: "*"
        weight:
          prepa
            - method: clip.mad
              k: 2.75
          quantize:
            method: gptq
            bits: 4
            sym: true
            group_size: 128
            activation_mode: fake
          export:
            format: gptq
            impl: default
        input:
          quantize:
            method: mxfp4
            mode: dynamic
            block_size: 32
            scale_bits: 8
          export:
            format: fp4
            variant: mxfp4
            impl: modelopt
      - match: "layer0.qkv"
        weight:
          quantize:
            method: skip

```

--------------------------------

### Load and Generate with GPTQModel

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Loads a pre-quantized model and generates text. Ensure the quant_path points to your quantized model.

```python
# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
```

--------------------------------

### Load and Generate with GGUF Model

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

This script demonstrates how to load a GGUF model using GPTQModel and generate text. No external 'gguf' PyPI package is required. You can optionally specify a 'profile' for loading, such as 'low_memory'.

```python
from gptqmodel import GPTQModel

model = GPTQModel.load("prism-ml/Bonsai-1.7B-gguf")
# or: model = GPTQModel.load("prism-ml/Bonsai-1.7B-gguf", profile="low_memory")

tokens = model.generate(
    "Who wrote Romeo and Juliet?",
    max_new_tokens=128,
)[0]

print(model.tokenizer.decode(tokens, skip_special_tokens=True))
```

--------------------------------

### XPU vs CPU INT4 Packing Visualization

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/torch_fused_int4_transformations.md

Illustrates the difference in INT4 packing between XPU (row-major lane packing) and CPU (byte-tiling).

```text
XPU:  | int32 lane | = [w7][w6][w5][w4][w3][w2][w1][w0]
CPU:  | uint8 lane | = [w1][w0]
```

--------------------------------

### Configure Activation-Aware GPTQ (YAML)

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

YAML configuration for activation-aware GPTQ, specifying bits, symmetry, group size, and activation mode. 'ignore' is for weight-only GPTQ.

```yaml
method: gptq
bits: 4
sym: true
group_size: 128
activation_mode: ignore
```

--------------------------------

### Machete GEMM API Usage

Source: https://github.com/modelcloud/gptqmodel/blob/main/gptqmodel_ext/machete/Readme.md

Demonstrates the typical workflow for using Machete's GEMM operation, including prepacking the weight matrix before calling the main GEMM function. Ensure weights are prepacked using `machete_prepack_B`.

```python
from vllm import _custom_ops as ops

...
W_q_packed = ops.machete_prepack_B(w_q, wtype)
output = ops.machete_gemm(
    a,
    b_q=W_q_packed,
    b_type=wtype,
    b_scales=w_s,
    b_group_size=group_size
)
```

--------------------------------

### Evaluate EoRA and GPTQ Model Performance

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/eora/README.md

Run this command to evaluate the performance of a GPTQ quantized model with its corresponding EoRA on ARC-C and MMLU benchmarks. Ensure the paths and rank match your generation settings.

```shell
python docs/eora/evaluation.py --quantized_model sliuau/Llama-3.2-3B_4bits_128group_size \
    --eora_save_path docs/eora/Llama-3.2-3B-4bits-eora_rank64_c4 \
    --eora_rank 64
```

--------------------------------

### Compose Quantization Rules: Override Bits (YAML)

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

YAML example for overriding quantization bits. A base rule sets GPTQ with 4 bits, and a subsequent rule targets specific projections to use 8 bits.

```yaml
- match: "*"
  weight:
    quantize:
      method: gptq
      bits: 4
      sym: true
      group_size: 128
    export:
      format: gptq
      impl: default

- match: ".*(q_proj|k_proj)$"
  weight:
    quantize:
      bits: 8
```

--------------------------------

### GPTQ Per-Module Quantization Overrides

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Configure base GPTQ quantization with per-module overrides for bits and group size. This example sets a default 4-bit GPTQ with group size 128, then overrides 'up_proj' and 'gate_proj' to 8-bit, and 'down_proj' to 4-bit with group size 32.

```python
Stage(
    name="ptq",
    rules=[
        Rule(
            match="*",
            weight={
                "quantize": {
                    "method": "gptq",
                    "bits": 4,
                    "group_size": 128,
                },
                "export": {
                    "format": "gptq",
                    "impl": "default",
                },
            },
        ),
        Rule(
            match=".*\.up_proj.*",
            weight={
                "quantize": {"bits": 8},
            },
        ),
        Rule(
            match=".*\.gate_proj.*",
            weight={
                "quantize": {"bits": 8},
            },
        ),
        Rule(
            match=".*\.down_proj.*",
            weight={
                "quantize": {"bits": 4, "group_size": 32},
            },
        ),
    ],
)
```

```yaml
stages:
  - name: ptq
    rules:
      - match: "*"
        weight:
          quantize:
            method: gptq
            bits: 4
            group_size: 128
          export:
            format: gptq
            impl: default
      - match: ".*\.up_proj.*"
        weight:
          quantize:
            bits: 8
      - match: ".*\.gate_proj.*"
        weight:
          quantize:
            bits: 8
      - match: ".*\.down_proj.*"
        weight:
          quantize:
            bits: 4
            group_size: 32
```

--------------------------------

### Quantize with One Method, Export as Another Format

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Configures quantization using one method (e.g., rtn) and specifies a different format for export (e.g., gptq). This enables flexibility in the quantization and export pipeline.

```python
Rule(
    match="primary_projection",
    weight={
        "quantize": rtn(bits=4, sym=True),
        "export": {"format": "gptq", "impl": "default"},
    },
)
```

```yaml
match: ".*down_proj$"
weight:
  quantize:
    method: rtn
    bits: 4
    sym: true
  export:
    format: gptq
    impl: default
```

--------------------------------

### Shorthand Export Formats

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Shorthand string representations for common export formats like 'gptq' and 'native'.

```text
"gptq" == {"format": "gptq"}
```

```text
"native" == {"format": "native"}
```

--------------------------------

### Configure RTN with Weight Smoothing and AWQ GEMM Export

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

This configuration defines a 'weight_only' stage using the RTN method for quantization with 4 bits and a group size of 128. It includes weight smoothing using SmoothMAD and targets AWQ GEMM for export.

```python
Stage(
    name="weight_only",
    rules=[
        Rule(
            match="*",
            weight={
                "prepare": [
                    {"method": "smooth.mad", "k": 1.5},
                ],
                "quantize": {
                    "method": "rtn",
                    "bits": 4,
                    "group_size": 128,
                },
                "export": {
                    "format": "awq",
                    "variant": "gemm",
                },
            },
        ),
    ],
)
```

```yaml
stages:
  - name: weight_only
    rules:
      - match: "*"
        weight:
          prepa
            - method: smooth.mad
              k: 1.5
          quantize:
            method: rtn
            bits: 4
            group_size: 128
          export:
            format: awq
            variant: gemm

```

--------------------------------

### Serve Model via OpenAI API Compatible Endpoint

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Make a loaded GPTQModel available through an OpenAI API compatible endpoint. Specify the host and port for the server.

```python
# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")
```

--------------------------------

### Machete GEMM Operation

Source: https://github.com/modelcloud/gptqmodel/blob/main/gptqmodel_ext/machete/Readme.md

This snippet demonstrates the typical usage of the machete_gemm operation, including prepacking the weight matrix.

```APIDOC
## machete_gemm and prepacking

### Description
This operation performs a GEMM (General Matrix Multiply) with quantized weights. The weight matrix `b_q` must be prepacked using `machete_prepack_B` before calling `machete_gemm`.

### Usage
```python
from vllm import _custom_ops as ops

# Assuming w_q is the quantized weight matrix, wtype is its data type,
# w_s are the scales, and group_size is the group size for quantization.

W_q_packed = ops.machete_prepack_B(w_q, wtype)
output = ops.machete_gemm(
    a,  # Input matrix A
    b_q=W_q_packed,  # Prepacked quantized weight matrix B
    b_type=wtype,  # Data type of the quantized weight matrix
    b_scales=w_s,  # Scales for dequantization
    b_group_size=group_size  # Group size for quantization
)
```

### Parameters
- **a** (Tensor): Input matrix A.
- **b_q** (Tensor): Prepacked quantized weight matrix B. This should be the output of `machete_prepack_B`.
- **b_type** (DataType): The data type of the quantized weight matrix `b_q`.
- **b_scales** (Tensor): The scales used for dequantization.
- **b_group_size** (int): The group size used during quantization. If None, it implies no grouping.

### Notes
- The weight matrix must be prepacked before calling `machete_gemm`.
- The `machete_prepack_B` function is used for this prepacking step.
```

--------------------------------

### Quantize Model using FP8 Format

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Demonstrates quantizing a model using the FP8 format with float8_e4m3fn precision. Calibration is set to None for weight-only quantization.

```python
from gptqmodel import BACKEND, FP8Config, GPTQModel

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-FP8-E4M3"

qcfg = FP8Config(
    format="float8_e4m3fn",  # or "float8_e5m2"
    bits=8,
    weight_scale_method="row",
)

model = GPTQModel.load(model_id, qcfg)
model.quantize(calibration=None, backend=BACKEND.GPTQ_TORCH)
model.save(quant_path)
```

--------------------------------

### Explicit Replacement and Stop Rule (YAML)

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

YAML configuration for explicit replacement of inherited settings and preventing subsequent rule changes using 'stop: true' for 'layer0.qkv'.

```yaml
match: "layer0.qkv"
stop: true
weight:
  mode: replace
  prepa
    - method: pad.columns
      multiple: 4
      semantic: true
  quantize:
    method: skip
```

--------------------------------

### Define Quantization Stages in YAML

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

This YAML configuration defines stages for quantization, specifying balancing and post-training quantization rules.

```yaml
stages:
  - name: balance
    rules:
      - match: ".*self_attn$"
        actions:
          - method: smoothquant
            alpha: 0.5
  - name: ptq
    rules:
      - match: "*"
        weight:
          quantize:
            method: gptq
            bits: 4
            sym: true
            group_size: 128
          export:
            format: gptq
```

--------------------------------

### Recommended Rule Shape in Python

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Illustrates the recommended structure for a Rule object in Python, including match, aliases, actions, and tensor targets.

```python
Rule(
    match="*",
    aliases=None,
    actions=[],
    stop=False,
    weight={...},
    input={...},
    output={...},
    kv_cache={...},
)
```

--------------------------------

### GPTQ Quantization with Fallback for Low Evidence

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Illustrates GPTQ quantization where fallback to RTN is used if evidence is insufficient. The export format remains GPTQ.

```yaml
weight:
  quantize:
    method: gptq
    bits: 4
    fallback:
      strategy: rtn
      threshold: 0.5%
  export:
    format: gptq
```

--------------------------------

### EoRA Accuracy Recovery with GPTQModel

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Demonstrates how to use EoRA (Enhanced Post-Quant Error Recovery via Lora) to improve quantized model accuracy. Requires a LoRA adapter path and a previously GPTQ-quantized model.

```python
# EoRa is currently only validated for GPTQ
# higher rank improves accuracy at the cost of VRAM usage
# suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage
eora = Lora(
  # for eora generation, path is adapter save path; for load, it is loading path
  path=f"{quant_path}/eora_rank32", 
  rank=32,
)

# provide a previously GPTQ-quantized model path
GPTQModel.adapter.generate(
  adapter=eora,
  model_id_or_path=model_id,
  quantized_model_id_or_path=quant_path,
  calibration_dataset=calibration_dataset,
  calibration_dataset_concat_size=0,
)

# post-eora inference
model = GPTQModel.load(
  model_id_or_path=quant_path,
  adapter=eora
)

tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)

print(f"Result: {result}")
# For more details on EoRA, please see docs/eora/
# Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness
```

--------------------------------

### Define Quantization Stages in Python

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Use this Python code to define stages for quantization, including balancing and post-training quantization rules.

```python
stages = [
    Stage(
        name="balance",
        rules=[
            Rule(
                match=".*self_attn$",
                actions=[smoothquant(alpha=0.5)],
            ),
        ],
    ),
    Stage(
        name="ptq",
        rules=[
            Rule(
                match="*",
                weight={
                    "quantize": gptq(bits=4, sym=True, group_size=128),
                    "export": {"format": "gptq"},
                },
            ),
        ],
    ),
]
```

--------------------------------

### Define Export Configuration in Python

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Use this structure to define the export format, variant, implementation, and version for quantized models in Python.

```python
weight={
    "export": {
        "format": "awq",
        "variant": "gemm",
        "impl": "llm_awq",
        "version": 2,
    },
}
```

--------------------------------

### Define Activation Quantization Rule (YAML)

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

This YAML configuration defines a rule for activation quantization, mirroring the Python configuration with method, mode, and export details.

```yaml
match: "*"
input:
  quantize:
    method: mxfp4
    mode: dynamic
    block_size: 32
    scale_bits: 8
  export:
    format: fp4
    variant: mxfp4
    impl: modelopt
```

--------------------------------

### Quantize LLM Model with GPTQModel

Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md

Quantize a specified LLM model using GPTQModel and a calibration dataset. Adjust batch size based on available VRAM for faster quantization. The quantized model is then saved.

```python
from datasets import load_dataset
from gptqmodel import GPTQConfig, GPTQModel

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = GPTQConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match GPU/VRAM specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)

model.save(quant_path)
```

--------------------------------

### Define Activation Quantization Rule (Python)

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Use this Python code to define a rule for activation quantization, specifying the quantization method, mode, and export format.

```python
Rule(
    match="*",
    input={
        "quantize": mxfp4(mode="dynamic", block_size=32, scale_bits=8),
        "export": {
            "format": "fp4",
            "variant": "mxfp4",
            "impl": "modelopt",
        },
    },
)
```

--------------------------------

### Generate EoRA with GPTQ Quantization

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/eora/README.md

Use this command to generate EoRA simultaneously with GPTQ quantization. Specify calibration data and desired rank. For MMLU task improvement, set 'eora_dataset' to 'mmlu'.

```shell
python docs/eora/eora_generation.py meta-llama/Llama-3.2-3B --bits 4 \
    --quant_save_path docs/eora/Llama-3.2-3B-4bits \
    --eora_dataset c4 \
    --eora_save_path docs/eora/Llama-3.2-3B-4bits-eora_rank64_c4 \
    --eora_rank 64
```

--------------------------------

### Protocol Root - Python DSL

Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md

Defines the basic structure of the quantization protocol using Python, including version and stages.

```python
version = 2

stages = [
    Stage(
        name="ptq",
        rules=[
            Rule(
                match="*",
                aliases=None,
                actions=[],
                stop=False,
                weight=None,
                input=None,
                output=None,
                kv_cache=None,
            ),
        ],
    ),
]
```