### Install GPTQModel from Source Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Install GPTQModel directly from its source repository. Ensure python3-dev is installed for source builds. Optional modules can also be included. ```bash # clone repo git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel # python3-dev is required for some source installs apt install python3-dev # pip: install from source # You can install optional modules like vllm, sglang, bitblas. # Example: pip install -v .[vllm,sglang,bitblas] pip install -v . ``` -------------------------------- ### Authoring Surfaces - Python DSL Example Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md A concise Python DSL example for defining quantization rules for weights. ```python Rule( match="*", weight={ "quantize": gptq(bits=4, sym=True, group_size=128), "export": {"format": "gptq"}, }, ) ``` -------------------------------- ### Install Evalution for Benchmarking Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Installs the Evalution library, a benchmarking toolkit for LLMs, which integrates with GPTQModel. ```bash # install Evalution pip install Evalution ``` -------------------------------- ### Authoring Surfaces - YAML Example Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md The equivalent YAML configuration for defining quantization rules for weights, matching the Python DSL example. ```yaml match: "*" weight: quantize: method: gptq bits: 4 sym: true group_size: 128 export: format: gptq ``` -------------------------------- ### Define Aliases and Actions in Python Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Python example demonstrating how to define aliases for reusable tensor references and apply actions using these aliases. ```python Rule( match=".*self_attn$", aliases={"proj": ["q_proj", "k_proj", "v_proj", "o_proj"]}, actions=[ record_stats(targets="@proj"), inspect_outliers(targets="@proj"), ], ) ``` -------------------------------- ### Common Export Format Examples Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Examples of common export formats and their variants, including GPTQ, AWQ, FP8, FP4, and GGUF. ```json {"format": "gptq"} ``` ```json {"format": "awq", "variant": "gemm"} ``` ```json {"format": "awq", "variant": "gemv"} ``` ```json {"format": "fp8", "variant": "e4m3fn", "impl": "transformer_engine"} ``` ```json {"format": "fp4", "variant": "nvfp4", "impl": "modelopt"} ``` ```json {"format": "gguf", "variant": "q4_k_m"} ``` -------------------------------- ### Weight Target Section in Python Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Python example showing the structure of the 'weight' target section, including optional prepare, quantize, and export configurations. ```python weight={ "prepare": [...], # optional "quantize": ..., # optional "export": ..., # optional } ``` -------------------------------- ### Install GPTQModel via PIP/UV Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Install the GPTQModel package using pip or uv. Optional modules like autoround, ipex, vllm, sglang, and bitblas can be included. ```bash # You can install optional modules like autoround, ipex, vllm, sglang, bitblas. # Example: pip install -v gptqmodel[vllm,sglang,bitblas] pip install -v gptqmodel uv pip install -v gptqmodel ``` -------------------------------- ### Separate Quantize and Export in Python Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Example in Python showing distinct 'quantize' and 'export' configurations, where RTN is used for quantization and GPTQ for export. ```python weight={ "quantize": rtn(bits=4, sym=True), "export": {"format": "gptq", "impl": "default"}, } ``` -------------------------------- ### Compose Quantization Rules: Default and Skip (YAML) Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md YAML example demonstrating rule composition, with a broad default rule and a specific rule to skip quantization for 'layer0.qkv'. ```yaml - match: "*" weight: prepa - method: pad.columns multiple: 4 semantic: true quantize: method: gptq bits: 4 sym: true group_size: 128 export: format: gptq impl: default input: quantize: method: mxfp4 mode: dynamic block_size: 32 scale_bits: 8 export: format: fp4 variant: mxfp4 impl: modelopt - match: "layer0.qkv" weight: quantize: method: skip ``` -------------------------------- ### Separate Quantize and Export in YAML Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Example in YAML showing distinct 'quantize' and 'export' configurations, where RTN is used for quantization and GPTQ for export. ```yaml weight: quantize: method: rtn bits: 4 sym: true export: format: gptq impl: default ``` -------------------------------- ### Dynamic Quantization Configuration Example Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Shows how to configure dynamic quantization overrides for specific modules within a model. It includes positive matches for overriding bits/group_size and negative matches for skipping modules. ```python dynamic = { # `.*\. ` matches the layers_node prefix # layer index starts at 0 # positive match: layer 19, gate module r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32}, # positive match: layer 20, gate module (prefix defaults to positive if missing) r".*\.19\..*gate.*": {"bits": 8, "group_size": 64}, # negative match: skip layer 21, gate module r"-:.+\.20\..*gate.*": {}, # negative match: skip all down modules for all layers r"-:.+\.down.*": {}, } ``` -------------------------------- ### GPTQ Quantization and Export in Python Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Example in Python specifying GPTQ for quantization and 'gptq' format for export, indicating GPTQ packing is part of export realization. ```python weight={ "quantize": gptq(bits=4, sym=True, group_size=128), "export": {"format": "gptq", "impl": "default"}, } ``` -------------------------------- ### GPTQ Quantization and Export in YAML Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Example in YAML specifying GPTQ for quantization and 'gptq' format for export, indicating GPTQ packing is part of export realization. ```yaml weight: quantize: method: gptq bits: 4 sym: true group_size: 128 export: format: gptq impl: default ``` -------------------------------- ### Quantize Model using GGUF Format Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Example of loading a model and quantizing it using the GGUF format with Q4_K_M quantization settings. Calibration is set to None for weight-only quantization. ```python from gptqmodel import BACKEND, GGUFConfig, GPTQModel model_id = "meta-llama/Llama-3.2-1B-Instruct" quant_path = "Llama-3.2-1B-Instruct-GGUF-Q4_K_M" qcfg = GGUFConfig( bits=4, format="q_k_m", ) model = GPTQModel.load(model_id, qcfg) model.quantize(calibration=None, backend=BACKEND.GGUF_TORCH) model.save(quant_path) ``` -------------------------------- ### GPTQModel Inference Example Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Perform inference using GPTQModel with a three-line API. Load a model and generate text, then decode the tokens to a string. ```python from gptqmodel import GPTQModel model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5") result = model.generate("Uncovering deep insights begins with")[0] # tokens print(model.tokenizer.decode(result)) # string output ``` -------------------------------- ### Patching Export Rules in YAML Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Example of applying specific export configurations to modules matching a pattern using YAML. ```yaml - match: "*" weight: export: format: awq variant: gemm impl: llm_awq version: 2 - match: ".*small_proj$" weight: export: variant: gemv ``` -------------------------------- ### Define Quantization Method Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Specifies the quantization method to be used. Examples include gptq, rtn, mxfp4, int8, and skip. ```text gptq(bits=4, sym=True, group_size=128) ``` ```text rtn(bits=4, sym=True) ``` ```text mxfp4(mode="dynamic", block_size=32, scale_bits=8) ``` ```text int8(calibration=observer("max")) ``` ```text skip() ``` -------------------------------- ### Compose Quantization Rules: Override Bits (Python) Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Python example showing how to override specific quantization parameters, like 'bits', for a subset of matched layers using a narrower rule. ```python Rule( match="*", weight={ "quantize": gptq(bits=4, sym=True, group_size=128), "export": {"format": "gptq", "impl": "default"}, }, ) Rule( match=".*(q_proj|k_proj)$", weight={ "quantize": {"bits": 8}, }, ) ``` -------------------------------- ### Quantize Model using Exllama V3 (EXL3) Format Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Example of quantizing a model using the Exllama V3 format with specified bits, head_bits, and codebook settings. Requires a calibration dataset. ```python from datasets import load_dataset from gptqmodel import BACKEND, EXL3Config, GPTQModel model_id = "meta-llama/Llama-3.2-1B-Instruct" quant_path = "Llama-3.2-1B-Instruct-EXL3" calibration_dataset = load_dataset( "allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split="train", ).select(range(1024))["text"] qcfg = EXL3Config( bits=4.0, # target average bits-per-weight head_bits=6.0, # optional higher bitrate for attention heads / sensitive tensors codebook="mcg", # one of: mcg, mul1, 3inst ) model = GPTQModel.load(model_id, qcfg) model.quantize(calibration_dataset, batch_size=1, backend=BACKEND.EXL3_EXLLAMA_V3) model.save(quant_path) ``` -------------------------------- ### AWQ Quantization with Fallback and Smoothing Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Configures AWQ quantization with a fallback strategy, threshold, and optional smoothing parameters. This YAML example shows detailed fallback settings. ```yaml weight: quantize: method: awq bits: 4 group_size: 128 fallback: strategy: rtn threshold: 1.0% smooth: type: mad k: 2.75 ``` -------------------------------- ### Load and Infer with EoRA-Enhanced GPTQ Model Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/eora/README.md Example Python code for loading a GPTQ model with an EoRA adapter for inference. Ensure the adapter path and rank are correctly specified. ```python from gptqmodel import BACKEND, GPTQModel # noqa: E402 from gptqmodel.adapter.adapter import Lora eora = Lora( # for eora generation, path is adapter save path; for load, it is loading path path='docs/eora/Llama-3.2-3B-4bits-eora_rank64_c4 ', rank=64, ) model = GPTQModel.load( model_id_or_path='sliuau/Llama-3.2-3B_4bits_128group_size', adapter=eora, ) tokens = model.generate("Capital of France is")[0] result = model.tokenizer.decode(tokens) print(f"Result: {result}") ``` -------------------------------- ### Compose Quantization Rules: Default and Skip (Python) Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Example of composing quantization rules in Python, defining a global default rule and a narrower rule to skip quantization for specific layers. ```python Rule( match="*", weight={ "prepare": [pad.columns(multiple=4, semantic=True)], "quantize": gptq(bits=4, sym=True, group_size=128), "export": {"format": "gptq", "impl": "default"}, }, input={ "quantize": mxfp4(mode="dynamic", block_size=32, scale_bits=8), "export": { "format": "fp4", "variant": "mxfp4", "impl": "modelopt", }, }, ) Rule( match="layer0.qkv", weight={ "quantize": skip(), }, ) ``` -------------------------------- ### Enable Group Aware Reordering (GAR) Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Example of creating a QuantizeConfig to enable Group Aware Reordering (GAR) by setting `act_group_aware` to True and `desc_act` to False. ```python quant_config = QuantizeConfig(bits=4, group_size=128, act_group_aware=True) ``` -------------------------------- ### Run GSM8K Benchmark with GPTQModel via Evalution Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Example of running the gsm8k_platinum benchmark using Evalution's native GPTQModel engine with the 'marlin' backend on CUDA. It specifies a model and benchmark parameters. ```python import evalution as eval run = ( eval.GPTQModel( backend="marlin", device="cuda:0", ) .model(eval.Model(path="ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1")) .run(eval.benchmarks.gsm8k_platinum(apply_chat_template=True, batch_size=16)) ) print(run.to_dict()["tests"][0]["metrics"]) ``` -------------------------------- ### Configure GPTQ and GGUF Quantization with Preprocessors Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Demonstrates how to configure GPTQConfig and GGUFConfig with various preprocessors like SmootherConfig, AutoModuleDecoderConfig, and TensorParallelPadderConfig. ```python import torch from gptqmodel import GGUFConfig, GPTQConfig from gptqmodel.quantization import ( AutoModuleDecoderConfig, SmoothMAD, SmootherConfig, TensorParallelPadderConfig, ) gptq_cfg = GPTQConfig( bits=4, group_size=128, preprocessors=[ SmootherConfig(smooth=SmoothMAD(k=2.0)), AutoModuleDecoderConfig(target_dtype=torch.bfloat16), TensorParallelPadderConfig(), ], ) gguf_cfg = GGUFConfig( bits=4, format="q_k_m", preprocessors=[ AutoModuleDecoderConfig(target_dtype=torch.bfloat16), TensorParallelPadderConfig(), ], ) ``` -------------------------------- ### Configure Activation-Aware GPTQ (Python) Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Configure GPTQ quantization with awareness of input activation modes. Use 'ignore' for classic weight-only GPTQ, 'fake' for optimization with fake-quantized inputs. ```python gptq( bits=4, sym=True, group_size=128, activation_mode="ignore", # or "fake", later possibly "real" ) ``` -------------------------------- ### Advanced Replace Mode in YAML Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Example of using 'mode: replace' in YAML for advanced control, overriding default patch merging behavior. ```yaml match: "layer0.qkv" weight: mode: replace prepa - method: pad.columns multiple: 4 semantic: true quantize: method: skip ``` -------------------------------- ### Advanced Replace Mode in Python Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Example of using 'mode: replace' in Python for advanced control, overriding default patch merging behavior. ```python Rule( match="layer0.qkv", weight={ "mode": "replace", "prepare": [pad.columns(multiple=4, semantic=True)], "quantize": skip(), }, ) ``` -------------------------------- ### Enable GPTAQ Quantization Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Enable GPTAQ quantization by setting `gptaq = GPTAQConfig(...)`. Note that GPTAQ is experimental, not MoE compatible, and requires significantly more VRAM. ```python # Note GPTAQ is currently experimental, not MoE compatible, and requires 2-4x more VRAM to execute # We have many reports of GPTAQ not working better or exceeding GPTQ so please use for testing only # If OOM on 1 GPU, please set CUDA_VISIBLE_DEVICES=0,1 to 2 GPUs and gptqmodel will auto use second GPU quant_config = QuantizeConfig(bits=4, group_size=128, gptaq=GPTAQConfig(alpha=0.25, device="auto")) ``` -------------------------------- ### Define Quantization Stages with Multiple Rules Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md This configuration sets up a 'ptq' stage with multiple rules for quantization. It includes actions for smoothquant, GPTQ quantization with specific parameters, and MXFP4 input quantization, along with export configurations for both weight and input tensors. A rule to skip quantization for 'layer0.qkv' is also included. ```python version = 2 stages = [ Stage( name="ptq", rules=[ Rule( match=".*self_attn$", actions=[smoothquant(alpha=0.5)], ), Rule( match="*", weight={ "prepare": [clip.mad(k=2.75)], "quantize": gptq( bits=4, sym=True, group_size=128, activation_mode="fake", ), "export": {"format": "gptq", "impl": "default"}, }, input={ "quantize": mxfp4( mode="dynamic", block_size=32, scale_bits=8, ), "export": { "format": "fp4", "variant": "mxfp4", "impl": "modelopt", }, }, ), Rule( match="layer0.qkv", weight={ "quantize": skip(), }, ), ], ), ] ``` ```yaml version: 2 stages: - name: ptq rules: - match: ".*self_attn$" actions: - method: smoothquant alpha: 0.5 - match: "*" weight: prepa - method: clip.mad k: 2.75 quantize: method: gptq bits: 4 sym: true group_size: 128 activation_mode: fake export: format: gptq impl: default input: quantize: method: mxfp4 mode: dynamic block_size: 32 scale_bits: 8 export: format: fp4 variant: mxfp4 impl: modelopt - match: "layer0.qkv" weight: quantize: method: skip ``` -------------------------------- ### Load and Generate with GPTQModel Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Loads a pre-quantized model and generates text. Ensure the quant_path points to your quantized model. ```python # test post-quant inference model = GPTQModel.load(quant_path) result = model.generate("Uncovering deep insights begins with")[0] # tokens print(model.tokenizer.decode(result)) # string output ``` -------------------------------- ### Load and Generate with GGUF Model Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md This script demonstrates how to load a GGUF model using GPTQModel and generate text. No external 'gguf' PyPI package is required. You can optionally specify a 'profile' for loading, such as 'low_memory'. ```python from gptqmodel import GPTQModel model = GPTQModel.load("prism-ml/Bonsai-1.7B-gguf") # or: model = GPTQModel.load("prism-ml/Bonsai-1.7B-gguf", profile="low_memory") tokens = model.generate( "Who wrote Romeo and Juliet?", max_new_tokens=128, )[0] print(model.tokenizer.decode(tokens, skip_special_tokens=True)) ``` -------------------------------- ### XPU vs CPU INT4 Packing Visualization Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/torch_fused_int4_transformations.md Illustrates the difference in INT4 packing between XPU (row-major lane packing) and CPU (byte-tiling). ```text XPU: | int32 lane | = [w7][w6][w5][w4][w3][w2][w1][w0] CPU: | uint8 lane | = [w1][w0] ``` -------------------------------- ### Configure Activation-Aware GPTQ (YAML) Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md YAML configuration for activation-aware GPTQ, specifying bits, symmetry, group size, and activation mode. 'ignore' is for weight-only GPTQ. ```yaml method: gptq bits: 4 sym: true group_size: 128 activation_mode: ignore ``` -------------------------------- ### Machete GEMM API Usage Source: https://github.com/modelcloud/gptqmodel/blob/main/gptqmodel_ext/machete/Readme.md Demonstrates the typical workflow for using Machete's GEMM operation, including prepacking the weight matrix before calling the main GEMM function. Ensure weights are prepacked using `machete_prepack_B`. ```python from vllm import _custom_ops as ops ... W_q_packed = ops.machete_prepack_B(w_q, wtype) output = ops.machete_gemm( a, b_q=W_q_packed, b_type=wtype, b_scales=w_s, b_group_size=group_size ) ``` -------------------------------- ### Evaluate EoRA and GPTQ Model Performance Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/eora/README.md Run this command to evaluate the performance of a GPTQ quantized model with its corresponding EoRA on ARC-C and MMLU benchmarks. Ensure the paths and rank match your generation settings. ```shell python docs/eora/evaluation.py --quantized_model sliuau/Llama-3.2-3B_4bits_128group_size \ --eora_save_path docs/eora/Llama-3.2-3B-4bits-eora_rank64_c4 \ --eora_rank 64 ``` -------------------------------- ### Compose Quantization Rules: Override Bits (YAML) Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md YAML example for overriding quantization bits. A base rule sets GPTQ with 4 bits, and a subsequent rule targets specific projections to use 8 bits. ```yaml - match: "*" weight: quantize: method: gptq bits: 4 sym: true group_size: 128 export: format: gptq impl: default - match: ".*(q_proj|k_proj)$" weight: quantize: bits: 8 ``` -------------------------------- ### GPTQ Per-Module Quantization Overrides Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Configure base GPTQ quantization with per-module overrides for bits and group size. This example sets a default 4-bit GPTQ with group size 128, then overrides 'up_proj' and 'gate_proj' to 8-bit, and 'down_proj' to 4-bit with group size 32. ```python Stage( name="ptq", rules=[ Rule( match="*", weight={ "quantize": { "method": "gptq", "bits": 4, "group_size": 128, }, "export": { "format": "gptq", "impl": "default", }, }, ), Rule( match=".*\.up_proj.*", weight={ "quantize": {"bits": 8}, }, ), Rule( match=".*\.gate_proj.*", weight={ "quantize": {"bits": 8}, }, ), Rule( match=".*\.down_proj.*", weight={ "quantize": {"bits": 4, "group_size": 32}, }, ), ], ) ``` ```yaml stages: - name: ptq rules: - match: "*" weight: quantize: method: gptq bits: 4 group_size: 128 export: format: gptq impl: default - match: ".*\.up_proj.*" weight: quantize: bits: 8 - match: ".*\.gate_proj.*" weight: quantize: bits: 8 - match: ".*\.down_proj.*" weight: quantize: bits: 4 group_size: 32 ``` -------------------------------- ### Quantize with One Method, Export as Another Format Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Configures quantization using one method (e.g., rtn) and specifies a different format for export (e.g., gptq). This enables flexibility in the quantization and export pipeline. ```python Rule( match="primary_projection", weight={ "quantize": rtn(bits=4, sym=True), "export": {"format": "gptq", "impl": "default"}, }, ) ``` ```yaml match: ".*down_proj$" weight: quantize: method: rtn bits: 4 sym: true export: format: gptq impl: default ``` -------------------------------- ### Shorthand Export Formats Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Shorthand string representations for common export formats like 'gptq' and 'native'. ```text "gptq" == {"format": "gptq"} ``` ```text "native" == {"format": "native"} ``` -------------------------------- ### Configure RTN with Weight Smoothing and AWQ GEMM Export Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md This configuration defines a 'weight_only' stage using the RTN method for quantization with 4 bits and a group size of 128. It includes weight smoothing using SmoothMAD and targets AWQ GEMM for export. ```python Stage( name="weight_only", rules=[ Rule( match="*", weight={ "prepare": [ {"method": "smooth.mad", "k": 1.5}, ], "quantize": { "method": "rtn", "bits": 4, "group_size": 128, }, "export": { "format": "awq", "variant": "gemm", }, }, ), ], ) ``` ```yaml stages: - name: weight_only rules: - match: "*" weight: prepa - method: smooth.mad k: 1.5 quantize: method: rtn bits: 4 group_size: 128 export: format: awq variant: gemm ``` -------------------------------- ### Serve Model via OpenAI API Compatible Endpoint Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Make a loaded GPTQModel available through an OpenAI API compatible endpoint. Specify the host and port for the server. ```python # load model using above inference guide first model.serve(host="0.0.0.0",port="12345") ``` -------------------------------- ### Machete GEMM Operation Source: https://github.com/modelcloud/gptqmodel/blob/main/gptqmodel_ext/machete/Readme.md This snippet demonstrates the typical usage of the machete_gemm operation, including prepacking the weight matrix. ```APIDOC ## machete_gemm and prepacking ### Description This operation performs a GEMM (General Matrix Multiply) with quantized weights. The weight matrix `b_q` must be prepacked using `machete_prepack_B` before calling `machete_gemm`. ### Usage ```python from vllm import _custom_ops as ops # Assuming w_q is the quantized weight matrix, wtype is its data type, # w_s are the scales, and group_size is the group size for quantization. W_q_packed = ops.machete_prepack_B(w_q, wtype) output = ops.machete_gemm( a, # Input matrix A b_q=W_q_packed, # Prepacked quantized weight matrix B b_type=wtype, # Data type of the quantized weight matrix b_scales=w_s, # Scales for dequantization b_group_size=group_size # Group size for quantization ) ``` ### Parameters - **a** (Tensor): Input matrix A. - **b_q** (Tensor): Prepacked quantized weight matrix B. This should be the output of `machete_prepack_B`. - **b_type** (DataType): The data type of the quantized weight matrix `b_q`. - **b_scales** (Tensor): The scales used for dequantization. - **b_group_size** (int): The group size used during quantization. If None, it implies no grouping. ### Notes - The weight matrix must be prepacked before calling `machete_gemm`. - The `machete_prepack_B` function is used for this prepacking step. ``` -------------------------------- ### Quantize Model using FP8 Format Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Demonstrates quantizing a model using the FP8 format with float8_e4m3fn precision. Calibration is set to None for weight-only quantization. ```python from gptqmodel import BACKEND, FP8Config, GPTQModel model_id = "meta-llama/Llama-3.2-1B-Instruct" quant_path = "Llama-3.2-1B-Instruct-FP8-E4M3" qcfg = FP8Config( format="float8_e4m3fn", # or "float8_e5m2" bits=8, weight_scale_method="row", ) model = GPTQModel.load(model_id, qcfg) model.quantize(calibration=None, backend=BACKEND.GPTQ_TORCH) model.save(quant_path) ``` -------------------------------- ### Explicit Replacement and Stop Rule (YAML) Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md YAML configuration for explicit replacement of inherited settings and preventing subsequent rule changes using 'stop: true' for 'layer0.qkv'. ```yaml match: "layer0.qkv" stop: true weight: mode: replace prepa - method: pad.columns multiple: 4 semantic: true quantize: method: skip ``` -------------------------------- ### Define Quantization Stages in YAML Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md This YAML configuration defines stages for quantization, specifying balancing and post-training quantization rules. ```yaml stages: - name: balance rules: - match: ".*self_attn$" actions: - method: smoothquant alpha: 0.5 - name: ptq rules: - match: "*" weight: quantize: method: gptq bits: 4 sym: true group_size: 128 export: format: gptq ``` -------------------------------- ### Recommended Rule Shape in Python Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Illustrates the recommended structure for a Rule object in Python, including match, aliases, actions, and tensor targets. ```python Rule( match="*", aliases=None, actions=[], stop=False, weight={...}, input={...}, output={...}, kv_cache={...}, ) ``` -------------------------------- ### GPTQ Quantization with Fallback for Low Evidence Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Illustrates GPTQ quantization where fallback to RTN is used if evidence is insufficient. The export format remains GPTQ. ```yaml weight: quantize: method: gptq bits: 4 fallback: strategy: rtn threshold: 0.5% export: format: gptq ``` -------------------------------- ### EoRA Accuracy Recovery with GPTQModel Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Demonstrates how to use EoRA (Enhanced Post-Quant Error Recovery via Lora) to improve quantized model accuracy. Requires a LoRA adapter path and a previously GPTQ-quantized model. ```python # EoRa is currently only validated for GPTQ # higher rank improves accuracy at the cost of VRAM usage # suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage eora = Lora( # for eora generation, path is adapter save path; for load, it is loading path path=f"{quant_path}/eora_rank32", rank=32, ) # provide a previously GPTQ-quantized model path GPTQModel.adapter.generate( adapter=eora, model_id_or_path=model_id, quantized_model_id_or_path=quant_path, calibration_dataset=calibration_dataset, calibration_dataset_concat_size=0, ) # post-eora inference model = GPTQModel.load( model_id_or_path=quant_path, adapter=eora ) tokens = model.generate("Capital of France is")[0] result = model.tokenizer.decode(tokens) print(f"Result: {result}") # For more details on EoRA, please see docs/eora/ # Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness ``` -------------------------------- ### Define Quantization Stages in Python Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Use this Python code to define stages for quantization, including balancing and post-training quantization rules. ```python stages = [ Stage( name="balance", rules=[ Rule( match=".*self_attn$", actions=[smoothquant(alpha=0.5)], ), ], ), Stage( name="ptq", rules=[ Rule( match="*", weight={ "quantize": gptq(bits=4, sym=True, group_size=128), "export": {"format": "gptq"}, }, ), ], ), ] ``` -------------------------------- ### Define Export Configuration in Python Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Use this structure to define the export format, variant, implementation, and version for quantized models in Python. ```python weight={ "export": { "format": "awq", "variant": "gemm", "impl": "llm_awq", "version": 2, }, } ``` -------------------------------- ### Define Activation Quantization Rule (YAML) Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md This YAML configuration defines a rule for activation quantization, mirroring the Python configuration with method, mode, and export details. ```yaml match: "*" input: quantize: method: mxfp4 mode: dynamic block_size: 32 scale_bits: 8 export: format: fp4 variant: mxfp4 impl: modelopt ``` -------------------------------- ### Quantize LLM Model with GPTQModel Source: https://github.com/modelcloud/gptqmodel/blob/main/README.md Quantize a specified LLM model using GPTQModel and a calibration dataset. Adjust batch size based on available VRAM for faster quantization. The quantized model is then saved. ```python from datasets import load_dataset from gptqmodel import GPTQConfig, GPTQModel model_id = "meta-llama/Llama-3.2-1B-Instruct" quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit" calibration_dataset = load_dataset( "allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split="train" ).select(range(1024))["text"] quant_config = GPTQConfig(bits=4, group_size=128) model = GPTQModel.load(model_id, quant_config) # increase `batch_size` to match GPU/VRAM specs to speed up quantization model.quantize(calibration_dataset, batch_size=1) model.save(quant_path) ``` -------------------------------- ### Define Activation Quantization Rule (Python) Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Use this Python code to define a rule for activation quantization, specifying the quantization method, mode, and export format. ```python Rule( match="*", input={ "quantize": mxfp4(mode="dynamic", block_size=32, scale_bits=8), "export": { "format": "fp4", "variant": "mxfp4", "impl": "modelopt", }, }, ) ``` -------------------------------- ### Generate EoRA with GPTQ Quantization Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/eora/README.md Use this command to generate EoRA simultaneously with GPTQ quantization. Specify calibration data and desired rank. For MMLU task improvement, set 'eora_dataset' to 'mmlu'. ```shell python docs/eora/eora_generation.py meta-llama/Llama-3.2-3B --bits 4 \ --quant_save_path docs/eora/Llama-3.2-3B-4bits \ --eora_dataset c4 \ --eora_save_path docs/eora/Llama-3.2-3B-4bits-eora_rank64_c4 \ --eora_rank 64 ``` -------------------------------- ### Protocol Root - Python DSL Source: https://github.com/modelcloud/gptqmodel/blob/main/docs/quantization_protocol.md Defines the basic structure of the quantization protocol using Python, including version and stages. ```python version = 2 stages = [ Stage( name="ptq", rules=[ Rule( match="*", aliases=None, actions=[], stop=False, weight=None, input=None, output=None, kv_cache=None, ), ], ), ] ```