### Install SliceGPT package

Source: https://github.com/microsoft/transformercompression/blob/main/README.md

Install the slicegpt package in editable mode with experiment dependencies.

```bash
pip install -e .[experiment]
```

--------------------------------

### Install recovery fine-tuning dependencies

Source: https://github.com/microsoft/transformercompression/blob/main/README.md

Install additional dependencies required for post-slicing recovery fine-tuning.

```bash
pip install -e .[experiment,finetune]
```

--------------------------------

### Prepare Test Dataloader

Source: https://context7.com/microsoft/transformercompression/llms.txt

Initializes a dataloader for perplexity evaluation using the provided dataset and tokenizer.

```python
test_loader = data_utils.prepare_test_dataloader(
    dataset=test_dataset,
    tokenizer=tokenizer,
    seqlen=2048,
    batch_size=8
)
```

--------------------------------

### Load Standard Datasets for Calibration and Evaluation

Source: https://context7.com/microsoft/transformercompression/llms.txt

Loads and prepares standard datasets such as WikiText-2, PTB, C4, and Alpaca for model calibration and evaluation. Automatically splits Alpaca into train/test/validation sets.

```python
from slicegpt import data_utils

# Load WikiText-2 dataset
dataset = data_utils.get_dataset("wikitext2")
train_dataset = dataset["train"]
test_dataset = dataset["test"]

# Load Alpaca dataset (automatically split into train/test/validation)
dataset = data_utils.get_dataset("alpaca")
print(f"Train samples: {len(dataset['train'])}")
print(f"Test samples: {len(dataset['test'])}")
print(f"Validation samples: {len(dataset['validation'])}")

# Load C4 dataset
dataset = data_utils.get_dataset("c4")
```

--------------------------------

### Prepare DataLoader for Training or Calibration

Source: https://context7.com/microsoft/transformercompression/llms.txt

Creates a DataLoader for training or calibration with configurable sequence length, batch size, and sampling parameters. Supports concatenating samples to a fixed length.

```python
from slicegpt import data_utils
import torch

# Prepare calibration dataloader for slicing
train_loader = data_utils.prepare_dataloader(
    dataset=train_dataset,
    tokenizer=tokenizer,
    max_seqlen=2048,
    batch_size=16,
    nsamples=128,
    varied_seqlen=False,  # Concatenate samples to fixed length
    seed=42
)
```

--------------------------------

### Run SliceGPT compression

Source: https://github.com/microsoft/transformercompression/blob/main/README.md

Compress a model like microsoft/phi-2 by specifying the sparsity level and output directory.

```bash
python run_slicegpt.py \
       --model microsoft/phi-2 \
       --save-dir dir/to/save/sliced_model/in \
       --sparsity 0.25 \
       --device cuda:0 \
       --eval-baseline \
       --no-wandb
```

--------------------------------

### prepare_dataloader

Source: https://context7.com/microsoft/transformercompression/llms.txt

Creates a DataLoader for training or calibration with configurable sequence length and sampling.

```APIDOC
## prepare_dataloader

### Description
Creates a DataLoader for training or calibration with configurable sequence length and sampling.

### Parameters
#### Request Body
- **dataset** (Dataset) - Required - The dataset to load.
- **tokenizer** (Tokenizer) - Required - The tokenizer to use.
- **max_seqlen** (int) - Required - Maximum sequence length.
- **batch_size** (int) - Required - Batch size.
- **nsamples** (int) - Required - Number of samples.
- **varied_seqlen** (bool) - Optional - Whether to use varied sequence lengths.

### Request Example
{
  "max_seqlen": 2048,
  "batch_size": 16,
  "nsamples": 128
}
```

--------------------------------

### Configure Slicing Schedulers

Source: https://context7.com/microsoft/transformercompression/llms.txt

Demonstrates different strategies for configuring slicing dimensions, including constant, linear, and configuration-based scheduling.

```python
from slicegpt.slicing_scheduler import (
    ConstSlicingScheduler,
    ConfigSlicingScheduler,
    FunctionSlicingScheduler
)

# Constant sparsity across all layers
const_scheduler = ConstSlicingScheduler(
    dimension=1920,      # New embedding dimension
    do_slice_head=False  # Whether to slice the LM head
)

# Linear varying sparsity (lower at start, higher at end)
linear_scheduler = FunctionSlicingScheduler.create_linear(
    mlp_start=0.1,       # 10% sparsity at first layer
    mlp_end=0.4,         # 40% sparsity at last layer
    attn_start=0.1,
    attn_end=0.4,
    round_interval=8,
    do_slice_head=False
)

# Load from saved configuration
from slicegpt.model_adapter import SlicingConfig
import pathlib

config_json = pathlib.Path("./sliced_model/phi-2_0.25.json").read_text()
slicing_conf = SlicingConfig.from_json_string(config_json)
config_scheduler = ConfigSlicingScheduler(slicing_conf)
```

--------------------------------

### Import Built-in Model Adapters

Source: https://context7.com/microsoft/transformercompression/llms.txt

Access built-in adapters for supported model families like Llama, Phi, and OPT.

```python
# Llama-2 models (sequential attention/MLP blocks)
from slicegpt.adapters.llama_adapter import LlamaModelAdapter
# Supports: meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf

# Llama-3 models
# Supports: meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3-8B-Instruct
# Supports: meta-llama/Meta-Llama-3-70B, meta-llama/Meta-Llama-3-70B-Instruct

# Phi-2 model (parallel attention/MLP blocks)
from slicegpt.adapters.phi2_adapter import Phi2ModelAdapter
# Supports: microsoft/phi-2

# Phi-3 model
from slicegpt.adapters.phi3_adapter import Phi3ModelAdapter
# Supports: microsoft/Phi-3-mini-4k-instruct

# OPT models (sequential blocks)
from slicegpt.adapters.opt_adapter import OPTModelAdapter
# Supports: facebook/opt-125m, facebook/opt-1.3b, facebook/opt-2.7b
# Supports: facebook/opt-6.7b, facebook/opt-13b, facebook/opt-30b, facebook/opt-66b
```

--------------------------------

### Execute Compression Pipeline

Source: https://context7.com/microsoft/transformercompression/llms.txt

Perform end-to-end model compression including loading, calibration, baseline evaluation, and layer fusion.

```python
import torch
import pathlib
from slicegpt import hf_utils, data_utils, gpu_utils, layernorm_fusion, rotate
from slicegpt.slicing_scheduler import ConstSlicingScheduler
from slicegpt.config import config

# Configuration
MODEL_NAME = "microsoft/phi-2"
SPARSITY = 0.25
ROUND_INTERVAL = 8
SAVE_DIR = pathlib.Path("./compressed_model")
config.device = torch.device("cuda:0")

# 1. Load model and tokenizer
model_adapter, tokenizer = hf_utils.get_model_and_tokenizer(
    MODEL_NAME, dtype=torch.float16
)

# 2. Prepare calibration data
dataset = data_utils.get_dataset("wikitext2")
train_loader = data_utils.prepare_dataloader(
    dataset["train"], tokenizer, max_seqlen=2048, batch_size=16, nsamples=128
)
test_loader = data_utils.prepare_test_dataloader(
    dataset["test"], tokenizer, batch_size=8
)

# 3. Evaluate baseline perplexity
model_adapter.model.to(config.device)
baseline_ppl = gpu_utils.evaluate_ppl(
    model_adapter.model, model_adapter.model.config.pad_token_id, test_loader
)
print(f"Baseline PPL: {baseline_ppl:.4f}")
model_adapter.model.cpu()

# 4. Prepare model for slicing
layernorm_fusion.replace_layers(model_adapter)
layernorm_fusion.fuse_modules(model_adapter)
```

--------------------------------

### Run Recovery Fine-tuning with LoRA

Source: https://context7.com/microsoft/transformercompression/llms.txt

Apply LoRA fine-tuning to a sliced model to recover accuracy, specifying sparsity and dataset parameters.

```bash
# Run recovery fine-tuning on a sliced model
python experiments/run_finetuning.py \
    --model microsoft/phi-2 \
    --sliced-model-path ./sliced_models/phi2 \
    --save-dir ./finetuned_models/phi2 \
    --sparsity 0.25 \
    --device cuda:0 \
    --ppl-eval-dataset alpaca \
    --finetune-dataset alpaca \
    --finetune-train-nsamples 8000 \
    --finetune-train-seqlen 1024 \
    --finetune-train-batch-size 3 \
    --lora-alpha 10 \
    --lora-r 32 \
    --lora-dropout 0.05 \
    --lora-target-option attn_head_and_mlp \
    --eval-steps 16 \
    --save-steps 16 \
    --no-wandb

# Output:
# PPL before finetuning: 12.3456
# trainable params: 8,388,608 || all params: 1,571,314,688 || trainable%: 0.534%
# PPL after finetuning: 10.8901
```

--------------------------------

### Run SliceGPT Compression for Phi-2 Model

Source: https://context7.com/microsoft/transformercompression/llms.txt

Compresses the 'microsoft/phi-2' model using SliceGPT with specified sparsity, device, calibration dataset, and evaluation settings. It outputs model statistics before and after slicing.

```bash
python experiments/run_slicegpt.py \
    --model microsoft/phi-2 \
    --save-dir ./sliced_models/phi2 \
    --sparsity 0.25 \
    --device cuda:0 \
    --cal-dataset wikitext2 \
    --cal-nsamples 128 \
    --cal-batch-size 16 \
    --eval-baseline \
    --no-wandb
```

--------------------------------

### Iterate Through Batches

Source: https://context7.com/microsoft/transformercompression/llms.txt

Demonstrates how to access input IDs and attention masks from a training loader.

```python
for batch in train_loader:
    input_ids = batch["input_ids"]      # Shape: [batch_size, seqlen]
    attention_mask = batch["attention_mask"]
    print(f"Batch shape: {input_ids.shape}")
    break
```

--------------------------------

### get_dataset

Source: https://context7.com/microsoft/transformercompression/llms.txt

Loads and prepares standard datasets for calibration and evaluation.

```APIDOC
## get_dataset

### Description
Loads and prepares standard datasets for calibration and evaluation. Supports WikiText-2, PTB, C4, and Alpaca datasets.

### Parameters
#### Request Body
- **dataset_name** (string) - Required - The name of the dataset (e.g., 'wikitext2', 'alpaca', 'c4').

### Request Example
{
  "dataset_name": "wikitext2"
}
```

--------------------------------

### Load Sliced Model for Inference and Fine-tuning

Source: https://context7.com/microsoft/transformercompression/llms.txt

Loads a previously sliced model for inference or with LoRA configuration for fine-tuning. Requires the model name, path to the sliced model, and sparsity level.

```python
from slicegpt import hf_utils

# Load a sliced model for inference
model_adapter, tokenizer = hf_utils.load_sliced_model(
    model_name="microsoft/phi-2",
    sliced_model_path="./sliced_models/phi2",
    sparsity=0.25,
    round_interval=8,
    token=None
)

# Load with LoRA configuration for fine-tuning
from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    r=32,
    lora_alpha=10,
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

model_adapter, tokenizer = hf_utils.load_sliced_model(
    model_name="microsoft/phi-2",
    sliced_model_path="./sliced_models/phi2",
    sparsity=0.25,
    lora_config=lora_config
)
```

--------------------------------

### Run recovery fine-tuning

Source: https://github.com/microsoft/transformercompression/blob/main/README.md

Perform recovery fine-tuning on a sliced model using LoRA hyperparameters.

```bash
python run_finetuning.py \
       --model microsoft/phi-2 \
       --sliced-model-path path/to/sliced \
       --save-dir dir/to/save/finetuned_model/in \
       --sparsity 0.25 \
       --device cuda:0 \
       --ppl-eval-dataset alpaca \
       --finetune-dataset alpaca \
       --finetune-train-nsamples 8000 \
       --finetune-train-seqlen 1024 \
       --finetune-train-batch-size 3 \
       --lora-alpha 10 \
       --lora-r 32 \
       --lora-dropout 0.05 \
       --lora-target-option attn_head_and_mlp \
       --eval-steps 16 \
       --save-steps 16 \
       --no-wandb
```

--------------------------------

### Load Hugging Face Model and Tokenizer

Source: https://context7.com/microsoft/transformercompression/llms.txt

Loads a pretrained model and its tokenizer from Hugging Face Hub or a local path. Supports automatic model adapter selection and can create uninitialized models for loading pre-sliced weights.

```python
from slicegpt import hf_utils
from slicegpt.config import config
import torch

# Load a pretrained model from Hugging Face Hub
model_adapter, tokenizer = hf_utils.get_model_and_tokenizer(
    model_name="microsoft/phi-2",
    model_path=None,           # None for HF models
    dtype=torch.float16,
    token=None                 # HF_TOKEN for gated models
)

# Access the underlying model
model = model_adapter.model
print(f"Model hidden size: {model_adapter.hidden_size}")
print(f"Sequence length: {model_adapter.seqlen}")
print(f"Parallel blocks: {model_adapter.parallel_blocks}")

# Load a local model
model_adapter, tokenizer = hf_utils.get_model_and_tokenizer(
    model_name="meta-llama/Llama-2-7b-hf",
    model_path="/path/to/local/llama",
    dtype=torch.float16
)
```

--------------------------------

### Evaluate Sliced Models

Source: https://context7.com/microsoft/transformercompression/llms.txt

Evaluate model performance on standard benchmarks using the LM Evaluation Harness.

```bash
# Evaluate on PIQA benchmark
python experiments/run_lm_eval.py \
    --model microsoft/phi-2 \
    --sliced-model-path ./sliced_models/phi2 \
    --sparsity 0.25 \
    --tasks piqa \
    --no-wandb

# Evaluate original model for comparison
python experiments/run_lm_eval.py \
    --model microsoft/phi-2 \
    --model-path microsoft/phi-2 \
    --tasks piqa,hellaswag,arc_easy \
    --no-wandb
```

--------------------------------

### Save Compressed Model

Source: https://context7.com/microsoft/transformercompression/llms.txt

Persists the model state dictionary and configuration to the specified directory.

```python
# 8. Save compressed model
SAVE_DIR.mkdir(parents=True, exist_ok=True)
model_name = pathlib.Path(MODEL_NAME).name
torch.save(model_adapter.model.state_dict(), SAVE_DIR / f"{model_name}_{SPARSITY}.pt")
(SAVE_DIR / f"{model_name}_{SPARSITY}.json").write_text(
    model_adapter.slicing_conf.to_json_string()
)
print(f"Saved to {SAVE_DIR}")
```

--------------------------------

### Evaluate Model Perplexity

Source: https://context7.com/microsoft/transformercompression/llms.txt

Moves the model to the configured device and calculates perplexity on a test dataset.

```python
from slicegpt import gpu_utils
from slicegpt.config import config

# Move model to GPU
model_adapter.model.to(config.device)

# Evaluate perplexity
ppl = gpu_utils.evaluate_ppl(
    model=model_adapter.model,
    pad_token_id=model_adapter.model.config.pad_token_id,
    testloader=test_loader
)

print(f"Model perplexity: {ppl:.4f}")

# Example output:
# Evaluating perplexity...
# Model perplexity: 11.2847
```

--------------------------------

### Implement Custom ModelAdapter

Source: https://context7.com/microsoft/transformercompression/llms.txt

Create a custom LayerAdapter to support new model architectures by defining layer accessors for layernorms, attention, and MLP modules.

```python
from slicegpt.model_adapter import ModelAdapter, LayerAdapter
from torch.nn import Module, Linear

class CustomLayerAdapter(LayerAdapter):
    def __init__(self, layer):
        self._layer = layer

    @property
    def layer(self) -> Module:
        return self._layer

    @property
    def hidden_states_args_position(self) -> int:
        return 0  # Position in forward() args

    @property
    def hidden_states_output_position(self) -> int:
        return 0  # Position in forward() output

    def get_first_layernorm(self) -> Module:
        return self._layer.input_layernorm

    def get_second_layernorm(self) -> Module:
        return self._layer.post_attention_layernorm

    def get_attention_inputs(self) -> list[Linear]:
        return [self._layer.self_attn.q_proj,
                self._layer.self_attn.k_proj,
                self._layer.self_attn.v_proj]

    def get_attention_output(self) -> Linear:
        return self._layer.self_attn.o_proj

    def get_mlp_inputs(self) -> list[Linear]:
        return [self._layer.mlp.gate_proj, self._layer.mlp.up_proj]

    def get_mlp_output(self) -> Linear:
        return self._layer.mlp.down_proj
```

--------------------------------

### Evaluate model using LM Eval Harness

Source: https://github.com/microsoft/transformercompression/blob/main/README.md

Run evaluation on a sliced model using the LM Eval Harness framework.

```bash
python run_lm_eval.py \
       --model microsoft/phi-2 \
       --sliced-model-path path/to/sliced \
       --sparsity 0.25 \
       --tasks piqa \
       --no-wandb
```

--------------------------------

### get_model_and_tokenizer

Source: https://context7.com/microsoft/transformercompression/llms.txt

Loads a Hugging Face model and its tokenizer with automatic model adapter selection.

```APIDOC
## get_model_and_tokenizer

### Description
Loads a Hugging Face model and its tokenizer with automatic model adapter selection. Supports pretrained models from Hugging Face Hub or local paths.

### Parameters
#### Request Body
- **model_name** (string) - Required - The name of the model on Hugging Face Hub.
- **model_path** (string) - Optional - Local path to the model.
- **dtype** (torch.dtype) - Optional - Data type for the model weights.
- **token** (string) - Optional - Hugging Face token for gated models.

### Request Example
{
  "model_name": "microsoft/phi-2",
  "dtype": "torch.float16"
}
```

--------------------------------

### Benchmark Inference Performance

Source: https://context7.com/microsoft/transformercompression/llms.txt

Measures latency and throughput by running token-by-token generation on a sample batch.

```python
from slicegpt import gpu_utils, data_utils

# Prepare a sample input
sample_batch = next(iter(test_loader))

# Run benchmark
results = gpu_utils.benchmark(model_adapter, sample_batch)

print(f"Median time per token: {results['median_time']:.4f}s")
print(f"Latency: {results['latency']:.4f}s")
print(f"Throughput: {results['throughput']:.2f} tokens/s")
```

--------------------------------

### Distribute Model Across GPUs

Source: https://context7.com/microsoft/transformercompression/llms.txt

Uses Hugging Face Accelerate to distribute large models across multiple GPUs for inference.

```python
from slicegpt import gpu_utils

# Distribute model across available GPUs
# Recommended for models with 30B+ parameters
gpu_utils.distribute_model(model_adapter)

# Model is now automatically distributed and ready for inference
output = model_adapter.model(input_ids=input_ids.to("cuda"))
```

--------------------------------

### Rotate and Slice Model

Source: https://context7.com/microsoft/transformercompression/llms.txt

Applies the rotation and slicing transformation to the model adapter using the provided scheduler and calibration data.

```python
# 6. Rotate and slice
rotate.rotate_and_slice(model_adapter, train_loader, scheduler)
```

--------------------------------

### Rotate and Slice Model Weights

Source: https://context7.com/microsoft/transformercompression/llms.txt

Applies orthogonal transformations and slices model weights based on a specified sparsity, then saves the resulting model and configuration.

```python
from slicegpt import rotate, layernorm_fusion
from slicegpt.slicing_scheduler import ConstSlicingScheduler

# Prepare model for slicing
layernorm_fusion.replace_layers(model_adapter)
layernorm_fusion.fuse_modules(model_adapter)

# Calculate new embedding dimension (25% sparsity)
sparsity = 0.25
round_interval = 8
new_embedding_dimension = int((1 - sparsity) * model_adapter.hidden_size)
new_embedding_dimension -= new_embedding_dimension % round_interval

# Create slicing scheduler with constant dimension
scheduler = ConstSlicingScheduler(new_embedding_dimension)

# Rotate and slice the model
rotate.rotate_and_slice(
    model_adapter=model_adapter,
    dataloader=train_loader,
    slicing_scheduler=scheduler,
    apply_mask=True,
    final_orientation='random'  # 'random' or 'pca'
)

# Save the sliced model
import torch
import pathlib

save_dir = pathlib.Path("./sliced_model")
save_dir.mkdir(parents=True, exist_ok=True)
torch.save(model_adapter.model.state_dict(), save_dir / "phi-2_0.25.pt")

# Save slicing configuration
config_path = save_dir / "phi-2_0.25.json"
config_path.write_text(model_adapter.slicing_conf.to_json_string())
```

--------------------------------

### load_sliced_model

Source: https://context7.com/microsoft/transformercompression/llms.txt

Loads a previously sliced model along with its slicing configuration and optionally applies LoRA for fine-tuning.

```APIDOC
## load_sliced_model

### Description
Loads a previously sliced model along with its slicing configuration and optionally applies LoRA for fine-tuning.

### Parameters
#### Request Body
- **model_name** (string) - Required - The name of the model.
- **sliced_model_path** (string) - Required - Path to the sliced model weights.
- **sparsity** (float) - Required - The sparsity level applied.
- **round_interval** (int) - Optional - Rounding interval for slicing.
- **lora_config** (LoraConfig) - Optional - Configuration for LoRA fine-tuning.

### Request Example
{
  "model_name": "microsoft/phi-2",
  "sliced_model_path": "./sliced_models/phi2",
  "sparsity": 0.25
}
```

--------------------------------

### Evaluate Sliced Model

Source: https://context7.com/microsoft/transformercompression/llms.txt

Computes the perplexity of the compressed model on the test dataset.

```python
# 7. Evaluate sliced model
model_adapter.model.to(config.device)
sliced_ppl = gpu_utils.evaluate_ppl(
    model_adapter.model, model_adapter.model.config.pad_token_id, test_loader
)
print(f"Sliced PPL: {sliced_ppl:.4f}")
```

--------------------------------

### Replace Layers and Fuse Modules

Source: https://context7.com/microsoft/transformercompression/llms.txt

Replaces transformer layers with compressed versions and fuses LayerNorm operations into adjacent linear layers to simplify normalization.

```python
from slicegpt import layernorm_fusion

# Step 1: Replace layers with compressed equivalents (adds shortcut operation)
layernorm_fusion.replace_layers(model_adapter)

# Step 2: Fuse LayerNorm weights into adjacent linear layers
# This is a mathematical transformation that preserves model output
layernorm_fusion.fuse_modules(model_adapter)

# After fusion, the model outputs remain identical but the
# normalization operations are simplified to RMSNorm without weights
```

--------------------------------

### Calculate Slicing Dimensions

Source: https://context7.com/microsoft/transformercompression/llms.txt

Determines the new hidden size based on target sparsity and rounding intervals.

```python
# 5. Calculate slicing dimensions
new_dim = int((1 - SPARSITY) * model_adapter.hidden_size)
new_dim -= new_dim % ROUND_INTERVAL
scheduler = ConstSlicingScheduler(new_dim)
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.