### Install SliceGPT package Source: https://github.com/microsoft/transformercompression/blob/main/README.md Install the slicegpt package in editable mode with experiment dependencies. ```bash pip install -e .[experiment] ``` -------------------------------- ### Install recovery fine-tuning dependencies Source: https://github.com/microsoft/transformercompression/blob/main/README.md Install additional dependencies required for post-slicing recovery fine-tuning. ```bash pip install -e .[experiment,finetune] ``` -------------------------------- ### Prepare Test Dataloader Source: https://context7.com/microsoft/transformercompression/llms.txt Initializes a dataloader for perplexity evaluation using the provided dataset and tokenizer. ```python test_loader = data_utils.prepare_test_dataloader( dataset=test_dataset, tokenizer=tokenizer, seqlen=2048, batch_size=8 ) ``` -------------------------------- ### Load Standard Datasets for Calibration and Evaluation Source: https://context7.com/microsoft/transformercompression/llms.txt Loads and prepares standard datasets such as WikiText-2, PTB, C4, and Alpaca for model calibration and evaluation. Automatically splits Alpaca into train/test/validation sets. ```python from slicegpt import data_utils # Load WikiText-2 dataset dataset = data_utils.get_dataset("wikitext2") train_dataset = dataset["train"] test_dataset = dataset["test"] # Load Alpaca dataset (automatically split into train/test/validation) dataset = data_utils.get_dataset("alpaca") print(f"Train samples: {len(dataset['train'])}") print(f"Test samples: {len(dataset['test'])}") print(f"Validation samples: {len(dataset['validation'])}") # Load C4 dataset dataset = data_utils.get_dataset("c4") ``` -------------------------------- ### Prepare DataLoader for Training or Calibration Source: https://context7.com/microsoft/transformercompression/llms.txt Creates a DataLoader for training or calibration with configurable sequence length, batch size, and sampling parameters. Supports concatenating samples to a fixed length. ```python from slicegpt import data_utils import torch # Prepare calibration dataloader for slicing train_loader = data_utils.prepare_dataloader( dataset=train_dataset, tokenizer=tokenizer, max_seqlen=2048, batch_size=16, nsamples=128, varied_seqlen=False, # Concatenate samples to fixed length seed=42 ) ``` -------------------------------- ### Run SliceGPT compression Source: https://github.com/microsoft/transformercompression/blob/main/README.md Compress a model like microsoft/phi-2 by specifying the sparsity level and output directory. ```bash python run_slicegpt.py \ --model microsoft/phi-2 \ --save-dir dir/to/save/sliced_model/in \ --sparsity 0.25 \ --device cuda:0 \ --eval-baseline \ --no-wandb ``` -------------------------------- ### prepare_dataloader Source: https://context7.com/microsoft/transformercompression/llms.txt Creates a DataLoader for training or calibration with configurable sequence length and sampling. ```APIDOC ## prepare_dataloader ### Description Creates a DataLoader for training or calibration with configurable sequence length and sampling. ### Parameters #### Request Body - **dataset** (Dataset) - Required - The dataset to load. - **tokenizer** (Tokenizer) - Required - The tokenizer to use. - **max_seqlen** (int) - Required - Maximum sequence length. - **batch_size** (int) - Required - Batch size. - **nsamples** (int) - Required - Number of samples. - **varied_seqlen** (bool) - Optional - Whether to use varied sequence lengths. ### Request Example { "max_seqlen": 2048, "batch_size": 16, "nsamples": 128 } ``` -------------------------------- ### Configure Slicing Schedulers Source: https://context7.com/microsoft/transformercompression/llms.txt Demonstrates different strategies for configuring slicing dimensions, including constant, linear, and configuration-based scheduling. ```python from slicegpt.slicing_scheduler import ( ConstSlicingScheduler, ConfigSlicingScheduler, FunctionSlicingScheduler ) # Constant sparsity across all layers const_scheduler = ConstSlicingScheduler( dimension=1920, # New embedding dimension do_slice_head=False # Whether to slice the LM head ) # Linear varying sparsity (lower at start, higher at end) linear_scheduler = FunctionSlicingScheduler.create_linear( mlp_start=0.1, # 10% sparsity at first layer mlp_end=0.4, # 40% sparsity at last layer attn_start=0.1, attn_end=0.4, round_interval=8, do_slice_head=False ) # Load from saved configuration from slicegpt.model_adapter import SlicingConfig import pathlib config_json = pathlib.Path("./sliced_model/phi-2_0.25.json").read_text() slicing_conf = SlicingConfig.from_json_string(config_json) config_scheduler = ConfigSlicingScheduler(slicing_conf) ``` -------------------------------- ### Import Built-in Model Adapters Source: https://context7.com/microsoft/transformercompression/llms.txt Access built-in adapters for supported model families like Llama, Phi, and OPT. ```python # Llama-2 models (sequential attention/MLP blocks) from slicegpt.adapters.llama_adapter import LlamaModelAdapter # Supports: meta-llama/Llama-2-7b-hf, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf # Llama-3 models # Supports: meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3-8B-Instruct # Supports: meta-llama/Meta-Llama-3-70B, meta-llama/Meta-Llama-3-70B-Instruct # Phi-2 model (parallel attention/MLP blocks) from slicegpt.adapters.phi2_adapter import Phi2ModelAdapter # Supports: microsoft/phi-2 # Phi-3 model from slicegpt.adapters.phi3_adapter import Phi3ModelAdapter # Supports: microsoft/Phi-3-mini-4k-instruct # OPT models (sequential blocks) from slicegpt.adapters.opt_adapter import OPTModelAdapter # Supports: facebook/opt-125m, facebook/opt-1.3b, facebook/opt-2.7b # Supports: facebook/opt-6.7b, facebook/opt-13b, facebook/opt-30b, facebook/opt-66b ``` -------------------------------- ### Execute Compression Pipeline Source: https://context7.com/microsoft/transformercompression/llms.txt Perform end-to-end model compression including loading, calibration, baseline evaluation, and layer fusion. ```python import torch import pathlib from slicegpt import hf_utils, data_utils, gpu_utils, layernorm_fusion, rotate from slicegpt.slicing_scheduler import ConstSlicingScheduler from slicegpt.config import config # Configuration MODEL_NAME = "microsoft/phi-2" SPARSITY = 0.25 ROUND_INTERVAL = 8 SAVE_DIR = pathlib.Path("./compressed_model") config.device = torch.device("cuda:0") # 1. Load model and tokenizer model_adapter, tokenizer = hf_utils.get_model_and_tokenizer( MODEL_NAME, dtype=torch.float16 ) # 2. Prepare calibration data dataset = data_utils.get_dataset("wikitext2") train_loader = data_utils.prepare_dataloader( dataset["train"], tokenizer, max_seqlen=2048, batch_size=16, nsamples=128 ) test_loader = data_utils.prepare_test_dataloader( dataset["test"], tokenizer, batch_size=8 ) # 3. Evaluate baseline perplexity model_adapter.model.to(config.device) baseline_ppl = gpu_utils.evaluate_ppl( model_adapter.model, model_adapter.model.config.pad_token_id, test_loader ) print(f"Baseline PPL: {baseline_ppl:.4f}") model_adapter.model.cpu() # 4. Prepare model for slicing layernorm_fusion.replace_layers(model_adapter) layernorm_fusion.fuse_modules(model_adapter) ``` -------------------------------- ### Run Recovery Fine-tuning with LoRA Source: https://context7.com/microsoft/transformercompression/llms.txt Apply LoRA fine-tuning to a sliced model to recover accuracy, specifying sparsity and dataset parameters. ```bash # Run recovery fine-tuning on a sliced model python experiments/run_finetuning.py \ --model microsoft/phi-2 \ --sliced-model-path ./sliced_models/phi2 \ --save-dir ./finetuned_models/phi2 \ --sparsity 0.25 \ --device cuda:0 \ --ppl-eval-dataset alpaca \ --finetune-dataset alpaca \ --finetune-train-nsamples 8000 \ --finetune-train-seqlen 1024 \ --finetune-train-batch-size 3 \ --lora-alpha 10 \ --lora-r 32 \ --lora-dropout 0.05 \ --lora-target-option attn_head_and_mlp \ --eval-steps 16 \ --save-steps 16 \ --no-wandb # Output: # PPL before finetuning: 12.3456 # trainable params: 8,388,608 || all params: 1,571,314,688 || trainable%: 0.534% # PPL after finetuning: 10.8901 ``` -------------------------------- ### Run SliceGPT Compression for Phi-2 Model Source: https://context7.com/microsoft/transformercompression/llms.txt Compresses the 'microsoft/phi-2' model using SliceGPT with specified sparsity, device, calibration dataset, and evaluation settings. It outputs model statistics before and after slicing. ```bash python experiments/run_slicegpt.py \ --model microsoft/phi-2 \ --save-dir ./sliced_models/phi2 \ --sparsity 0.25 \ --device cuda:0 \ --cal-dataset wikitext2 \ --cal-nsamples 128 \ --cal-batch-size 16 \ --eval-baseline \ --no-wandb ``` -------------------------------- ### Iterate Through Batches Source: https://context7.com/microsoft/transformercompression/llms.txt Demonstrates how to access input IDs and attention masks from a training loader. ```python for batch in train_loader: input_ids = batch["input_ids"] # Shape: [batch_size, seqlen] attention_mask = batch["attention_mask"] print(f"Batch shape: {input_ids.shape}") break ``` -------------------------------- ### get_dataset Source: https://context7.com/microsoft/transformercompression/llms.txt Loads and prepares standard datasets for calibration and evaluation. ```APIDOC ## get_dataset ### Description Loads and prepares standard datasets for calibration and evaluation. Supports WikiText-2, PTB, C4, and Alpaca datasets. ### Parameters #### Request Body - **dataset_name** (string) - Required - The name of the dataset (e.g., 'wikitext2', 'alpaca', 'c4'). ### Request Example { "dataset_name": "wikitext2" } ``` -------------------------------- ### Load Sliced Model for Inference and Fine-tuning Source: https://context7.com/microsoft/transformercompression/llms.txt Loads a previously sliced model for inference or with LoRA configuration for fine-tuning. Requires the model name, path to the sliced model, and sparsity level. ```python from slicegpt import hf_utils # Load a sliced model for inference model_adapter, tokenizer = hf_utils.load_sliced_model( model_name="microsoft/phi-2", sliced_model_path="./sliced_models/phi2", sparsity=0.25, round_interval=8, token=None ) # Load with LoRA configuration for fine-tuning from peft import LoraConfig, TaskType lora_config = LoraConfig( r=32, lora_alpha=10, lora_dropout=0.05, task_type=TaskType.CAUSAL_LM, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] ) model_adapter, tokenizer = hf_utils.load_sliced_model( model_name="microsoft/phi-2", sliced_model_path="./sliced_models/phi2", sparsity=0.25, lora_config=lora_config ) ``` -------------------------------- ### Run recovery fine-tuning Source: https://github.com/microsoft/transformercompression/blob/main/README.md Perform recovery fine-tuning on a sliced model using LoRA hyperparameters. ```bash python run_finetuning.py \ --model microsoft/phi-2 \ --sliced-model-path path/to/sliced \ --save-dir dir/to/save/finetuned_model/in \ --sparsity 0.25 \ --device cuda:0 \ --ppl-eval-dataset alpaca \ --finetune-dataset alpaca \ --finetune-train-nsamples 8000 \ --finetune-train-seqlen 1024 \ --finetune-train-batch-size 3 \ --lora-alpha 10 \ --lora-r 32 \ --lora-dropout 0.05 \ --lora-target-option attn_head_and_mlp \ --eval-steps 16 \ --save-steps 16 \ --no-wandb ``` -------------------------------- ### Load Hugging Face Model and Tokenizer Source: https://context7.com/microsoft/transformercompression/llms.txt Loads a pretrained model and its tokenizer from Hugging Face Hub or a local path. Supports automatic model adapter selection and can create uninitialized models for loading pre-sliced weights. ```python from slicegpt import hf_utils from slicegpt.config import config import torch # Load a pretrained model from Hugging Face Hub model_adapter, tokenizer = hf_utils.get_model_and_tokenizer( model_name="microsoft/phi-2", model_path=None, # None for HF models dtype=torch.float16, token=None # HF_TOKEN for gated models ) # Access the underlying model model = model_adapter.model print(f"Model hidden size: {model_adapter.hidden_size}") print(f"Sequence length: {model_adapter.seqlen}") print(f"Parallel blocks: {model_adapter.parallel_blocks}") # Load a local model model_adapter, tokenizer = hf_utils.get_model_and_tokenizer( model_name="meta-llama/Llama-2-7b-hf", model_path="/path/to/local/llama", dtype=torch.float16 ) ``` -------------------------------- ### Evaluate Sliced Models Source: https://context7.com/microsoft/transformercompression/llms.txt Evaluate model performance on standard benchmarks using the LM Evaluation Harness. ```bash # Evaluate on PIQA benchmark python experiments/run_lm_eval.py \ --model microsoft/phi-2 \ --sliced-model-path ./sliced_models/phi2 \ --sparsity 0.25 \ --tasks piqa \ --no-wandb # Evaluate original model for comparison python experiments/run_lm_eval.py \ --model microsoft/phi-2 \ --model-path microsoft/phi-2 \ --tasks piqa,hellaswag,arc_easy \ --no-wandb ``` -------------------------------- ### Save Compressed Model Source: https://context7.com/microsoft/transformercompression/llms.txt Persists the model state dictionary and configuration to the specified directory. ```python # 8. Save compressed model SAVE_DIR.mkdir(parents=True, exist_ok=True) model_name = pathlib.Path(MODEL_NAME).name torch.save(model_adapter.model.state_dict(), SAVE_DIR / f"{model_name}_{SPARSITY}.pt") (SAVE_DIR / f"{model_name}_{SPARSITY}.json").write_text( model_adapter.slicing_conf.to_json_string() ) print(f"Saved to {SAVE_DIR}") ``` -------------------------------- ### Evaluate Model Perplexity Source: https://context7.com/microsoft/transformercompression/llms.txt Moves the model to the configured device and calculates perplexity on a test dataset. ```python from slicegpt import gpu_utils from slicegpt.config import config # Move model to GPU model_adapter.model.to(config.device) # Evaluate perplexity ppl = gpu_utils.evaluate_ppl( model=model_adapter.model, pad_token_id=model_adapter.model.config.pad_token_id, testloader=test_loader ) print(f"Model perplexity: {ppl:.4f}") # Example output: # Evaluating perplexity... # Model perplexity: 11.2847 ``` -------------------------------- ### Implement Custom ModelAdapter Source: https://context7.com/microsoft/transformercompression/llms.txt Create a custom LayerAdapter to support new model architectures by defining layer accessors for layernorms, attention, and MLP modules. ```python from slicegpt.model_adapter import ModelAdapter, LayerAdapter from torch.nn import Module, Linear class CustomLayerAdapter(LayerAdapter): def __init__(self, layer): self._layer = layer @property def layer(self) -> Module: return self._layer @property def hidden_states_args_position(self) -> int: return 0 # Position in forward() args @property def hidden_states_output_position(self) -> int: return 0 # Position in forward() output def get_first_layernorm(self) -> Module: return self._layer.input_layernorm def get_second_layernorm(self) -> Module: return self._layer.post_attention_layernorm def get_attention_inputs(self) -> list[Linear]: return [self._layer.self_attn.q_proj, self._layer.self_attn.k_proj, self._layer.self_attn.v_proj] def get_attention_output(self) -> Linear: return self._layer.self_attn.o_proj def get_mlp_inputs(self) -> list[Linear]: return [self._layer.mlp.gate_proj, self._layer.mlp.up_proj] def get_mlp_output(self) -> Linear: return self._layer.mlp.down_proj ``` -------------------------------- ### Evaluate model using LM Eval Harness Source: https://github.com/microsoft/transformercompression/blob/main/README.md Run evaluation on a sliced model using the LM Eval Harness framework. ```bash python run_lm_eval.py \ --model microsoft/phi-2 \ --sliced-model-path path/to/sliced \ --sparsity 0.25 \ --tasks piqa \ --no-wandb ``` -------------------------------- ### get_model_and_tokenizer Source: https://context7.com/microsoft/transformercompression/llms.txt Loads a Hugging Face model and its tokenizer with automatic model adapter selection. ```APIDOC ## get_model_and_tokenizer ### Description Loads a Hugging Face model and its tokenizer with automatic model adapter selection. Supports pretrained models from Hugging Face Hub or local paths. ### Parameters #### Request Body - **model_name** (string) - Required - The name of the model on Hugging Face Hub. - **model_path** (string) - Optional - Local path to the model. - **dtype** (torch.dtype) - Optional - Data type for the model weights. - **token** (string) - Optional - Hugging Face token for gated models. ### Request Example { "model_name": "microsoft/phi-2", "dtype": "torch.float16" } ``` -------------------------------- ### Benchmark Inference Performance Source: https://context7.com/microsoft/transformercompression/llms.txt Measures latency and throughput by running token-by-token generation on a sample batch. ```python from slicegpt import gpu_utils, data_utils # Prepare a sample input sample_batch = next(iter(test_loader)) # Run benchmark results = gpu_utils.benchmark(model_adapter, sample_batch) print(f"Median time per token: {results['median_time']:.4f}s") print(f"Latency: {results['latency']:.4f}s") print(f"Throughput: {results['throughput']:.2f} tokens/s") ``` -------------------------------- ### Distribute Model Across GPUs Source: https://context7.com/microsoft/transformercompression/llms.txt Uses Hugging Face Accelerate to distribute large models across multiple GPUs for inference. ```python from slicegpt import gpu_utils # Distribute model across available GPUs # Recommended for models with 30B+ parameters gpu_utils.distribute_model(model_adapter) # Model is now automatically distributed and ready for inference output = model_adapter.model(input_ids=input_ids.to("cuda")) ``` -------------------------------- ### Rotate and Slice Model Source: https://context7.com/microsoft/transformercompression/llms.txt Applies the rotation and slicing transformation to the model adapter using the provided scheduler and calibration data. ```python # 6. Rotate and slice rotate.rotate_and_slice(model_adapter, train_loader, scheduler) ``` -------------------------------- ### Rotate and Slice Model Weights Source: https://context7.com/microsoft/transformercompression/llms.txt Applies orthogonal transformations and slices model weights based on a specified sparsity, then saves the resulting model and configuration. ```python from slicegpt import rotate, layernorm_fusion from slicegpt.slicing_scheduler import ConstSlicingScheduler # Prepare model for slicing layernorm_fusion.replace_layers(model_adapter) layernorm_fusion.fuse_modules(model_adapter) # Calculate new embedding dimension (25% sparsity) sparsity = 0.25 round_interval = 8 new_embedding_dimension = int((1 - sparsity) * model_adapter.hidden_size) new_embedding_dimension -= new_embedding_dimension % round_interval # Create slicing scheduler with constant dimension scheduler = ConstSlicingScheduler(new_embedding_dimension) # Rotate and slice the model rotate.rotate_and_slice( model_adapter=model_adapter, dataloader=train_loader, slicing_scheduler=scheduler, apply_mask=True, final_orientation='random' # 'random' or 'pca' ) # Save the sliced model import torch import pathlib save_dir = pathlib.Path("./sliced_model") save_dir.mkdir(parents=True, exist_ok=True) torch.save(model_adapter.model.state_dict(), save_dir / "phi-2_0.25.pt") # Save slicing configuration config_path = save_dir / "phi-2_0.25.json" config_path.write_text(model_adapter.slicing_conf.to_json_string()) ``` -------------------------------- ### load_sliced_model Source: https://context7.com/microsoft/transformercompression/llms.txt Loads a previously sliced model along with its slicing configuration and optionally applies LoRA for fine-tuning. ```APIDOC ## load_sliced_model ### Description Loads a previously sliced model along with its slicing configuration and optionally applies LoRA for fine-tuning. ### Parameters #### Request Body - **model_name** (string) - Required - The name of the model. - **sliced_model_path** (string) - Required - Path to the sliced model weights. - **sparsity** (float) - Required - The sparsity level applied. - **round_interval** (int) - Optional - Rounding interval for slicing. - **lora_config** (LoraConfig) - Optional - Configuration for LoRA fine-tuning. ### Request Example { "model_name": "microsoft/phi-2", "sliced_model_path": "./sliced_models/phi2", "sparsity": 0.25 } ``` -------------------------------- ### Evaluate Sliced Model Source: https://context7.com/microsoft/transformercompression/llms.txt Computes the perplexity of the compressed model on the test dataset. ```python # 7. Evaluate sliced model model_adapter.model.to(config.device) sliced_ppl = gpu_utils.evaluate_ppl( model_adapter.model, model_adapter.model.config.pad_token_id, test_loader ) print(f"Sliced PPL: {sliced_ppl:.4f}") ``` -------------------------------- ### Replace Layers and Fuse Modules Source: https://context7.com/microsoft/transformercompression/llms.txt Replaces transformer layers with compressed versions and fuses LayerNorm operations into adjacent linear layers to simplify normalization. ```python from slicegpt import layernorm_fusion # Step 1: Replace layers with compressed equivalents (adds shortcut operation) layernorm_fusion.replace_layers(model_adapter) # Step 2: Fuse LayerNorm weights into adjacent linear layers # This is a mathematical transformation that preserves model output layernorm_fusion.fuse_modules(model_adapter) # After fusion, the model outputs remain identical but the # normalization operations are simplified to RMSNorm without weights ``` -------------------------------- ### Calculate Slicing Dimensions Source: https://context7.com/microsoft/transformercompression/llms.txt Determines the new hidden size based on target sparsity and rounding intervals. ```python # 5. Calculate slicing dimensions new_dim = int((1 - SPARSITY) * model_adapter.hidden_size) new_dim -= new_dim % ROUND_INTERVAL scheduler = ConstSlicingScheduler(new_dim) ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.