### Install Dependencies and Run Medusa Example Source: https://github.com/linkedin/liger-kernel/blob/main/examples/medusa/README.md Follow these commands to clone the repository, install dependencies, and execute the Medusa example script with Llama3. ```bash git clone git@github.com:linkedin/Liger-Kernel.git cd {PATH_TO_Liger-Kernel}/Liger-Kernel/ pip install -e . cd {PATH_TO_Liger-Kernel}/Liger-Kernel/examples/medusa pip install -r requirements.txt sh scripts/llama3_8b_medusa.sh ``` -------------------------------- ### Install Dependencies and Run Locally Source: https://github.com/linkedin/liger-kernel/blob/main/examples/huggingface/README.md Use these commands to install the necessary dependencies and run the example locally on a GPU machine. Ensure you have the `requirements.txt` file and the appropriate shell script. ```bash pip install -r requirements.txt sh run_{MODEL}.sh ``` -------------------------------- ### Setup Function for Benchmarking Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/BENCHMARK_GUIDELINES.md Define a single setup function to build inputs and the layer for both speed and memory benchmarks. This function should accept `SingleBenchmarkRunInput` and return the input tensor and the layer or function to be benchmarked. ```python def _setup_geglu(input: SingleBenchmarkRunInput): cfg = input.extra_benchmark_config # Build model config, create x tensor, instantiate layer by provider return x, layer ``` -------------------------------- ### Run Remotely on Modal Source: https://github.com/linkedin/liger-kernel/blob/main/examples/huggingface/README.md These commands allow you to run the example remotely on Modal, a serverless platform for GPU computation. This is useful if you do not have local GPU access. You need to install the Modal client and authenticate. ```bash pip install modal modal setup # authenticate with Modal modal run launch_on_modal.py --script "run_qwen2_vl.sh" ``` -------------------------------- ### Install Dependencies and Editable Package Source: https://github.com/linkedin/liger-kernel/blob/main/docs/contributing.md Install project dependencies and the editable package. Use the alternative command if the primary one fails. ```sh pip install . -e[dev] ``` ```sh pip install -e .'[dev]' ``` -------------------------------- ### Install Dependencies Source: https://github.com/linkedin/liger-kernel/blob/main/examples/lightning/README.md Install the required Python packages using the provided requirements file. ```bash pip install -r requirements.txt ``` -------------------------------- ### Visualizing Benchmark Results (Model-Config Sweep) Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md Command-line example for visualizing benchmark results, focusing on speed metrics for a model-config sweep. ```bash python benchmarks_visualizer.py \ --kernel-name geglu \ --metric-name speed \ --sweep-mode model_config ``` -------------------------------- ### Visualizing Benchmark Results (Token-Length Sweep) Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md Command-line examples for visualizing benchmark results using the visualization script. This example focuses on the token-length sweep, plotting speed metrics for specified operation modes (forward, backward). ```bash python benchmarks_visualizer.py \ --kernel-name kto_loss \ --metric-name speed \ --kernel-operation-mode forward backward ``` -------------------------------- ### Install and Run Pre-commit Hooks Source: https://github.com/linkedin/liger-kernel/blob/main/docs/contributing.md Install pre-commit hooks using prek, a Rust-based alternative. Run checks without committing using the -a flag. ```sh prek install ``` ```sh prek run -a ``` -------------------------------- ### Install Liger Kernel from Source Source: https://github.com/linkedin/liger-kernel/blob/main/docs/index.md Install Liger Kernel from its source repository, including default or development dependencies. ```bash git clone https://github.com/linkedin/Liger-Kernel.git cd Liger-Kernel # Install Default Dependencies # Setup.py will detect whether you are using AMD or NVIDIA pip install -e . # Setup Development Dependencies pip install -e ".[dev]" ``` -------------------------------- ### Install Liger Kernel (Nightly) Source: https://github.com/linkedin/liger-kernel/blob/main/docs/index.md Install the nightly build of Liger Kernel using pip. ```bash pip install liger-kernel-nightly ``` -------------------------------- ### ORPO Training with LigerORPOTrainer Source: https://github.com/linkedin/liger-kernel/blob/main/docs/Examples.md Example of setting up and running ORPO training locally on a GPU machine with FSDP. Imports necessary libraries from transformers and trl. ```python import torch from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer from trl import ORPOConfig # noqa: F401 from liger_kernel.transformers.trainer import LigerORPOTrainer # noqa: F401 model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B-Instruct", dtype=torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Llama-3.2-1B-Instruct", max_length=512, padding="max_length", ) tokenizer.pad_token = tokenizer.eos_token train_dataset = load_dataset("trl-lib/tldr-preference", split="train") training_args = ORPOConfig( output_dir="Llama3.2_1B_Instruct", beta=0.1, max_length=128, per_device_train_batch_size=32, max_steps=100, save_strategy="no", ) trainer = LigerORPOTrainer( model=model, args=training_args, tokenizer=tokenizer, train_dataset=train_dataset ) trainer.train() ``` -------------------------------- ### Compute Default Tiling Strategy Example Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/_ascend/ascend-ub-manager-design.md Example of how to compute the default tiling strategy for given shapes and tiling dimensions. Ensure shapes and tiling_dims have matching lengths. ```python shapes = ((32, 128), (32, 128)) # (n_q_head, hd), (n_kv_head, hd) tile_shapes = compute_default_tiling_strategy( safety_margin=0.90, dtype_size=4, # float32 memory_multiplier=3.0, shapes=shapes, tiling_dims=(0, 0) # First dimension of each shape can be tiled ) if tile_shapes is not None and len(tile_shapes) == len(shapes): q_tile_shape, k_tile_shape = tile_shapes BLOCK_Q, _ = q_tile_shape # Tiled dimension BLOCK_K, _ = k_tile_shape # Tiled dimension # Call kernel with BLOCK_Q and BLOCK_K ``` -------------------------------- ### Visualizing Benchmark Results (Token-Length Sweep, All Modes) Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md Command-line example for visualizing benchmark results, specifically plotting speed metrics for all available operation modes when using a token-length sweep. ```bash python benchmarks_visualizer.py \ --kernel-name kto_loss \ --metric-name speed ``` -------------------------------- ### Setup Function for Benchmarking Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md Defines how to construct inputs and modules for a single forward pass in the benchmark. It takes a SingleBenchmarkRunInput and returns a tuple of tensors or modules. ```python def _setup_fn(input: SingleBenchmarkRunInput) -> Tuple[Any, ...]: x = ... layer = ... return x, layer ``` -------------------------------- ### Install Liger Kernel (Stable) Source: https://github.com/linkedin/liger-kernel/blob/main/docs/index.md Install the stable version of Liger Kernel using pip. ```bash pip install liger-kernel ``` -------------------------------- ### Visualizing Benchmark Results (Memory) Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md Command-line example for visualizing memory benchmark results. For memory metrics, only the 'full' plot is generated. ```bash python benchmarks_visualizer.py \ --kernel-name kto_loss \ --metric-name memory ``` -------------------------------- ### Running Benchmark Scripts Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md Command-line examples for running benchmark scripts. These scripts can be used to benchmark specific kernels like 'kto_loss' with different sweep modes (model_config or token_length). ```bash cd benchmark python scripts/benchmark_kto_loss.py --sweep-mode model_config [--model llama_3_8b] ``` ```bash python scripts/benchmark_kto_loss.py [--sweep-mode token_length] [--bt 2048] ``` -------------------------------- ### Install Liger Kernel from Source Source: https://github.com/linkedin/liger-kernel/blob/main/README.md Clone the repository and install Liger Kernel from source, including default or development dependencies. For AMD users, specific PyTorch nightly builds are recommended. ```bash git clone https://github.com/linkedin/Liger-Kernel.git cd Liger-Kernel # Install Default Dependencies # Setup.py will detect whether you are using AMD or NVIDIA pip install -e . # Setup Development Dependencies pip install -e ".[dev]" # NOTE -> For AMD users only pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3/ ``` -------------------------------- ### Speed Benchmark Function Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/BENCHMARK_GUIDELINES.md Implement a speed benchmark function that utilizes the setup function and `run_speed_benchmark` utility. It takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput`. ```python def bench_speed_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput: x, layer = _setup_geglu(input) return run_speed_benchmark(lambda: layer(x), input.kernel_operation_mode, [x]) ``` -------------------------------- ### Compute Default Tiling Strategy for GEGLU Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/_ascend/ascend-ub-manager-design.md Example demonstrating how to use `compute_default_tiling_strategy` for a GEGLU forward pass. It specifies shapes, tiling dimensions, and memory parameters to obtain optimal block sizes. ```python from liger_kernel.ops.backends._ascend.ub_manager import compute_default_tiling_strategy # GEGLU forward shapes = ((4096,),) tile_shapes = compute_default_tiling_strategy( safety_margin=0.80, dtype_size=2, # float16 memory_multiplier=7.0, shapes=shapes, tiling_dims=(0,) # First dimension can be tiled ) if tile_shapes is not None and len(tile_shapes) > 0: block_size = tile_shapes[0][0] # Call kernel with block_size ``` -------------------------------- ### Run Training on Multiple GPUs Source: https://github.com/linkedin/liger-kernel/blob/main/examples/lightning/README.md Execute the training script for a setup with 8xA100 40GB GPUs, using the deepspeed strategy. ```bash python training.py --model meta-llama/Meta-Llama-3-8B --strategy deepspeed ``` -------------------------------- ### Memory Benchmark Function Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/BENCHMARK_GUIDELINES.md Implement a memory benchmark function using the setup function and `run_memory_benchmark` utility. It takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput`. ```python def bench_memory_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput: x, layer = _setup_geglu(input) return run_memory_benchmark(lambda: layer(x), input.kernel_operation_mode) ``` -------------------------------- ### Add New Kernel Tiling Support with compute_default_tiling_strategy Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/_ascend/ascend-ub-manager-design.md Provides a Python example demonstrating how to add tiling support for a new kernel using the `compute_default_tiling_strategy` function. It covers parameter preparation, strategy computation, and kernel invocation. ```python def my_kernel_forward(input): # Prepare parameters n_cols = input.shape[-1] dtype_size = input.element_size() # Compute strategy # Example 1: Simple case (all dimensions can be tiled) shapes = ((n_cols,),) tile_shapes = compute_default_tiling_strategy( safety_margin=0.80, dtype_size=dtype_size, memory_multiplier=7.0, # Based on your memory analysis shapes=shapes, tiling_dims=(0,) # First dimension can be tiled ) if tile_shapes is not None and len(tile_shapes) > 0: block_size = tile_shapes[0][0] else: block_size = triton.next_power_of_2(n_cols) # Fallback # Example 2: Multiple shapes with fixed dimensions # shapes = ((M, K), (K, N)) # tiling_dims = (0, 1) # First shape: dim 0 can be tiled, dim 1 is fixed # # Second shape: dim 0 is fixed, dim 1 can be tiled # Returns: ((block_M, K), (K, block_N)) # Call kernel kernel[(grid_size,)]( input, BLOCK_SIZE=block_size, ) ``` -------------------------------- ### Install PyTorch for ROCm 6.3 Source: https://github.com/linkedin/liger-kernel/blob/main/README.md Install the nightly build of PyTorch with ROCm 6.3 support. This is a prerequisite for using Liger Kernel with AMD GPUs. ```bash pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3/ ``` -------------------------------- ### Multimodal Finetuning with Torchrun Source: https://github.com/linkedin/liger-kernel/blob/main/docs/Examples.md Use this script to run multimodal finetuning locally on a GPU machine. Ensure you have 4xA100 80GB GPUs for the default configuration. ```bash #!/bin/bash torchrun --nnodes=1 --nproc-per-node=4 training_multimodal.py \ --model_name "Qwen/Qwen2-VL-7B-Instruct" \ --bf16 \ --num_train_epochs 1 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --eval_strategy "no" \ --save_strategy "no" \ --learning_rate 6e-6 \ --weight_decay 0.05 \ --warmup_ratio 0.1 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --include_num_input_tokens_seen \ --report_to none \ --fsdp "full_shard auto_wrap" \ --fsdp_config config/fsdp_config.json \ --seed 42 \ --use_liger True \ --output_dir multimodal_finetuning ``` -------------------------------- ### Install Liger Kernel with Development Dependencies Source: https://github.com/linkedin/liger-kernel/blob/main/README.md Install the Liger Kernel package in editable mode, including development dependencies. This command is typically run from the root of the project directory. ```bash pip install -e .[dev] ``` -------------------------------- ### Run Training on Single GPU Source: https://github.com/linkedin/liger-kernel/blob/main/examples/lightning/README.md Execute the training script for a single L40 48GB GPU, specifying the model and number of GPUs. ```bash python training.py --model Qwen/Qwen2-0.5B-Instruct --num_gpu 1 --max_length 1024 ``` -------------------------------- ### Directory Structure for Vendor Backends Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/README.md Illustrates the expected file and directory layout for a new vendor backend implementation within the Liger-Kernel structure. ```bash mkdir -p backends/_/ops touch backends/_/__init__.py touch backends/_/ops/__init__.py ``` -------------------------------- ### Vendor-Specific Operator Implementation Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/README.md Example of implementing a vendor-specific PyTorch autograd Function for an operator. This includes forward and backward passes, with placeholders for vendor-specific logic. ```python import torch class LigerGELUMulFunction(torch.autograd.Function): """ Vendor-specific LigerGELUMulFunction implementation. """ @staticmethod def forward(ctx, a, b): # Your vendor-specific forward implementation ... @staticmethod def backward(ctx, dc): # Your vendor-specific backward implementation ... # Optional: vendor-specific kernel functions def geglu_forward_vendor(a, b): ... def geglu_backward_vendor(a, b, dc): ... ``` -------------------------------- ### Device Detection Logic Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/README.md Example of how to extend the device inference function to detect a new custom device type. This function should be updated to include checks for your specific device. ```python def infer_device(): if torch.cuda.is_available(): return "cuda" if is_npu_available(): return "npu" # Add your device detection here if is__available(): return "" return "cpu" ```