### Install Dependencies and Run Medusa Example

Source: https://github.com/linkedin/liger-kernel/blob/main/examples/medusa/README.md

Follow these commands to clone the repository, install dependencies, and execute the Medusa example script with Llama3.

```bash
git clone git@github.com:linkedin/Liger-Kernel.git
cd {PATH_TO_Liger-Kernel}/Liger-Kernel/ 
pip install -e .
cd {PATH_TO_Liger-Kernel}/Liger-Kernel/examples/medusa
pip install -r requirements.txt
sh scripts/llama3_8b_medusa.sh
```

--------------------------------

### Install Dependencies and Run Locally

Source: https://github.com/linkedin/liger-kernel/blob/main/examples/huggingface/README.md

Use these commands to install the necessary dependencies and run the example locally on a GPU machine. Ensure you have the `requirements.txt` file and the appropriate shell script.

```bash
pip install -r requirements.txt
sh run_{MODEL}.sh
```

--------------------------------

### Setup Function for Benchmarking

Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/BENCHMARK_GUIDELINES.md

Define a single setup function to build inputs and the layer for both speed and memory benchmarks. This function should accept `SingleBenchmarkRunInput` and return the input tensor and the layer or function to be benchmarked.

```python
def _setup_geglu(input: SingleBenchmarkRunInput):
    cfg = input.extra_benchmark_config
    # Build model config, create x tensor, instantiate layer by provider
    return x, layer
```

--------------------------------

### Run Remotely on Modal

Source: https://github.com/linkedin/liger-kernel/blob/main/examples/huggingface/README.md

These commands allow you to run the example remotely on Modal, a serverless platform for GPU computation. This is useful if you do not have local GPU access. You need to install the Modal client and authenticate.

```bash
pip install modal
modal setup  # authenticate with Modal
modal run launch_on_modal.py --script "run_qwen2_vl.sh"
```

--------------------------------

### Install Dependencies and Editable Package

Source: https://github.com/linkedin/liger-kernel/blob/main/docs/contributing.md

Install project dependencies and the editable package. Use the alternative command if the primary one fails.

```sh
pip install . -e[dev]
```

```sh
pip install -e .'[dev]'
```

--------------------------------

### Install Dependencies

Source: https://github.com/linkedin/liger-kernel/blob/main/examples/lightning/README.md

Install the required Python packages using the provided requirements file.

```bash
pip install -r requirements.txt
```

--------------------------------

### Visualizing Benchmark Results (Model-Config Sweep)

Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md

Command-line example for visualizing benchmark results, focusing on speed metrics for a model-config sweep.

```bash
python benchmarks_visualizer.py \
    --kernel-name geglu \
    --metric-name speed \
    --sweep-mode model_config
```

--------------------------------

### Visualizing Benchmark Results (Token-Length Sweep)

Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md

Command-line examples for visualizing benchmark results using the visualization script. This example focuses on the token-length sweep, plotting speed metrics for specified operation modes (forward, backward).

```bash
python benchmarks_visualizer.py \
    --kernel-name kto_loss \
    --metric-name speed \
    --kernel-operation-mode forward backward
```

--------------------------------

### Install and Run Pre-commit Hooks

Source: https://github.com/linkedin/liger-kernel/blob/main/docs/contributing.md

Install pre-commit hooks using prek, a Rust-based alternative. Run checks without committing using the -a flag.

```sh
prek install
```

```sh
prek run -a
```

--------------------------------

### Install Liger Kernel from Source

Source: https://github.com/linkedin/liger-kernel/blob/main/docs/index.md

Install Liger Kernel from its source repository, including default or development dependencies.

```bash
git clone https://github.com/linkedin/Liger-Kernel.git
cd Liger-Kernel

# Install Default Dependencies
# Setup.py will detect whether you are using AMD or NVIDIA
pip install -e .

# Setup Development Dependencies
pip install -e ".[dev]"
```

--------------------------------

### Install Liger Kernel (Nightly)

Source: https://github.com/linkedin/liger-kernel/blob/main/docs/index.md

Install the nightly build of Liger Kernel using pip.

```bash
pip install liger-kernel-nightly
```

--------------------------------

### ORPO Training with LigerORPOTrainer

Source: https://github.com/linkedin/liger-kernel/blob/main/docs/Examples.md

Example of setting up and running ORPO training locally on a GPU machine with FSDP. Imports necessary libraries from transformers and trl.

```python
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import ORPOConfig  # noqa: F401

from liger_kernel.transformers.trainer import LigerORPOTrainer  # noqa: F401

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    max_length=512,
    padding="max_length",
)
tokenizer.pad_token = tokenizer.eos_token

train_dataset = load_dataset("trl-lib/tldr-preference", split="train")

training_args = ORPOConfig(
    output_dir="Llama3.2_1B_Instruct",
    beta=0.1,
    max_length=128,
    per_device_train_batch_size=32,
    max_steps=100,
    save_strategy="no",
)

trainer = LigerORPOTrainer(
    model=model, args=training_args, tokenizer=tokenizer, train_dataset=train_dataset
)

trainer.train()
```

--------------------------------

### Compute Default Tiling Strategy Example

Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/_ascend/ascend-ub-manager-design.md

Example of how to compute the default tiling strategy for given shapes and tiling dimensions. Ensure shapes and tiling_dims have matching lengths.

```python
shapes = ((32, 128), (32, 128))  # (n_q_head, hd), (n_kv_head, hd)
tile_shapes = compute_default_tiling_strategy(
    safety_margin=0.90,
    dtype_size=4,  # float32
    memory_multiplier=3.0,
    shapes=shapes,
    tiling_dims=(0, 0)  # First dimension of each shape can be tiled
)
if tile_shapes is not None and len(tile_shapes) == len(shapes):
    q_tile_shape, k_tile_shape = tile_shapes
    BLOCK_Q, _ = q_tile_shape  # Tiled dimension
    BLOCK_K, _ = k_tile_shape  # Tiled dimension
    # Call kernel with BLOCK_Q and BLOCK_K
```

--------------------------------

### Visualizing Benchmark Results (Token-Length Sweep, All Modes)

Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md

Command-line example for visualizing benchmark results, specifically plotting speed metrics for all available operation modes when using a token-length sweep.

```bash
python benchmarks_visualizer.py \
    --kernel-name kto_loss \
    --metric-name speed
```

--------------------------------

### Setup Function for Benchmarking

Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md

Defines how to construct inputs and modules for a single forward pass in the benchmark. It takes a SingleBenchmarkRunInput and returns a tuple of tensors or modules.

```python
def _setup_fn(input: SingleBenchmarkRunInput) -> Tuple[Any, ...]:
    x = ...
    layer = ...
    return x, layer
```

--------------------------------

### Install Liger Kernel (Stable)

Source: https://github.com/linkedin/liger-kernel/blob/main/docs/index.md

Install the stable version of Liger Kernel using pip.

```bash
pip install liger-kernel
```

--------------------------------

### Visualizing Benchmark Results (Memory)

Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md

Command-line example for visualizing memory benchmark results. For memory metrics, only the 'full' plot is generated.

```bash
python benchmarks_visualizer.py \
    --kernel-name kto_loss \
    --metric-name memory
```

--------------------------------

### Running Benchmark Scripts

Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/README.md

Command-line examples for running benchmark scripts. These scripts can be used to benchmark specific kernels like 'kto_loss' with different sweep modes (model_config or token_length).

```bash
cd benchmark
python scripts/benchmark_kto_loss.py --sweep-mode model_config [--model llama_3_8b]
```

```bash
python scripts/benchmark_kto_loss.py [--sweep-mode token_length] [--bt 2048]
```

--------------------------------

### Install Liger Kernel from Source

Source: https://github.com/linkedin/liger-kernel/blob/main/README.md

Clone the repository and install Liger Kernel from source, including default or development dependencies. For AMD users, specific PyTorch nightly builds are recommended.

```bash
git clone https://github.com/linkedin/Liger-Kernel.git
cd Liger-Kernel

# Install Default Dependencies
# Setup.py will detect whether you are using AMD or NVIDIA
pip install -e .

# Setup Development Dependencies
pip install -e ".[dev]"

# NOTE -> For AMD users only
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3/
```

--------------------------------

### Speed Benchmark Function

Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/BENCHMARK_GUIDELINES.md

Implement a speed benchmark function that utilizes the setup function and `run_speed_benchmark` utility. It takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput`.

```python
def bench_speed_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
    x, layer = _setup_geglu(input)
    return run_speed_benchmark(lambda: layer(x), input.kernel_operation_mode, [x])
```

--------------------------------

### Compute Default Tiling Strategy for GEGLU

Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/_ascend/ascend-ub-manager-design.md

Example demonstrating how to use `compute_default_tiling_strategy` for a GEGLU forward pass. It specifies shapes, tiling dimensions, and memory parameters to obtain optimal block sizes.

```python
from liger_kernel.ops.backends._ascend.ub_manager import compute_default_tiling_strategy

# GEGLU forward
shapes = ((4096,),)
tile_shapes = compute_default_tiling_strategy(
    safety_margin=0.80,
    dtype_size=2,  # float16
    memory_multiplier=7.0,
    shapes=shapes,
    tiling_dims=(0,)  # First dimension can be tiled
)
if tile_shapes is not None and len(tile_shapes) > 0:
    block_size = tile_shapes[0][0]
    # Call kernel with block_size

```

--------------------------------

### Run Training on Multiple GPUs

Source: https://github.com/linkedin/liger-kernel/blob/main/examples/lightning/README.md

Execute the training script for a setup with 8xA100 40GB GPUs, using the deepspeed strategy.

```bash
python training.py --model meta-llama/Meta-Llama-3-8B --strategy deepspeed
```

--------------------------------

### Memory Benchmark Function

Source: https://github.com/linkedin/liger-kernel/blob/main/benchmark/BENCHMARK_GUIDELINES.md

Implement a memory benchmark function using the setup function and `run_memory_benchmark` utility. It takes `SingleBenchmarkRunInput` and returns `SingleBenchmarkRunOutput`.

```python
def bench_memory_geglu(input: SingleBenchmarkRunInput) -> SingleBenchmarkRunOutput:
    x, layer = _setup_geglu(input)
    return run_memory_benchmark(lambda: layer(x), input.kernel_operation_mode)
```

--------------------------------

### Add New Kernel Tiling Support with compute_default_tiling_strategy

Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/_ascend/ascend-ub-manager-design.md

Provides a Python example demonstrating how to add tiling support for a new kernel using the `compute_default_tiling_strategy` function. It covers parameter preparation, strategy computation, and kernel invocation.

```python
def my_kernel_forward(input):
    # Prepare parameters
    n_cols = input.shape[-1]
    dtype_size = input.element_size()
    
    # Compute strategy
    # Example 1: Simple case (all dimensions can be tiled)
    shapes = ((n_cols,),)
    tile_shapes = compute_default_tiling_strategy(
        safety_margin=0.80,
        dtype_size=dtype_size,
        memory_multiplier=7.0,  # Based on your memory analysis
        shapes=shapes,
        tiling_dims=(0,)  # First dimension can be tiled
    )
    
    if tile_shapes is not None and len(tile_shapes) > 0:
        block_size = tile_shapes[0][0]
    else:
        block_size = triton.next_power_of_2(n_cols)  # Fallback
    
    # Example 2: Multiple shapes with fixed dimensions
    # shapes = ((M, K), (K, N))
    # tiling_dims = (0, 1)  # First shape: dim 0 can be tiled, dim 1 is fixed
    #                      # Second shape: dim 0 is fixed, dim 1 can be tiled
    # Returns: ((block_M, K), (K, block_N))
    
    # Call kernel
    kernel[(grid_size,)](
        input,
        BLOCK_SIZE=block_size,
    )
```

--------------------------------

### Install PyTorch for ROCm 6.3

Source: https://github.com/linkedin/liger-kernel/blob/main/README.md

Install the nightly build of PyTorch with ROCm 6.3 support. This is a prerequisite for using Liger Kernel with AMD GPUs.

```bash
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3/
```

--------------------------------

### Multimodal Finetuning with Torchrun

Source: https://github.com/linkedin/liger-kernel/blob/main/docs/Examples.md

Use this script to run multimodal finetuning locally on a GPU machine. Ensure you have 4xA100 80GB GPUs for the default configuration.

```bash
#!/bin/bash

torchrun --nnodes=1 --nproc-per-node=4 training_multimodal.py \
    --model_name "Qwen/Qwen2-VL-7B-Instruct" \
    --bf16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --eval_strategy "no" \
    --save_strategy "no" \
    --learning_rate 6e-6 \
    --weight_decay 0.05 \
    --warmup_ratio 0.1 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --include_num_input_tokens_seen \
    --report_to none \
    --fsdp "full_shard auto_wrap" \
    --fsdp_config config/fsdp_config.json \
    --seed 42 \
    --use_liger True \
    --output_dir multimodal_finetuning
```

--------------------------------

### Install Liger Kernel with Development Dependencies

Source: https://github.com/linkedin/liger-kernel/blob/main/README.md

Install the Liger Kernel package in editable mode, including development dependencies. This command is typically run from the root of the project directory.

```bash
pip install -e .[dev]
```

--------------------------------

### Run Training on Single GPU

Source: https://github.com/linkedin/liger-kernel/blob/main/examples/lightning/README.md

Execute the training script for a single L40 48GB GPU, specifying the model and number of GPUs.

```bash
python training.py --model Qwen/Qwen2-0.5B-Instruct --num_gpu 1 --max_length 1024
```

--------------------------------

### Directory Structure for Vendor Backends

Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/README.md

Illustrates the expected file and directory layout for a new vendor backend implementation within the Liger-Kernel structure.

```bash
mkdir -p backends/_<vendor>/ops
touch backends/_<vendor>/__init__.py
touch backends/_<vendor>/ops/__init__.py
```

--------------------------------

### Vendor-Specific Operator Implementation

Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/README.md

Example of implementing a vendor-specific PyTorch autograd Function for an operator. This includes forward and backward passes, with placeholders for vendor-specific logic.

```python
import torch

class LigerGELUMulFunction(torch.autograd.Function):
    """
    Vendor-specific LigerGELUMulFunction implementation.
    """
    @staticmethod
    def forward(ctx, a, b):
        # Your vendor-specific forward implementation
        ...

    @staticmethod
    def backward(ctx, dc):
        # Your vendor-specific backward implementation
        ...

# Optional: vendor-specific kernel functions
def geglu_forward_vendor(a, b):
    ...

def geglu_backward_vendor(a, b, dc):
    ...
```

--------------------------------

### Device Detection Logic

Source: https://github.com/linkedin/liger-kernel/blob/main/src/liger_kernel/ops/backends/README.md

Example of how to extend the device inference function to detect a new custom device type. This function should be updated to include checks for your specific device.

```python
def infer_device():
    if torch.cuda.is_available():
        return "cuda"
    if is_npu_available():
        return "npu"
    # Add your device detection here
    if is_<device>_available():
        return "<device>"
    return "cpu"
```