### Install Optimi Package

Source: https://optimi.benjaminwarner.dev/

Install the optimi library using pip. This command fetches and installs the latest version from PyPI.

```bash
pip install torch-optimi

```

--------------------------------

### Using Kahan Summation with optimi AdamW Optimizer

Source: https://optimi.benjaminwarner.dev/kahan_summation?q=

This example demonstrates initializing an optimi AdamW optimizer with a model cast to BFloat16. Kahan summation is automatically enabled for low-precision layers. The optimizer step will then use Kahan summation for updating low-precision weights.

```python
import torch
from torch import nn
from optimi import AdamW

# create or cast some model layers in low precision (bfloat16)
model = nn.Linear(20, 1, dtype=torch.bfloat16)

# initialize any optmi optimizer with low precsion parameters
# Kahan summation is enabled since some model layers are bfloat16
opt = AdamW(model.parameters(), lr=1e-3)

# forward and backward, casting input to bfloat16 if needed
loss = model(torch.randn(20, dtype=torch.bfloat16))
loss.backward()

# optimizer step automatically uses Kahan summation for low precision layers
opt.step()
opt.zero_grad()

```

--------------------------------

### PyTorch Optimizer Accumulation Example

Source: https://optimi.benjaminwarner.dev/optimizer_accumulation

This snippet shows how to use optimizer accumulation with a PyTorch dataloader. Set `optimizer_accumulation` to control gradient accumulation or parameter updates. The optimizer step and zero_grad are handled automatically.

```python
for idx, batch in enumerate(dataloader):
    # `optimizer_accumulation=True` accumulates gradients into
    # optimizer states. set `optimizer_accumulation=False` to
    # update parameters by performing a full gradient release step
    opt.optimizer_accumulation = (idx+1) % accumulation_steps != 0

    # calling backward on the model will peform the optimizer step
    # either accumulating gradients or updating model parameters
    loss = model(batch)
    loss.backward()

    # optimizer step and zero_grad are no longer needed, and will
    # harmlessly no-op if called by an existing training framework
    # opt.step()
    # opt.zero_grad()

    # step the learning rate scheduler after accumulating gradients
    if not opt.optimizer_accumulation:
        scheduler.step()

# optionally remove gradient release hooks when done training
remove_gradient_release(model)
```

--------------------------------

### Initialize Optimizer with Gradient Release and Prepare Model

Source: https://optimi.benjaminwarner.dev/optimizer_accumulation

Initialize your model and optimizer with gradient release enabled. This involves setting `gradient_release=True` for the optimizer and calling `prepare_for_gradient_release` on both the model and optimizer. Use bfloat16 for the model's data type.

```python
import torch
from torch import nn
from optimi import AdamW

# create or cast model in low precision (bfloat16)
model = nn.Linear(20, 1, dtype=torch.bfloat16)

# initialize any optimi optimizer with `gradient_release=True`
# and call `prepare_for_gradient_release` on model and optimizer
opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True)
prepare_for_gradient_release(model, opt)

# update model parameters every four steps after accumulating
# gradients directly into the optimizer states
accumulation_steps = 4

# setup a learning rate scheduler for gradient accumulation
scheduler = CosineAnnealingLR(opt, ...)
```

--------------------------------

### Override Float32 Modules in Low Precision Casting

Source: https://optimi.benjaminwarner.dev/utils?q=

Customizes which modules are kept in float32 during low-precision casting. For example, to only preserve LayerNorm modules.

```python
to_low_precision(model, dtype=torch.bfloat16, fp32_modules=(nn.LayerNorm,))
```

--------------------------------

### Standard Weight Decay with AdamW (Float32/Mixed Precision)

Source: https://optimi.benjaminwarner.dev/

Use AdamW with standard PyTorch-style weight decay for float32 or mixed-precision training. This is a typical setup when not utilizing low-precision features.

```python
# create model
model = nn.Linear(20, 1)

# initialize any optimi optimizer with parameters
opt = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

```

--------------------------------

### Initialize and Use Gradient Release

Source: https://optimi.benjaminwarner.dev/gradient_release

Initialize an optimi optimizer with `gradient_release=True` and call `prepare_for_gradient_release` on the model and optimizer. The optimizer step and zero_grad are handled automatically during the backward pass.

```python
import torch
from torch import nn
from optimi import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

# create or cast model in low precision (bfloat16)
model = nn.Linear(20, 1, dtype=torch.bfloat16)

# initialize any optimi optimizer with `gradient_release=True`
# and call `prepare_for_gradient_release` on model and optimizer
opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True)
prepare_for_gradient_release(model, opt)

# setup a learning rate scheduler like normal
scheduler = CosineAnnealingLR(opt, ...)

# calling backward on the model will peform the optimzier step
loss = model(torch.randn(20, dtype=torch.bfloat16))
loss.backward()

# optimizer step and zero_grad are no longer needed, and will
# harmlessly no-op if called by an existing training framework
# opt.step()
# opt.zero_grad()

# step the learning rate scheduler like normal
scheduler.step()

# optionally remove gradient release hooks when done training
remove_gradient_release(model)

```

--------------------------------

### Initialize and Use Gradient Release

Source: https://optimi.benjaminwarner.dev/gradient_release?q=

Initialize an optimi optimizer with `gradient_release=True` and call `prepare_for_gradient_release` on the model and optimizer. Subsequent backward calls will fuse the optimizer step and gradient clearing.

```python
import torch
from torch import nn
from optimi import AdamW

# create or cast model in low precision (bfloat16)
model = nn.Linear(20, 1, dtype=torch.bfloat16)

# initialize any optimi optimizer with `gradient_release=True`
# and call `prepare_for_gradient_release` on model and optimizer
opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True)
prepare_for_gradient_release(model, opt)

# setup a learning rate scheduler like normal
scheduler = CosineAnnealingLR(opt, ...)

# calling backward on the model will peform the optimzier step
loss = model(torch.randn(20, dtype=torch.bfloat16))
loss.backward()

# optimizer step and zero_grad are no longer needed, and will
# harmlessly no-op if called by an existing training framework
# opt.step()
# opt.zero_grad()

# step the learning rate scheduler like normal
scheduler.step()

# optionally remove gradient release hooks when done training
remove_gradient_release(model)

```

--------------------------------

### Initialize and Use AdamW with Triton Backend

Source: https://optimi.benjaminwarner.dev/triton?q=

Demonstrates how to initialize an AdamW optimizer, showing both the default Triton usage on supported GPUs and explicit enabling. This snippet is for models on a supported GPU.

```python
import torch
from torch import nn
from optimi import AdamW

# create model
model = nn.Linear(20, 1, device="cuda")

# models on a supported GPU will default to `triton=True`
opt = AdamW(model.parameters(), lr=1e-3)

# or initialize any optimi optimizer with `triton=True`
opt = AdamW(model.parameters(), lr=1e-3, triton=True)

# forward and backward
loss = model(torch.randn(20))
loss.backward()

# optimizer step is the Triton implementation
opt.step()
opt.zero_grad()

```

--------------------------------

### Initialize AdamW Optimizer with Foreach

Source: https://optimi.benjaminwarner.dev/foreach

Demonstrates how to initialize the AdamW optimizer with the `foreach=True` argument to enable the foreach implementation. This requires PyTorch 2.1+ and is typically used on CUDA devices.

```python
import torch
from torch import nn
from optimi import AdamW

# create model
model = nn.Linear(20, 1, device="cuda")

# initialize any optmi optimizer with `foreach=True`
opt = AdamW(model.parameters(), lr=1e-3, foreach=True)

# forward and backward
loss = model(torch.randn(20))
loss.backward()

# optimizer step is the foreach implementation
opt.step()
opt.zero_grad()

```

--------------------------------

### Lion Optimizer

Source: https://optimi.benjaminwarner.dev/optimizers/lion

Initializes the Lion optimizer with specified parameters.

```APIDOC
## Lion Optimizer

Lion optimizer. Evolved Sign Momentum.

### Parameters

- **params** (`Iterable[Tensor]` | `Iterable[dict]`) - Required - Iterable of parameters to optimize or dicts defining parameter groups
- **lr** (`float`) - Required - Learning rate
- **betas** (`tuple[float, float]`) - Optional - Coefficients for update moving average and gradient moving average (default: (0.9, 0.99))
- **weight_decay** (`float`) - Optional - Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay (default: 0)
- **decouple_lr** (`bool`) - Optional - Apply fully decoupled weight decay instead of decoupled weight decay (default: False)
- **max_lr** (`float | None`) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None)
- **kahan_sum** (`bool | None`) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None)
- **foreach** (`bool | None`) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None)
- **triton** (`bool | None`) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None)
- **gradient_release** (`bool`) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False)
```

--------------------------------

### Prepare Model for Gradient Release

Source: https://optimi.benjaminwarner.dev/utils?q=

Registers post_accumulate_grad_hooks on model parameters for gradient release optimization. Ensure the optimizer is initialized with `gradient_release=True`.

```python
prepare_for_gradient_release(
    model, optimizer, ignore_existing_hooks=False
)

```

--------------------------------

### StableAdamW Algorithm Formulation

Source: https://optimi.benjaminwarner.dev/optimizers/stableadamw?q=

Mathematical steps for the StableAdamW optimizer, including initialization, gradient calculation, moment updates, RMS normalization, and parameter updates with decoupled weight decay.

```mathematica
\begin{aligned} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{\textcolor{#9a3fe4}{Stable}AdamW} \\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \: \text{(epsilon)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}:\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\ &\[0.5em] &\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\ &\[0.5em] &\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\\ &\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\\ &\[0.5em] &\hspace{10mm} \textcolor{#9a3fe4}{\textbf{RMS}_t \leftarrow \sqrt{\mathbb{E[\bm{g}^2_t/\text{max}(\bm{v}_t, \epsilon^2)]}}}\\\ &\hspace{10mm} \textcolor{#9a3fe4}{\bm{\eta}_t \leftarrow \gamma_t/\text{max}(1,\textbf{RMS}_t)}\\\[0.5em] &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \textcolor{#9a3fe4}{\bm{\eta}_t} \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) + \lambda\bm{\theta}_{t-1} \bigr)\\\[-0.5em] &\rule{100mm}{0.4pt}\\\ \end{aligned}
```

--------------------------------

### Adan Optimizer

Source: https://optimi.benjaminwarner.dev/optimizers/adan

Initializes the Adan optimizer with specified parameters and hyperparameters.

```APIDOC
## Adan Optimizer

Adan Optimizer: Adaptive Nesterov Momentum Algorithm.

### Parameters

- **params** (Iterable[Tensor] | Iterable[dict]) - Required - Iterable of parameters to optimize or dicts defining parameter groups
- **lr** (float) - Required - Learning rate
- **betas** (tuple[float, float, float]) - Optional - Coefficients for gradient, gradient difference, and squared gradient moving averages (default: (0.98, 0.92, 0.99))
- **weight_decay** (float) - Optional - Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay (default: 0.02)
- **eps** (float) - Optional - Added to denominator to improve numerical stability (default: 1e-6)
- **decouple_lr** (bool) - Optional - Apply fully decoupled weight decay instead of decoupled weight decay (default: False)
- **max_lr** (float | None) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None)
- **adam_wd** (bool) - Optional - Apply weight decay before parameter update (Adam-style), instead of after the update per Adan algorithm (default: False)
- **kahan_sum** (bool | None) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: False)
- **foreach** (bool | None) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None)
- **triton** (bool | None) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None)
- **gradient_release** (bool) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False)
```

--------------------------------

### SGD with Kahan Summation Algorithm

Source: https://optimi.benjaminwarner.dev/kahan_summation?q=

This snippet illustrates the algorithmic steps for Stochastic Gradient Descent (SGD) incorporating Kahan summation for improved numerical stability. It outlines the inputs, initialization, and the iterative update process for parameters and the compensation buffer.

```latex
\begin{aligned} &\rule{90mm}{0.4pt}\\ &\hspace{2mm} \textcolor{#009ddb}{\textbf{SGD}} \: \textcolor{#9a3fe4}{\text{with Kahan summation}}\\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)};\\ &\hspace{17.25mm} \gamma_t \:\text{(learning rate at } t \text{)}; \: \lambda \: \text{(weight decay)}\\ &\hspace{5mm} \text{initialize} : \textcolor{#9a3fe4}{\bm{k}_{0} \leftarrow \bm{0}}\\[-0.5em] &\rule{90mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}:\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1}) - \lambda\bm{\theta}_{t-1}\\[0.5em] &\hspace{10mm} \textcolor{#009ddb}{\bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t\bm{g}_t}\\[0.3em] &\hspace{10mm} \textcolor{#9a3fe4}{\bm{u}_t \leftarrow \bm{k}_{t-1} - \gamma_t\bm{g}_t}\\ &\hspace{10mm} \textcolor{#9a3fe4}{\bm{\theta}_t \leftarrow \bm{\theta}_{t-1} + \bm{u}_t}\\ &\hspace{10mm} \textcolor{#9a3fe4}{\bm{k}_t \leftarrow \bm{u}_t + (\bm{\theta}_{t-1} - \bm{\theta}_t)}\\[-0.5em] &\rule{90mm}{0.4pt}\\ \end{aligned}
```

--------------------------------

### RAdam Optimizer

Source: https://optimi.benjaminwarner.dev/optimizers/radam

Initializes the RAdam optimizer. This optimizer is suitable for various training scenarios and offers options for decoupled weight decay and numerical stability enhancements.

```APIDOC
## RAdam

Rectified Adam optimizer. Optionally with decoupled weight decay.

### Parameters

- **params** (Iterable[Tensor] | Iterable[dict]) - Required - Iterable of parameters to optimize or dicts defining parameter groups
- **lr** (float) - Required - Learning rate
- **betas** (tuple[float, float]) - Optional - Coefficients for gradient and squared gradient moving averages (default: (0.9, 0.99))
- **weight_decay** (float) - Optional - Weight decay coefficient. If `decouple_wd` and `decouple_lr` are False, applies L2 penalty (default: 0)
- **eps** (float) - Optional - Added to denominator to improve numerical stability (default: 1e-6)
- **decouple_wd** (bool) - Optional - Apply decoupled weight decay instead of L2 penalty (default: False)
- **decouple_lr** (bool) - Optional - Apply fully decoupled weight decay instead of L2 penalty (default: False)
- **max_lr** (float | None) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None)
- **kahan_sum** (bool | None) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None)
- **foreach** (bool | None) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None)
- **triton** (bool | None) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None)
- **gradient_release** (bool) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False)
```

--------------------------------

### AdamW and Fully Decoupled Weight Decay Algorithm

Source: https://optimi.benjaminwarner.dev/fully_decoupled_weight_decay

This algorithm outlines the steps for AdamW and Adam with fully decoupled weight decay. It shows the update rule for parameters, emphasizing the separate application of weight decay.

```mathematics
\begin{aligned} &\rule{105mm}{0.4pt}\\\ &\hspace{2mm} \textcolor{#009ddb}{\text{PyTorch’s AdamW}} \: \text{\&} \: \textcolor{#9a3fe4}{\text{Adam with fully decoupled weight decay}}\\\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\
&\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \text{ (epsilon)};\\
&\hspace{17.25mm} \gamma_\text{max} \: \text{(maximum learning rate)}\\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{105mm}{0.4pt}\\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}:\
&\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\[0.5em]\\ &
&\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\
&\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\[0.5em]\\ &
&\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\
&\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\[0.5em]\\ &
&\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) \textcolor{#009ddb}{+ \lambda\bm{\theta}_{t-1}} \bigr)\textcolor{#9a3fe4}{- (\gamma_t/\gamma_\text{max})\lambda\bm{\theta}_{t-1}}\\[-0.5em] &
&\rule{105mm}{0.4pt}\\\ \end{aligned}
```

--------------------------------

### StableAdamW Optimizer Initialization

Source: https://optimi.benjaminwarner.dev/optimizers/stableadamw

Initializes the StableAdamW optimizer. This optimizer is a drop-in replacement for AdamW and supports its hyperparameters, with the addition of update clipping for enhanced training stability. It also offers options for fully decoupled weight decay and Kahan summation for low-precision training.

```APIDOC
## StableAdamW

StableAdamW optimizer. An AdamW-Adafactor hybrid with learning rate update clipping.

### Parameters

- **params** (Iterable[Tensor] | Iterable[dict]): Iterable of parameters to optimize or dicts defining parameter groups. Required.
- **lr** (float): Learning rate. Required.
- **betas** (tuple[float, float]): Coefficients for gradient and squared gradient moving averages. Default: (0.9, 0.99).
- **weight_decay** (float): Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay. Default: 0.01.
- **eps** (float): Added to denominator to improve numerical stability. Default: 1e-6.
- **decouple_lr** (bool): Apply fully decoupled weight decay instead of decoupled weight decay. Default: False.
- **max_lr** (float | None): Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True. Default: None.
- **kahan_sum** (bool | None): Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters. Default: None.
- **foreach** (bool | None): Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster. Default: None.
- **triton** (bool | None): Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations. Default: None.
- **gradient_release** (bool): Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure. Default: False.
```

--------------------------------

### PyTorch AdamW & Optimi Adam with Fully Decoupled Weight Decay Algorithm

Source: https://optimi.benjaminwarner.dev/fully_decoupled_weight_decay?q=

This algorithm outlines the steps for PyTorch's AdamW and Optimi's Adam when using fully decoupled weight decay. It details the inputs, initialization, and the iterative update process for parameters.

```mathematics
\begin{aligned} &\rule{105mm}{0.4pt}\\\ &\hspace{2mm} \textcolor{#009ddb}{\text{PyTorch’s AdamW}} \: \text{\&} \: \textcolor{#9a3fe4}{\text{Adam with fully decoupled weight decay}}\\\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\
&
\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \text{ (epsilon)};\\
&
\hspace{17.25mm} \gamma_\text{max} \: \text{(maximum learning rate)}\\\ &
\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}\\[-0.5em] &
\rule{105mm}{0.4pt}\\\ &
\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}	ext{:}\\\ &
\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\[0.5em] &
\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\\ &
\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\[0.5em] &
\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\\ &
\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\[0.5em] &
\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) \textcolor{#009ddb}{+ \lambda\bm{\theta}_{t-1}} \bigr)\textcolor{#9a3fe4}{- (\gamma_t/\gamma_\text{max})\lambda\bm{\theta}_{t-1}}\\[-0.5em] &
\rule{105mm}{0.4pt}\\\ 
\end{aligned}
```

--------------------------------

### Gradient Release with AdamW

Source: https://optimi.benjaminwarner.dev/

Implement gradient release by initializing the optimizer with `gradient_release=True` and calling `prepare_for_gradient_release`. This performs the optimizer step during the backward pass, freeing gradient memory immediately. Standard `opt.step()` and `opt.zero_grad()` calls become no-ops.

```python
# initialize any optimi optimizer with `gradient_release=True`
# and call `prepare_for_gradient_release` on model and optimizer
opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True)
prepare_for_gradient_release(model, opt)

# setup a learning rate scheduler like normal
scheduler = CosineAnnealingLR(opt, ...)

# calling backward on the model will peform the optimzier step
loss = model(torch.randn(20, dtype=torch.bfloat16))
loss.backward()

# optimizer step and zero_grad are no longer needed, and will
# harmlessly no-op if called by an existing training framework
# opt.step()
# opt.zero_grad()

# step the learning rate scheduler like normal
scheduler.step()

# optionally remove gradient release hooks when done training
remove_gradient_release(model)

```

--------------------------------

### Adan Optimizer

Source: https://optimi.benjaminwarner.dev/optimizers/adan?q=

Initializes the Adan optimizer, an adaptive Nesterov momentum algorithm.

```APIDOC
## Adan Optimizer

### Description
Initializes the Adan optimizer, an adaptive Nesterov momentum algorithm that estimates both first- and second-order gradient movements for faster deep model optimization.

### Parameters

- **params** (`Iterable[Tensor]` or `Iterable[dict]`) - Required - Iterable of parameters to optimize or dicts defining parameter groups.
- **lr** (`float`) - Required - Learning rate.
- **betas** (`tuple[float, float, float]`) - Optional - Coefficients for gradient, gradient difference, and squared gradient moving averages. Defaults to `(0.98, 0.92, 0.99)`.
- **weight_decay** (`float`) - Optional - Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay. Defaults to `0.02`.
- **eps** (`float`) - Optional - Added to denominator to improve numerical stability. Defaults to `1e-06`.
- **decouple_lr** (`bool`) - Optional - Apply fully decoupled weight decay instead of decoupled weight decay. Defaults to `False`.
- **max_lr** (`float | None`) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True. Defaults to `None`.
- **adam_wd** (`bool`) - Optional - Apply weight decay before parameter update (Adam-style), instead of after the update per Adan algorithm. Defaults to `False`.
- **kahan_sum** (`bool | None`) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision. If unspecified, automatically applies for low precision parameters. Defaults to `False`.
- **foreach** (`bool | None`) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster. Defaults to `None`.
- **triton** (`bool | None`) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations. Defaults to `None`.
- **gradient_release** (`bool`) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure. Defaults to `False`.
```

--------------------------------

### Lion Optimizer Algorithm

Source: https://optimi.benjaminwarner.dev/optimizers/lion

The mathematical formulation of the Lion optimizer, detailing the steps from input parameters to parameter updates.

```mathematics
\begin{aligned} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{Lion} \\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \: \text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}:\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\ [0.5em] &\hspace{10mm} \bm{u} \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{m}_t \leftarrow \beta_2 \bm{m}_{t-1} + (1 - \beta_2) \bm{g}_t\\ [0.5em] &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t \bigl(\text{sign}(\bm{u}) + \lambda\bm{\theta}_{t-1} \bigr)\\ [-0.5em] &\rule{100mm}{0.4pt}\\\ \end{aligned}
```

--------------------------------

### SGD with Momentum and Dampening Algorithm

Source: https://optimi.benjaminwarner.dev/optimizers/sgd?q=

This snippet outlines the mathematical steps for the SGD optimizer with momentum and dampening. It specifies the inputs, initialization, and the iterative update process for gradients and parameters.

```mathematica
\begin{aligned} &\rule{100mm}{0.4pt}\\\ &
\hspace{2mm} \textcolor{#dc3918}{\textbf{SGD}} \: \textcolor{#009ddb}{\text{with momentum}} \: \textcolor{#9a3fe4}{\text{and dampening}}\\\\ &
\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\
&
\hspace{17.25mm} \beta \: \text{(momentum)}; \: \lambda \: \text{(weight decay)}\\\\ &
\hspace{5mm} \text{initialize} : \textcolor{#009ddb}{\bm{m}_{0} \leftarrow \bm{0}}\\\\ [-0.5em] &
\rule{100mm}{0.4pt}\\\\ &
\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}	ext{:}\\\\ &
\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1}) - \lambda\bm{\theta}_{t-1}\\\\ &
\hspace{10mm} \textcolor{#009ddb}{\bm{m}_t \leftarrow \beta \bm{m}_{t-1} +} \textcolor{#9a3fe4}{(1 - \beta)} \textcolor{#009ddb}{\bm{g}_t}\\\\ &
\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} \textcolor{#dc3918}{- \gamma_t\bm{g}_t} \textcolor{#009ddb}{- \gamma_t\bm{m}_t}\\\\ [-0.5em] &
\rule{100mm}{0.4pt}\\\\ \end{aligned}
```

--------------------------------

### StableAdamW Optimizer Initialization

Source: https://optimi.benjaminwarner.dev/optimizers/stableadamw?q=

Initializes the StableAdamW optimizer, which is a drop-in replacement for AdamW. It optimizes a given set of parameters with specified learning rates and other hyperparameters. Note that gradient clipping is not required.

```APIDOC
## StableAdamW

StableAdamW optimizer. An AdamW-Adafactor hybrid with learning rate update clipping.

### Parameters

- **params** (Iterable[Tensor] | Iterable[dict]) - Required - Iterable of parameters to optimize or dicts defining parameter groups
- **lr** (float) - Required - Learning rate
- **betas** (tuple[float, float]) - Optional - Coefficients for gradient and squared gradient moving averages (default: (0.9, 0.99))
- **weight_decay** (float) - Optional - Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay (default: 1e-2)
- **eps** (float) - Optional - Added to denominator to improve numerical stability (default: 1e-6)
- **decouple_lr** (bool) - Optional - Apply fully decoupled weight decay instead of decoupled weight decay (default: False)
- **max_lr** (float | None) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None)
- **kahan_sum** (bool | None) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None)
- **foreach** (bool | None) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None)
- **triton** (bool | None) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None)
- **gradient_release** (bool) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False)
```

--------------------------------

### Optimizer Accumulation with Gradient Release

Source: https://optimi.benjaminwarner.dev/

Approximate gradient accumulation using gradient release by accumulating gradients into optimizer states. Initialize the optimizer with `gradient_release=True` and call `prepare_for_gradient_release`. Model parameters are updated every N steps after gradients are accumulated.

```python
# initialize any optimi optimizer with `gradient_release=True`
# and call `prepare_for_gradient_release` on model and optimizer
opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True)
prepare_for_gradient_release(model, opt)

# update model parameters every four steps after accumulating
# gradients directly into the optimizer states
accumulation_steps = 4

# setup a learning rate scheduler for gradient accumulation
scheduler = CosineAnnealingLR(opt, ...)


```

--------------------------------

### Lion Optimizer Pseudocode

Source: https://optimi.benjaminwarner.dev/optimizers/lion?q=

This pseudocode outlines the core steps of the Lion optimizer algorithm, including gradient calculation, momentum updates, and parameter updates with weight decay.

```pseudocode
Lioninputs:θ0 (params); f(θ)(objective); γt (learning rate at t);β1,β2 (betas); λ (weight decay)initialize:m0←0for t=1 to … do:gt←∇θft(θt−1)u←β1mt−1+(1−β1)gtmt←β2mt−1+(1−β2)gtθt←θt−1−γt(sign(u)+λθt−1)
```

--------------------------------

### StableAdamW Algorithm

Source: https://optimi.benjaminwarner.dev/optimizers/stableadamw

The mathematical formulation of the StableAdamW algorithm, including initialization, gradient calculation, moment updates, RMS scaling, and parameter updates with decoupled weight decay.

```mathematica
\begin{aligned} &\rule{100mm}{0.4pt}\\
 &\hspace{2mm} \textbf{\textcolor{#9a3fe4}{Stable}AdamW} \\
 &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\
 &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \: \text{(epsilon)}\\
 &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}\\
&\rule{100mm}{0.4pt}\\
 &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}\text{:}\\
 &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\\n &\[0.5em]
 &\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\
 &\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\\n &\[0.5em]
 &\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\
 &\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\
\\
 &\[0.5em]
 &\hspace{10mm} \textcolor{#9a3fe4}{\textbf{RMS}_t \leftarrow \sqrt{\mathbb{E[\bm{g}^2_t/\text{max}(\bm{v}_t, \epsilon^2)]}}}\\
 &\hspace{10mm} \textcolor{#9a3fe4}{\bm{\eta}_t \leftarrow \gamma_t/\text{max}(1,\textbf{RMS}_t)}\\
\\
 &\[0.5em]
 &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \textcolor{#9a3fe4}{\bm{\eta}_t} \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) + \lambda\bm{\theta}_{t-1} \bigr)\\
\\
&\rule{100mm}{0.4pt}\\
 \end{aligned}
```

--------------------------------

### Standard SGD Parameter Update

Source: https://optimi.benjaminwarner.dev/kahan_summation

This code illustrates the standard parameter update step in Stochastic Gradient Descent (SGD). It is used as a baseline to understand the modifications introduced by Kahan summation.

```pseudocode
θt ← θt−1−γtgt
```