### Install Optimi Package Source: https://optimi.benjaminwarner.dev/ Install the optimi library using pip. This command fetches and installs the latest version from PyPI. ```bash pip install torch-optimi ``` -------------------------------- ### Using Kahan Summation with optimi AdamW Optimizer Source: https://optimi.benjaminwarner.dev/kahan_summation?q= This example demonstrates initializing an optimi AdamW optimizer with a model cast to BFloat16. Kahan summation is automatically enabled for low-precision layers. The optimizer step will then use Kahan summation for updating low-precision weights. ```python import torch from torch import nn from optimi import AdamW # create or cast some model layers in low precision (bfloat16) model = nn.Linear(20, 1, dtype=torch.bfloat16) # initialize any optmi optimizer with low precsion parameters # Kahan summation is enabled since some model layers are bfloat16 opt = AdamW(model.parameters(), lr=1e-3) # forward and backward, casting input to bfloat16 if needed loss = model(torch.randn(20, dtype=torch.bfloat16)) loss.backward() # optimizer step automatically uses Kahan summation for low precision layers opt.step() opt.zero_grad() ``` -------------------------------- ### PyTorch Optimizer Accumulation Example Source: https://optimi.benjaminwarner.dev/optimizer_accumulation This snippet shows how to use optimizer accumulation with a PyTorch dataloader. Set `optimizer_accumulation` to control gradient accumulation or parameter updates. The optimizer step and zero_grad are handled automatically. ```python for idx, batch in enumerate(dataloader): # `optimizer_accumulation=True` accumulates gradients into # optimizer states. set `optimizer_accumulation=False` to # update parameters by performing a full gradient release step opt.optimizer_accumulation = (idx+1) % accumulation_steps != 0 # calling backward on the model will peform the optimizer step # either accumulating gradients or updating model parameters loss = model(batch) loss.backward() # optimizer step and zero_grad are no longer needed, and will # harmlessly no-op if called by an existing training framework # opt.step() # opt.zero_grad() # step the learning rate scheduler after accumulating gradients if not opt.optimizer_accumulation: scheduler.step() # optionally remove gradient release hooks when done training remove_gradient_release(model) ``` -------------------------------- ### Initialize Optimizer with Gradient Release and Prepare Model Source: https://optimi.benjaminwarner.dev/optimizer_accumulation Initialize your model and optimizer with gradient release enabled. This involves setting `gradient_release=True` for the optimizer and calling `prepare_for_gradient_release` on both the model and optimizer. Use bfloat16 for the model's data type. ```python import torch from torch import nn from optimi import AdamW # create or cast model in low precision (bfloat16) model = nn.Linear(20, 1, dtype=torch.bfloat16) # initialize any optimi optimizer with `gradient_release=True` # and call `prepare_for_gradient_release` on model and optimizer opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True) prepare_for_gradient_release(model, opt) # update model parameters every four steps after accumulating # gradients directly into the optimizer states accumulation_steps = 4 # setup a learning rate scheduler for gradient accumulation scheduler = CosineAnnealingLR(opt, ...) ``` -------------------------------- ### Override Float32 Modules in Low Precision Casting Source: https://optimi.benjaminwarner.dev/utils?q= Customizes which modules are kept in float32 during low-precision casting. For example, to only preserve LayerNorm modules. ```python to_low_precision(model, dtype=torch.bfloat16, fp32_modules=(nn.LayerNorm,)) ``` -------------------------------- ### Standard Weight Decay with AdamW (Float32/Mixed Precision) Source: https://optimi.benjaminwarner.dev/ Use AdamW with standard PyTorch-style weight decay for float32 or mixed-precision training. This is a typical setup when not utilizing low-precision features. ```python # create model model = nn.Linear(20, 1) # initialize any optimi optimizer with parameters opt = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2) ``` -------------------------------- ### Initialize and Use Gradient Release Source: https://optimi.benjaminwarner.dev/gradient_release Initialize an optimi optimizer with `gradient_release=True` and call `prepare_for_gradient_release` on the model and optimizer. The optimizer step and zero_grad are handled automatically during the backward pass. ```python import torch from torch import nn from optimi import AdamW from torch.optim.lr_scheduler import CosineAnnealingLR # create or cast model in low precision (bfloat16) model = nn.Linear(20, 1, dtype=torch.bfloat16) # initialize any optimi optimizer with `gradient_release=True` # and call `prepare_for_gradient_release` on model and optimizer opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True) prepare_for_gradient_release(model, opt) # setup a learning rate scheduler like normal scheduler = CosineAnnealingLR(opt, ...) # calling backward on the model will peform the optimzier step loss = model(torch.randn(20, dtype=torch.bfloat16)) loss.backward() # optimizer step and zero_grad are no longer needed, and will # harmlessly no-op if called by an existing training framework # opt.step() # opt.zero_grad() # step the learning rate scheduler like normal scheduler.step() # optionally remove gradient release hooks when done training remove_gradient_release(model) ``` -------------------------------- ### Initialize and Use Gradient Release Source: https://optimi.benjaminwarner.dev/gradient_release?q= Initialize an optimi optimizer with `gradient_release=True` and call `prepare_for_gradient_release` on the model and optimizer. Subsequent backward calls will fuse the optimizer step and gradient clearing. ```python import torch from torch import nn from optimi import AdamW # create or cast model in low precision (bfloat16) model = nn.Linear(20, 1, dtype=torch.bfloat16) # initialize any optimi optimizer with `gradient_release=True` # and call `prepare_for_gradient_release` on model and optimizer opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True) prepare_for_gradient_release(model, opt) # setup a learning rate scheduler like normal scheduler = CosineAnnealingLR(opt, ...) # calling backward on the model will peform the optimzier step loss = model(torch.randn(20, dtype=torch.bfloat16)) loss.backward() # optimizer step and zero_grad are no longer needed, and will # harmlessly no-op if called by an existing training framework # opt.step() # opt.zero_grad() # step the learning rate scheduler like normal scheduler.step() # optionally remove gradient release hooks when done training remove_gradient_release(model) ``` -------------------------------- ### Initialize and Use AdamW with Triton Backend Source: https://optimi.benjaminwarner.dev/triton?q= Demonstrates how to initialize an AdamW optimizer, showing both the default Triton usage on supported GPUs and explicit enabling. This snippet is for models on a supported GPU. ```python import torch from torch import nn from optimi import AdamW # create model model = nn.Linear(20, 1, device="cuda") # models on a supported GPU will default to `triton=True` opt = AdamW(model.parameters(), lr=1e-3) # or initialize any optimi optimizer with `triton=True` opt = AdamW(model.parameters(), lr=1e-3, triton=True) # forward and backward loss = model(torch.randn(20)) loss.backward() # optimizer step is the Triton implementation opt.step() opt.zero_grad() ``` -------------------------------- ### Initialize AdamW Optimizer with Foreach Source: https://optimi.benjaminwarner.dev/foreach Demonstrates how to initialize the AdamW optimizer with the `foreach=True` argument to enable the foreach implementation. This requires PyTorch 2.1+ and is typically used on CUDA devices. ```python import torch from torch import nn from optimi import AdamW # create model model = nn.Linear(20, 1, device="cuda") # initialize any optmi optimizer with `foreach=True` opt = AdamW(model.parameters(), lr=1e-3, foreach=True) # forward and backward loss = model(torch.randn(20)) loss.backward() # optimizer step is the foreach implementation opt.step() opt.zero_grad() ``` -------------------------------- ### Lion Optimizer Source: https://optimi.benjaminwarner.dev/optimizers/lion Initializes the Lion optimizer with specified parameters. ```APIDOC ## Lion Optimizer Lion optimizer. Evolved Sign Momentum. ### Parameters - **params** (`Iterable[Tensor]` | `Iterable[dict]`) - Required - Iterable of parameters to optimize or dicts defining parameter groups - **lr** (`float`) - Required - Learning rate - **betas** (`tuple[float, float]`) - Optional - Coefficients for update moving average and gradient moving average (default: (0.9, 0.99)) - **weight_decay** (`float`) - Optional - Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay (default: 0) - **decouple_lr** (`bool`) - Optional - Apply fully decoupled weight decay instead of decoupled weight decay (default: False) - **max_lr** (`float | None`) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None) - **kahan_sum** (`bool | None`) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None) - **foreach** (`bool | None`) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None) - **triton** (`bool | None`) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None) - **gradient_release** (`bool`) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False) ``` -------------------------------- ### Prepare Model for Gradient Release Source: https://optimi.benjaminwarner.dev/utils?q= Registers post_accumulate_grad_hooks on model parameters for gradient release optimization. Ensure the optimizer is initialized with `gradient_release=True`. ```python prepare_for_gradient_release( model, optimizer, ignore_existing_hooks=False ) ``` -------------------------------- ### StableAdamW Algorithm Formulation Source: https://optimi.benjaminwarner.dev/optimizers/stableadamw?q= Mathematical steps for the StableAdamW optimizer, including initialization, gradient calculation, moment updates, RMS normalization, and parameter updates with decoupled weight decay. ```mathematica \begin{aligned} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{\textcolor{#9a3fe4}{Stable}AdamW} \\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \: \text{(epsilon)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}:\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\ &\[0.5em] &\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\ &\[0.5em] &\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\\ &\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\\ &\[0.5em] &\hspace{10mm} \textcolor{#9a3fe4}{\textbf{RMS}_t \leftarrow \sqrt{\mathbb{E[\bm{g}^2_t/\text{max}(\bm{v}_t, \epsilon^2)]}}}\\\ &\hspace{10mm} \textcolor{#9a3fe4}{\bm{\eta}_t \leftarrow \gamma_t/\text{max}(1,\textbf{RMS}_t)}\\\[0.5em] &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \textcolor{#9a3fe4}{\bm{\eta}_t} \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) + \lambda\bm{\theta}_{t-1} \bigr)\\\[-0.5em] &\rule{100mm}{0.4pt}\\\ \end{aligned} ``` -------------------------------- ### Adan Optimizer Source: https://optimi.benjaminwarner.dev/optimizers/adan Initializes the Adan optimizer with specified parameters and hyperparameters. ```APIDOC ## Adan Optimizer Adan Optimizer: Adaptive Nesterov Momentum Algorithm. ### Parameters - **params** (Iterable[Tensor] | Iterable[dict]) - Required - Iterable of parameters to optimize or dicts defining parameter groups - **lr** (float) - Required - Learning rate - **betas** (tuple[float, float, float]) - Optional - Coefficients for gradient, gradient difference, and squared gradient moving averages (default: (0.98, 0.92, 0.99)) - **weight_decay** (float) - Optional - Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay (default: 0.02) - **eps** (float) - Optional - Added to denominator to improve numerical stability (default: 1e-6) - **decouple_lr** (bool) - Optional - Apply fully decoupled weight decay instead of decoupled weight decay (default: False) - **max_lr** (float | None) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None) - **adam_wd** (bool) - Optional - Apply weight decay before parameter update (Adam-style), instead of after the update per Adan algorithm (default: False) - **kahan_sum** (bool | None) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: False) - **foreach** (bool | None) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None) - **triton** (bool | None) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None) - **gradient_release** (bool) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False) ``` -------------------------------- ### SGD with Kahan Summation Algorithm Source: https://optimi.benjaminwarner.dev/kahan_summation?q= This snippet illustrates the algorithmic steps for Stochastic Gradient Descent (SGD) incorporating Kahan summation for improved numerical stability. It outlines the inputs, initialization, and the iterative update process for parameters and the compensation buffer. ```latex \begin{aligned} &\rule{90mm}{0.4pt}\\ &\hspace{2mm} \textcolor{#009ddb}{\textbf{SGD}} \: \textcolor{#9a3fe4}{\text{with Kahan summation}}\\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)};\\ &\hspace{17.25mm} \gamma_t \:\text{(learning rate at } t \text{)}; \: \lambda \: \text{(weight decay)}\\ &\hspace{5mm} \text{initialize} : \textcolor{#9a3fe4}{\bm{k}_{0} \leftarrow \bm{0}}\\[-0.5em] &\rule{90mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}:\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1}) - \lambda\bm{\theta}_{t-1}\\[0.5em] &\hspace{10mm} \textcolor{#009ddb}{\bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t\bm{g}_t}\\[0.3em] &\hspace{10mm} \textcolor{#9a3fe4}{\bm{u}_t \leftarrow \bm{k}_{t-1} - \gamma_t\bm{g}_t}\\ &\hspace{10mm} \textcolor{#9a3fe4}{\bm{\theta}_t \leftarrow \bm{\theta}_{t-1} + \bm{u}_t}\\ &\hspace{10mm} \textcolor{#9a3fe4}{\bm{k}_t \leftarrow \bm{u}_t + (\bm{\theta}_{t-1} - \bm{\theta}_t)}\\[-0.5em] &\rule{90mm}{0.4pt}\\ \end{aligned} ``` -------------------------------- ### RAdam Optimizer Source: https://optimi.benjaminwarner.dev/optimizers/radam Initializes the RAdam optimizer. This optimizer is suitable for various training scenarios and offers options for decoupled weight decay and numerical stability enhancements. ```APIDOC ## RAdam Rectified Adam optimizer. Optionally with decoupled weight decay. ### Parameters - **params** (Iterable[Tensor] | Iterable[dict]) - Required - Iterable of parameters to optimize or dicts defining parameter groups - **lr** (float) - Required - Learning rate - **betas** (tuple[float, float]) - Optional - Coefficients for gradient and squared gradient moving averages (default: (0.9, 0.99)) - **weight_decay** (float) - Optional - Weight decay coefficient. If `decouple_wd` and `decouple_lr` are False, applies L2 penalty (default: 0) - **eps** (float) - Optional - Added to denominator to improve numerical stability (default: 1e-6) - **decouple_wd** (bool) - Optional - Apply decoupled weight decay instead of L2 penalty (default: False) - **decouple_lr** (bool) - Optional - Apply fully decoupled weight decay instead of L2 penalty (default: False) - **max_lr** (float | None) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None) - **kahan_sum** (bool | None) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None) - **foreach** (bool | None) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None) - **triton** (bool | None) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None) - **gradient_release** (bool) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False) ``` -------------------------------- ### AdamW and Fully Decoupled Weight Decay Algorithm Source: https://optimi.benjaminwarner.dev/fully_decoupled_weight_decay This algorithm outlines the steps for AdamW and Adam with fully decoupled weight decay. It shows the update rule for parameters, emphasizing the separate application of weight decay. ```mathematics \begin{aligned} &\rule{105mm}{0.4pt}\\\ &\hspace{2mm} \textcolor{#009ddb}{\text{PyTorch’s AdamW}} \: \text{\&} \: \textcolor{#9a3fe4}{\text{Adam with fully decoupled weight decay}}\\\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \text{ (epsilon)};\\ &\hspace{17.25mm} \gamma_\text{max} \: \text{(maximum learning rate)}\\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{105mm}{0.4pt}\\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}:\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\[0.5em]\\ & &\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\ &\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\[0.5em]\\ & &\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\ &\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\[0.5em]\\ & &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) \textcolor{#009ddb}{+ \lambda\bm{\theta}_{t-1}} \bigr)\textcolor{#9a3fe4}{- (\gamma_t/\gamma_\text{max})\lambda\bm{\theta}_{t-1}}\\[-0.5em] & &\rule{105mm}{0.4pt}\\\ \end{aligned} ``` -------------------------------- ### StableAdamW Optimizer Initialization Source: https://optimi.benjaminwarner.dev/optimizers/stableadamw Initializes the StableAdamW optimizer. This optimizer is a drop-in replacement for AdamW and supports its hyperparameters, with the addition of update clipping for enhanced training stability. It also offers options for fully decoupled weight decay and Kahan summation for low-precision training. ```APIDOC ## StableAdamW StableAdamW optimizer. An AdamW-Adafactor hybrid with learning rate update clipping. ### Parameters - **params** (Iterable[Tensor] | Iterable[dict]): Iterable of parameters to optimize or dicts defining parameter groups. Required. - **lr** (float): Learning rate. Required. - **betas** (tuple[float, float]): Coefficients for gradient and squared gradient moving averages. Default: (0.9, 0.99). - **weight_decay** (float): Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay. Default: 0.01. - **eps** (float): Added to denominator to improve numerical stability. Default: 1e-6. - **decouple_lr** (bool): Apply fully decoupled weight decay instead of decoupled weight decay. Default: False. - **max_lr** (float | None): Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True. Default: None. - **kahan_sum** (bool | None): Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters. Default: None. - **foreach** (bool | None): Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster. Default: None. - **triton** (bool | None): Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations. Default: None. - **gradient_release** (bool): Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure. Default: False. ``` -------------------------------- ### PyTorch AdamW & Optimi Adam with Fully Decoupled Weight Decay Algorithm Source: https://optimi.benjaminwarner.dev/fully_decoupled_weight_decay?q= This algorithm outlines the steps for PyTorch's AdamW and Optimi's Adam when using fully decoupled weight decay. It details the inputs, initialization, and the iterative update process for parameters. ```mathematics \begin{aligned} &\rule{105mm}{0.4pt}\\\ &\hspace{2mm} \textcolor{#009ddb}{\text{PyTorch’s AdamW}} \: \text{\&} \: \textcolor{#9a3fe4}{\text{Adam with fully decoupled weight decay}}\\\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ & \hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \text{ (epsilon)};\\ & \hspace{17.25mm} \gamma_\text{max} \: \text{(maximum learning rate)}\\\ & \hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}\\[-0.5em] & \rule{105mm}{0.4pt}\\\ & \hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} ext{:}\\\ & \hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\[0.5em] & \hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\\ & \hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\[0.5em] & \hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\\ & \hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\[0.5em] & \hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) \textcolor{#009ddb}{+ \lambda\bm{\theta}_{t-1}} \bigr)\textcolor{#9a3fe4}{- (\gamma_t/\gamma_\text{max})\lambda\bm{\theta}_{t-1}}\\[-0.5em] & \rule{105mm}{0.4pt}\\\ \end{aligned} ``` -------------------------------- ### Gradient Release with AdamW Source: https://optimi.benjaminwarner.dev/ Implement gradient release by initializing the optimizer with `gradient_release=True` and calling `prepare_for_gradient_release`. This performs the optimizer step during the backward pass, freeing gradient memory immediately. Standard `opt.step()` and `opt.zero_grad()` calls become no-ops. ```python # initialize any optimi optimizer with `gradient_release=True` # and call `prepare_for_gradient_release` on model and optimizer opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True) prepare_for_gradient_release(model, opt) # setup a learning rate scheduler like normal scheduler = CosineAnnealingLR(opt, ...) # calling backward on the model will peform the optimzier step loss = model(torch.randn(20, dtype=torch.bfloat16)) loss.backward() # optimizer step and zero_grad are no longer needed, and will # harmlessly no-op if called by an existing training framework # opt.step() # opt.zero_grad() # step the learning rate scheduler like normal scheduler.step() # optionally remove gradient release hooks when done training remove_gradient_release(model) ``` -------------------------------- ### Adan Optimizer Source: https://optimi.benjaminwarner.dev/optimizers/adan?q= Initializes the Adan optimizer, an adaptive Nesterov momentum algorithm. ```APIDOC ## Adan Optimizer ### Description Initializes the Adan optimizer, an adaptive Nesterov momentum algorithm that estimates both first- and second-order gradient movements for faster deep model optimization. ### Parameters - **params** (`Iterable[Tensor]` or `Iterable[dict]`) - Required - Iterable of parameters to optimize or dicts defining parameter groups. - **lr** (`float`) - Required - Learning rate. - **betas** (`tuple[float, float, float]`) - Optional - Coefficients for gradient, gradient difference, and squared gradient moving averages. Defaults to `(0.98, 0.92, 0.99)`. - **weight_decay** (`float`) - Optional - Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay. Defaults to `0.02`. - **eps** (`float`) - Optional - Added to denominator to improve numerical stability. Defaults to `1e-06`. - **decouple_lr** (`bool`) - Optional - Apply fully decoupled weight decay instead of decoupled weight decay. Defaults to `False`. - **max_lr** (`float | None`) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True. Defaults to `None`. - **adam_wd** (`bool`) - Optional - Apply weight decay before parameter update (Adam-style), instead of after the update per Adan algorithm. Defaults to `False`. - **kahan_sum** (`bool | None`) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision. If unspecified, automatically applies for low precision parameters. Defaults to `False`. - **foreach** (`bool | None`) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster. Defaults to `None`. - **triton** (`bool | None`) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations. Defaults to `None`. - **gradient_release** (`bool`) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure. Defaults to `False`. ``` -------------------------------- ### Lion Optimizer Algorithm Source: https://optimi.benjaminwarner.dev/optimizers/lion The mathematical formulation of the Lion optimizer, detailing the steps from input parameters to parameter updates. ```mathematics \begin{aligned} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{Lion} \\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \: \text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}:\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\ [0.5em] &\hspace{10mm} \bm{u} \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{m}_t \leftarrow \beta_2 \bm{m}_{t-1} + (1 - \beta_2) \bm{g}_t\\ [0.5em] &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t \bigl(\text{sign}(\bm{u}) + \lambda\bm{\theta}_{t-1} \bigr)\\ [-0.5em] &\rule{100mm}{0.4pt}\\\ \end{aligned} ``` -------------------------------- ### SGD with Momentum and Dampening Algorithm Source: https://optimi.benjaminwarner.dev/optimizers/sgd?q= This snippet outlines the mathematical steps for the SGD optimizer with momentum and dampening. It specifies the inputs, initialization, and the iterative update process for gradients and parameters. ```mathematica \begin{aligned} &\rule{100mm}{0.4pt}\\\ & \hspace{2mm} \textcolor{#dc3918}{\textbf{SGD}} \: \textcolor{#009ddb}{\text{with momentum}} \: \textcolor{#9a3fe4}{\text{and dampening}}\\\\ & \hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ & \hspace{17.25mm} \beta \: \text{(momentum)}; \: \lambda \: \text{(weight decay)}\\\\ & \hspace{5mm} \text{initialize} : \textcolor{#009ddb}{\bm{m}_{0} \leftarrow \bm{0}}\\\\ [-0.5em] & \rule{100mm}{0.4pt}\\\\ & \hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} ext{:}\\\\ & \hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1}) - \lambda\bm{\theta}_{t-1}\\\\ & \hspace{10mm} \textcolor{#009ddb}{\bm{m}_t \leftarrow \beta \bm{m}_{t-1} +} \textcolor{#9a3fe4}{(1 - \beta)} \textcolor{#009ddb}{\bm{g}_t}\\\\ & \hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} \textcolor{#dc3918}{- \gamma_t\bm{g}_t} \textcolor{#009ddb}{- \gamma_t\bm{m}_t}\\\\ [-0.5em] & \rule{100mm}{0.4pt}\\\\ \end{aligned} ``` -------------------------------- ### StableAdamW Optimizer Initialization Source: https://optimi.benjaminwarner.dev/optimizers/stableadamw?q= Initializes the StableAdamW optimizer, which is a drop-in replacement for AdamW. It optimizes a given set of parameters with specified learning rates and other hyperparameters. Note that gradient clipping is not required. ```APIDOC ## StableAdamW StableAdamW optimizer. An AdamW-Adafactor hybrid with learning rate update clipping. ### Parameters - **params** (Iterable[Tensor] | Iterable[dict]) - Required - Iterable of parameters to optimize or dicts defining parameter groups - **lr** (float) - Required - Learning rate - **betas** (tuple[float, float]) - Optional - Coefficients for gradient and squared gradient moving averages (default: (0.9, 0.99)) - **weight_decay** (float) - Optional - Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay (default: 1e-2) - **eps** (float) - Optional - Added to denominator to improve numerical stability (default: 1e-6) - **decouple_lr** (bool) - Optional - Apply fully decoupled weight decay instead of decoupled weight decay (default: False) - **max_lr** (float | None) - Optional - Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None) - **kahan_sum** (bool | None) - Optional - Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None) - **foreach** (bool | None) - Optional - Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None) - **triton** (bool | None) - Optional - Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None) - **gradient_release** (bool) - Optional - Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False) ``` -------------------------------- ### Optimizer Accumulation with Gradient Release Source: https://optimi.benjaminwarner.dev/ Approximate gradient accumulation using gradient release by accumulating gradients into optimizer states. Initialize the optimizer with `gradient_release=True` and call `prepare_for_gradient_release`. Model parameters are updated every N steps after gradients are accumulated. ```python # initialize any optimi optimizer with `gradient_release=True` # and call `prepare_for_gradient_release` on model and optimizer opt = AdamW(model.parameters(), lr=1e-3, gradient_release=True) prepare_for_gradient_release(model, opt) # update model parameters every four steps after accumulating # gradients directly into the optimizer states accumulation_steps = 4 # setup a learning rate scheduler for gradient accumulation scheduler = CosineAnnealingLR(opt, ...) ``` -------------------------------- ### Lion Optimizer Pseudocode Source: https://optimi.benjaminwarner.dev/optimizers/lion?q= This pseudocode outlines the core steps of the Lion optimizer algorithm, including gradient calculation, momentum updates, and parameter updates with weight decay. ```pseudocode Lioninputs:θ0 (params); f(θ)(objective); γt (learning rate at t);β1,β2 (betas); λ (weight decay)initialize:m0←0for t=1 to … do:gt←∇θft(θt−1)u←β1mt−1+(1−β1)gtmt←β2mt−1+(1−β2)gtθt←θt−1−γt(sign(u)+λθt−1) ``` -------------------------------- ### StableAdamW Algorithm Source: https://optimi.benjaminwarner.dev/optimizers/stableadamw The mathematical formulation of the StableAdamW algorithm, including initialization, gradient calculation, moment updates, RMS scaling, and parameter updates with decoupled weight decay. ```mathematica \begin{aligned} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{\textcolor{#9a3fe4}{Stable}AdamW} \\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \: \text{(epsilon)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}\\ &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}\text{:}\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\\n &\[0.5em] &\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\\n &\[0.5em] &\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\ &\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\ \\ &\[0.5em] &\hspace{10mm} \textcolor{#9a3fe4}{\textbf{RMS}_t \leftarrow \sqrt{\mathbb{E[\bm{g}^2_t/\text{max}(\bm{v}_t, \epsilon^2)]}}}\\ &\hspace{10mm} \textcolor{#9a3fe4}{\bm{\eta}_t \leftarrow \gamma_t/\text{max}(1,\textbf{RMS}_t)}\\ \\ &\[0.5em] &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \textcolor{#9a3fe4}{\bm{\eta}_t} \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) + \lambda\bm{\theta}_{t-1} \bigr)\\ \\ &\rule{100mm}{0.4pt}\\ \end{aligned} ``` -------------------------------- ### Standard SGD Parameter Update Source: https://optimi.benjaminwarner.dev/kahan_summation This code illustrates the standard parameter update step in Stochastic Gradient Descent (SGD). It is used as a baseline to understand the modifications introduced by Kahan summation. ```pseudocode θt ← θt−1−γtgt ```