# MLForecast

MLForecast is a scalable machine learning framework for time series forecasting. It provides efficient feature engineering to train any scikit-learn compatible model on millions of time series, with out-of-the-box compatibility with pandas, polars, spark, dask, and ray. The library offers the fastest implementations of feature engineering for time series in Python, enabling both local and distributed training at scale.

The framework follows a familiar sklearn-style API with `.fit` and `.predict` methods, making it easy to integrate into existing ML workflows. Key features include probabilistic forecasting with Conformal Prediction, support for exogenous variables and static covariates, automatic lag feature generation, rolling window statistics, and target transformations for handling trends and seasonality.

## MLForecast Class Initialization

The main `MLForecast` class encapsulates the entire forecasting pipeline, including feature engineering, model training, and prediction generation. It accepts any scikit-learn compatible regressor and automatically handles the creation of lag features, date features, and target transformations.

```python
import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from mlforecast import MLForecast
from mlforecast.lag_transforms import RollingMean, ExpandingMean, ExponentiallyWeightedMean
from mlforecast.target_transforms import Differences

# Initialize with multiple models and feature configuration
fcst = MLForecast(
    models=[
        lgb.LGBMRegressor(random_state=0, verbosity=-1),
        LinearRegression(),
    ],
    freq='D',  # Daily frequency (use 'W' for weekly, 'M' for monthly, or int for integer frequencies)
    lags=[1, 7, 14, 28],  # Create lag features for these periods
    lag_transforms={
        1: [ExpandingMean()],  # Expanding mean on lag 1
        7: [RollingMean(window_size=7), RollingMean(window_size=28)],  # Rolling means on lag 7
        14: [ExponentiallyWeightedMean(alpha=0.5)],  # EWM on lag 14
    },
    date_features=['dayofweek', 'month', 'year'],  # Extract date components
    target_transforms=[Differences([1])],  # First difference to remove trend
    num_threads=4,  # Parallel feature computation
)
```

## MLForecast.fit Method

The `fit` method computes all features and trains the configured models on the provided time series data. It expects data in long format with columns for series identifier, timestamp, and target value.

```python
import pandas as pd
from mlforecast.utils import generate_daily_series, PredictionIntervals

# Generate sample data in long format
series = generate_daily_series(
    n_series=100,
    min_length=200,
    max_length=500,
    n_static_features=2,
    with_trend=True,
    seed=42,
)

# Fit the model with prediction intervals for uncertainty quantification
fcst.fit(
    df=series,
    id_col='unique_id',      # Column identifying each series
    time_col='ds',           # Timestamp column
    target_col='y',          # Target variable column
    static_features=['static_0', 'static_1'],  # Features constant across time
    dropna=True,             # Drop rows with NaN from lag features
    keep_last_n=100,         # Keep only last 100 points per series (memory optimization)
    prediction_intervals=PredictionIntervals(n_windows=3, h=7),  # For conformal intervals
    fitted=True,             # Store in-sample predictions
)

# Access fitted values
fitted_df = fcst.forecast_fitted_values()
print(fitted_df.head())
```

## MLForecast.predict Method

The `predict` method generates forecasts for the specified horizon using the trained models. It automatically handles recursive prediction, updating lag features at each step.

```python
# Generate predictions for next 14 days
predictions = fcst.predict(h=14)
print(predictions.head())
# Output columns: unique_id, ds, LGBMRegressor, LinearRegression

# Predict with confidence intervals
predictions_with_intervals = fcst.predict(h=14, level=[80, 95])
print(predictions_with_intervals.columns.tolist())
# Output: ['unique_id', 'ds', 'LGBMRegressor', 'LinearRegression',
#          'LGBMRegressor-lo-95', 'LGBMRegressor-lo-80',
#          'LGBMRegressor-hi-80', 'LGBMRegressor-hi-95', ...]

# Predict for specific series only
subset_predictions = fcst.predict(h=14, ids=['id_00', 'id_01', 'id_02'])

# Predict with future exogenous features
future_exog = pd.DataFrame({
    'unique_id': ['id_00'] * 14 + ['id_01'] * 14,
    'ds': pd.date_range('2000-04-10', periods=14).tolist() * 2,
    'price': [1.5] * 28,  # Future values of exogenous variable
})
predictions_with_exog = fcst.predict(h=14, X_df=future_exog)
```

## MLForecast.cross_validation Method

Cross-validation evaluates model performance using time series splits, training on historical data and testing on future periods across multiple windows.

```python
# Perform time series cross-validation
cv_results = fcst.cross_validation(
    df=series,
    n_windows=3,             # Number of evaluation windows
    h=7,                     # Forecast horizon per window
    step_size=7,             # Step between windows (defaults to h)
    id_col='unique_id',
    time_col='ds',
    target_col='y',
    refit=True,              # Retrain model for each window
    level=[90],              # Prediction interval levels
    fitted=True,             # Store fitted values for each fold
)

print(cv_results.columns.tolist())
# Output: ['unique_id', 'ds', 'cutoff', 'y', 'LGBMRegressor', 'LinearRegression', ...]

# Calculate metrics per model
from sklearn.metrics import mean_absolute_error
for model in ['LGBMRegressor', 'LinearRegression']:
    mae = mean_absolute_error(cv_results['y'], cv_results[model])
    print(f'{model} MAE: {mae:.2f}')

# Access cross-validation fitted values
cv_fitted = fcst.cross_validation_fitted_values()
```

## Rolling Lag Transforms

Rolling transforms compute statistics over a sliding window of past values. These are essential for capturing local patterns and trends in time series data.

```python
from mlforecast.lag_transforms import (
    RollingMean, RollingStd, RollingMin, RollingMax, RollingQuantile,
    SeasonalRollingMean, SeasonalRollingStd
)

fcst = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1, 7],
    lag_transforms={
        # Rolling statistics on lag 1
        1: [
            RollingMean(window_size=7),           # 7-day moving average
            RollingStd(window_size=7),            # 7-day rolling std
            RollingMin(window_size=14),           # 14-day rolling min
            RollingMax(window_size=14),           # 14-day rolling max
            RollingQuantile(p=0.5, window_size=7), # 7-day rolling median
        ],
        # Seasonal rolling (e.g., same day of week patterns)
        7: [
            SeasonalRollingMean(season_length=7, window_size=4),  # Mean of last 4 same weekdays
            SeasonalRollingStd(season_length=7, window_size=4),
        ],
    },
)

# Rolling transforms with minimum samples requirement
fcst_with_min = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1],
    lag_transforms={
        1: [RollingMean(window_size=7, min_samples=3)],  # Output even with 3+ samples
    },
)
```

## Expanding and Exponential Transforms

Expanding transforms use all available history while exponentially weighted transforms give more weight to recent observations.

```python
from mlforecast.lag_transforms import (
    ExpandingMean, ExpandingStd, ExpandingMin, ExpandingMax, ExpandingQuantile,
    ExponentiallyWeightedMean
)

fcst = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1, 7],
    lag_transforms={
        1: [
            ExpandingMean(),                      # Mean of all history
            ExpandingStd(),                       # Std of all history
            ExpandingMin(),                       # Min of all history
            ExpandingMax(),                       # Max of all history
            ExpandingQuantile(p=0.9),             # 90th percentile of all history
            ExponentiallyWeightedMean(alpha=0.3), # EWM with alpha=0.3
            ExponentiallyWeightedMean(alpha=0.9), # EWM with higher weight on recent
        ],
    },
)

# Global transforms (aggregate across all series by timestamp)
fcst_global = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1],
    lag_transforms={
        1: [
            ExpandingMean(global_=True),  # Global expanding mean across all series
            RollingMean(window_size=7, global_=True),  # Global rolling mean
        ],
    },
)
```

## Target Transforms

Target transforms modify the target variable before feature computation and are automatically reversed during prediction. They handle trends, seasonality, and scale normalization.

```python
from mlforecast.target_transforms import (
    Differences, AutoDifferences, AutoSeasonalDifferences,
    LocalStandardScaler, LocalMinMaxScaler, LocalRobustScaler,
    LocalBoxCox, GlobalSklearnTransformer
)
from sklearn.preprocessing import PowerTransformer

# Simple differencing to remove trend
fcst_diff = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1, 7],
    target_transforms=[Differences([1])],  # First difference
)

# Seasonal differencing for weekly patterns
fcst_seasonal = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1, 7],
    target_transforms=[Differences([1, 7])],  # First + seasonal (weekly) difference
)

# Automatic differencing (finds optimal number per series)
fcst_auto = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1, 7],
    target_transforms=[AutoDifferences(max_diffs=2)],
)

# Combine scaling with differencing
fcst_scaled = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1, 7],
    target_transforms=[
        Differences([1]),
        LocalStandardScaler(),  # Z-score normalization per series
    ],
)

# Alternative scalers
fcst_minmax = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1],
    target_transforms=[LocalMinMaxScaler()],  # Scale to [0,1] per series
)

fcst_boxcox = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1],
    target_transforms=[LocalBoxCox()],  # Box-Cox transformation per series
)

# Global sklearn transformer
fcst_global_tfm = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1],
    target_transforms=[GlobalSklearnTransformer(PowerTransformer())],
)
```

## MLForecast.preprocess Method

The `preprocess` method generates features without training, useful for custom training pipelines or feature inspection.

```python
# Generate features without training
prep_df = fcst.preprocess(
    df=series,
    id_col='unique_id',
    time_col='ds',
    target_col='y',
    static_features=['static_0'],
    dropna=True,
    return_X_y=False,  # Return DataFrame with all columns
)
print(prep_df.columns.tolist())
# Output: ['unique_id', 'ds', 'y', 'static_0', 'lag1', 'lag7',
#          'rolling_mean_lag1_window_size7', ...]

# Get features and target separately for custom training
X, y = fcst.preprocess(
    df=series,
    id_col='unique_id',
    time_col='ds',
    target_col='y',
    return_X_y=True,
    as_numpy=True,  # Return numpy arrays
)
print(f'Features shape: {X.shape}, Target shape: {y.shape}')
```

## Direct Multi-Step Forecasting

Train separate models for each forecast horizon (direct forecasting) instead of recursive prediction.

```python
# Direct forecasting with max_horizon
fcst_direct = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[7, 14, 21],
)

fcst_direct.fit(
    df=series,
    id_col='unique_id',
    time_col='ds',
    target_col='y',
    max_horizon=7,  # Train 7 separate models (one per horizon)
)

# Predict using direct models
predictions = fcst_direct.predict(h=7)

# Sparse horizons (train only specific horizons)
fcst_sparse = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[7, 14],
)

fcst_sparse.fit(
    df=series,
    id_col='unique_id',
    time_col='ds',
    target_col='y',
    horizons=[1, 7, 14],  # Only train models for horizons 1, 7, and 14
)
```

## AutoMLForecast for Hyperparameter Optimization

`AutoMLForecast` automates hyperparameter tuning using Optuna, searching over model parameters, lags, and transformations.

```python
from mlforecast.auto import (
    AutoMLForecast, AutoLightGBM, AutoXGBoost, AutoRidge, AutoRandomForest
)

# Basic automatic model selection and tuning
auto_fcst = AutoMLForecast(
    models=[AutoLightGBM(), AutoRidge()],
    freq='D',
    season_length=7,  # Weekly seasonality
)

auto_fcst.fit(
    df=series,
    n_windows=2,      # Cross-validation windows for evaluation
    h=7,              # Forecast horizon
    num_samples=20,   # Number of Optuna trials per model
    step_size=7,
    id_col='unique_id',
    time_col='ds',
    target_col='y',
    fitted=True,
)

# Get predictions from best models
auto_predictions = auto_fcst.predict(h=7, level=[90])

# Access optimization results
for name, study in auto_fcst.results_.items():
    print(f'{name} best params: {study.best_params}')
    print(f'{name} best value: {study.best_value:.4f}')

# Custom model configuration spaces
def custom_lgb_config(trial):
    return {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'learning_rate': trial.suggest_float('lr', 0.01, 0.3),
        'num_leaves': trial.suggest_int('num_leaves', 16, 256),
    }

auto_custom = AutoMLForecast(
    models={'custom_lgb': AutoLightGBM(config=custom_lgb_config)},
    freq='D',
    season_length=7,
)
```

## DistributedMLForecast for Large-Scale Training

`DistributedMLForecast` enables training on massive datasets using Dask, Spark, or Ray backends.

```python
from mlforecast.distributed import DistributedMLForecast
from mlforecast.distributed.models.dask.lgb import DaskLGBMForecast
# from mlforecast.distributed.models.spark.lgb import SparkLGBMForecast  # For Spark
# from mlforecast.distributed.models.ray.lgb import RayLGBMForecast     # For Ray

# Using Dask for distributed training
from dask.distributed import Client
import dask.dataframe as dd

client = Client()  # Start Dask cluster

# Convert pandas to Dask DataFrame
dask_series = dd.from_pandas(series, npartitions=4)

# Initialize distributed forecaster
dist_fcst = DistributedMLForecast(
    models=[DaskLGBMForecast()],
    freq='D',
    lags=[1, 7, 14],
    lag_transforms={
        1: [ExpandingMean()],
        7: [RollingMean(window_size=7)],
    },
    engine=client,
    num_partitions=4,
)

# Fit on distributed data
dist_fcst.fit(
    df=dask_series,
    id_col='unique_id',
    time_col='ds',
    target_col='y',
)

# Predict (returns Dask DataFrame)
dist_predictions = dist_fcst.predict(h=7)
local_predictions = dist_predictions.compute()  # Collect to pandas

# Cross-validation on distributed data
dist_cv = dist_fcst.cross_validation(
    df=dask_series,
    n_windows=2,
    h=7,
    id_col='unique_id',
    time_col='ds',
    target_col='y',
)

# Convert distributed model to local for smaller-scale predictions
local_fcst = dist_fcst.to_local()
```

## Model Persistence and Transfer Learning

Save trained models for later use and apply pre-trained models to new series (transfer learning).

```python
# Save fitted model
fcst.save('/path/to/model')

# Load model
loaded_fcst = MLForecast.load('/path/to/model')
predictions = loaded_fcst.predict(h=7)

# Transfer learning: predict on new series using fitted model
new_series = generate_daily_series(n_series=10, min_length=100, seed=123)

# Predict on new data (model uses patterns learned from original data)
transfer_predictions = fcst.predict(
    h=7,
    new_df=new_series,  # New series data
)

# Update model with new observations
new_observations = pd.DataFrame({
    'unique_id': ['id_00', 'id_01'],
    'ds': pd.to_datetime(['2000-04-11', '2000-04-11']),
    'y': [150.5, 200.3],
})
fcst.update(new_observations)

# Predict with updated state
updated_predictions = fcst.predict(h=7)
```

## Prediction Callbacks

Callbacks allow custom transformations of features before prediction and predictions before target updates.

```python
import numpy as np

def clip_predictions(predictions):
    """Ensure predictions are non-negative"""
    return np.clip(predictions, 0, None)

def add_noise_to_features(df):
    """Add small noise to features for robustness"""
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        df[col] = df[col] + np.random.normal(0, 0.01, len(df))
    return df

# Use callbacks during prediction
predictions = fcst.predict(
    h=14,
    before_predict_callback=add_noise_to_features,  # Transform features
    after_predict_callback=clip_predictions,        # Transform predictions
)
```

## Sample Weights for Training

Use sample weights to give more importance to certain observations during model training.

```python
# Add weight column to data
series_weighted = series.copy()
# Give more weight to recent observations
series_weighted['weight'] = series_weighted.groupby('unique_id').cumcount() + 1
series_weighted['weight'] = series_weighted['weight'] / series_weighted.groupby('unique_id')['weight'].transform('max')

# Fit with sample weights
fcst_weighted = MLForecast(
    models=[lgb.LGBMRegressor(verbosity=-1)],
    freq='D',
    lags=[1, 7],
)

fcst_weighted.fit(
    df=series_weighted,
    id_col='unique_id',
    time_col='ds',
    target_col='y',
    weight_col='weight',  # Column containing sample weights
)
```

## Data Generation Utilities

Helper functions for generating synthetic time series data for testing and experimentation.

```python
from mlforecast.utils import generate_daily_series, generate_prices_for_series

# Generate synthetic panel data
synthetic_data = generate_daily_series(
    n_series=50,
    min_length=100,
    max_length=365,
    n_static_features=3,
    equal_ends=True,        # All series end on same date
    static_as_categorical=True,
    with_trend=True,        # Add trend component
    seed=42,
    engine='pandas',        # or 'polars'
)

# Generate price data for exogenous features
prices = generate_prices_for_series(
    series=synthetic_data,
    horizon=14,  # Generate prices for forecast horizon
    seed=42,
)

print(synthetic_data.head())
print(prices.head())
```

MLForecast excels in production environments where speed and scalability are critical. Its main use cases include demand forecasting for retail and supply chain, energy load prediction, financial time series analysis, and any scenario requiring forecasts for thousands to millions of individual series. The library integrates seamlessly with existing ML pipelines through its sklearn-compatible API.

For optimal results, start with simple lag configurations and gradually add complexity based on cross-validation performance. Use `AutoMLForecast` for automatic feature selection when the optimal configuration is unknown. For datasets exceeding memory limits, leverage `DistributedMLForecast` with Dask, Spark, or Ray backends to distribute both feature computation and model training across a cluster.