# MLForecast MLForecast is a scalable machine learning framework for time series forecasting. It provides efficient feature engineering to train any scikit-learn compatible model on millions of time series, with out-of-the-box compatibility with pandas, polars, spark, dask, and ray. The library offers the fastest implementations of feature engineering for time series in Python, enabling both local and distributed training at scale. The framework follows a familiar sklearn-style API with `.fit` and `.predict` methods, making it easy to integrate into existing ML workflows. Key features include probabilistic forecasting with Conformal Prediction, support for exogenous variables and static covariates, automatic lag feature generation, rolling window statistics, and target transformations for handling trends and seasonality. ## MLForecast Class Initialization The main `MLForecast` class encapsulates the entire forecasting pipeline, including feature engineering, model training, and prediction generation. It accepts any scikit-learn compatible regressor and automatically handles the creation of lag features, date features, and target transformations. ```python import lightgbm as lgb from sklearn.linear_model import LinearRegression from mlforecast import MLForecast from mlforecast.lag_transforms import RollingMean, ExpandingMean, ExponentiallyWeightedMean from mlforecast.target_transforms import Differences # Initialize with multiple models and feature configuration fcst = MLForecast( models=[ lgb.LGBMRegressor(random_state=0, verbosity=-1), LinearRegression(), ], freq='D', # Daily frequency (use 'W' for weekly, 'M' for monthly, or int for integer frequencies) lags=[1, 7, 14, 28], # Create lag features for these periods lag_transforms={ 1: [ExpandingMean()], # Expanding mean on lag 1 7: [RollingMean(window_size=7), RollingMean(window_size=28)], # Rolling means on lag 7 14: [ExponentiallyWeightedMean(alpha=0.5)], # EWM on lag 14 }, date_features=['dayofweek', 'month', 'year'], # Extract date components target_transforms=[Differences([1])], # First difference to remove trend num_threads=4, # Parallel feature computation ) ``` ## MLForecast.fit Method The `fit` method computes all features and trains the configured models on the provided time series data. It expects data in long format with columns for series identifier, timestamp, and target value. ```python import pandas as pd from mlforecast.utils import generate_daily_series, PredictionIntervals # Generate sample data in long format series = generate_daily_series( n_series=100, min_length=200, max_length=500, n_static_features=2, with_trend=True, seed=42, ) # Fit the model with prediction intervals for uncertainty quantification fcst.fit( df=series, id_col='unique_id', # Column identifying each series time_col='ds', # Timestamp column target_col='y', # Target variable column static_features=['static_0', 'static_1'], # Features constant across time dropna=True, # Drop rows with NaN from lag features keep_last_n=100, # Keep only last 100 points per series (memory optimization) prediction_intervals=PredictionIntervals(n_windows=3, h=7), # For conformal intervals fitted=True, # Store in-sample predictions ) # Access fitted values fitted_df = fcst.forecast_fitted_values() print(fitted_df.head()) ``` ## MLForecast.predict Method The `predict` method generates forecasts for the specified horizon using the trained models. It automatically handles recursive prediction, updating lag features at each step. ```python # Generate predictions for next 14 days predictions = fcst.predict(h=14) print(predictions.head()) # Output columns: unique_id, ds, LGBMRegressor, LinearRegression # Predict with confidence intervals predictions_with_intervals = fcst.predict(h=14, level=[80, 95]) print(predictions_with_intervals.columns.tolist()) # Output: ['unique_id', 'ds', 'LGBMRegressor', 'LinearRegression', # 'LGBMRegressor-lo-95', 'LGBMRegressor-lo-80', # 'LGBMRegressor-hi-80', 'LGBMRegressor-hi-95', ...] # Predict for specific series only subset_predictions = fcst.predict(h=14, ids=['id_00', 'id_01', 'id_02']) # Predict with future exogenous features future_exog = pd.DataFrame({ 'unique_id': ['id_00'] * 14 + ['id_01'] * 14, 'ds': pd.date_range('2000-04-10', periods=14).tolist() * 2, 'price': [1.5] * 28, # Future values of exogenous variable }) predictions_with_exog = fcst.predict(h=14, X_df=future_exog) ``` ## MLForecast.cross_validation Method Cross-validation evaluates model performance using time series splits, training on historical data and testing on future periods across multiple windows. ```python # Perform time series cross-validation cv_results = fcst.cross_validation( df=series, n_windows=3, # Number of evaluation windows h=7, # Forecast horizon per window step_size=7, # Step between windows (defaults to h) id_col='unique_id', time_col='ds', target_col='y', refit=True, # Retrain model for each window level=[90], # Prediction interval levels fitted=True, # Store fitted values for each fold ) print(cv_results.columns.tolist()) # Output: ['unique_id', 'ds', 'cutoff', 'y', 'LGBMRegressor', 'LinearRegression', ...] # Calculate metrics per model from sklearn.metrics import mean_absolute_error for model in ['LGBMRegressor', 'LinearRegression']: mae = mean_absolute_error(cv_results['y'], cv_results[model]) print(f'{model} MAE: {mae:.2f}') # Access cross-validation fitted values cv_fitted = fcst.cross_validation_fitted_values() ``` ## Rolling Lag Transforms Rolling transforms compute statistics over a sliding window of past values. These are essential for capturing local patterns and trends in time series data. ```python from mlforecast.lag_transforms import ( RollingMean, RollingStd, RollingMin, RollingMax, RollingQuantile, SeasonalRollingMean, SeasonalRollingStd ) fcst = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1, 7], lag_transforms={ # Rolling statistics on lag 1 1: [ RollingMean(window_size=7), # 7-day moving average RollingStd(window_size=7), # 7-day rolling std RollingMin(window_size=14), # 14-day rolling min RollingMax(window_size=14), # 14-day rolling max RollingQuantile(p=0.5, window_size=7), # 7-day rolling median ], # Seasonal rolling (e.g., same day of week patterns) 7: [ SeasonalRollingMean(season_length=7, window_size=4), # Mean of last 4 same weekdays SeasonalRollingStd(season_length=7, window_size=4), ], }, ) # Rolling transforms with minimum samples requirement fcst_with_min = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1], lag_transforms={ 1: [RollingMean(window_size=7, min_samples=3)], # Output even with 3+ samples }, ) ``` ## Expanding and Exponential Transforms Expanding transforms use all available history while exponentially weighted transforms give more weight to recent observations. ```python from mlforecast.lag_transforms import ( ExpandingMean, ExpandingStd, ExpandingMin, ExpandingMax, ExpandingQuantile, ExponentiallyWeightedMean ) fcst = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1, 7], lag_transforms={ 1: [ ExpandingMean(), # Mean of all history ExpandingStd(), # Std of all history ExpandingMin(), # Min of all history ExpandingMax(), # Max of all history ExpandingQuantile(p=0.9), # 90th percentile of all history ExponentiallyWeightedMean(alpha=0.3), # EWM with alpha=0.3 ExponentiallyWeightedMean(alpha=0.9), # EWM with higher weight on recent ], }, ) # Global transforms (aggregate across all series by timestamp) fcst_global = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1], lag_transforms={ 1: [ ExpandingMean(global_=True), # Global expanding mean across all series RollingMean(window_size=7, global_=True), # Global rolling mean ], }, ) ``` ## Target Transforms Target transforms modify the target variable before feature computation and are automatically reversed during prediction. They handle trends, seasonality, and scale normalization. ```python from mlforecast.target_transforms import ( Differences, AutoDifferences, AutoSeasonalDifferences, LocalStandardScaler, LocalMinMaxScaler, LocalRobustScaler, LocalBoxCox, GlobalSklearnTransformer ) from sklearn.preprocessing import PowerTransformer # Simple differencing to remove trend fcst_diff = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1, 7], target_transforms=[Differences([1])], # First difference ) # Seasonal differencing for weekly patterns fcst_seasonal = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1, 7], target_transforms=[Differences([1, 7])], # First + seasonal (weekly) difference ) # Automatic differencing (finds optimal number per series) fcst_auto = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1, 7], target_transforms=[AutoDifferences(max_diffs=2)], ) # Combine scaling with differencing fcst_scaled = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1, 7], target_transforms=[ Differences([1]), LocalStandardScaler(), # Z-score normalization per series ], ) # Alternative scalers fcst_minmax = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1], target_transforms=[LocalMinMaxScaler()], # Scale to [0,1] per series ) fcst_boxcox = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1], target_transforms=[LocalBoxCox()], # Box-Cox transformation per series ) # Global sklearn transformer fcst_global_tfm = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1], target_transforms=[GlobalSklearnTransformer(PowerTransformer())], ) ``` ## MLForecast.preprocess Method The `preprocess` method generates features without training, useful for custom training pipelines or feature inspection. ```python # Generate features without training prep_df = fcst.preprocess( df=series, id_col='unique_id', time_col='ds', target_col='y', static_features=['static_0'], dropna=True, return_X_y=False, # Return DataFrame with all columns ) print(prep_df.columns.tolist()) # Output: ['unique_id', 'ds', 'y', 'static_0', 'lag1', 'lag7', # 'rolling_mean_lag1_window_size7', ...] # Get features and target separately for custom training X, y = fcst.preprocess( df=series, id_col='unique_id', time_col='ds', target_col='y', return_X_y=True, as_numpy=True, # Return numpy arrays ) print(f'Features shape: {X.shape}, Target shape: {y.shape}') ``` ## Direct Multi-Step Forecasting Train separate models for each forecast horizon (direct forecasting) instead of recursive prediction. ```python # Direct forecasting with max_horizon fcst_direct = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[7, 14, 21], ) fcst_direct.fit( df=series, id_col='unique_id', time_col='ds', target_col='y', max_horizon=7, # Train 7 separate models (one per horizon) ) # Predict using direct models predictions = fcst_direct.predict(h=7) # Sparse horizons (train only specific horizons) fcst_sparse = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[7, 14], ) fcst_sparse.fit( df=series, id_col='unique_id', time_col='ds', target_col='y', horizons=[1, 7, 14], # Only train models for horizons 1, 7, and 14 ) ``` ## AutoMLForecast for Hyperparameter Optimization `AutoMLForecast` automates hyperparameter tuning using Optuna, searching over model parameters, lags, and transformations. ```python from mlforecast.auto import ( AutoMLForecast, AutoLightGBM, AutoXGBoost, AutoRidge, AutoRandomForest ) # Basic automatic model selection and tuning auto_fcst = AutoMLForecast( models=[AutoLightGBM(), AutoRidge()], freq='D', season_length=7, # Weekly seasonality ) auto_fcst.fit( df=series, n_windows=2, # Cross-validation windows for evaluation h=7, # Forecast horizon num_samples=20, # Number of Optuna trials per model step_size=7, id_col='unique_id', time_col='ds', target_col='y', fitted=True, ) # Get predictions from best models auto_predictions = auto_fcst.predict(h=7, level=[90]) # Access optimization results for name, study in auto_fcst.results_.items(): print(f'{name} best params: {study.best_params}') print(f'{name} best value: {study.best_value:.4f}') # Custom model configuration spaces def custom_lgb_config(trial): return { 'n_estimators': trial.suggest_int('n_estimators', 50, 500), 'learning_rate': trial.suggest_float('lr', 0.01, 0.3), 'num_leaves': trial.suggest_int('num_leaves', 16, 256), } auto_custom = AutoMLForecast( models={'custom_lgb': AutoLightGBM(config=custom_lgb_config)}, freq='D', season_length=7, ) ``` ## DistributedMLForecast for Large-Scale Training `DistributedMLForecast` enables training on massive datasets using Dask, Spark, or Ray backends. ```python from mlforecast.distributed import DistributedMLForecast from mlforecast.distributed.models.dask.lgb import DaskLGBMForecast # from mlforecast.distributed.models.spark.lgb import SparkLGBMForecast # For Spark # from mlforecast.distributed.models.ray.lgb import RayLGBMForecast # For Ray # Using Dask for distributed training from dask.distributed import Client import dask.dataframe as dd client = Client() # Start Dask cluster # Convert pandas to Dask DataFrame dask_series = dd.from_pandas(series, npartitions=4) # Initialize distributed forecaster dist_fcst = DistributedMLForecast( models=[DaskLGBMForecast()], freq='D', lags=[1, 7, 14], lag_transforms={ 1: [ExpandingMean()], 7: [RollingMean(window_size=7)], }, engine=client, num_partitions=4, ) # Fit on distributed data dist_fcst.fit( df=dask_series, id_col='unique_id', time_col='ds', target_col='y', ) # Predict (returns Dask DataFrame) dist_predictions = dist_fcst.predict(h=7) local_predictions = dist_predictions.compute() # Collect to pandas # Cross-validation on distributed data dist_cv = dist_fcst.cross_validation( df=dask_series, n_windows=2, h=7, id_col='unique_id', time_col='ds', target_col='y', ) # Convert distributed model to local for smaller-scale predictions local_fcst = dist_fcst.to_local() ``` ## Model Persistence and Transfer Learning Save trained models for later use and apply pre-trained models to new series (transfer learning). ```python # Save fitted model fcst.save('/path/to/model') # Load model loaded_fcst = MLForecast.load('/path/to/model') predictions = loaded_fcst.predict(h=7) # Transfer learning: predict on new series using fitted model new_series = generate_daily_series(n_series=10, min_length=100, seed=123) # Predict on new data (model uses patterns learned from original data) transfer_predictions = fcst.predict( h=7, new_df=new_series, # New series data ) # Update model with new observations new_observations = pd.DataFrame({ 'unique_id': ['id_00', 'id_01'], 'ds': pd.to_datetime(['2000-04-11', '2000-04-11']), 'y': [150.5, 200.3], }) fcst.update(new_observations) # Predict with updated state updated_predictions = fcst.predict(h=7) ``` ## Prediction Callbacks Callbacks allow custom transformations of features before prediction and predictions before target updates. ```python import numpy as np def clip_predictions(predictions): """Ensure predictions are non-negative""" return np.clip(predictions, 0, None) def add_noise_to_features(df): """Add small noise to features for robustness""" numeric_cols = df.select_dtypes(include=[np.number]).columns for col in numeric_cols: df[col] = df[col] + np.random.normal(0, 0.01, len(df)) return df # Use callbacks during prediction predictions = fcst.predict( h=14, before_predict_callback=add_noise_to_features, # Transform features after_predict_callback=clip_predictions, # Transform predictions ) ``` ## Sample Weights for Training Use sample weights to give more importance to certain observations during model training. ```python # Add weight column to data series_weighted = series.copy() # Give more weight to recent observations series_weighted['weight'] = series_weighted.groupby('unique_id').cumcount() + 1 series_weighted['weight'] = series_weighted['weight'] / series_weighted.groupby('unique_id')['weight'].transform('max') # Fit with sample weights fcst_weighted = MLForecast( models=[lgb.LGBMRegressor(verbosity=-1)], freq='D', lags=[1, 7], ) fcst_weighted.fit( df=series_weighted, id_col='unique_id', time_col='ds', target_col='y', weight_col='weight', # Column containing sample weights ) ``` ## Data Generation Utilities Helper functions for generating synthetic time series data for testing and experimentation. ```python from mlforecast.utils import generate_daily_series, generate_prices_for_series # Generate synthetic panel data synthetic_data = generate_daily_series( n_series=50, min_length=100, max_length=365, n_static_features=3, equal_ends=True, # All series end on same date static_as_categorical=True, with_trend=True, # Add trend component seed=42, engine='pandas', # or 'polars' ) # Generate price data for exogenous features prices = generate_prices_for_series( series=synthetic_data, horizon=14, # Generate prices for forecast horizon seed=42, ) print(synthetic_data.head()) print(prices.head()) ``` MLForecast excels in production environments where speed and scalability are critical. Its main use cases include demand forecasting for retail and supply chain, energy load prediction, financial time series analysis, and any scenario requiring forecasts for thousands to millions of individual series. The library integrates seamlessly with existing ML pipelines through its sklearn-compatible API. For optimal results, start with simple lag configurations and gradually add complexity based on cross-validation performance. Use `AutoMLForecast` for automatic feature selection when the optimal configuration is unknown. For datasets exceeding memory limits, leverage `DistributedMLForecast` with Dask, Spark, or Ray backends to distribute both feature computation and model training across a cluster.