Time series forecasting is one of the most practically valuable and most frequently botched areas of applied machine learning. The demand forecast that saves a retailer millions in inventory costs, the energy load prediction that prevents grid failures, the sales forecast that guides hiring decisions -- all of these are time series problems. And all of them have a fundamental property that standard ML workflows ignore at their peril: observations are not independent. What happened yesterday predicts what happens today.
This dependency violates the core assumption of most ML workflows and requires fundamentally different approaches to data preparation, model selection, and evaluation.
Understanding Time Series Structure
A time series is a sequence of observations indexed by time. Unlike tabular data where rows are independent examples, time series observations are inherently ordered and temporally dependent.
Time series typically exhibit three components:
Trend: A long-term directional movement. Revenue grows 15% year-over-year. User base expands. Climate temperatures rise. Trend is often captured by fitting a line or polynomial to the data over time.
Seasonality: Regular, periodic patterns that repeat. Retail sales spike in November-December. Air conditioning load peaks in summer. Website traffic drops on weekends. Seasonality has a fixed period (daily, weekly, annual) and repeatable shape.
Residual (or noise): What is left after removing trend and seasonality. Ideally, the residual is random noise. In practice, the residual often contains additional structure (autocorrelation -- today's residual correlates with yesterday's residual).
Understanding these components guides model selection and feature engineering.
The Critical Mistake: Random Train-Test Splits
The most dangerous mistake in time series ML: splitting data randomly into training and test sets.
In standard tabular ML, random splitting is correct because examples are independent. In time series, random splitting is catastrophically wrong because it creates data leakage: information from the future leaks into the training set, making your model appear more accurate than it actually is.
If you train on a random 80% of your time series data and test on the remaining 20%, your training set will contain observations from after the test observations. The model will have implicitly seen future information and will appear to forecast well -- but this performance will not generalize to actual future prediction.
Always use time-based splits:
# WRONG: random split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# CORRECT: time-based split
cutoff = int(len(df) * 0.8)
train_df = df.iloc[:cutoff]
test_df = df.iloc[cutoff:]
For cross-validation, use time series cross-validation (walk-forward validation): train on periods 1-N, test on period N+1, then train on 1-(N+1), test on N+2, and so on. Scikit-learn provides TimeSeriesSplit for this.
ARIMA: The Classical Statistical Approach
ARIMA (AutoRegressive Integrated Moving Average) is the classical approach to time series forecasting. It models the time series as a function of its own past values (autoregressive component), past forecast errors (moving average component), and differences to handle non-stationarity (integrated component).
ARIMA is parameterized by (p, d, q):
- p: number of lagged observations in the autoregressive component
- d: number of differencing operations to make the series stationary
- q: number of lagged forecast errors in the moving average component
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(train_series, order=(2, 1, 2)) # AR(2), I(1), MA(2)
fitted = model.fit()
forecast = fitted.forecast(steps=30) # Forecast 30 periods ahead
ARIMA handles trend and autocorrelation well. SARIMA extends it with seasonal components. These models are interpretable, have well-understood statistical properties, and work well for short to medium forecast horizons on univariate time series.
When ARIMA works well:
- Univariate forecasting (one variable predicted from its own history)
- Clear trend and/or seasonality
- Short to medium forecast horizons (days to weeks)
- When confidence intervals and statistical inference matter
ARIMA limitations:
- Struggles with multivariate inputs (many external variables affecting the forecast)
- Cannot capture complex non-linear patterns
- Requires stationarity (constant statistical properties over time) -- often requires differencing
- Sensitive to outliers and structural breaks
LightGBM with Lag Features: The Workhorse for Complex Forecasting
For most production forecasting problems -- especially when you have multiple relevant features, complex non-linear patterns, or need to forecast many series simultaneously -- gradient boosting with lag features is the current best practice.
The key transformation: convert the time series forecasting problem into a standard supervised ML problem by creating lag features.
import pandas as pd
import lightgbm as lgb
def create_lag_features(df, target_col, lags, windows):
for lag in lags:
df[f'lag_{lag}'] = df[target_col].shift(lag)
for window in windows:
df[f'rolling_mean_{window}'] = df[target_col].shift(1).rolling(window).mean()
df[f'rolling_std_{window}'] = df[target_col].shift(1).rolling(window).std()
return df
df = create_lag_features(df, 'sales', lags=[1, 7, 14, 28], windows=[7, 14, 28])
# Add date features
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['week_of_year'] = df['date'].dt.isocalendar().week
# Time-based split and train
train = df[df['date'] < '2024-01-01'].dropna()
test = df[df['date'] >= '2024-01-01'].dropna()
model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.05)
model.fit(train[feature_cols], train['sales'])
The lag features encode temporal dependencies explicitly: "how much did we sell 7 days ago, 14 days ago, 28 days ago?" Rolling statistics encode trend and local volatility. Date features encode seasonality (day of week, month of year).
Why LightGBM beats ARIMA for complex forecasting:
- Easily incorporates external features (promotions, holidays, price changes, weather)
- Captures complex non-linear interactions between features
- Scales to forecasting many series simultaneously (same model for all products/locations)
- Feature importance gives interpretability into which lags and features matter
- Competitive or superior performance on most practical forecasting benchmarks
The Kaggle M5 competition (forecasting Walmart sales across 42,840 time series) was dominated by LightGBM and related gradient boosting approaches, validating their practical effectiveness.
LSTM: Deep Learning for Sequential Patterns
LSTMs (Long Short-Term Memory networks) are recurrent neural networks designed for sequences. They maintain a "cell state" that can carry information across many time steps, addressing the vanishing gradient problem that plagued earlier RNNs.
import torch
import torch.nn as nn
class LSTMForecaster(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, forecast_horizon):
super().__init__()
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
self.linear = nn.Linear(hidden_size, forecast_horizon)
def forward(self, x):
lstm_out, _ = self.lstm(x)
return self.linear(lstm_out[:, -1, :])
LSTMs can learn to weight information from many time steps ago when it is relevant, which is useful for long-range dependencies that lag features miss.
When LSTMs are appropriate:
- Very long-range dependencies (information from months or years ago is relevant)
- Complex multivariate sequence inputs (multiple sensor readings over time)
- When the sequential structure is genuinely important (not just lagged features)
LSTM caveats:
- Slower to train than LightGBM
- Requires more data to realize the advantage over simpler methods
- Hyperparameter tuning is more complex
- In practice, LightGBM with good lag features often matches or beats LSTM on tabular time series
Modern alternatives to LSTMs for time series: Temporal Convolutional Networks (TCNs), N-BEATS, and Temporal Fusion Transformer. For most practitioners, these are advanced options to explore after establishing a solid LightGBM baseline.
Facebook Prophet: Time Series for Non-Specialists
Prophet, developed by Facebook, is designed for business forecasting by non-specialists. It models trend, seasonality, and holidays explicitly and handles missing data, outliers, and trend changes gracefully.
from prophet import Prophet
model = Prophet(seasonality_mode='multiplicative', yearly_seasonality=True)
model.add_country_holidays(country_name='US')
model.fit(train_df[['ds', 'y']]) # ds: datetime, y: target
future = model.make_future_dataframe(periods=90)
forecast = model.predict(future)
Prophet is particularly good for business metrics with strong yearly seasonality and holiday effects (website traffic, retail sales). It is less suitable for fine-grained forecasting (sub-hourly, highly irregular series) or when external features need to be incorporated in complex ways.
Choosing the Right Approach
- Univariate, strong trend/seasonality, statistical inference needed: ARIMA/SARIMA
- Multivariate, complex features, production forecasting at scale: LightGBM with lag features
- Long-range sequential dependencies, complex multivariate inputs: LSTM or Temporal Fusion Transformer
- Business metrics forecasting for non-specialists, holiday effects matter: Prophet
- Exploratory / sanity check baseline: Simple seasonally-adjusted average (naive seasonal baseline)
Always start with a simple baseline (naive forecast: tomorrow = today, or seasonal naive: this week = same week last year). If your ML model cannot beat the naive baseline substantially, something is wrong with your data, features, or evaluation approach.
Forecasting is hard. The uncertainty in forecasts grows rapidly with the forecast horizon. Be honest about forecast confidence intervals, communicate them to stakeholders, and build systems that are robust to forecast error rather than assuming point estimates are correct.
Keep Reading
- Machine Learning Complete Guide for Software Developers -- where time series forecasting fits in the broader ML landscape
- Feature Engineering Practical Guide -- lag features and cyclical encoding are central to time series ML
- Overfitting and Underfitting: How to Fix Them -- time series models overfit in subtle ways, especially with too many lag features
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.