In most domains of machine learning, bigger models win. But time series forecasting datasets are small — not “small” in the deep learning sense of a few million samples, but small as in a few thousand observations per channel. The ETTh1 benchmark has 17,420 time steps across 7 channels. Electricity has 26,304 hourly readings. Compare that to ImageNet or a GPT training corpus. In this regime, overfitting is the dominant enemy, and bringing a 50-million-parameter Transformer to a 17,000-step dataset is like bringing a bazooka to a knife fight — except the bazooka might shoot you in the foot.
CMoS (Si et al., ICML 2025) takes the “less is more” trend in time series forecasting to its logical extreme. On ETTh1, it achieves state-of-the-art MSE with roughly 750 parameters — not 750K, not 750M. Just 750. This post breaks down the three ideas that make it work: chunk-wise correlation, noise robustness via weight averaging, and a periodicity injection scheme that mirrors Wold’s classical decomposition theorem.
Why the Lightweight Trend Matters
The story starts with DLinear (Zeng et al., 2023), which embarrassingly showed that simple linear layers could match or beat complex Transformer architectures on standard benchmarks. Then came FITS, SparseTSF, and CycleNet, each pushing the parameter count lower while maintaining competitive accuracy.
| Model | Year | Approx. parameters (ETTh1) |
|---|---|---|
| PatchTST | 2023 | ~50 M |
| iTransformer | 2024 | ~5 M |
| DLinear | 2023 | ~75 K |
| SparseTSF | 2024 | ~10 K |
| CMoS | 2025 | ~750 |
The philosophical question behind this progression is pointed: if a dataset of 17,000 time steps can be forecast well by 750 parameters, what does that tell us about the intrinsic dimensionality of the forecasting problem? CMoS’s answer is that it is very low, and that the right inductive bias can exploit it.
The Core Idea: Chunks, Not Points
Most linear forecasting models operate at the point level: each future time step is a linear combination of all past time steps. If the lookback window has $L$ points and the forecast horizon has $H$ points, the weight matrix $\theta$ has $L \times H$ entries.
CMoS divides both the lookback window and the forecast horizon into chunks of size $S$, then models chunk-to-chunk dependencies instead:
\[\hat{x}_{t+i} = \sum_{j=0}^{L/S} \theta_{ij} \, x_{t-j} + b_i\]where each $x_{t+i}$ and $x_{t-j}$ are now chunks (vectors of size $S$), not individual points. The weight matrix shrinks from $L \times H$ to $\frac{L}{S} \times \frac{H}{S}$ — a reduction by a factor of $S^2$.
The parameter reduction is significant, but it is not even the most important benefit. The real win is noise robustness, and this is where the paper delivers a clean theorem.
Noise Robustness: A Theorem for Regression People
If you have taken econometrics or any statistical learning course, you know the fundamental tension: adding parameters lets you fit the signal better, but also lets you fit the noise. This is the bias-variance tradeoff.
CMoS’s Theorem 3.2 formalizes exactly why chunking helps from a noise perspective.
Setup
Consider a linear regression $f(x; \theta) = \theta^\top x$. Suppose the input is corrupted by Gaussian noise: $x’ = x + \delta$ where $\delta \sim \mathcal{N}(\mu, \sigma^2 I)$.
Definition 3.1. The noise sensitivity of the model is the variance of the output change caused by the noise:
\[\text{Noise Sensitivity} = \text{Var}(\theta^\top \delta) = \sigma^2 \|\theta\|_2^2\]This is intuitive: the model’s sensitivity to input noise is proportional to the squared norm of the weights. Large weights amplify noise. Small weights dampen it. It is exactly why Ridge regression adds an $\ell_2$ penalty on $|\theta|^2$.
The Chunking Theorem
Given point-wise weights ${\theta_1, \ldots, \theta_n}$ within a chunk, define the chunk weight as their weighted average:
\[\theta^* = \frac{\sum_{i=1}^n \alpha_i \theta_i}{\sum_{i=1}^n \alpha_i}, \quad \alpha_i \geq 0\]Theorem 3.2. The noise sensitivity of the chunk-wise model is never worse than the point-wise model:
\[\sigma^2 \, \theta^{*2} \leq \sigma^2 \sum_{i=1}^n \theta_i^2\]The Proof (Two Lines of Cauchy-Schwarz)
Since $(\sum \alpha_i)^2 \geq \sum \alpha_i^2$:
\[\theta^{*2} = \left(\frac{\sum \alpha_i \theta_i}{\sum \alpha_i}\right)^2 \leq \frac{(\sum \alpha_i \theta_i)^2}{\sum \alpha_i^2}\]By Cauchy-Schwarz: $(\sum \alpha_i \theta_i)^2 \leq (\sum \alpha_i^2)(\sum \theta_i^2)$, giving:
\[\theta^{*2} \leq \sum \theta_i^2 \quad \blacksquare\]Equality holds only when at most one $\alpha_i$ is non-zero. In every other case, chunking strictly reduces noise sensitivity. If you know Ridge regression, you already understand the mechanism: Ridge adds a soft penalty $\lambda |\theta|^2$ to shrink weights and reduce overfitting. Chunking achieves the same effect structurally, as a hard constraint. For noisy time series, which is most real-world data, this is a winning trade.
Correlation Mixing: PCA for Temporal Structures
For multivariate forecasting with $N$ channels, each channel may have a different temporal structure. Industrial electricity load has long-term dependencies; residential demand reacts to short-term weather. Modeling this diversity without $N$ separate models (overfitting) or one shared model (underfitting) is a genuine challenge.
CMoS learns $K$ shared basis correlation matrices $\theta^0, \theta^1, \ldots, \theta^{K-1}$ (typically $K = 4$), and for each channel $n$ computes a weighted combination:
\[\hat{x}^n_{t+i} = \frac{1}{\sum_k e^{\gamma^n_k}} \sum_{k=0}^{K-1} e^{\gamma^n_k} \left( \sum_{j=0}^{L/S} \theta^k_{ij} \, x^n_{t-j} + b^k_i \right)\]The channel-specific weights $\Gamma^n = {\gamma^n_0, \ldots, \gamma^n_{K-1}}$ are computed in two stages: a per-channel Conv1D for smoothing, followed by a shared linear layer that maps the smoothed representation to softmax weights over the $K$ matrices.
If you know Principal Component Analysis, this structure should feel familiar. PCA decomposes a high-dimensional covariance matrix as a weighted sum of $K$ rank-1 components. The insight is that covariance matrices are approximately low-rank: most of the structure lives in a small subspace. CMoS makes the analogous claim for the space of temporal correlation structures. Each channel’s unique temporal behavior is a point in the $K$-dimensional space spanned by the basis matrices — and the basis is learned end-to-end through backpropagation rather than via eigendecomposition, so it optimizes forecasting error directly.
The paper provides an information-theoretic motivation. If channels share dependencies (mutual information $I(X_i, X_j) > 0$), then their joint entropy is less than the sum of marginal entropies: $H(X_1, \ldots, X_N) < \sum H(X_i)$. A low-rank basis exploits exactly this redundancy.
Periodicity Injection and the Wold Decomposition
Many real-world time series exhibit strong periodicity. CMoS exploits this with an initialization scheme that mirrors a classical result from time series theory.
Wold’s Decomposition Theorem (1938) states that any covariance-stationary process ${X_t}$ decomposes uniquely into two uncorrelated components:
\[X_t = D_t + S_t\]where $D_t$ is the deterministic component (perfectly predictable from its own infinite past, including periodic signals and trends) and $S_t$ is the purely stochastic component, an $\text{MA}(\infty)$ process $S_t = \sum_{j=0}^{\infty} \psi_j \epsilon_{t-j}$ driven by white noise innovations.
CMoS’s architecture mirrors this decomposition directly. Among the $K$ basis matrices:
- Matrix $\theta^0$ (initialized) plays the role of $D_t$. It is pre-filled with periodic peaks: for a period $p$ and chunk size $S$, the entry $\theta^{\text{edit}}_{ij}$ is set to $p/L$ whenever the chunk distance is a multiple of $p/S$, and 0 otherwise. This matrix already encodes that the best predictor for chunk $i$ is the corresponding chunk from previous periods.
- Matrices $\theta^1, \ldots, \theta^{K-1}$ (learned from scratch) play the role of $S_t$. They capture whatever residual temporal structure remains after periodicity is accounted for — short-term momentum, slow trends, cross-day dependencies.
Wold’s theorem guarantees that this decomposition is always valid for covariance-stationary processes. By encoding it into the initialization, CMoS gives the optimizer a head start that the paper shows roughly halves convergence time.
There is also a deeper connection worth noting. In the Wold representation, the stochastic component is $S_t = \sum_{j=0}^{\infty} \psi_j \epsilon_{t-j}$. The coefficients $\psi_j$ describe how past innovations influence the present. The learned correlation weights $\theta_{ij}$ serve an analogous role: they encode the temporal impulse response of the system, chunked and truncated to a finite window. This is why the learned matrices are directly interpretable.
Results: 750 Parameters vs. the Field
CMoS was tested on 7 standard datasets against both heavyweight Transformers and lightweight linear models, across horizons $H \in {96, 192, 336, 720}$.
| Model | Electricity | Traffic | Weather | ETTh1 | ETTh2 | Params |
|---|---|---|---|---|---|---|
| PatchTST | 0.171 | 0.397 | 0.224 | 0.429 | 0.351 | ~50 M |
| iTransformer | 0.163 | 0.397 | 0.232 | 0.439 | 0.370 | ~5 M |
| DLinear | 0.167 | 0.428 | 0.242 | 0.430 | 0.470 | ~75 K |
| SparseTSF | 0.165 | 0.412 | 0.240 | 0.406 | 0.344 | ~10 K |
| CMoS | 0.158 | 0.396 | 0.220 | 0.403 | 0.331 | ~750 |
MSE averaged over H ∈ {96, 192, 336, 720}. Bold = best in column. Parameter counts approximate for ETTh1.
Three observations stand out. First, CMoS dominates high-channel datasets. On Electricity (321 channels), Traffic (862 channels), and Weather (21 channels), it achieves the best MSE across the board — exactly the setting where Correlation Mixing matters most. Second, the parameter gap is striking: CMoS uses ~750 parameters while PatchTST uses ~50 million, a factor of 65,000×, and CMoS still wins on ETTh1 MSE (0.403 vs. 0.429). Third, inference efficiency follows directly. On Electricity, CMoS uses 2.96G FLOPs and 252MB GPU memory; PatchTST uses 1,196G FLOPs and 22GB, representing roughly a 400× compute reduction and an 87× memory reduction.
Interpretability: Reading the Model’s Mind
This is the most underrated aspect of CMoS. Because the correlation matrices directly encode “how much does past chunk $j$ influence future chunk $i$”, you can look at the learned weights and understand what the model learned.
The authors visualize the four basis matrices learned on the Weather dataset (chunk size = 4, sampled every 10 minutes, so 36 chunks = 1 day). The four matrices each specialize: one captures diffuse residual corrections across all lags, one shows a sharp stripe at lag 36 (daily periodicity), one concentrates weight exclusively on the most recent chunks (short-term momentum), and one reflects multi-day periodic dependencies. When channels are then assigned mixing weights, a slow-drifting channel with no clear periodicity loads almost entirely on the short-term matrix; a strongly periodic channel splits its weight across the daily and multi-day matrices.
This level of interpretability is rare in deep learning. You do not just get a prediction — you get a decomposition of why the model is making that prediction. For practitioners deploying forecasting models in energy, finance, or operations, the ability to inspect and sanity-check these matrices before trusting them in production is genuinely valuable.
Takeaway
CMoS is a reminder that the best inductive bias is the one that matches the structure of your problem. Time series data in finance, energy, and operations is often governed by a few simple temporal patterns: periodicity, short-term momentum, long-term trends. CMoS encodes exactly this structure. Chunking provides noise robustness through weight averaging (Theorem 3.2). Correlation Mixing provides a low-rank basis for temporal structures, in the same way PCA provides a low-rank basis for covariance matrices. Periodicity Injection provides a head start on the deterministic component, mirroring Wold’s decomposition.
The result is state-of-the-art forecasting with 750 parameters. The next time you reach for a Transformer to forecast a time series, it is worth asking whether you genuinely need 50 million parameters to predict electricity demand, or whether the signal lives in a structure so simple that a few hundred weights can capture it entirely.
Frequently Asked Questions
What is CMoS and why does it matter? CMoS is a time series forecasting model (ICML 2025) that achieves state-of-the-art accuracy on standard benchmarks with roughly 750 parameters, compared to tens of millions for Transformer-based models. It matters because real-world time series datasets are small, making overfitting the primary challenge, and CMoS’s inductive biases are matched to that regime.
How does chunk-wise modeling reduce overfitting? By grouping time steps into chunks and modeling chunk-to-chunk dependencies, CMoS reduces the weight matrix size by a factor of $S^2$. Theorem 3.2 proves that the resulting weight averaging also strictly reduces noise sensitivity, providing a formal analogue of Ridge regularization.
What is Correlation Mixing? Correlation Mixing learns $K$ shared basis correlation matrices (typically $K = 4$) and assigns each channel a learned softmax weighting over these matrices. This is analogous to PCA: rather than modeling every channel independently ($O(N)$ parameter sets) or all channels identically (one shared structure), it discovers a low-dimensional basis for the space of temporal dependencies.
Why is Periodicity Injection useful? Initializing one basis matrix with periodic peaks gives the optimizer a head start on the deterministic component of the signal, in the sense of Wold’s decomposition. In practice, the paper reports that this initialization roughly halves convergence time.
Can CMoS run on limited hardware? Yes. On the Electricity dataset, CMoS uses 2.96G FLOPs and 252MB of GPU memory at inference time, compared to over 1,000G FLOPs and 22GB for PatchTST. It can realistically run on edge hardware.
Paper: CMoS: Rethinking Time Series Prediction Through the Lens of Chunk-wise Spatial Correlations — Haotian Si, Changhua Pei, Jianhui Li, Dan Pei, Gaogang Xie. ICML 2025.