
by Harsha Halgamuwe Hewage, PhD student
Lead supervisor: Prof. Bahman Rostami-Tabar
Co supervisors: Prof. Aris Syntetos & Dr. Federico Liberatore
2026/06/30


Outline
Motivation: why this problem matters
The planning reality
Why TSFMs look attractive
Forecast quality beyond accuracy
Experimental design and results
Where this work goes next
121 million unintended pregnancies occur each year worldwide.
Over 60% of unintended pregnancies end in abortion,
and over 45% of abortions are unsafe.

What the planning reality looks like
“Forecasting tools exist, but the last-mile planning reality is still Excel, paper forms, manual adjustments, and fragmented systems.”
Ethiopia field visit and collaborator feedback
“A tool can be ‘in use’, but only in a limited scope, during a project period, or in central sites, without scaling to the whole country.”
DHIS2/HISP collaborator feedback

Why time series foundation models look attractive
Do time series foundation models provide better forecast quality than statistical baselines in public health supply chains?
| Country | Domains | Period | Geographic granularity | Product granularity | Series |
|---|---|---|---|---|---|
| Côte d’Ivoire | Family planning | Apr 2016 to Sep 2019 | Facility / site | Contraceptive product | 1,360 |
| Lao PDR | Family planning | Jan 2019 to Dec 2023 | Service delivery unit / org unit | Contraceptive product | 1,446 |
| Pakistan | Family planning | Jan 2019 to Jun 2025 | Service delivery unit / org unit | Contraceptive product | 1,313 |
| Lao PDR | General/ acute care Maternal & newborn health Malaria Immunisation/ EPI Non-communicable diseases |
Jan 2019 to Dec 2023 | Service delivery unit / org unit | Health commodity product | 20,736 |
Note: Experiments were run locally on an Intel Core Ultra 9 185H processor with 32 GB RAM between Jan–Mar 2026. Runtime bottlenecks and execution failures should be interpreted in this testing context.
Results shown: family planning data, 3-month forecast horizon.
Figure 1: Time series characteristics across the family planning dataset. Panel (a) illustrates trend versus seasonal strength, while Panel (b) displays demand variability against intermittency. The color scale encodes spectral entropy, where lighter colors indicate more predictable, structured dynamics.
Figure 2: Multiple Comparisons with the Best (MCB) test results for RMSSE.
Figure 3: Multiple Comparisons with the Best (MCB) test results for sPIN.
Figure 4: Scaled quantile loss and empirical versus nominal coverage at forecast horizon 3 for the highlighted models.
Figure 5: Scaled quantile loss and empirical versus nominal coverage at forecast horizon 3 for the top-ranked models.
Figure 6: Scaled quantile loss and empirical versus nominal coverage at forecast horizon 3 for the bottom-ranked models.
Figure 7: Models with total cross-validation runtime of up to 10 minutes are highlighted.
Figure 8: Models with total cross-validation runtime of up to one hour are highlighted.
Figure 9: Average point-forecast rank versus total cross-validation runtime across all evaluated models.

Maintainability and debuggability

The operational question is not “TSFM or baseline?”
it is;
Which model gives enough forecast quality to justify its complexity?

Where this work goes next
Thank you to Breno Horsth at DHIS2, Mariana Menchero at Nixtla, and Laila Akhlaghi at JSI.
If you found this talk useful, I would be grateful if you considered voting for me for the Best Student Presentation in the Whova app.
| Model | Reason for exclusion / screening flag |
|---|---|
| Chronos-2-synthetic | Unreliable forecast outputs, with many forecasts taking negative values. |
| Chronos-2-LoRA | Included in Stage 1 only; excluded from Stage 2 due to dependency issues. |
| Chronos-t5-large | Excluded during screening due to high computational runtime. |
| Chronos-t5-base | Flagged/excluded from later screening due to high computational runtime. |
| Flan-T5 | Excluded during screening due to high computational runtime. |
| Lag-Llama | Excluded during screening due to model path errors. |
| Moirai-1.0-R-large | Excluded during screening due to high computational runtime. |
| Sundial | Excluded during screening due to high computational runtime. |
| TimeGPT-1-FT-Full | Excluded because global fine-tuning requires uniform time series lengths, incompatible with heterogeneous operational supply chain data. |
| TimesFM-2-500m | Excluded during screening due to high computational runtime. |
| Toto | Excluded during screening due to high computational runtime. |
Figure 10: Country-level Multiple Comparisons with the Best (MCB) test results for RMSSE at forecast horizon 3.
Figure 11: Country-level Multiple Comparisons with the Best (MCB) test results for sPIN at forecast horizon 3.
Figure 12: Average probabilistic-forecast rank versus total cross-validation runtime at forecast horizon 3.