Evaluating and Benchmarking Time Series Foundational Models

for Public Healthcare Demand Forecasting

by Harsha Halgamuwe Hewage, PhD student

Lead supervisor: Prof. Bahman Rostami-Tabar

Co supervisors: Prof. Aris Syntetos & Dr. Federico Liberatore

2026/06/30

Outline

Motivation: why this problem matters

The planning reality

Why TSFMs look attractive

Forecast quality beyond accuracy

Experimental design and results

Where this work goes next

121 million unintended pregnancies occur each year worldwide.

Over 60% of unintended pregnancies end in abortion,

and over 45% of abortions are unsafe.

What the planning reality looks like

“Forecasting tools exist, but the last-mile planning reality is still Excel, paper forms, manual adjustments, and fragmented systems.”

Ethiopia field visit and collaborator feedback

“A tool can be ‘in use’, but only in a limited scope, during a project period, or in central sites, without scaling to the whole country.”

DHIS2/HISP collaborator feedback

Why time series foundation models look attractive

Public health systems have many related but weak individual time series.

TSFMs may transfer learned temporal patterns from large-scale pretraining.

They can reduce the need for long local histories, manual feature engineering, and repeated local model building.

This creates the potential to democratise forecasting support for facilities where planning is often done by nurses, store managers, or non-specialist staff.

But accuracy is (not) all we need

Do time series foundation models provide better forecast quality than statistical baselines in public health supply chains?

Data used in this study

Country	Domains	Period	Geographic granularity	Product granularity	Series
Côte d’Ivoire	Family planning	Apr 2016 to Sep 2019	Facility / site	Contraceptive product	1,360
Lao PDR	Family planning	Jan 2019 to Dec 2023	Service delivery unit / org unit	Contraceptive product	1,446
Pakistan	Family planning	Jan 2019 to Jun 2025	Service delivery unit / org unit	Contraceptive product	1,313
Lao PDR	General/ acute care Maternal & newborn health Malaria Immunisation/ EPI Non-communicable diseases	Jan 2019 to Dec 2023	Service delivery unit / org unit	Health commodity product	20,736

Experimental design

Note: Experiments were run locally on an Intel Core Ultra 9 185H processor with 32 GB RAM between Jan–Mar 2026. Runtime bottlenecks and execution failures should be interpreted in this testing context.

Results shown: family planning data, 3-month forecast horizon.

A quick look at the data

Figure 1: Time series characteristics across the family planning dataset. Panel (a) illustrates trend versus seasonal strength, while Panel (b) displays demand variability against intermittency. The color scale encodes spectral entropy, where lighter colors indicate more predictable, structured dynamics.

TSFMs can improve point accuracy

Figure 2: Multiple Comparisons with the Best (MCB) test results for RMSSE.

Probabilistic performance tells a similar story

Figure 3: Multiple Comparisons with the Best (MCB) test results for sPIN.

Aggregate loss can hide miscalibration

Figure 4: Scaled quantile loss and empirical versus nominal coverage at forecast horizon 3 for the highlighted models.

Aggregate loss can hide miscalibration

Figure 5: Scaled quantile loss and empirical versus nominal coverage at forecast horizon 3 for the top-ranked models.

Aggregate loss can hide miscalibration

Figure 6: Scaled quantile loss and empirical versus nominal coverage at forecast horizon 3 for the bottom-ranked models.

Useful gains are available within minutes

Figure 7: Models with total cross-validation runtime of up to 10 minutes are highlighted.

The operational shortlist expands within one hour

Figure 8: Models with total cross-validation runtime of up to one hour are highlighted.

Accuracy gains come with computational costs

Figure 9: Average point-forecast rank versus total cross-validation runtime across all evaluated models.

Maintainability and debuggability

High runtime
Too slow for large rolling evaluations.
e.g.; Chronos-t5-large, TimesFM-2-500m

Unstable outputs
Negative or extreme forecast values.
e.g.; Chronos-2-synthetic

Configuration failure
Model paths, dependencies, API issues.
e.g.; LagLlama, Chronos-2-LoRA

Fine-tuning risk
Local fine tune can worsen performance.
e.g.; TimeGPT, Chronos-2

The operational question is not “TSFM or baseline?”

it is;

Which model gives enough forecast quality to justify its complexity?

Accuracy show whether it improves forecasts.

Calibration shows whether uncertainty can be trusted.

Runtime shows whether it can scale.

Maintainability and debuggability show whether it can survive deployment.

Where this work goes next

From forecast accuracy to decision value
Evaluate whether better forecasts improve ordering, allocation, stock availability, and budget use.

From benchmark results to model-selection guidance
Develop a practical framework for deciding when simple baselines are enough, when TSFMs are justified.

From cloud inference to data governance
Account for country-level data residency, privacy, and infrastructure constraints.

From offline evaluation to embedded systems
Test how these methods work inside LMIS, DHIS2, CHAP, or other routine planning workflows.

Acknowledgements

Thank you to Breno Horsth at DHIS2, Mariana Menchero at Nixtla, and Laila Akhlaghi at JSI.

Thank you…

Open for questions and discussion.

If you found this talk useful, I would be grateful if you considered voting for me for the Best Student Presentation in the Whova app.

Appendix

Experimental design

TSFM pre-screening exclusions

Model	Reason for exclusion / screening flag
Chronos-2-synthetic	Unreliable forecast outputs, with many forecasts taking negative values.
Chronos-2-LoRA	Included in Stage 1 only; excluded from Stage 2 due to dependency issues.
Chronos-t5-large	Excluded during screening due to high computational runtime.
Chronos-t5-base	Flagged/excluded from later screening due to high computational runtime.
Flan-T5	Excluded during screening due to high computational runtime.
Lag-Llama	Excluded during screening due to model path errors.
Moirai-1.0-R-large	Excluded during screening due to high computational runtime.
Sundial	Excluded during screening due to high computational runtime.
TimeGPT-1-FT-Full	Excluded because global fine-tuning requires uniform time series lengths, incompatible with heterogeneous operational supply chain data.
TimesFM-2-500m	Excluded during screening due to high computational runtime.
Toto	Excluded during screening due to high computational runtime.

Country-level model ranking: RMSSE

Figure 10: Country-level Multiple Comparisons with the Best (MCB) test results for RMSSE at forecast horizon 3.

Country-level model ranking: sPIN

Figure 11: Country-level Multiple Comparisons with the Best (MCB) test results for sPIN at forecast horizon 3.

Accuracy gains come with computational costs

Figure 12: Average probabilistic-forecast rank versus total cross-validation runtime at forecast horizon 3.