Cardiff University Logo Data Lab Logo WGSSS Logo UKRI Logo

Evaluating and Benchmarking Time Series Foundational Models

for Public Healthcare Demand Forecasting

by Harsha Halgamuwe Hewage, PhD student

Lead supervisor: Prof. Bahman Rostami-Tabar

Co supervisors: Prof. Aris Syntetos & Dr. Federico Liberatore

2026/06/30



Outline

Motivation: why this problem matters

The planning reality

Why TSFMs look attractive

Forecast quality beyond accuracy

Experimental design and results

Where this work goes next





121 million unintended pregnancies occur each year worldwide.


Over 60% of unintended pregnancies end in abortion,


and over 45% of abortions are unsafe.

What the planning reality looks like

“Forecasting tools exist, but the last-mile planning reality is still Excel, paper forms, manual adjustments, and fragmented systems.”

Ethiopia field visit and collaborator feedback

“A tool can be ‘in use’, but only in a limited scope, during a project period, or in central sites, without scaling to the whole country.”

DHIS2/HISP collaborator feedback

Why time series foundation models look attractive


  • Public health systems have many related but weak individual time series.
  • TSFMs may transfer learned temporal patterns from large-scale pretraining.
  • They can reduce the need for long local histories, manual feature engineering, and repeated local model building.
  • This creates the potential to democratise forecasting support for facilities where planning is often done by nurses, store managers, or non-specialist staff.

But accuracy is (not) all we need

But accuracy is (not) all we need

Do time series foundation models provide better forecast quality than statistical baselines in public health supply chains?

Data used in this study

Country Domains Period Geographic granularity Product granularity Series
Côte d’Ivoire Family planning Apr 2016 to Sep 2019 Facility / site Contraceptive product 1,360
Lao PDR Family planning Jan 2019 to Dec 2023 Service delivery unit / org unit Contraceptive product 1,446
Pakistan Family planning Jan 2019 to Jun 2025 Service delivery unit / org unit Contraceptive product 1,313
Lao PDR General/ acute care
Maternal & newborn health
Malaria
Immunisation/ EPI
Non-communicable diseases
Jan 2019 to Dec 2023 Service delivery unit / org unit Health commodity product 20,736

Experimental design


Experimental design


Experimental design


Experimental design


Experimental design


Experimental design


Note: Experiments were run locally on an Intel Core Ultra 9 185H processor with 32 GB RAM between Jan–Mar 2026. Runtime bottlenecks and execution failures should be interpreted in this testing context.

Results shown: family planning data, 3-month forecast horizon.

A quick look at the data

Figure 1: Time series characteristics across the family planning dataset. Panel (a) illustrates trend versus seasonal strength, while Panel (b) displays demand variability against intermittency. The color scale encodes spectral entropy, where lighter colors indicate more predictable, structured dynamics.

TSFMs can improve point accuracy

Figure 2: Multiple Comparisons with the Best (MCB) test results for RMSSE.

Probabilistic performance tells a similar story

Figure 3: Multiple Comparisons with the Best (MCB) test results for sPIN.

Aggregate loss can hide miscalibration

Figure 4: Scaled quantile loss and empirical versus nominal coverage at forecast horizon 3 for the highlighted models.

Aggregate loss can hide miscalibration

Figure 5: Scaled quantile loss and empirical versus nominal coverage at forecast horizon 3 for the top-ranked models.

Aggregate loss can hide miscalibration

Figure 6: Scaled quantile loss and empirical versus nominal coverage at forecast horizon 3 for the bottom-ranked models.

Useful gains are available within minutes

Figure 7: Models with total cross-validation runtime of up to 10 minutes are highlighted.

The operational shortlist expands within one hour

Figure 8: Models with total cross-validation runtime of up to one hour are highlighted.

Accuracy gains come with computational costs

Figure 9: Average point-forecast rank versus total cross-validation runtime across all evaluated models.

Maintainability and debuggability


  • High runtime
    Too slow for large rolling evaluations.
    e.g.; Chronos-t5-large, TimesFM-2-500m
  • Unstable outputs
    Negative or extreme forecast values.
    e.g.; Chronos-2-synthetic
  • Configuration failure
    Model paths, dependencies, API issues.
    e.g.; LagLlama, Chronos-2-LoRA
  • Fine-tuning risk
    Local fine tune can worsen performance.
    e.g.; TimeGPT, Chronos-2

The operational question is not “TSFM or baseline?”


it is;

Which model gives enough forecast quality to justify its complexity?

  • Accuracy show whether it improves forecasts.
  • Calibration shows whether uncertainty can be trusted.
  • Runtime shows whether it can scale.
  • Maintainability and debuggability show whether it can survive deployment.

Where this work goes next


  • From forecast accuracy to decision value
    Evaluate whether better forecasts improve ordering, allocation, stock availability, and budget use.
  • From benchmark results to model-selection guidance
    Develop a practical framework for deciding when simple baselines are enough, when TSFMs are justified.
  • From cloud inference to data governance
    Account for country-level data residency, privacy, and infrastructure constraints.
  • From offline evaluation to embedded systems
    Test how these methods work inside LMIS, DHIS2, CHAP, or other routine planning workflows.

Acknowledgements

Thank you to Breno Horsth at DHIS2, Mariana Menchero at Nixtla, and Laila Akhlaghi at JSI.

Thank you…

Open for questions and discussion.

If you found this talk useful, I would be grateful if you considered voting for me for the Best Student Presentation in the Whova app.

Appendix

Experimental design

TSFM pre-screening exclusions

Model Reason for exclusion / screening flag
Chronos-2-synthetic Unreliable forecast outputs, with many forecasts taking negative values.
Chronos-2-LoRA Included in Stage 1 only; excluded from Stage 2 due to dependency issues.
Chronos-t5-large Excluded during screening due to high computational runtime.
Chronos-t5-base Flagged/excluded from later screening due to high computational runtime.
Flan-T5 Excluded during screening due to high computational runtime.
Lag-Llama Excluded during screening due to model path errors.
Moirai-1.0-R-large Excluded during screening due to high computational runtime.
Sundial Excluded during screening due to high computational runtime.
TimeGPT-1-FT-Full Excluded because global fine-tuning requires uniform time series lengths, incompatible with heterogeneous operational supply chain data.
TimesFM-2-500m Excluded during screening due to high computational runtime.
Toto Excluded during screening due to high computational runtime.

Country-level model ranking: RMSSE

Figure 10: Country-level Multiple Comparisons with the Best (MCB) test results for RMSSE at forecast horizon 3.

Country-level model ranking: sPIN

Figure 11: Country-level Multiple Comparisons with the Best (MCB) test results for sPIN at forecast horizon 3.

Accuracy gains come with computational costs

Figure 12: Average probabilistic-forecast rank versus total cross-validation runtime at forecast horizon 3.

Experimental design