Background: HealthStats NSW http://www.healthstats.nsw.gov.au/ is an open data platform for large scale reporting of summarised statistics from more than 10 routinely collected, large administrative data sources across a wide range of health topics. The data are regularly updated using reproducible code and semi-automated processes, including producing small Local Government Area (LGA) estimates and confidence intervals using spatial smoothing methods. The feasibility of using such complex statistical methods on larger datasets requires more efficient approaches to data analysis workflows for ongoing large scale reporting.
Aim: To improve processing efficiency for large scale public reporting of small area life expectancy estimates.
Methods: We explored the use of parallel computation through multi-core processors, changes to data management and statistical procedures as a solution to reduce analysis time without compromising the validity of estimates.
Results: We estimated age and sex specific death rates by LGA and year using generalised additive models and produced estimates of life expectancy with bootstrapped 95% confidence intervals (CIs). Sequential computation of CIs using 10,000 bootstraps took days to complete, however the independence of model coefficients for bootstrapping suggested an ‘embarrassingly parallel’ workload and was completed in hours in parallel. This approach included determining an optimal number of bootstrap replications for valid inference and also required careful handling of intermediate data to overcome memory limitations.
Conclusion: The use of parallel computation with modern epidemiological datasets will soon be a necessity for researchers using complex statistical methods particularly when applying these to very large datasets.