使用探索性数据分析和有监督的机器学习技术,探索社会经济地位作为COVI卡塔尔世界杯8强波胆分析D-19流行的全球决定因素:算法开发与验证研究%A Winston,Luke %A McCann,Michael %A Onofrei,George %+大西洋理工大学计算系,Port Road, Letterkenny, F92 FC93,爱尔兰,353 862435617,L00162644@student.lyit.ie %K COVID-19 %K机器学习%K数据分析%K流行病学%K人类发展指数%D 2022 %7 27.9.2022 %9原始论文%J JMIR Form Res %G英文%X背景:COVID-19大流行是近年来最前所未有的全球挑战。在国际社会试图长期控制这一大流行病之际,了解是什么因素推动了流行率并预测病毒的未来轨迹至关重要。目的:本研究有两个目标。首先,它测试了社会经济地位与COVID-19流行率之间的统计关系。其次,它使用机器学习技术来预测182个国家的多国样本中的累积COVID-19病例。综上所述,这些目标将阐明社会经济地位是COVID-19大流行的全球风险因素。方法:采用探索性数据分析和监督式机器学习方法。探索性分析包括变量分布、变量相关性和异常值检测。 Following this, the following 3 supervised regression techniques were applied: linear regression, random forest, and adaptive boosting (AdaBoost). Results were evaluated using k-fold cross-validation and subsequently compared to analyze algorithmic suitability. The analysis involved 2 models. First, the algorithms were trained to predict 2021 COVID-19 prevalence using only 2020 reported case data. Following this, socioeconomic indicators were added as features and the algorithms were trained again. The Human Development Index (HDI) metrics of life expectancy, mean years of schooling, expected years of schooling, and gross national income were used to approximate socioeconomic status. Results: All variables correlated positively with the 2021 COVID-19 prevalence, with R2 values ranging from 0.55 to 0.85. Using socioeconomic indicators, COVID-19 prevalence was predicted with a reasonable degree of accuracy. Using 2020 reported case rates as a lone predictor to predict 2021 prevalence rates, the average predictive accuracy of the algorithms was low (R2=0.543). When socioeconomic indicators were added alongside 2020 prevalence rates as features, the average predictive performance improved considerably (R2=0.721) and all error statistics decreased. Thus, adding socioeconomic indicators alongside 2020 reported case data optimized the prediction of COVID-19 prevalence to a considerable degree. Linear regression was the strongest learner with R2=0.693 on the first model and R2=0.763 on the second model, followed by random forest (0.481 and 0.722) and AdaBoost (0.454 and 0.679). Following this, the second model was retrained using a selection of additional COVID-19 risk factors (population density, median age, and vaccination uptake) instead of the HDI metrics. However, average accuracy dropped to 0.649, which highlights the value of socioeconomic status as a predictor of COVID-19 cases in the chosen sample. Conclusions: The results show that socioeconomic status is an important variable to consider in future epidemiological modeling, and highlights the reality of the COVID-19 pandemic as a social phenomenon and a health care phenomenon. This paper also puts forward new considerations about the application of statistical and machine learning techniques to understand and combat the COVID-19 pandemic. %M 36001798 %R 10.2196/35114 %U https://formative.www.mybigtv.com/2022/9/e35114 %U https://doi.org/10.2196/35114 %U http://www.ncbi.nlm.nih.gov/pubmed/36001798
Baidu
map