@文章{info:doi/10.2196/26616,作者="杨,元奇和Al-Garadi,穆罕默德·阿里和布雷默,Whitney和Zhu, Jane M和Grande, David和Sarker, Abeed",标题="开发一种自动分类Twitter上关于卫生服务的聊天系统:医疗保健的案例研究",期刊="J Med Internet Res",年="2021",月="5",日="3",卷="23",数="5",页="e26616",关键词="自然语言处理;机器学习;推特;infodemiology;infoveillance;社交媒体;医疗补助;背景:社交媒体在日常生活中的广泛应用使其成为一种丰富而有效的资源,可以对消费者对卫生服务的看法进行近乎实时的评估。然而,由于社交媒体聊天内容的海量数据和多样性,在这些评估中使用它可能具有挑战性。目的:本研究旨在开发和评估一个涉及自然语言处理和机器学习的自动系统,以美国医疗补助计划(Medicaid)为例,自动描述用户发布的关于医疗服务的推特数据。 Methods: We collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website's search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a sample of tweets in 5 predetermined categories or other and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), na{\"i}ve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods. Results: We manually annotated 11,379 tweets (Corpus 1: 9179; Corpus 2: 2200) and used 7930 (69.7{\%}) for training, 1449 (12.7{\%}) for validation, and 2000 (17.6{\%}) for testing. A classifier based on BERT obtained the highest accuracies (81.7{\%}, Corpus 1; 80.7{\%}, Corpus 2) and F1 scores on consumer feedback (0.58, Corpus 1; 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6{\%}, RF on Corpus 1; 69.4{\%}, RF on Corpus 2) and F1 score on consumer feedback (0.44, NN on Corpus 1; 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78{\%}) and consumer feedback (15073/27337, 55.14{\%}) tweets being the most frequent for Corpus 1 and Corpus 2, respectively. Conclusions: The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies. ", issn="1438-8871", doi="10.2196/26616", url="//www.mybigtv.com/2021/5/e26616", url="https://doi.org/10.2196/26616", url="http://www.ncbi.nlm.nih.gov/pubmed/33938807" }
Baidu
map