“当‘坏’是‘好’的时候”卡塔尔世界杯8强波胆分析识别药物相关推文中的个人沟通和情绪%A Daniulaityte,Raminta %A Chen,Lu %A Lamy,Francois % R %A Carlson,Robert G %A Thirunarayan,Krishnaprasad %A Sheth,Amit %+干预、治疗和成瘾研究中心,人口与公共卫生科学系,赖特州立大学Boonshoft医学院,Kettering, OH, 45420,美国,1937 775 1411,raminta.daniulaityte@wright.edu社交媒体Twitter大麻合成大麻素机器学习情感分析eDrugTrends背景:为了充分利用社交媒体对药物滥用趋势进行流行病学监测的潜力,该领域需要在处理和分析社交媒体内容方面实现更高水平的自动化。目的:本研究的目的是描述eDrugTrends平台的监督机器学习技术的发展,该技术可以根据通信类型/来源(个人、官方/媒体、零售)和大麻和合成大麻素相关推文中表达的情绪(积极、消极、中性)对推文进行自动分类。方法:使用Twitter流媒体应用程序编程接口收集推文,并通过eDrugTrends平台使用大麻、可食用大麻、大麻浓缩液和合成大麻素相关关键词进行过滤。在创建编码规则并评估编码器间可靠性之后,通过对eDrugTrends(2015年5月至11月)收集的15,623,869条推文中随机选择的几批推文子集进行编码,开发了手动标记的数据集(N=4000)。在4000条tweet中,25%(1000/4000)用于构建源分类器,75%(3000/4000)用于情感分类器。使用逻辑回归(LR)、朴素贝叶斯(NB)和支持向量机(SVM)来训练分类器。源分类(n=1000)测试了使用短url的方法1和将url扩展并包含在词袋分析中的方法2。对于情感分类,方法1使用了所有的推文,而不考虑其来源/类型(n=3000),而方法2仅将情感分类应用于个人通信推文(2633/3000,88%)。 Multiclass and binary classification tasks were examined, and machine-learning sentiment classifier performance was compared with Valence Aware Dictionary for sEntiment Reasoning (VADER), a lexicon and rule-based method. The performance of each classifier was assessed using 5-fold cross validation that calculated average F-scores. One-tailed t test was used to determine if differences in F-scores were statistically significant. Results: In multiclass source classification, the use of expanded URLs did not contribute to significant improvement in classifier performance (0.7972 vs 0.8102 for SVM, P=.19). In binary classification, the identification of all source categories improved significantly when unshortened URLs were used, with personal communication tweets benefiting the most (0.8736 vs 0.8200, P<.001). In multiclass sentiment classification Approach 1, SVM (0.6723) performed similarly to NB (0.6683) and LR (0.6703). In Approach 2, SVM (0.7062) did not differ from NB (0.6980, P=.13) or LR (F=0.6931, P=.05), but it was over 40% more accurate than VADER (F=0.5030, P<.001). In multiclass task, improvements in sentiment classification (Approach 2 vs Approach 1) did not reach statistical significance (eg, SVM: 0.7062 vs 0.6723, P=.052). In binary sentiment classification (positive vs negative), Approach 2 (focus on personal communication tweets only) improved classification results, compared with Approach 1, for LR (0.8752 vs 0.8516, P=.04) and SVM (0.8800 vs 0.8557, P=.045). Conclusions: The study provides an example of the use of supervised machine learning methods to categorize cannabis- and synthetic cannabinoid–related tweets with fairly high accuracy. Use of these content analysis tools along with geographic identification capabilities developed by the eDrugTrends platform will provide powerful methods for tracking regional changes in user opinions related to cannabis and synthetic cannabinoids use over time and across different regions. %M 27777215 %R 10.2196/publichealth.6327 %U http://publichealth.www.mybigtv.com/2016/2/e162/ %U https://doi.org/10.2196/publichealth.6327 %U http://www.ncbi.nlm.nih.gov/pubmed/27777215
Baidu
map