@Article{info:doi/10.2196/18387,作者=“Kweon, Solbi and Lee, Jeong Hoon and Lee, Younghee and Park, Yu Rang”,标题=“基于机器学习的癌症患者RNA表达数据的个人健康信息推断:算法验证研究”,期刊=“J Med Internet Res”,年=“2020”,月=“8”,日=“10”,卷=“22”,号=“8”,页=“e18387”,关键词=“Cancer;隐私问题;个人信息;预测;RNA序列;,摘要=“背景:随着基因组数据共享需求的增长,隐私问题和担忧,如围绕数据共享和个人信息披露的伦理问题,被提出。目的:本研究的主要目的是验证基因组数据是否足以预测患者的个人信息。方法:在癌症基因组图谱项目中收集9538例患者的RNA表达数据和匹配的患者个人信息。记录每位患者的5个个人信息变量(年龄、性别、种族、癌症类型、癌症分期)。使用四种不同的机器学习算法(支持向量机、决策树、随机森林和人工神经网络)来确定是否可以从RNA表达数据中准确预测患者的个人信息。 Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients. Results: In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types. Conclusions: We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients. ", issn="1438-8871", doi="10.2196/18387", url="//www.mybigtv.com/2020/8/e18387", url="https://doi.org/10.2196/18387", url="http://www.ncbi.nlm.nih.gov/pubmed/32773372" }
Baidu
map