@Article{info:doi/10.2196/29667,作者=“Yum, Yunjin and Lee, Jeong Moon and Jang, Moon jung and Kim, Yoojoong and Kim, Jong-Ho and Kim, Seongtae and Shin,不明和Song, Sanghoun and Joo, Hyung Joon”,标题=“韩国医学词汇语义相似度和相关性的词对数据集:参考开发和验证”,期刊=“JMIR Med Inform”,年=“2021”,月=“Jun”,日=“24”,卷=“9”,数=“6”,页=“e29667”,关键词=“医学词对”;相似之处;关联性;字嵌入;fastText;背景:医学术语需要特殊的专业知识,并且变得越来越复杂,这使得在医学信息学中使用自然语言处理技术变得困难。已经开发了几个人类验证的医学术语参考标准来评估使用医学词对的语义相似性和相关性的词嵌入模型。然而,在非英语语言中很少有参考标准。此外,由于现有的参考标准是很久以前制定的,因此有必要制定一项更新的标准,以反映医学科学的最新发现。目的:提出一种新的韩语词对参考集来验证嵌入模型。 Methods: From January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs. Results: The proportion of word pairs answered by all participants was 90.8{\%} (551/607) for the similarity task and 86.5{\%} (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation ($\rho$=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50{\%} of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, $\rho$=0.12 vs with medical text for the similarity task, $\rho$=0.47; namu, $\rho$=0.02 vs with medical text for the relatedness task, $\rho$=0.30). Conclusions: Korean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future. ", issn="2291-9694", doi="10.2196/29667", url="https://medinform.www.mybigtv.com/2021/6/e29667/", url="https://doi.org/10.2196/29667", url="http://www.ncbi.nlm.nih.gov/pubmed/34185005" }