Automatic Linking of Terms from Scientific Texts with Knowledge Base Entities
https://doi.org/10.25205/1818-7900-2021-19-2-65-75
Abstract
Due to the growth of the number of scientific publications, the tasks related to scientific article processing become more actual. Such texts have a special structure, lexical and semantic content that should be taken into account while processing. Using information from knowledge bases can significantly improve the quality of text processing systems. This paper is dedicated to the entity linking task for scientific articles in Russian, where we consider scientific terms as entities. During our work, we annotated a corpus with scientific texts, where each term was linked with an entity from a knowledge base. Also, we implemented an algorithm for entity linking and evaluated it on the corpus. The algorithm consists of two stages: candidate generation for an input term and ranking this set of candidates to choose the best match. We used string matching of an input term and an entity in a knowledge base to generate a set of candidates. To rank the candidates and choose the most relevant entity for a term, information about the number of links to other entities within the knowledge base and to other sites is used. We analyzed the obtained results and proposed possible ways to improve the quality of the algorithm, for example, using information about the context and a knowledge base structure. The annotated corpus is publicly available and can be useful for other researchers.
About the Authors
A. A. MezentsevaRussian Federation
Anastasia A. Mezentseva - Student, Novosibirsk State University.
Novosibirsk.
E. P. Bruches
Russian Federation
Elena P. Bruches - PhD Student, A. P. Ershov Institute of Informatics Systems SB RAS; Assistant, Novosibirsk State University.
Novosibirsk.
T. V. Batura
Russian Federation
Tatiana V. Batura - PhD in Physics and Mathematics, Senior Researcher, A.P. Ershov Institute of Informatics Systems SB RAS; Associate Professor, Novosibirsk State University.
Novosibirsk.
References
1. Sevgili O., Shelmanov A., Arkhipov M., Panchenko A., Biemann C. Neural Entity Linking: A Survey of Models Based on Deep Learning. 2020. arXiv:2006.00575.
2. Lehmann J., Isele R., Jakob M., Jentzsch A., Kontokostas D., Pablo N. Mendesf, Hellmann S., Morsey M., Patrick van Kleef, Auer S., Bizer C. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, 2015, vol. 6, no. 2, p. 167-195. DOI 10.3233/SW-140134
3. Dong X., Gabrilovich E., Heitz G., Horn W., Lao N., Murphy K., Strohmann T., Sun S., Zhang W. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In: Proceedings of SIGKDD, 2014, p. 601-610.
4. Bollacker K., Evans C., Paritosh P., Sturge T., Taylor J. Freebase: A collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Vancouver, British Columbia, Canada, 2008, p. 1247-1249. DOI 10.1145/1376616.1376746
5. Otegi A., Arregi X., Ansa O., Agirre E. Using knowledge-based relatedness for information retrieval. Knowledge and Information Systems, 2015, vol. 44, p. 689-718. DOI 10.1007/s10115-014-0785-4
6. Shalaby W., Arantes A., GonzalezDiaz T., Gupta C. Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities. In: Proceedings of the International Conference on Prognostics and Health Management, 2019, p. 1-8.
7. Tripodi I., Boguslav M., Hailu N., Hunter L. Knowledge-base-enriched relation extraction. In: Proceedings of the Sixth BioCreative Challenge Evaluation Workshop. Bethesda, MD USA, 2018, vol. 1, p. 163-166.
8. Li J., Sun A., Han R., Li C. A Survey on Deep Learning for Named Entity Recognition. In: IEEE Transactions on Knowledge and Data Engineering, 2020, p. 1-20. DOI 10.1109/TKDE.2020.2981314.
9. Fang Z., Cao Y., Li Q., Zhang D., Zhang Z., Liu Y. Joint entity linking with deep reinforcement learning. In: The World Wide Web Conference, WWW'19. New York, NY, USA, ACM, 2019, p. 438-447.
10. Winkler W. E. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods. American Statistical Association, 2020, p. 354-359.
11. Zwicklbauer S., Seifert Ch., Granitzer M. Robust and collective entity disambiguation through semantic embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2016, p. 425-434. DOI 10.1145/2911451.2911535
12. Cao Y., Hou L., Li J., Liu Z. Neural collective entity linking. In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA, 2018, p.675-686.
13. Bunescu R. C., Pasca M. Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006, p. 9-16.
14. Yin X., Huang Y., Zhou B., Li A., Lan L., Jia Y. Deep Entity Linking via Eliminating Semantic Ambiguity With BERT. IEEE Access, 2019, vol. 7, p. 169434-169445. DOI 10.1109/ACCESS.2019.2955498
15. Varma V., Pingali P., Katragadda R., Krishna S., Ganesh S., Sarvabhotla K., Ga-rapati H., Gopisetty H., Reddy V.B., Reddy K., Bysani P. IIIT Hyderabad at TAC 2009. In: Proceedings of Text Analysis Conference, 2009, p. 102-114.
16. Zhang W., Su J., Tan C. L., Wang W. T. Entity linking leveraging: Automatically generated annotation. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 2010, p. 1290-1298.
17. Huang H., Heck L., Ji H. Leveraging deep neural networks and knowledge graphs for entity disambiguation. 2015. arXiv:1504.07678.
18. Parravicini A., Patra R., Bartolini D., Santambrogio M. Fast and Accurate Entity Linking via Graph Embedding. In: Proceedings of the 2nd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), 2019, p. 1-9. DOI 10.1145/3327964.3328499
19. Perozzi B., Al-Rfou R., Skiena S. DeepWalk: Online Learning of Social Representations. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, p. 701-710. DOI 10.1145/2623330.2623732
20. Nedelchev R., Chaudhuri D., Lehmann J., Fischer A. End-to-End Entity Linking and Disambiguation leveraging Word and Knowledge Graph Embeddings. 2020. arXiv:2002.11143.
21. Bordes A., Usunier N., Garcia-Duran A., Weston J., Yakhnenko O. Translating Embeddings for Modeling Multi-relational Data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013, vol. 2, p. 2787-2795.
22. Delpeuch A. OpenTapioca: Lightweight Entity Linking for Wikidata. 2019. arXiv:1904. 09131.
23. BrQmmer M., Dojchinovski M., Hellmann S. DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2016, p. 3339-3343.
24. Noullet K., Mix R., Farber M. KORE 50DYWC: An Evaluation Data Set for Entity Linking Based on DBpedia, YAGO, Wikidata, and Crunchbase. In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, p. 2389-2395.
25. Minard A., Speranza M., Urizar R., Altuna B., Marieke van Erp, Schoen A., Chantal van Son. MEANTIME, the NewsReader Multilingual Event and Time Corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2016, p. 4417-4422.
26. Vashishth S., Joshi R., Dutt R., Newman-Griffis D., Rose C. MedType: Improving Medical Entity Linking with Semantic Type Prediction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2020, p. 229-240.
27. D'Souza J., Hoppe A., Brack A., Yaser Jaradeh M., Auer S., Ewerth R. The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. In: Proceedings of the 12th Language Resources and Evaluation Conference, 2020, p. 2192-2203.
28. Moro A., Navigli R. SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015, p. 288-297. DOI 10.18653/v1/S15-2049
29. Bruches E., Pauls A., Batura T., Isachenko V. Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian. In: Proceedings of the Science and Artificial Intelligence Conference, 2020, p. 41-45. DOI 10.1109/S.A.I.ence50533.2020.9303196
Review
For citations:
Mezentseva A.A., Bruches E.P., Batura T.V. Automatic Linking of Terms from Scientific Texts with Knowledge Base Entities. Vestnik NSU. Series: Information Technologies. 2021;19(2):65-75. (In Russ.) https://doi.org/10.25205/1818-7900-2021-19-2-65-75