Preview

Vestnik NSU. Series: Information Technologies

Advanced search

Method for Automatic Term Extraction from Scientific Articles Based on Weak Supervision

https://doi.org/10.25205/1818-7900-2021-19-2-5-16

Abstract

We propose a method for scientific terms extraction from the texts in Russian based on weakly supervised learning. This approach doesn't require a large amount of hand-labeled data. To implement this method we collected a list of terms in a semi-automatic way and then annotated texts of scientific articles with these terms. These texts we used to train a model. Then we used predictions of this model on another part of the text collection to extend the train set. The second model was trained on both text collections: annotated with a dictionary and by a second model. Obtained results showed that giving additional data, annotated even in an automatic way, improves the quality of scientific terms extraction.

About the Authors

E. P. Bruches
A.P. Ershov Institute of Informatics Systems SB RAS; Novosibirsk State University
Russian Federation

Elena P. Bruches - PhD student, A. P. Ershov Institute of Informatics Systems SB RAS; Assistant, Novosibirsk State University.

Novosibirsk.



T. V. Batura
A.P. Ershov Institute of Informatics Systems SB RAS; Novosibirsk State University
Russian Federation

Tatiana V. Batura - PhD in Physics and Mathematics, Senior Researcher, A. P. Ershov Institute of Informatics Systems SB RAS; Associate Professor, Novosibirsk State University.

Novosibirsk.



References

1. Head A., Lo K., Kang D., Fok R., Skjonsberg S., Weld D. S., Hearst M. A. Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols. ArXiv: 2009.14237. 2021.

2. Lopatin V. V., Lopatina L. E. Russian Explanatory Dictionary. Moscow, 1997, 832 p. (in Russ.)

3. Stankovic R., Krstev C., Obradović I., Lazić B., Trtovac A. Rule-based Automatic Multiword Term Extraction and Lemmatization. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). 2016, p. 507-514.

4. Yuan Y., Gao J., Zhang Y. Supervised Learning for Robust Term Extraction. In: Proceedings of 2017 International Conference on Asian Language Processing (IALP). 2017, p. 302-305. DOI 10.1109/IALP.2017.8300603

5. Conrado M., Pardo T., Rezende S. O. A Machine Learning Approach to Automatic Term Extraction using a Rich Feature Set. In: Proceedings of the NAACL HLT 2013 Student Research Workshop. Atlanta, Georgia, 2013, p. 16-23.

6. Zhang Z., Gao J., Ciravegna F. SemRe-Rank: Improving Automatic Term Extraction by Incorporating Semantic Relatedness with Personalised PageRank. ACM Transactions on Knowledge Discovery from Data (TKDD), 2018, vol. 12, no. 5, p. 1-41.

7. Bilu Y., Gretz Sh., Cohen E., Slonim N. What if we had no Wikipedia? Domain-independent Term Extraction from a Large News Corpus. arXiv: 2009.08240. 2020.

8. Wang R., Liu W., McDonald C. Featureless Domain-Specific Term Extraction with Minimal Labelled Data. In: Proceedings of Australasian Language Technology Association Workshop, 2016, p. 103-112.

9. Hossari M., Dev S., Kelleher J. D. TEST: A Terminology Extraction System for Technology Related Terms. In: Proceedings of the 2019 11th International Conference on Computer and Automation Engineering, 2019, p. 78-81. DOI 10.1145/3313991.3314006

10. Kucza M., Niehues J., Zenkel T., Waibel A., Stüker S. Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks. In: Proceedings of Interspeech 2018. 2018. p. 2072-2076.

11. Bolshakova E., Loukachevitch N., Nokel M. Topic Models Can Improve Domain Term Extraction. In:European Conference on Information Retrieval (ECIR 2013). Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2013, vol. 7814, p. 684-687.

12. Bruches E., Pauls A., Batura T., Isachenko V. Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian. In: Science and Artificial Intelligence conference, 2020, p. 41-45. DOI 10.1109/S.A.I.ence50533.2020.9303196

13. Bruches E., Pauls A., Batura T., Isachenko V., Shcherbatov D. Semantic Analysis of Scientific Texts: Experience in Creating a Corpus and Building Language Models. Software & Systems, 2020, 18 p. (in Russ.)

14. Rose S., Engel D., Cramer N., Cowley W. Automatic Keyword Extraction from Individual Documents. Text Mining: Theory and Applications, 2010, vol. 1, p. 1-20. DOI 10.1002/9780470689646.ch1

15. Ivanin V., Artemova E., Batura T., Ivanov V., Sarkisyan V., Tutubalina E., Smurov I. RUREBUS-2020 Shared Task: Russian Relation Extraction for Business. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialog”, 2020, p. 416-431. DOI 10.28995/2075-7182-2020-19-416-431


Review

For citations:


Bruches E.P., Batura T.V. Method for Automatic Term Extraction from Scientific Articles Based on Weak Supervision. Vestnik NSU. Series: Information Technologies. 2021;19(2):5-16. (In Russ.) https://doi.org/10.25205/1818-7900-2021-19-2-5-16

Views: 354


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-7900 (Print)
ISSN 2410-0420 (Online)