References

intechngu

Вестник НГУ. Серия: Информационные технологии

Vestnik NSU. Series: Information Technologies

1818-79002410-0420

НГУ

10.25205/1818-7900-2021-19-2-5-16

intechngu-159

Research Article

Статьи

Метод автоматического извлечения терминов из научных статей на основе слабо контролируемого обучения

Method for Automatic Term Extraction from Scientific Articles Based on Weak Supervision

Бручес

Е. П.

Bruches

E. P.

Бручес Елена Павловна - аспирант, ИСИ СО РАН; ассистент, НГУ.

Новосибирск.

Elena P. Bruches - PhD student, A. P. Ershov Institute of Informatics Systems SB RAS; Assistant, Novosibirsk State University.

Novosibirsk.

bruches@bk.ru

https://orcid.org/0000-0003-4333-7888

Батура

Т. В.

Batura

T. V.

Батура Татьяна Викторовна - кандидат физико-математических наук, старший научный сотрудник, ИСИ СО РАН; доцент, НГУ.

Новосибирск.

Tatiana V. Batura - PhD in Physics and Mathematics, Senior Researcher, A. P. Ershov Institute of Informatics Systems SB RAS; Associate Professor, Novosibirsk State University.

Novosibirsk.

tatiana.v.batura@gmail.com

Институт систем информатики им. А.П. Ершова СО РАН; Новосибирский государственный университетРоссияA.P. Ershov Institute of Informatics Systems SB RAS; Novosibirsk State UniversityRussian Federation

2021

20072021

192516

2021

Бручес Е.П., Батура Т.В.

Bruches E.P., Batura T.V.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://intechngu.elpub.ru/jour/article/view/159

Описывается метод извлечения научных терминов из текстов на русском языке, основанный на слабо контролируемом обучении (weakly supervised learning). Особенность данного метода заключается в том, что для него не нужны размеченные вручную данные, что является очень актуальным. Для реализации метода мы собрали в полуавтоматическом режиме словарь терминов, затем автоматически разметили тексты научных статей этими терминами. Полученные тексты мы использовали для обучения модели. Затем этой моделью были автоматически размечены другие тексты. Вторая модель была обучена на объединении текстов, размеченных словарем и первой моделью. Результаты показали, что добавление данных, полученных даже автоматической разметкой, улучшает качество извлечения терминов из текстов.

We propose a method for scientific terms extraction from the texts in Russian based on weakly supervised learning. This approach doesn't require a large amount of hand-labeled data. To implement this method we collected a list of terms in a semi-automatic way and then annotated texts of scientific articles with these terms. These texts we used to train a model. Then we used predictions of this model on another part of the text collection to extend the train set. The second model was trained on both text collections: annotated with a dictionary and by a second model. Obtained results showed that giving additional data, annotated even in an automatic way, improves the quality of scientific terms extraction.

извлечение терминовнейросетевые модели языкасловарный методслабо контролируемое обучениераспознавание сущностейобработка текстов

term extractionneural network language modelsdictionary approachweakly supervised learningentity recognitiontext processing

Исследование выполнено при финансовой поддержке РФФИ в рамках научного проекта № 19-07-01134

The study was funded by RFBR according to the research project N 19-07-01134

References1

Head A., Lo K., Kang D., Fok R., Skjonsberg S., Weld D. S., Hearst M. A. Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols. ArXiv: 2009.14237. 2021.

Лопатин В. В., Лопатина Л. Е. Русский толковый словарь: около 35 000 слов. М.: Русский язык, 1997. 832 с.

Lopatin V. V., Lopatina L. E. Russian Explanatory Dictionary. Moscow, 1997, 832 p. (in Russ.)

Stankovic R., Krstev C., Obradović I., Lazić B., Trtovac A. Rule-based Automatic Multiword Term Extraction and Lemmatization. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). 2016, p. 507-514.

Yuan Y., Gao J., Zhang Y. Supervised Learning for Robust Term Extraction. In: Proceedings of 2017 International Conference on Asian Language Processing (IALP). 2017, p. 302-305. DOI 10.1109/IALP.2017.8300603

Conrado M., Pardo T., Rezende S. O. A Machine Learning Approach to Automatic Term Extraction using a Rich Feature Set. In: Proceedings of the NAACL HLT 2013 Student Research Workshop. Atlanta, Georgia, 2013, p. 16-23.

Zhang Z., Gao J., Ciravegna F. SemRe-Rank: Improving Automatic Term Extraction by Incorporating Semantic Relatedness with Personalised PageRank. ACM Transactions on Knowledge Discovery from Data (TKDD), 2018, vol. 12, no. 5, p. 1-41.

Bilu Y., Gretz Sh., Cohen E., Slonim N. What if we had no Wikipedia? Domain-independent Term Extraction from a Large News Corpus. arXiv: 2009.08240. 2020.

Wang R., Liu W., McDonald C. Featureless Domain-Specific Term Extraction with Minimal Labelled Data. In: Proceedings of Australasian Language Technology Association Workshop, 2016, p. 103-112.

Hossari M., Dev S., Kelleher J. D. TEST: A Terminology Extraction System for Technology Related Terms. In: Proceedings of the 2019 11th International Conference on Computer and Automation Engineering, 2019, p. 78-81. DOI 10.1145/3313991.3314006

Kucza M., Niehues J., Zenkel T., Waibel A., Stüker S. Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks. In: Proceedings of Interspeech 2018. 2018. p. 2072-2076.

Bolshakova E., Loukachevitch N., Nokel M. Topic Models Can Improve Domain Term Extraction. In:European Conference on Information Retrieval (ECIR 2013). Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2013, vol. 7814, p. 684-687.

Bruches E., Pauls A., Batura T., Isachenko V. Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian. In: Science and Artificial Intelligence conference, 2020, p. 41-45. DOI 10.1109/S.A.I.ence50533.2020.9303196

Бручес Е. П., Паульс А. Е., Батура Т. В., Исаченко В. В., Щербатов Д. Р. Семантический анализ научных текстов: опыт создания корпуса и построения языковых моделей // Программные продукты и системы, 2020, 18 с.

Bruches E., Pauls A., Batura T., Isachenko V., Shcherbatov D. Semantic Analysis of Scientific Texts: Experience in Creating a Corpus and Building Language Models. Software & Systems, 2020, 18 p. (in Russ.)

Rose S., Engel D., Cramer N., Cowley W. Automatic Keyword Extraction from Individual Documents. Text Mining: Theory and Applications, 2010, vol. 1, p. 1-20. DOI 10.1002/9780470689646.ch1

Ivanin V., Artemova E., Batura T., Ivanov V., Sarkisyan V., Tutubalina E., Smurov I. RUREBUS-2020 Shared Task: Russian Relation Extraction for Business. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialog”, 2020, p. 416-431. DOI 10.28995/2075-7182-2020-19-416-431

The authors declare that there are no conflicts of interest present.