<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">intechngu</journal-id><journal-title-group><journal-title xml:lang="ru">Вестник НГУ. Серия: Информационные технологии</journal-title><trans-title-group xml:lang="en"><trans-title>Vestnik NSU. Series: Information Technologies</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1818-7900</issn><issn pub-type="epub">2410-0420</issn><publisher><publisher-name>НГУ</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.25205/1818-7900-2021-19-2-5-16</article-id><article-id custom-type="elpub" pub-id-type="custom">intechngu-159</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Метод автоматического извлечения терминов из научных статей на основе слабо контролируемого обучения</article-title><trans-title-group xml:lang="en"><trans-title>Method for Automatic Term Extraction from Scientific Articles Based on Weak Supervision</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Бручес</surname><given-names>Е. П.</given-names></name><name name-style="western" xml:lang="en"><surname>Bruches</surname><given-names>E. P.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Бручес Елена Павловна - аспирант, ИСИ СО РАН; ассистент, НГУ.</p><p>Новосибирск.</p></bio><bio xml:lang="en"><p>Elena P. Bruches - PhD student, A. P. Ershov Institute of Informatics Systems SB RAS; Assistant, Novosibirsk State University.</p><p>Novosibirsk.</p></bio><email xlink:type="simple">bruches@bk.ru</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-4333-7888</contrib-id><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Батура</surname><given-names>Т. В.</given-names></name><name name-style="western" xml:lang="en"><surname>Batura</surname><given-names>T. V.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Батура Татьяна Викторовна - кандидат физико-математических наук, старший научный сотрудник, ИСИ СО РАН; доцент, НГУ.</p><p>Новосибирск.</p></bio><bio xml:lang="en"><p>Tatiana V. Batura - PhD in Physics and Mathematics, Senior Researcher, A. P. Ershov Institute of Informatics Systems SB RAS; Associate Professor, Novosibirsk State University.</p><p>Novosibirsk.</p></bio><email xlink:type="simple">tatiana.v.batura@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru">Институт систем информатики им. А.П. Ершова СО РАН; Новосибирский государственный университет<country>Россия</country></aff><aff xml:lang="en">A.P. Ershov Institute of Informatics Systems SB RAS; Novosibirsk State University<country>Russian Federation</country></aff></aff-alternatives><pub-date pub-type="collection"><year>2021</year></pub-date><pub-date pub-type="epub"><day>20</day><month>07</month><year>2021</year></pub-date><volume>19</volume><issue>2</issue><fpage>5</fpage><lpage>16</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Бручес Е.П., Батура Т.В., 2021</copyright-statement><copyright-year>2021</copyright-year><copyright-holder xml:lang="ru">Бручес Е.П., Батура Т.В.</copyright-holder><copyright-holder xml:lang="en">Bruches E.P., Batura T.V.</copyright-holder><license license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://intechngu.elpub.ru/jour/article/view/159">https://intechngu.elpub.ru/jour/article/view/159</self-uri><abstract><p>Описывается метод извлечения научных терминов из текстов на русском языке, основанный на слабо контролируемом обучении (weakly supervised learning). Особенность данного метода заключается в том, что для него не нужны размеченные вручную данные, что является очень актуальным. Для реализации метода мы собрали в полуавтоматическом режиме словарь терминов, затем автоматически разметили тексты научных статей этими терминами. Полученные тексты мы использовали для обучения модели. Затем этой моделью были автоматически размечены другие тексты. Вторая модель была обучена на объединении текстов, размеченных словарем и первой моделью. Результаты показали, что добавление данных, полученных даже автоматической разметкой, улучшает качество извлечения терминов из текстов.</p></abstract><trans-abstract xml:lang="en"><p>We propose a method for scientific terms extraction from the texts in Russian based on weakly supervised learning. This approach doesn't require a large amount of hand-labeled data. To implement this method we collected a list of terms in a semi-automatic way and then annotated texts of scientific articles with these terms. These texts we used to train a model. Then we used predictions of this model on another part of the text collection to extend the train set. The second model was trained on both text collections: annotated with a dictionary and by a second model. Obtained results showed that giving additional data, annotated even in an automatic way, improves the quality of scientific terms extraction.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>извлечение терминов</kwd><kwd>нейросетевые модели языка</kwd><kwd>словарный метод</kwd><kwd>слабо контролируемое обучение</kwd><kwd>распознавание сущностей</kwd><kwd>обработка текстов</kwd></kwd-group><kwd-group xml:lang="en"><kwd>term extraction</kwd><kwd>neural network language models</kwd><kwd>dictionary approach</kwd><kwd>weakly supervised learning</kwd><kwd>entity recognition</kwd><kwd>text processing</kwd></kwd-group><funding-group xml:lang="ru"><funding-statement>Исследование выполнено при финансовой поддержке РФФИ в рамках научного проекта № 19-07-01134</funding-statement></funding-group><funding-group xml:lang="en"><funding-statement>The study was funded by RFBR according to the research project N 19-07-01134</funding-statement></funding-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Head A., Lo K., Kang D., Fok R., Skjonsberg S., Weld D. S., Hearst M. A. Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols. ArXiv: 2009.14237. 2021.</mixed-citation><mixed-citation xml:lang="en">Head A., Lo K., Kang D., Fok R., Skjonsberg S., Weld D. S., Hearst M. A. Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols. ArXiv: 2009.14237. 2021.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Лопатин В. В., Лопатина Л. Е. Русский толковый словарь: около 35 000 слов. М.: Русский язык, 1997. 832 с.</mixed-citation><mixed-citation xml:lang="en">Lopatin V. V., Lopatina L. E. Russian Explanatory Dictionary. Moscow, 1997, 832 p. (in Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Stankovic R., Krstev C., Obradović I., Lazić B., Trtovac A. Rule-based Automatic Multiword Term Extraction and Lemmatization. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). 2016, p. 507-514.</mixed-citation><mixed-citation xml:lang="en">Stankovic R., Krstev C., Obradović I., Lazić B., Trtovac A. Rule-based Automatic Multiword Term Extraction and Lemmatization. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16). 2016, p. 507-514.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Yuan Y., Gao J., Zhang Y. Supervised Learning for Robust Term Extraction. In: Proceedings of 2017 International Conference on Asian Language Processing (IALP). 2017, p. 302-305. DOI 10.1109/IALP.2017.8300603</mixed-citation><mixed-citation xml:lang="en">Yuan Y., Gao J., Zhang Y. Supervised Learning for Robust Term Extraction. In: Proceedings of 2017 International Conference on Asian Language Processing (IALP). 2017, p. 302-305. DOI 10.1109/IALP.2017.8300603</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Conrado M., Pardo T., Rezende S. O. A Machine Learning Approach to Automatic Term Extraction using a Rich Feature Set. In: Proceedings of the NAACL HLT 2013 Student Research Workshop. Atlanta, Georgia, 2013, p. 16-23.</mixed-citation><mixed-citation xml:lang="en">Conrado M., Pardo T., Rezende S. O. A Machine Learning Approach to Automatic Term Extraction using a Rich Feature Set. In: Proceedings of the NAACL HLT 2013 Student Research Workshop. Atlanta, Georgia, 2013, p. 16-23.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Zhang Z., Gao J., Ciravegna F. SemRe-Rank: Improving Automatic Term Extraction by Incorporating Semantic Relatedness with Personalised PageRank. ACM Transactions on Knowledge Discovery from Data (TKDD), 2018, vol. 12, no. 5, p. 1-41.</mixed-citation><mixed-citation xml:lang="en">Zhang Z., Gao J., Ciravegna F. SemRe-Rank: Improving Automatic Term Extraction by Incorporating Semantic Relatedness with Personalised PageRank. ACM Transactions on Knowledge Discovery from Data (TKDD), 2018, vol. 12, no. 5, p. 1-41.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Bilu Y., Gretz Sh., Cohen E., Slonim N. What if we had no Wikipedia? Domain-independent Term Extraction from a Large News Corpus. arXiv: 2009.08240. 2020.</mixed-citation><mixed-citation xml:lang="en">Bilu Y., Gretz Sh., Cohen E., Slonim N. What if we had no Wikipedia? Domain-independent Term Extraction from a Large News Corpus. arXiv: 2009.08240. 2020.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Wang R., Liu W., McDonald C. Featureless Domain-Specific Term Extraction with Minimal Labelled Data. In: Proceedings of Australasian Language Technology Association Workshop, 2016, p. 103-112.</mixed-citation><mixed-citation xml:lang="en">Wang R., Liu W., McDonald C. Featureless Domain-Specific Term Extraction with Minimal Labelled Data. In: Proceedings of Australasian Language Technology Association Workshop, 2016, p. 103-112.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Hossari M., Dev S., Kelleher J. D. TEST: A Terminology Extraction System for Technology Related Terms. In: Proceedings of the 2019 11th International Conference on Computer and Automation Engineering, 2019, p. 78-81. DOI 10.1145/3313991.3314006</mixed-citation><mixed-citation xml:lang="en">Hossari M., Dev S., Kelleher J. D. TEST: A Terminology Extraction System for Technology Related Terms. In: Proceedings of the 2019 11th International Conference on Computer and Automation Engineering, 2019, p. 78-81. DOI 10.1145/3313991.3314006</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Kucza M., Niehues J., Zenkel T., Waibel A., Stüker S. Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks. In: Proceedings of Interspeech 2018. 2018. p. 2072-2076.</mixed-citation><mixed-citation xml:lang="en">Kucza M., Niehues J., Zenkel T., Waibel A., Stüker S. Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks. In: Proceedings of Interspeech 2018. 2018. p. 2072-2076.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Bolshakova E., Loukachevitch N., Nokel M. Topic Models Can Improve Domain Term Extraction. In:European Conference on Information Retrieval (ECIR 2013). Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2013, vol. 7814, p. 684-687.</mixed-citation><mixed-citation xml:lang="en">Bolshakova E., Loukachevitch N., Nokel M. Topic Models Can Improve Domain Term Extraction. In:European Conference on Information Retrieval (ECIR 2013). Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2013, vol. 7814, p. 684-687.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Bruches E., Pauls A., Batura T., Isachenko V. Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian. In: Science and Artificial Intelligence conference, 2020, p. 41-45. DOI 10.1109/S.A.I.ence50533.2020.9303196</mixed-citation><mixed-citation xml:lang="en">Bruches E., Pauls A., Batura T., Isachenko V. Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian. In: Science and Artificial Intelligence conference, 2020, p. 41-45. DOI 10.1109/S.A.I.ence50533.2020.9303196</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Бручес Е. П., Паульс А. Е., Батура Т. В., Исаченко В. В., Щербатов Д. Р. Семантический анализ научных текстов: опыт создания корпуса и построения языковых моделей // Программные продукты и системы, 2020, 18 с.</mixed-citation><mixed-citation xml:lang="en">Bruches E., Pauls A., Batura T., Isachenko V., Shcherbatov D. Semantic Analysis of Scientific Texts: Experience in Creating a Corpus and Building Language Models. Software &amp; Systems, 2020, 18 p. (in Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Rose S., Engel D., Cramer N., Cowley W. Automatic Keyword Extraction from Individual Documents. Text Mining: Theory and Applications, 2010, vol. 1, p. 1-20. DOI 10.1002/9780470689646.ch1</mixed-citation><mixed-citation xml:lang="en">Rose S., Engel D., Cramer N., Cowley W. Automatic Keyword Extraction from Individual Documents. Text Mining: Theory and Applications, 2010, vol. 1, p. 1-20. DOI 10.1002/9780470689646.ch1</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Ivanin V., Artemova E., Batura T., Ivanov V., Sarkisyan V., Tutubalina E., Smurov I. RUREBUS-2020 Shared Task: Russian Relation Extraction for Business. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialog”, 2020, p. 416-431. DOI 10.28995/2075-7182-2020-19-416-431</mixed-citation><mixed-citation xml:lang="en">Ivanin V., Artemova E., Batura T., Ivanov V., Sarkisyan V., Tutubalina E., Smurov I. RUREBUS-2020 Shared Task: Russian Relation Extraction for Business. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialog”, 2020, p. 416-431. DOI 10.28995/2075-7182-2020-19-416-431</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
