Preview

Vestnik NSU. Series: Information Technologies

Advanced search

Development of a Classification System for Texts by Scientific Specialties Using Machine Learning Methods

https://doi.org/10.25205/1818-7900-2021-19-1-39-47

Abstract

In the process of preparing a dissertation, the researcher is faced with the problems of determining the research topic, as well as the problems of writing a text for a particular specialty. To increase the objectivity of the classification of scientific texts by specialties, a system has been developed using machine learning algorithms. In the process of building the classification system, the following tasks were solved: development tools were identified, the collection and processing of initial data was carried out, machine learning models were built, and a web application was developed. The initial dataset is a sample of texts for the group of Russian scientific specialties “Informatics and Computer Engineering”. For the objectivity of the study, the initial data was filtered - the least representative classes were removed. Conversion of texts for vectorization was also carried out. For the vectorization of the source texts, the TF-IDF model was used, which allowed loading the entire data volume with limited technical capabilities. Multiclass logistic regression is chosen as a machine learning model used to classify a scientific specialty. For training, the initial data divided into 2 parts - training and test in a ratio of 80 to 20. Accuracy is used as a metric of the quality of the machine learning model. The choice of the metric is due to the sufficient balance of the classes. The model trained on the training data made it possible to achieve the accuracy of determining the scientific specialty on test data equal to 0.87. To use a ready-made machine learning model for classification of a scientific specialty, a web application has been developed using Flask. The web app is currently located at http://predict-spec.herokuapp.com/ The most urgent tasks for finalizing the system at the moment are: transferring a web application to a more powerful server, finalizing machine learning models, displaying visual information on the analyzed work.

Keywords

NLP

About the Author

P. Yu. Gusev
Voronezh State Technical University
Russian Federation


References

1. Суслова С. И. Специальность 12.00.03: статистическое исследование тематики и количества диссертационных работ, представленных к защите // Пролог: журнал о праве. 2018. № 2. С. 48-54. DOI 10.21639/2313-6715.2018.2.8

2. Девицкий Э. И. Специальность 12.00. 05: статистическое исследование тематики и количества диссертационных работ, представленных к защите // Пролог: журнал о праве. 2020. № 1 (25). С. 50-56. DOI 10.21639/2313-6715.2020.1.6

3. Качурова Е. С., Суслова С. И. Специальность 12.00.08: статистическое исследование тематики и количества диссертационных работ, представленных к защите // Пролог: журнал о праве. 2018. № 3. С. 54-60. DOI 10.21639/2313-6715.2018.3.8

4. Данилов Г. В. и др. Сравнительный анализ статистических методов классификации научных публикаций в области медицины // Компьютерные исследования и моделирование. 2020. Т. 12, № 4. С. 921-933. DOI 10.20537/2076-7633-2020-12-4-921-933

5. Васенин В. А. и др. Интеллектуальная система тематического исследования научно-технической информации (ИСТИНА) // Информационное общество. 2013. Т. 1, № 03. С. 21-30.

6. Бородин А. И., Вейнберг Р. Р., Литвишко О. В. Методы обработки текста при создании чат-ботов // Хуманитарни Балкански изследвания. 2019. Т. 3, № 3 (5). DOI 10.34671/ SCH.HBR.2019.0303.0026

7. Artama M., Sukajaya I. N., Indrawan G. Classification of official letters using TF-IDF method. Journal of Physics: Conference Series, 2020, vol. 1516, no. 1, p. 012001-012001. DOI 10.1088/1742-6596/1516/1/012001

8. Grohe M. Word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings of structured data. In: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2020, p. 1-16. DOI 10.1145/3375395.3387641

9. Жилина Е. В. Использование бинарной логистической регрессии для оценки качества адаптивного теста // Вестник Томского государственного университета. 2010. № 334. С. 106-109.

10. Попова Е. П., Леоненко В. Н. Прогнозирование реакции пользователей в социальных сетях методами машинного обучения // Научно-технический вестник информационных технологий, механики и оптики. 2020. Т. 20, № 1. С. 118-124. DOI 10.17586/2226-1494-2020-20-1-118-124

11. Yang F. et al. How do visual explanations foster end users' appropriate trust in machine learning? In: Proceedings of the 25th International Conference on Intelligent User Interfaces, 2020, p. 189-201. DOI 10.1145/3377325.3377480

12. Collaris D., Wijk J. J. van. ExplainExplore: Visual Exploration of Machine Learning Explanations. In: IEEE Pacific Visualization Symposium (PacificVis), 2020, p. 26-35. DOI 10.1109/ PacificVis48177.2020.7090


Review

For citations:


Gusev P.Yu. Development of a Classification System for Texts by Scientific Specialties Using Machine Learning Methods. Vestnik NSU. Series: Information Technologies. 2021;19(1):39-47. (In Russ.) https://doi.org/10.25205/1818-7900-2021-19-1-39-47

Views: 105


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-7900 (Print)
ISSN 2410-0420 (Online)