Preview

Vestnik NSU. Series: Information Technologies

Advanced search

Limitations of Applying the Data Compression Method to the Classification of Abstracts of Publications Indexed in Scopus

https://doi.org/10.25205/1818-7900-2020-18-3-57-68

Abstract

The paper describes the limitations of applying the method of classification of scientific texts based on data compression to all categories indicated in the ASJC classification used in the Scopus bibliographic database. It is shown that the automatic generation of learning samples for each category is a rather time-consuming process, and in some cases is impossible due to the restriction on data upload installed in Scopus and the lack of category names in the Scopus Search API. Another reason is that in many subject areas there are completely no journals and, accordingly, publications that have only one category. Application of the method to all 26 subject areas is impossible due to their vastness, as well as the initial classification of Scopus. Often in different subject areas there are terminologically close categories, which makes it difficult to classify a publication as a true area. These findings also indicate that the classification currently used in Scopus and SciVal may not be completely reliable. For example, according to SciVal in terms of the number of publications, the category “Theoretical computer science” is in second place among all publications in the subject area “Mathematics”. The study showed that this category is one of the smallest categories, both in terms of the presence of journals and publications with only this category. Thus, many studies based on the use of publications in ASJC may have some inaccuracies.

About the Author

I. V. Selivanova
SPSTL SB RAS
Russian Federation


References

1. Рябко Б. Я., Гуськов А. Е., Селиванова И. В. Теоретико-информационный метод классификации текстов // Проблемы передачи информации. 2017. Т. 53, № 3. С. 100-111.

2. Селиванова И. В., Рябко Б. Я., Гуськов А. Е. Классификация посредством компрессии: применение методов теории информации для определения тематики научных текстов // Научно-техническая информация. Серия 2: Информационные процессы и системы. 2017. № 6. С. 8-15.

3. Селиванова И. В., Косяков Д. В., Гуськов А. Е. Классификация научных текстов на основе компрессии аннотаций публикаций // Научно-техническая информация. Серия 2: Информационные процессы и системы. 2019. № 12. С. 25-38.

4. Cilibrasi R., Vitanyi P. M. B. Clustering by Compression. IEEE Transactions on Information Theory, 2005, vol. 51, no. 4, p. 1523-1545. DOI 10.1109/tit.2005.844059

5. Wang Q., Waltman L. Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus. Journal of Informetrics, 2016, vol. 10, no. 2, p. 347-364. DOI 10.1016/j.joi.2016.02.003

6. Frédérique Bordignon. Tracking content updates in Scopus (2011-2018): a quantitative analysis of journals per subject category and subject categories per journal. In: 17th International Conference on Scientometrics & Informetrics, ISSI. Rome, Italy, 2019, p. 1630.

7. Miao Y., Keselj V, Milios E. Document Clustering using Character N-grams: A Comparative Evaluation with Term-based and Word-based Clustering. URL: htps://web.cs.dal.ca/ eem/cvWeb/pubs/Miao-CIKM-2005.pdf

8. Волкова Л. Л., Строганов Ю. В. Об ассоциативных бинарных мерах близости документов: классификация и приложение к кластеризации // Новые информационные технологии в автоматизированных системах. 2014. № 17. С. 421-432.

9. Baghel R., Dhir R. A Frequent Concepts Based Document Clustering Algorithm. International Journal of Computer Applications, 2010, vol. 4, no. 5, p. 6-12, DOI 10.5120/826-1171

10. Beil F., Ester M., Xu X. Frequent Term-Based Text Clustering. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD ‘2002). Edmonton, Alberta, Canada, 2002, p. 436-442. DOI 10.1145/775047.775110

11. Deng Z.-H., Tang S.-W., Yang D.-Q., Zhang M., Li L.-Y., Xie K. Q. A comparative study on feature weight in text categorization. In: APWeb, 2004, p. 588-597. DOI 10.1007/978-3-540-24655-8_64

12. Агеев М. С., Добров Б. В. Метод эффективного расчета матрицы ближайших соседей для полнотекстовых документов // Вестник Санкт-Петерб. ун-та. Серия 10. 2011. № 3. С. 72-84.

13. Schaeffer S. E. Graph clustering. Computer Science Review, 2007, vol. 1, no. 1, p. 27-64. DOI 10.1016/j.cosrev.2007.05.001

14. Rujiang B., Junhua L. A novel conception based text classification method. In: Proceedings of the IEEE international e-conference on Advanced Science and Technology, 2009, p. 30-34. DOI 10.1109/ast.2009.15

15. Wang Z., Sun X., Zhang D., Li X. An optimal SVM based text classification algorithm. In: Proceedings of the 5th IEEE international conference on Machine Learning and Cybernetics, 2006, p. 1378-1381. DOI 10.1109/icmlc.2006.258708

16. Hu L., Huang M., Ke S. et al. The distance function effect on k-nearest neighbor classification for medical datasets. SpringerPlus, 2016, vol. 1304. DOI 10.1186/s40064-016-2941-7

17. Chung J., Tsay M.-Y. A Bibliometric Analysis of the Literature on Open Access in Scopus. Qualitative and Quantitative Methods in Libraries, 2017, vol. 4, no. 4, p. 821-841.

18. Martín-Martín A., Orduna-Malea E., Thelwall M., López-Cózar E. D. Google Scholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subject categories. Journal of Informetrics, 2018, vol. 12, no. 4, p. 1160-1177. DOI 10.1016/j.joi.2018.09.002

19. Bathrinarayanan A. L., Vaithiyanathan V., Narayanan S. Advanced Applied Mathematics Research Output Scientometric Analysis on SCOPUS Database. International Journal of Pure and Applied Mathematics, 2017, vol. 117, no. 13, p. 429-437.

20. Bakri A., Azura N. M., Nadzar M. D., Ibrahim R., Tahira M. Publication Productivity Pattern of Malaysian Researchers in Scopus from 1995 to 2015. Journal of Scientometric Research, 2017, vol. 6, no. 2, p. 86-101. DOI 10.5530/jscires.6.2.14


Review

For citations:


Selivanova I.V. Limitations of Applying the Data Compression Method to the Classification of Abstracts of Publications Indexed in Scopus. Vestnik NSU. Series: Information Technologies. 2020;18(3):57-68. (In Russ.) https://doi.org/10.25205/1818-7900-2020-18-3-57-68

Views: 60


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-7900 (Print)
ISSN 2410-0420 (Online)