Transparent Reduction of Dimension with Genetic Algorithm
https://doi.org/10.25205/1818-7900-2023-21-1-46-61
Abstract
There are domain areas where all transformations of data must be transparent and interpretable (medicine and finance for example). Dimension reduction is an important part of a preprocessing pipeline but algorithms for it are not transparent at the current time. In this work, we provide a genetic algorithm for transparent dimension reduction of numerical data. The algorithm constructs features in a form of expression trees based on a subset of numerical features from the source data and common arithmetical operations. It is designed to maximize quality in binary classification tasks and generate features explainable by a human which achieves by using human-interpretable operations in a feature construction. Also, data transformed by the algorithm can be used in a visual analysis. The multicriterial dynamic fitness function is provided to build features with high diversity.
About the Author
N. A. RadeevRussian Federation
Nikita A. Radeev, master
faculty of Information Technologies
General Informatics Department
Novosibirsk
References
1. M. H. ur Rehman, C. S. Liew, A. Abbas, P. P. Jayaraman, T. Y. Wah, S. U. Khan. Big Data Reduction Methods: A Survey. Data Sci. Eng., 2016, vol. 1, no. 4, p. 265–284, DOI: 10.1007/s41019-016-0022-0.
2. C. H. Yoon, R. Torrance, N. Scheinerman. Machine learning in medicine: should the pursuit of enhanced interpretability be abandoned? J. Med. Ethics, 2022, vol. 48, no. 9, p. 581–585, DOI: 10.1136/medethics-2020-107102.
3. P. Linardatos, V. Papastefanopoulos, S. Kotsiantis. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy, 2021, vol. 23, no. 1, Art. no. 1, DOI: 10.3390/e23010018.
4. Izenman. Introduction to manifold learning. Wiley Interdiscip. Rev. Comput. Stat., 2012, vol. 4, DOI: 10.1002/wics.1222.
5. H. Han, W. Li, J. Wang, G. Qin, X. Qin. Enhance Explainability of Manifold Learning. Neurocomputing, 2022, vol. 500, DOI: 10.1016/j.neucom.2022.05.119.
6. D. Elizondo, R. Birkenhead, M. Gámez, N. Rubio, E. Alfaro-Cortés. Linear separability and classification complexity. Expert Syst. Appl., 2012, vol. 39, p. 7796–7807, DOI: 10.1016/j.eswa.2012.01.090.
7. J. Koza, R. Poli. Genetic Programming. in Search Methodologies, 2005, p. 127–164. DOI: 10.1007/0-387-28356-0_5.
8. U.-M. O’Reilly, E. Hemberg. Genetic programming: a tutorial introduction. 2021, p. 453. DOI: 10.1145/3449726.3461394.
9. Vasuki. Genetic Programming. 2020, p. 61–76. DOI: 10.1201/9780429289071-5.
10. L. Kallel, B. Naudts, C. Reeves. Properties of Fitness Functions and Search Landscapes. 2000, DOI: 10.1007/978-3-662-04448-3_8.
11. M. Schmidt, H. Lipson. Distilling Free-Form Natural Laws from Experimental Data. Science, 2009, vol. 324, no. 5923, p. 81–85, DOI: 10.1126/science.1165893.
12. W. La Cava et al. Contemporary Symbolic Regression Methods and their Relative Performance. in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 2021, vol. 1.
13. L. Sotto, P. Kaufmann, T. Atkinson, R. Kalkreuth, M. Basgalupp. Graph representations in genetic programming. Genet. Program. Evolvable Mach., 2021, vol. 22, DOI: 10.1007/s10710-021-09413-9.
14. P. Krtolica, P. Stanimirovic. Reverse Polish notation method. Int. J. Comput. Math., 2004, vol. IJCM, p. 273–284, DOI: 10.1080/00207160410001660826.
15. C. Ferreira. Gene Expression Programming: a New Adaptive Algorithm for Solving Problems. arXiv, 2001. DOI: 10.48550/arXiv.cs/0102027.
16. C. Ferreira. Gene Expression Programming in Problem Solving. in Soft Computing and Industry: Recent Applications, Eds. London: Springer, 2002, p. 635–653. DOI: 10.1007/978-1-4471-0123-9_54.
17. Jolliffe. Principal Component Analysis. Springer: Berlin, Germany, 1986, vol. 87, p. 41–64, DOI: 10.1007/b98835.
18. L. van der Maaten, G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, vol. 9, p. 2579–2605.
19. B. Hosseini, B. Hammer. Interpretable Discriminative Dimensionality Reduction and Feature Selection on the Manifold. arXiv, arXiv:1909.09218, 2019 DOI: 10.48550/arXiv.1909.09218.
20. T. Uriot, M. Virgolin, T. Alderliesten, P. Bosman. On genetic programming representations and fitness functions for interpretable dimensionality reduction. arXiv, arXiv:2203.00528, 2022. DOI: 10.48550/arXiv.2203.00528.
21. Lensen, M. Zhang, B. Xue. Multi-Objective Genetic Programming for Manifold Learning: Balancing Quality and Dimensionality. Genet. Program. Evolvable Mach., 2020, vol. 21, no. 3, p. 399–431, DOI: 10.1007/s10710-020-09375-4.
22. M. Virgolin, T. Alderliesten, P. A. N. Bosman. On Explaining Machine Learning Models by Evolving Crucial and Compact Features. Swarm Evol. Comput., 2020 vol. 53, p. 100640, DOI: 10.1016/j.swevo.2019.100640.
23. Lensen, B. Xue, M. Zhang. Can Genetic Programming Do Manifold Learning Too? in Genetic Programming, Cham, 2019, p. 114–130. doi: 10.1007/978-3-030-16670-0_8.
24. Q. Zhang, H. Li. MOEA/D: A Multiobjective Evolutionary Algorithm Based on Decomposition. IEEE Trans. Evol. Comput., 2007, vol. 11, no. 6, p. 712–731, DOI: 10.1109/TEVC.2007.892759.
25. S. Roweis, L. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 2001, vol. 290, p. 2323–6, DOI: 10.1126/science.290.5500.2323.
26. B. K. Tripathy, S. Anveshrithaa, S. Ghela. Multidimensional Scaling (MDS). 2021, p. 41–51. DOI: 10.1201/9781003190554-6.
27. L. McInnes, J. Healy, J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv, Sep. 17, 2020. DOI: 10.48550/arXiv.1802.03426.
28. C. Cortes, V. Vapnik. Support vector machines. Mach. Learn., 1995, vol. 20, p. 273–293.
29. Vanschoren, J. N. van Rijn, B. Bischl, L. Torgo. OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl., 2014, vol. 15, no. 2, p. 49–60, DOI: 10.1145/2641190.2641198.
30. S. Moro, P. Cortez, R. Laureano. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 2011.
31. Yeh, K.-J. Yang, T.-M. Ting. Knowledge discovery on RFM model using Bernoulli sequence. Expert Syst Appl, 2009, vol. 36, p. 5866–5871, DOI: 10.1016/j.eswa.2008.07.018.
32. Bennett, O. L. Mangasarian. Robust Linear Programming Discrimination Of Two Linearly Inseparable Sets. Optim. Methods Softw., 2002, vol. 1, DOI: 10.1080/10556789208805504.
33. D. Dua, C. Graff. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2019. [Online]. Available: http://archive.ics.uci.edu/ml
34. V. Sigillito, S. Wing, L. Hutton, K. Baker. Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech. Dig. Appl. Phys. Lab., 1989, vol. 10.
35. Guyon, S. Gunn, A. Ben-Hur, G. Dror. Result Analysis of the NIPS 2003 Feature Selection Challenge, vol. 17. 2004.
36. R. P. Gorman, T. Sejnowski. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw., 1988, vol. 1, p. 75–89, DOI: 10.1016/0893-6080(88)90023-8
Review
For citations:
Radeev N.A. Transparent Reduction of Dimension with Genetic Algorithm. Vestnik NSU. Series: Information Technologies. 2023;21(1):46-61. (In Russ.) https://doi.org/10.25205/1818-7900-2023-21-1-46-61