On Increasing the Quality of the Climate Observations Question-Answering System’s Output Data
https://doi.org/10.25205/1818-7900-2024-22-4-5-16
Abstract
The development of the climate observations question-answer (QA) information system relies on heterogeneous climate data in various formats (text, numerical, graphic, video, audio, geographic and monitoring data). A mandatory element of such a system is a tool that allows processing and analyzing such data.
Searching and retrieving data is a central part of the system in question, since the quality of the generated answer heavily depends on it. The exact way the data is retrieved is critical to the output of a QA system as well as to decision-making problems, since there are situations in which the LLM generates a contextually appropriate but factually incorrect answers that do not match the input. Using correct metrics and algorithms for some data types and incorrect ones for others can cause the permissible threshold of irrelevant data to be exceeded, which in turn can cause the quality of the answers to decrease. Retrieval-augmented generation (RAG) systems can also be used to optimize input data for that task.
This work discusses various algorithms for data extraction and document ranking, as well as the possibility of using ensembles of LLM agents in development of the QA system that works with climate data.
About the Authors
O. Yu. GavenkoRussian Federation
Olga Yu. Gavenko, Doctor of Sciences (Technical Sciences), Сandidate of Sciences (Philology), Leading Researcher; Senior lecturer of the Department of Mathematical Modeling
Novosibirsk
N. A. Shashok
Russian Federation
Natalia A. Shashok, Ph. D Student
Novosibirsk
References
1. Hirschman L., Gaizauskas R. Natural language question answering: the view from here. Natural Language Engineering Journal, 2001, vol. 7, no. 4, pp. 275–300. DOI: 10.1017/S1351324901002807
2. Keen P. G. W, Michael S. S. M. Decision support systems: an organizational perspective. Michigan, Addison-Wesley, 1978.
3. Woods W. A. Progress in natural language understanding: an application to lunar geology. Proceedings of the national computer conference and exposition (AFIPS ‘73), 1974, Association for Computing Machinery, New York, NY, USA, pp. 441–450. DOI: https://doi.org/10.1145/1499586.1499695
4. Lewis P., Perez E., et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ‘20), 2020, Curran Associates Inc., Red Hook, NY, USA, Article 793, pp. 9459–9474. DOI: 10.48550/arXiv.2005.11401
5. Wang L., Lo K. et al. CORD-19: The Covid-19 Open Research Dataset. ArXiv, abs/2004.10706, 2020. DOI: 10.48550/arXiv.2004.10706
6. Rajpurkar P., Zhang J., Lopyrev K., Liang P. Squad: 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, Association for Computational Linguistics, Austin, Texas, USA, pp. 2383–2392. doi: 10.18653/v1/D16-1264
7. Magesh V., Surani F., Dahl M., Suzgun M. et al. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. ArXiv, abs/2405.20362, 2024. DOI: 10.48550/arXiv.2405.20362
8. Page L., Brin S., Motwani R., Winograd T. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford InfoLab, 1999.
9. Fadeev S. V. Ekologicheskij slovar’. Saint Petersburg, 2011 (in Russ.)
10. Florin C., Giovanni T., et al: The Power of Noise: Redefining Retrieval for RAG Systems. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, Association for Computing Machinery, New York, NY, USA pp. 719-729. DOI: 10.1145/3626772.3657834
11. Cormack G. V., Clarke C. ., Büttcher S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009, Association for Computing Machinery, New York, NY, USA, pp. 758–759. DOI: 10.1145/1571941.1572114
Review
For citations:
Gavenko O.Yu., Shashok N.A. On Increasing the Quality of the Climate Observations Question-Answering System’s Output Data. Vestnik NSU. Series: Information Technologies. 2024;22(4):5-16. (In Russ.) https://doi.org/10.25205/1818-7900-2024-22-4-5-16