References

intechngu

Вестник НГУ. Серия: Информационные технологии

Vestnik NSU. Series: Information Technologies

1818-79002410-0420

НГУ

10.25205/1818-7900-2025-23-4-44-61

intechngu-338

Research Article

Статьи

Исследование методов оптимизации скорости исполнения больших языковых моделей для задачи распознавания команд

Research of inference speed optimization methods of large language models for function calling task

https://orcid.org/0009-0000-5087-8506

Гончаренко

А. И.

Goncharenko

A. I.

Гончаренко Александр Игоревич, старший преподаватель

Новосибирск

Alexander I. Goncharenko, Senior lecturer

Novosibirsk

a.goncharenko@expasoft.tech

Чупров

М. И.

Chuprov

M. I.

Чупров Максим Иванович, разработчик-исследователь систем искусственного интеллекта

Новосибирск

Maxim I. Chuprov, Artificial intelligence systems developer/researcher

Novosibirsk

m.chuprov@expasoft.tech

Нежевенко

Е. С.

Nejevenko

E. S.

Нежевенко Евгений Семенович, доктор технических наук, ведущий научный сотрудник тематической группы оптико-электронных специализированных процессоров

Новосибирск

Evgeniy S. Nezhevenko, PhD., Leading researcher of the subject group of optical-electronic specialized processors

Novosibirsk

nedj@iae.nsk.su

Институт интеллектуальной робототехники НГУРоссияInstitute of Intelligent Robotics of Novosibirsk State UniversityRussian Federation

ООО «Экспасофт»РоссияExpasoft LLCRussian Federation

Институт автоматики и электрометрии СО РАНРоссияInstitute of Automation and Electrometry of the Siberian Branch of the Russian Academy of SciencesRussian Federation

2025

12022026

2344461

2026

Гончаренко А.И., Чупров М.И., Нежевенко Е.С.

Goncharenko A.I., Chuprov M.I., Nejevenko E.S.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://intechngu.elpub.ru/jour/article/view/338

Целью данной работы являлось исследование и реализация методов оптимизации (особенно методов прунинга) больших языковых моделей для задачи function calling, а также сравнение точности и скорости работы полученных моделей.

В качестве базовой модели была выбрана модель Mistral-7B. Для эффективной тренировки модели использовался датасет glaive-function-calling-v2, предназначенный для задачи function calling. Для обучения базовой модели использовалось квантование до 4 бит в формате nf4 и двойное квантование в сочетании с методом QLoRA (Quantized Low-Rank Adaptation).

Оптимизация модели проводилась несколькими способами: (1) с использованием метода ShortGPT, (2) с помощью критерия Тейлора для послойного прунинга, (3) методом LLM-Pruner, который отбрасывает параметры модели поканально, оставляя при этом количество слоев модели неизменным, и (4) методом PowerInfer, который использует свойство контекстуальной разреженности в больших языковых моделях. Для всех перечисленных способов оптимизации были построены оптимизированные модели, и проведено сравнение точности и скорости работы полученных моделей.

Результаты экспериментов показали, что наибольшая точность была достигнута на модели, которая была оптимизирована с помощью метода послойного прунинга по критерию Тейлора важности слоя. Для данного метода был проведен ряд экспериментов, в которых исследовалась разная расстановка гейтов внутри слоя декодера, а также различные способы агрегирования важности слоя на гейтах. По итогам экспериментов можно сделать вывод, что расстановка гейтов после блоков Multi-Head Attention и использование агрегирования важности с помощью L2-нормы вектора градиентов дают наибольшую точность по сравнению с другими возможными вариантами.

Научная значимость работы состоит в сравнении передовых методов прунинга, исходя из соотношения качество/скорость модели, и получении ускоренной версии модели для задачи function calling.

This work is devoted to study and practical implementation of optimization methods (especially pruning) for large language models (LLM) in the context of function calling task, as well as comparison of accuracy and speed of the obtained models.

Authors chose Mistral-7B as the basic model; glaive-function-calling-v2 – as dataset for training. 4-bit quantization in nf4 format and double quantization were used in combination with QLoRA (Quantized Low-Rank Adaptation) method.

Four different pruning methods were applied for model optimization. The first method, ShortGPT, focuses on reducing the model size by trimming less significant parts. The second method is based on Taylor’s criterion for layer-by-layer pruning. The third method, LLM-Pruner, removes parameters channel-by-channel maintaining the total number of layers. The fourth method, PowerInfer, uses contextual sparsity of large language models. Optimized models were implemented for all these methods; the accuracy and speed of resulting models were compared.

Results of experiments show that the highest accuracy was achieved using the layer-by-layer pruning according to Taylor’s criterion of layer importance. This method was tested with different placement of gates within the decoder layer and different ways of aggregation of layer importance on the gates. Experiments show that best results were achieved by placing the gates after Multi-Head Attention blocks and using the L2 norm of the gradient vector to aggregate layer importance.

Scholarly importance of the work includes comparison of advanced pruning methods in the context of quality/speed ratio and obtaining a speed up version model for the function calling task.

прунингквантованиеряд Тейлорабольшие языковые моделимеханизм вниманияfunction callingPowerInfer

pruningquantizationTaylor serieslarge language modelsattention mechanismfunction callingPowerInfer

References1

Radford A. et al. Improving language understanding by generative pre-training. 2018.

Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv: 1810.04805. 2018. DOI: 10.18653/V1/N19-1423

Mikolov T. et al. Efficient estimation of word representations in vector space // arXiv preprint arXiv: 1301.3781. 2013. https://doi.org/10.48550/arXiv.1301.3781

Vaswani A. et al. Attention is all you need // Advances in neural information processing systems. 2017. Т. 30. DOI/10.5555/3295222.3295349

Ma X., Fang G., Wang X. Llm-pruner: On the structural pruning of large language models // Advances in neural information processing systems. 2023, vol. 36, рp. 21702–21720. https://doi.org/10.48550/arXiv.2305.11627

Men X. et al. Shortgpt: Layers in large language models are more redundant than you expect // arXiv preprint arXiv: 2403.03853. 2024. https://doi.org/10.48550/arXiv.2403.03853

Frantar E., Alistarh D. Sparsegpt: Massive language models can be accurately pruned in one-shot // International Conference on Machine Learning. PMLR, 2023. P. 10323–10337. https://doi.org/10.48550/arXiv.2301.00774

Liu Z. et al. Deja vu: Contextual sparsity for efficient llms at inference time // International Conference on Machine Learning. PMLR, 2023. P. 22137–22176. DOI/10.5555/3618408.3619327

Song Y. et al. Powerinfer: Fast large language model serving with a consumer-grade gpu // arXiv preprint arXiv: 2312.12456. 2023. https://doi.org/10.1145/3694715.3695964

Jiang A. Q. et al. Mistral 7B // arXiv preprint arXiv: 2310.06825. 2023. https://doi.org/10.48550/arXiv.2310.06825

Molchanov P. et al. Importance estimation for neural network pruning // Proceedings of the IEEE / CVF conference on computer vision and pattern recognition. 2019. P. 11264-11272. DOI: 10.1109/CVPR.2019.01152

Gerganov G. GitHub – ggerganov/llama.cpp: Port of Facebook’s LLaMA model in C/C++ – github.com. 2023.

The authors declare that there are no conflicts of interest present.