Research of inference speed optimization methods of large language models for function calling task
https://doi.org/10.25205/1818-7900-2025-23-4-44-61
Abstract
This work is devoted to study and practical implementation of optimization methods (especially pruning) for large language models (LLM) in the context of function calling task, as well as comparison of accuracy and speed of the obtained models.
Authors chose Mistral-7B as the basic model; glaive-function-calling-v2 – as dataset for training. 4-bit quantization in nf4 format and double quantization were used in combination with QLoRA (Quantized Low-Rank Adaptation) method.
Four different pruning methods were applied for model optimization. The first method, ShortGPT, focuses on reducing the model size by trimming less significant parts. The second method is based on Taylor’s criterion for layer-by-layer pruning. The third method, LLM-Pruner, removes parameters channel-by-channel maintaining the total number of layers. The fourth method, PowerInfer, uses contextual sparsity of large language models. Optimized models were implemented for all these methods; the accuracy and speed of resulting models were compared.
Results of experiments show that the highest accuracy was achieved using the layer-by-layer pruning according to Taylor’s criterion of layer importance. This method was tested with different placement of gates within the decoder layer and different ways of aggregation of layer importance on the gates. Experiments show that best results were achieved by placing the gates after Multi-Head Attention blocks and using the L2 norm of the gradient vector to aggregate layer importance.
Scholarly importance of the work includes comparison of advanced pruning methods in the context of quality/speed ratio and obtaining a speed up version model for the function calling task.
About the Authors
A. I. GoncharenkoRussian Federation
Alexander I. Goncharenko, Senior lecturer
Novosibirsk
M. I. Chuprov
Russian Federation
Maxim I. Chuprov, Artificial intelligence systems developer/researcher
Novosibirsk
E. S. Nejevenko
Russian Federation
Evgeniy S. Nezhevenko, PhD., Leading researcher of the subject group of optical-electronic specialized processors
Novosibirsk
References
1. Radford A. et al. Improving language understanding by generative pre-training. 2018.
2. Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv: 1810.04805. 2018. DOI: 10.18653/V1/N19-1423
3. Mikolov T. et al. Efficient estimation of word representations in vector space // arXiv preprint arXiv: 1301.3781. 2013. https://doi.org/10.48550/arXiv.1301.3781
4. Vaswani A. et al. Attention is all you need // Advances in neural information processing systems. 2017. Т. 30. DOI/10.5555/3295222.3295349
5. Ma X., Fang G., Wang X. Llm-pruner: On the structural pruning of large language models // Advances in neural information processing systems. 2023, vol. 36, рp. 21702–21720. https://doi.org/10.48550/arXiv.2305.11627
6. Men X. et al. Shortgpt: Layers in large language models are more redundant than you expect // arXiv preprint arXiv: 2403.03853. 2024. https://doi.org/10.48550/arXiv.2403.03853
7. Frantar E., Alistarh D. Sparsegpt: Massive language models can be accurately pruned in one-shot // International Conference on Machine Learning. PMLR, 2023. P. 10323–10337. https://doi.org/10.48550/arXiv.2301.00774
8. Liu Z. et al. Deja vu: Contextual sparsity for efficient llms at inference time // International Conference on Machine Learning. PMLR, 2023. P. 22137–22176. DOI/10.5555/3618408.3619327
9. Song Y. et al. Powerinfer: Fast large language model serving with a consumer-grade gpu // arXiv preprint arXiv: 2312.12456. 2023. https://doi.org/10.1145/3694715.3695964
10. Jiang A. Q. et al. Mistral 7B // arXiv preprint arXiv: 2310.06825. 2023. https://doi.org/10.48550/arXiv.2310.06825
11. Molchanov P. et al. Importance estimation for neural network pruning // Proceedings of the IEEE / CVF conference on computer vision and pattern recognition. 2019. P. 11264-11272. DOI: 10.1109/CVPR.2019.01152
12. Gerganov G. GitHub – ggerganov/llama.cpp: Port of Facebook’s LLaMA model in C/C++ – github.com. 2023.
Review
For citations:
Goncharenko A.I., Chuprov M.I., Nejevenko E.S. Research of inference speed optimization methods of large language models for function calling task. Vestnik NSU. Series: Information Technologies. 2025;23(4):44-61. (In Russ.) https://doi.org/10.25205/1818-7900-2025-23-4-44-61
JATS XML


