Preview

Vestnik NSU. Series: Information Technologies

Advanced search

Research of inference speed optimization methods of large language models for function calling task

https://doi.org/10.25205/1818-7900-2025-23-4-44-61

Abstract

This work is devoted to study and practical implementation of optimization methods (especially pruning) for large language models (LLM) in the context of function calling task, as well as comparison of accuracy and speed of the obtained models.

Authors chose Mistral-7B as the basic model; glaive-function-calling-v2 – as dataset for training. 4-bit quantization in nf4 format and double quantization were used in combination with QLoRA (Quantized Low-Rank Adaptation) method.

Four different pruning methods were applied for model optimization. The first method, ShortGPT, focuses on reducing the model size by trimming less significant parts. The second method is based on Taylor’s criterion for layer-by-layer pruning. The third method, LLM-Pruner, removes parameters channel-by-channel maintaining the total number of layers. The fourth method, PowerInfer, uses contextual sparsity of large language models. Optimized models were implemented for all these methods; the accuracy and speed of resulting models were compared.

Results of experiments show that the highest accuracy was achieved using the layer-by-layer pruning according to Taylor’s criterion of layer importance. This method was tested with different placement of gates within the decoder layer and different ways of aggregation of layer importance on the gates. Experiments show that best results were achieved by placing the gates after Multi-Head Attention blocks and using the L2 norm of the gradient vector to aggregate layer importance.

Scholarly importance of the work includes comparison of advanced pruning methods in the context of quality/speed ratio and obtaining a speed up version model for the function calling task.

About the Authors

A. I. Goncharenko
Institute of Intelligent Robotics of Novosibirsk State University
Russian Federation

Alexander I. Goncharenko, Senior lecturer

Novosibirsk



M. I. Chuprov
Expasoft LLC
Russian Federation

Maxim I. Chuprov, Artificial intelligence systems developer/researcher

Novosibirsk



E. S. Nejevenko
Institute of Automation and Electrometry of the Siberian Branch of the Russian Academy of Sciences
Russian Federation

Evgeniy S. Nezhevenko, PhD., Leading researcher of the subject group of optical-electronic specialized processors

Novosibirsk



References

1. Radford A. et al. Improving language understanding by generative pre-training. 2018.

2. Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv: 1810.04805. 2018. DOI: 10.18653/V1/N19-1423

3. Mikolov T. et al. Efficient estimation of word representations in vector space // arXiv preprint arXiv: 1301.3781. 2013. https://doi.org/10.48550/arXiv.1301.3781

4. Vaswani A. et al. Attention is all you need // Advances in neural information processing systems. 2017. Т. 30. DOI/10.5555/3295222.3295349

5. Ma X., Fang G., Wang X. Llm-pruner: On the structural pruning of large language models // Advances in neural information processing systems. 2023, vol. 36, рp. 21702–21720. https://doi.org/10.48550/arXiv.2305.11627

6. Men X. et al. Shortgpt: Layers in large language models are more redundant than you expect // arXiv preprint arXiv: 2403.03853. 2024. https://doi.org/10.48550/arXiv.2403.03853

7. Frantar E., Alistarh D. Sparsegpt: Massive language models can be accurately pruned in one-shot // International Conference on Machine Learning. PMLR, 2023. P. 10323–10337. https://doi.org/10.48550/arXiv.2301.00774

8. Liu Z. et al. Deja vu: Contextual sparsity for efficient llms at inference time // International Conference on Machine Learning. PMLR, 2023. P. 22137–22176. DOI/10.5555/3618408.3619327

9. Song Y. et al. Powerinfer: Fast large language model serving with a consumer-grade gpu // arXiv preprint arXiv: 2312.12456. 2023. https://doi.org/10.1145/3694715.3695964

10. Jiang A. Q. et al. Mistral 7B // arXiv preprint arXiv: 2310.06825. 2023. https://doi.org/10.48550/arXiv.2310.06825

11. Molchanov P. et al. Importance estimation for neural network pruning // Proceedings of the IEEE / CVF conference on computer vision and pattern recognition. 2019. P. 11264-11272. DOI: 10.1109/CVPR.2019.01152

12. Gerganov G. GitHub – ggerganov/llama.cpp: Port of Facebook’s LLaMA model in C/C++ – github.com. 2023.


Review

For citations:


Goncharenko A.I., Chuprov M.I., Nejevenko E.S. Research of inference speed optimization methods of large language models for function calling task. Vestnik NSU. Series: Information Technologies. 2025;23(4):44-61. (In Russ.) https://doi.org/10.25205/1818-7900-2025-23-4-44-61

Views: 84

JATS XML


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1818-7900 (Print)
ISSN 2410-0420 (Online)