Research of inference speed optimization methods of large language models for function calling task

A. I. Goncharenko; M. I. Chuprov; E. S. Nejevenko

doi:10.25205/1818-7900-2025-23-4-44-61

Research of inference speed optimization methods of large language models for function calling task

A. I. Goncharenko, M. I. Chuprov, E. S. Nejevenko

https://doi.org/10.25205/1818-7900-2025-23-4-44-61

Full Text:

PDF (Rus)

Generate QR code

Abstract

This work is devoted to study and practical implementation of optimization methods (especially pruning) for large language models (LLM) in the context of function calling task, as well as comparison of accuracy and speed of the obtained models.

Authors chose Mistral-7B as the basic model; glaive-function-calling-v2 – as dataset for training. 4-bit quantization in nf4 format and double quantization were used in combination with QLoRA (Quantized Low-Rank Adaptation) method.

Four different pruning methods were applied for model optimization. The first method, ShortGPT, focuses on reducing the model size by trimming less significant parts. The second method is based on Taylor’s criterion for layer-by-layer pruning. The third method, LLM-Pruner, removes parameters channel-by-channel maintaining the total number of layers. The fourth method, PowerInfer, uses contextual sparsity of large language models. Optimized models were implemented for all these methods; the accuracy and speed of resulting models were compared.

Results of experiments show that the highest accuracy was achieved using the layer-by-layer pruning according to Taylor’s criterion of layer importance. This method was tested with different placement of gates within the decoder layer and different ways of aggregation of layer importance on the gates. Experiments show that best results were achieved by placing the gates after Multi-Head Attention blocks and using the L2 norm of the gradient vector to aggregate layer importance.

Scholarly importance of the work includes comparison of advanced pruning methods in the context of quality/speed ratio and obtaining a speed up version model for the function calling task.

Keywords

pruning, quantization, Taylor series, large language models, attention mechanism, function calling, PowerInfer

About the Authors

A. I. Goncharenko

Institute of Intelligent Robotics of Novosibirsk State University
Russian Federation

Alexander I. Goncharenko, Senior lecturer

Novosibirsk

M. I. Chuprov

Expasoft LLC
Russian Federation

Maxim I. Chuprov, Artificial intelligence systems developer/researcher

Novosibirsk

E. S. Nejevenko

Institute of Automation and Electrometry of the Siberian Branch of the Russian Academy of Sciences
Russian Federation

Evgeniy S. Nezhevenko, PhD., Leading researcher of the subject group of optical-electronic specialized processors

Novosibirsk

References

1. Radford A. et al. Improving language understanding by generative pre-training. 2018.

2. Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv: 1810.04805. 2018. DOI: 10.18653/V1/N19-1423

3. Mikolov T. et al. Efficient estimation of word representations in vector space // arXiv preprint arXiv: 1301.3781. 2013. https://doi.org/10.48550/arXiv.1301.3781

4. Vaswani A. et al. Attention is all you need // Advances in neural information processing systems. 2017. Т. 30. DOI/10.5555/3295222.3295349

5. Ma X., Fang G., Wang X. Llm-pruner: On the structural pruning of large language models // Advances in neural information processing systems. 2023, vol. 36, рp. 21702–21720. https://doi.org/10.48550/arXiv.2305.11627

6. Men X. et al. Shortgpt: Layers in large language models are more redundant than you expect // arXiv preprint arXiv: 2403.03853. 2024. https://doi.org/10.48550/arXiv.2403.03853

7. Frantar E., Alistarh D. Sparsegpt: Massive language models can be accurately pruned in one-shot // International Conference on Machine Learning. PMLR, 2023. P. 10323–10337. https://doi.org/10.48550/arXiv.2301.00774

8. Liu Z. et al. Deja vu: Contextual sparsity for efficient llms at inference time // International Conference on Machine Learning. PMLR, 2023. P. 22137–22176. DOI/10.5555/3618408.3619327

9. Song Y. et al. Powerinfer: Fast large language model serving with a consumer-grade gpu // arXiv preprint arXiv: 2312.12456. 2023. https://doi.org/10.1145/3694715.3695964

10. Jiang A. Q. et al. Mistral 7B // arXiv preprint arXiv: 2310.06825. 2023. https://doi.org/10.48550/arXiv.2310.06825

11. Molchanov P. et al. Importance estimation for neural network pruning // Proceedings of the IEEE / CVF conference on computer vision and pattern recognition. 2019. P. 11264-11272. DOI: 10.1109/CVPR.2019.01152

12. Gerganov G. GitHub – ggerganov/llama.cpp: Port of Facebook’s LLaMA model in C/C++ – github.com. 2023.

Review

For citations:

Goncharenko A.I., Chuprov M.I., Nejevenko E.S. Research of inference speed optimization methods of large language models for function calling task. Vestnik NSU. Series: Information Technologies. 2025;23(4):44-61. (In Russ.) https://doi.org/10.25205/1818-7900-2025-23-4-44-61

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1818-7900 (Print)
ISSN 2410-0420 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Vestnik NSU. Series: Information Technologies

Research of inference speed optimization methods of large language models for function calling task

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy