<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">intechngu</journal-id><journal-title-group><journal-title xml:lang="ru">Вестник НГУ. Серия: Информационные технологии</journal-title><trans-title-group xml:lang="en"><trans-title>Vestnik NSU. Series: Information Technologies</trans-title></trans-title-group></journal-title-group><issn pub-type="ppub">1818-7900</issn><issn pub-type="epub">2410-0420</issn><publisher><publisher-name>НГУ</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.25205/1818-7900-2025-23-4-44-61</article-id><article-id custom-type="elpub" pub-id-type="custom">intechngu-338</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Исследование методов оптимизации скорости исполнения больших языковых моделей для задачи распознавания команд</article-title><trans-title-group xml:lang="en"><trans-title>Research of inference speed optimization methods of large language models for function calling task</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><contrib-id contrib-id-type="orcid">https://orcid.org/0009-0000-5087-8506</contrib-id><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Гончаренко</surname><given-names>А. И.</given-names></name><name name-style="western" xml:lang="en"><surname>Goncharenko</surname><given-names>A. I.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Гончаренко Александр Игоревич, старший преподаватель</p><p>Новосибирск</p></bio><bio xml:lang="en"><p>Alexander I. Goncharenko, Senior lecturer</p><p>Novosibirsk</p></bio><email xlink:type="simple">a.goncharenko@expasoft.tech</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Чупров</surname><given-names>М. И.</given-names></name><name name-style="western" xml:lang="en"><surname>Chuprov</surname><given-names>M. I.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Чупров Максим Иванович, разработчик-исследователь систем искусственного интеллекта</p><p>Новосибирск</p></bio><bio xml:lang="en"><p>Maxim I. Chuprov, Artificial intelligence systems developer/researcher</p><p>Novosibirsk</p></bio><email xlink:type="simple">m.chuprov@expasoft.tech</email><xref ref-type="aff" rid="aff-2"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Нежевенко</surname><given-names>Е. С.</given-names></name><name name-style="western" xml:lang="en"><surname>Nejevenko</surname><given-names>E. S.</given-names></name></name-alternatives><bio xml:lang="ru"><p>Нежевенко Евгений Семенович, доктор технических наук, ведущий научный сотрудник тематической группы оптико-электронных специализированных процессоров</p><p>Новосибирск</p></bio><bio xml:lang="en"><p>Evgeniy S. Nezhevenko, PhD., Leading researcher of the subject group of optical-electronic specialized processors</p><p>Novosibirsk</p></bio><email xlink:type="simple">nedj@iae.nsk.su</email><xref ref-type="aff" rid="aff-3"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru">Институт интеллектуальной робототехники НГУ<country>Россия</country></aff><aff xml:lang="en">Institute of Intelligent Robotics of Novosibirsk State University<country>Russian Federation</country></aff></aff-alternatives><aff-alternatives id="aff-2"><aff xml:lang="ru">ООО «Экспасофт»<country>Россия</country></aff><aff xml:lang="en">Expasoft LLC<country>Russian Federation</country></aff></aff-alternatives><aff-alternatives id="aff-3"><aff xml:lang="ru">Институт автоматики и электрометрии СО РАН<country>Россия</country></aff><aff xml:lang="en">Institute of Automation and Electrometry of the Siberian Branch of the Russian Academy of Sciences<country>Russian Federation</country></aff></aff-alternatives><pub-date pub-type="collection"><year>2025</year></pub-date><pub-date pub-type="epub"><day>12</day><month>02</month><year>2026</year></pub-date><volume>23</volume><issue>4</issue><fpage>44</fpage><lpage>61</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Гончаренко А.И., Чупров М.И., Нежевенко Е.С., 2026</copyright-statement><copyright-year>2026</copyright-year><copyright-holder xml:lang="ru">Гончаренко А.И., Чупров М.И., Нежевенко Е.С.</copyright-holder><copyright-holder xml:lang="en">Goncharenko A.I., Chuprov M.I., Nejevenko E.S.</copyright-holder><license license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://intechngu.elpub.ru/jour/article/view/338">https://intechngu.elpub.ru/jour/article/view/338</self-uri><abstract><p> Целью данной работы являлось исследование и реализация методов оптимизации (особенно методов прунинга) больших языковых моделей для задачи function calling, а также сравнение точности и скорости работы полученных моделей.</p><p>В качестве базовой модели была выбрана модель Mistral-7B. Для эффективной тренировки модели использовал­ся датасет glaive-function-calling-v2, предназначенный для задачи function calling. Для обучения базовой модели использовалось квантование до 4 бит в формате nf4 и двойное квантование в сочетании с методом QLoRA (Quantized Low-Rank Adaptation).</p><p>Оптимизация модели проводилась несколькими способами: (1) с использованием метода ShortGPT, (2) с помо­щью критерия Тейлора для послойного прунинга, (3) методом LLM-Pruner, который отбрасывает параметры модели поканально, оставляя при этом количество слоев модели неизменным, и (4) методом PowerInfer, который использует свойство контекстуальной разреженности в больших языковых моделях. Для всех перечисленных способов оптимизации были построены оптимизированные модели, и проведено сравнение точности и скоро­сти работы полученных моделей.</p><p>Результаты экспериментов показали, что наибольшая точность была достигнута на модели, которая была оптимизирована с помощью метода послойного прунинга по критерию Тейлора важности слоя. Для данного метода был проведен ряд экспериментов, в которых исследовалась разная расстановка гейтов внутри слоя декодера, а также различные способы агрегирования важности слоя на гейтах. По итогам экспериментов можно сделать вывод, что расстановка гейтов после блоков Multi-Head Attention и использование агрегирования важности с помощью L2-нормы вектора градиентов дают наибольшую точность по сравнению с другими возможными ва­риантами.</p><p>Научная значимость работы состоит в сравнении передовых методов прунинга, исходя из соотношения каче­ство/скорость модели, и получении ускоренной версии модели для задачи function calling.</p></abstract><trans-abstract xml:lang="en"><p>This work is devoted to study and practical implementation of optimization methods (especially pruning) for large language models (LLM) in the context of function calling task, as well as comparison of accuracy and speed of the obtained models.</p><p>Authors chose Mistral-7B as the basic model; glaive-function-calling-v2 – as dataset for training. 4-bit quantization in nf4 format and double quantization were used in combination with QLoRA (Quantized Low-Rank Adaptation) method.</p><p>Four different pruning methods were applied for model optimization. The first method, ShortGPT, focuses on reducing the model size by trimming less significant parts. The second method is based on Taylor’s criterion for layer-by-layer pruning. The third method, LLM-Pruner, removes parameters channel-by-channel maintaining the total number of layers. The fourth method, PowerInfer, uses contextual sparsity of large language models. Optimized models were implemented for all these methods; the accuracy and speed of resulting models were compared.</p><p>Results of experiments show that the highest accuracy was achieved using the layer-by-layer pruning according to Taylor’s criterion of layer importance. This method was tested with different placement of gates within the decoder layer and different ways of aggregation of layer importance on the gates. Experiments show that best results were achieved by placing the gates after Multi-Head Attention blocks and using the L2 norm of the gradient vector to aggregate layer importance.</p><p>Scholarly importance of the work includes comparison of advanced pruning methods in the context of quality/speed ratio and obtaining a speed up version model for the function calling task.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>прунинг</kwd><kwd>квантование</kwd><kwd>ряд Тейлора</kwd><kwd>большие языковые модели</kwd><kwd>механизм внимания</kwd><kwd>function calling</kwd><kwd>PowerInfer</kwd></kwd-group><kwd-group xml:lang="en"><kwd>pruning</kwd><kwd>quantization</kwd><kwd>Taylor series</kwd><kwd>large language models</kwd><kwd>attention mechanism</kwd><kwd>function calling</kwd><kwd>PowerInfer</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Radford A. et al. Improving language understanding by generative pre-training. 2018.</mixed-citation><mixed-citation xml:lang="en">Radford A. et al. Improving language understanding by generative pre-training. 2018.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv: 1810.04805. 2018. DOI: 10.18653/V1/N19-1423</mixed-citation><mixed-citation xml:lang="en">Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv: 1810.04805. 2018. DOI: 10.18653/V1/N19-1423</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Mikolov T. et al. Efficient estimation of word representations in vector space // arXiv preprint arXiv: 1301.3781. 2013. https://doi.org/10.48550/arXiv.1301.3781</mixed-citation><mixed-citation xml:lang="en">Mikolov T. et al. Efficient estimation of word representations in vector space // arXiv preprint arXiv: 1301.3781. 2013. https://doi.org/10.48550/arXiv.1301.3781</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Vaswani A. et al. Attention is all you need // Advances in neural information processing systems. 2017. Т. 30. DOI/10.5555/3295222.3295349</mixed-citation><mixed-citation xml:lang="en">Vaswani A. et al. Attention is all you need // Advances in neural information processing systems. 2017. Т. 30. DOI/10.5555/3295222.3295349</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Ma X., Fang G., Wang X. Llm-pruner: On the structural pruning of large language models // Advances in neural information processing systems. 2023, vol. 36, рp. 21702–21720. https://doi.org/10.48550/arXiv.2305.11627</mixed-citation><mixed-citation xml:lang="en">Ma X., Fang G., Wang X. Llm-pruner: On the structural pruning of large language models // Advances in neural information processing systems. 2023, vol. 36, рp. 21702–21720. https://doi.org/10.48550/arXiv.2305.11627</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Men X. et al. Shortgpt: Layers in large language models are more redundant than you expect // arXiv preprint arXiv: 2403.03853. 2024. https://doi.org/10.48550/arXiv.2403.03853</mixed-citation><mixed-citation xml:lang="en">Men X. et al. Shortgpt: Layers in large language models are more redundant than you expect // arXiv preprint arXiv: 2403.03853. 2024. https://doi.org/10.48550/arXiv.2403.03853</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Frantar E., Alistarh D. Sparsegpt: Massive language models can be accurately pruned in one-shot // International Conference on Machine Learning. PMLR, 2023. P. 10323–10337. https://doi.org/10.48550/arXiv.2301.00774</mixed-citation><mixed-citation xml:lang="en">Frantar E., Alistarh D. Sparsegpt: Massive language models can be accurately pruned in one-shot // International Conference on Machine Learning. PMLR, 2023. P. 10323–10337. https://doi.org/10.48550/arXiv.2301.00774</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Liu Z. et al. Deja vu: Contextual sparsity for efficient llms at inference time // International Conference on Machine Learning. PMLR, 2023. P. 22137–22176. DOI/10.5555/3618408.3619327</mixed-citation><mixed-citation xml:lang="en">Liu Z. et al. Deja vu: Contextual sparsity for efficient llms at inference time // International Conference on Machine Learning. PMLR, 2023. P. 22137–22176. DOI/10.5555/3618408.3619327</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Song Y. et al. Powerinfer: Fast large language model serving with a consumer-grade gpu // arXiv preprint arXiv: 2312.12456. 2023. https://doi.org/10.1145/3694715.3695964</mixed-citation><mixed-citation xml:lang="en">Song Y. et al. Powerinfer: Fast large language model serving with a consumer-grade gpu // arXiv preprint arXiv: 2312.12456. 2023. https://doi.org/10.1145/3694715.3695964</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Jiang A. Q. et al. Mistral 7B // arXiv preprint arXiv: 2310.06825. 2023. https://doi.org/10.48550/arXiv.2310.06825</mixed-citation><mixed-citation xml:lang="en">Jiang A. Q. et al. Mistral 7B // arXiv preprint arXiv: 2310.06825. 2023. https://doi.org/10.48550/arXiv.2310.06825</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Molchanov P. et al. Importance estimation for neural network pruning // Proceedings of the IEEE / CVF conference on computer vision and pattern recognition. 2019. P. 11264-11272. DOI: 10.1109/CVPR.2019.01152</mixed-citation><mixed-citation xml:lang="en">Molchanov P. et al. Importance estimation for neural network pruning // Proceedings of the IEEE / CVF conference on computer vision and pattern recognition. 2019. P. 11264-11272. DOI: 10.1109/CVPR.2019.01152</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Gerganov G. GitHub – ggerganov/llama.cpp: Port of Facebook’s LLaMA model in C/C++ – github.com. 2023.</mixed-citation><mixed-citation xml:lang="en">Gerganov G. GitHub – ggerganov/llama.cpp: Port of Facebook’s LLaMA model in C/C++ – github.com. 2023.</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
