Presentation of the results of the project project “Infrastructure for Fine-Tuning Pre-trained Large Language Models”

On 29 May 2026, at 11:00 a.m., the presentation of the results of the research project “Infrastructure for Fine-Tuning Pre-trained Large Language Models” took place at the Shared Workspace, 16 Gen. Gurko St., 6th floor. The event was organised by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

The audience comprised researchers from scientific institutions, IT professionals, and other experts. The project leader, Prof. Svetla Koeva, outlined the goals and objectives set by the team: the development of an open-access infrastructure of algorithms and software for the selection and preprocessing of large-scale Bulgarian-language data, as well as company- or industry-specific data, and the fine-tuning of suitable open-source large language models for solving specific tasks. The team’s mission is to support the advancement of artificial intelligence by building upon the existing toolkit and improving the availability of resources for the Bulgarian language.

Dr. Ivelina Stoyanova briefly presented the large-scale language dataset IfGPT, created within the framework of the project. She highlighted the need not only to collect the largest possible volume of diverse and high-quality textual data, but also to enrich it through the development and application of content enhancement procedures, including filtering, anonymisation, and deduplication, as well as the introduction of a detailed metadata system for data description. This system enables information retrieval according to user-defined criteria. All of these aspects are of particular importance for the development and fine-tuning of large language models for Bulgarian, as the existing datasets are limited in size and diversity, as well as in terms of accessibility and information retrieval capabilities.

Dr. Yordan Kralev continued with a presentation of the developed infrastructure for building a chatbot based on large language models and retrieval-augmented instruction expansion. The need for such a system stems from the lack of a sufficient number of Bulgarian-language solutions based on large language models, their limited scope, and the need for a functional architecture that can operate on affordable hardware. As a result of the project activities, an open-source Retrieval-Augmented Generation (RAG) system for Bulgarian has been developed. Its effectiveness in summarising documents using external context has been successfully demonstrated, and its applicability to a variety of tasks on accessible hardware has been confirmed. Dr. Kralev also outlined future directions for the system’s development, including question-answering systems, classification systems, and the fine-tuning of models for specific domains.

Dr. Valentina Stefanova presented two of the datasets developed for evaluating large language models for Bulgarian: MMLU-BG and Reasoning-BG. The first dataset is designed to assess the ability of large language models to “understand” and apply knowledge from different domains. It was created by experts through the translation and adaptation of the English-language Measuring Massive Multitask Language Understanding (MMLU) dataset. The data are organised into 56 subject areas and include a total of 15,000 questions, each with four answer options, one of which is correct. Its development involved comprehensive terminological and semantic adaptation; preservation of the original level of difficulty and logical structure of the questions; accuracy of scientific terminology; and semantic and grammatical correctness in Bulgarian. Reasoning-BG consists of 232 popular science texts, each accompanied by 10 questions, and is intended to evaluate language models’ capabilities in semantic analysis, information extraction, identifying logical relationships, and interpreting textual content. In addition to text selection and preprocessing procedures, its development included the evaluation and editing of questions and answers automatically generated by an open large language model, including verification of answer uniqueness and correctness, as well as assessment of semantic consistency between each text and its corresponding questions.

Each presenter emphasised that the project has established a number of directions for future work that will continue actively in the coming years. During the discussion that followed, Prof. Koeva answered questions from the audience regarding the challenges faced by the team and how they were successfully addressed.

The results of the project “Infrastructure for Fine-Tuning Pre-trained Large Language Models” are available on the project website.