Selection and pre-processing of large datasets for Bulgarian as well as tailored data for particular industries and fine-tuning suitable freely available large language models for specific purposes.
Specification of the criteria for their evaluation, comparison and selection of large language models.
Developing a component of the Infrastructure for the collection, filtering, anonymisation and reduplication of large, diverse and high quality text data for Bulgarian.
Developing a component of the Infrastructure for the fine-tuning of pre-trained large language models for Bulgarian.
Developing a component of the Infrastructure for evaluating the fine-tuning of large language models for Bulgarian.
Reaching Technology Readiness Level 7 of the Infrastructure for Fine-Tuning Pre-Trained Large Language Models.
Open access to the results of the project for the industry, the academia and the wide public.
Results
Large Language Models
Description of existing LLMs with respect to their functionalities and applicability for processing of Bulgarian.
IfGPT dataset
Description of a dataset of clean data without duplication for the purposes of finetuning.
Documentation
Documentation of tools for using large language models, as well as testing and evaluation of finetuning.












