Slide
Project
Infrastructure for Fine-tuning Pre-trained Large Language Models
Slide
Beneficiary
Slide
Start date: 12.12.2024
End date: 30.05.2026
Duration: 17.5 months
Slide
Total budget: BGN 437,446.38
Amount of EU funding:
BGN 437,446.38 (100%)
Slide

Main goal
To develop a freely accessible infrastructure for the selection and pre-processing of large datasets for Bulgarian as well as tailored data for specific industries and fine-tuning suitable freely available large language models for specific purposes.

previous arrowprevious arrow
next arrownext arrow

Selection and pre-processing of large datasets for Bulgarian as well as tailored data for particular industries and fine-tuning suitable freely available large language models for specific purposes.

Specification of the criteria for their evaluation, comparison and selection of large language models.

Developing a component of the Infrastructure for the collection, filtering, anonymisation and reduplication of large, diverse and high quality text data for Bulgarian.

Developing a component of the Infrastructure for the fine-tuning of pre-trained large language models for Bulgarian.

Developing a component of the Infrastructure for evaluating the fine-tuning of large language models for Bulgarian.

Reaching Technology Readiness Level 7 of the Infrastructure for Fine-Tuning Pre-Trained Large Language Models.

Open access to the results of the project for the industry, the academia and the wide public.

Results

Large Language Models

Description of existing LLMs with respect to their functionalities and applicability for processing of Bulgarian.

IfGPT dataset

Description of a dataset of clean data without duplication for the purposes of finetuning.

Documentation

Documentation of tools for using large language models, as well as testing and evaluation of finetuning.

News