Slide

Project
Infrastructure for Fine-tuning Pre-trained Large Language Models

Slide

Beneficiary

Slide

Start date: 12.12.2024
End date: 30.05.2026
Duration: 17.5 months

Slide

Total budget: BGN 437,446.38
Amount of EU funding:
BGN 437,446.38 (100%)

Slide

Main goal
To develop a freely accessible infrastructure for the selection and pre-processing of large datasets for Bulgarian as well as tailored data for specific industries and fine-tuning suitable freely available large language models for specific purposes.

Selection and pre-processing of large datasets for Bulgarian as well as tailored data for particular industries and fine-tuning suitable freely available large language models for specific purposes.

Specification of the criteria for their evaluation, comparison and selection of large language models.

Developing a component of the Infrastructure for the collection, filtering, anonymisation and reduplication of large, diverse and high quality text data for Bulgarian.

Developing a component of the Infrastructure for the fine-tuning of pre-trained large language models for Bulgarian.

Developing a component of the Infrastructure for evaluating the fine-tuning of large language models for Bulgarian.

Reaching Technology Readiness Level 7 of the Infrastructure for Fine-Tuning Pre-Trained Large Language Models.

Open access to the results of the project for the industry, the academia and the wide public.

Results

Large Language Models

Description of existing LLMs with respect to their functionalities and applicability for processing of Bulgarian.

Results →

IfGPT dataset

Description of a dataset of clean data without duplication for the purposes of finetuning.

Results →

Documentation

Documentation of tools for using large language models, as well as testing and evaluation of finetuning.

Results →

News

ranlp

Presentation at the International Forum on Advanced ICT Research and Innovation

ranlp

Participation at the International Conference Recent Advances in Natural Language Processing 2025

156aa3a9cebc393889ff7d84733039c3

Prof. Svetla Koeva: It is important how AI legislation will be handled

Screenshot from 2025-02-18 15-52-35

Dimitar Hristov talks about the latest trends in computational linguistics and explains why he chose to pursue a PhD at the Department of Computational Linguistics

manager-interview-tmb

Dr Veselin Stoyanov: Polarisation is tackled with incentives, not bans