Objectives

The project aims to develop a freely accessible infrastructure for the selection and pre-processing of large datasets for Bulgarian as well as tailored data for particular industries and fine-tuning suitable freely available large language models for specific purposes.

For achieving the main goal, several objectives are set:

To provide a detailed description of the characteristics of large language models and a specification of the criteria for their evaluation, comparison and selection.
To develop an infrastructure component for the collection, filtering, anonymisation and reduplication of large, diverse and high quality text data for Bulgarian.
To develop an infrastructure component for the fine-tuning of pre-trained large language models for Bulgarian.
To develop a component of the Infrastructure for evaluating the fine-tuning of large language models for Bulgarian.
To reach Technology Readiness Level 7 of the Infrastructure for Fine-Tuning Pre-Trained Large Language Models by integrating all components into a prototype demonstrating the operation of the Infrastructure for Fine-Tuning Large Language Models in a real-world environment.