Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation
Federico A. Galatolo, Mario G.C.A. Cimino

TL;DR
This paper presents a novel method for generating and evaluating high-quality, language-specific chat corpora using self-chat mechanisms and a new MLM-based quality metric, leading to a state-of-the-art Italian LLM.
Contribution
It introduces a new approach combining generator and embedder LLMs with a novel MLM-based quality assessment for creating refined language-specific chat corpora.
Findings
Generated Italian chat corpus improves LLM performance in Italian tasks.
Refined corpora lead to significant enhancements in language comprehension.
Achieved state-of-the-art results for Italian LLMs.
Abstract
This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism. We combine a generator LLM for creating new samples and an embedder LLM to ensure diversity. A new Masked Language Modelling (MLM) model-based quality assessment metric is proposed for evaluating and filtering the corpora. Utilizing the llama2-70b as the generator and a multilingual sentence transformer as embedder, we generate an Italian chat corpus and refine the Fauno corpus, which is based on translated English ChatGPT self-chat data. The refinement uses structural assertions and Natural Language Processing techniques. Both corpora undergo a comprehensive quality evaluation using the proposed MLM model-based quality metric. The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
