Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced   Chat Corpus Generation and Evaluation

Federico A. Galatolo; Mario G.C.A. Cimino

arXiv:2311.15698·cs.CL·November 28, 2023·1 cites

Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation

Federico A. Galatolo, Mario G.C.A. Cimino

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper presents a novel method for generating and evaluating high-quality, language-specific chat corpora using self-chat mechanisms and a new MLM-based quality metric, leading to a state-of-the-art Italian LLM.

Contribution

It introduces a new approach combining generator and embedder LLMs with a novel MLM-based quality assessment for creating refined language-specific chat corpora.

Findings

01

Generated Italian chat corpus improves LLM performance in Italian tasks.

02

Refined corpora lead to significant enhancements in language comprehension.

03

Achieved state-of-the-art results for Italian LLMs.

Abstract

This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism. We combine a generator LLM for creating new samples and an embedder LLM to ensure diversity. A new Masked Language Modelling (MLM) model-based quality assessment metric is proposed for evaluating and filtering the corpora. Utilizing the llama2-70b as the generator and a multilingual sentence transformer as embedder, we generate an Italian chat corpus and refine the Fauno corpus, which is based on translated English ChatGPT self-chat data. The refinement uses structural assertions and Natural Language Processing techniques. Both corpora undergo a comprehensive quality evaluation using the proposed MLM model-based quality metric. The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

galatolofederico/cerbero-7b
pytorchOfficial

Models

🤗
galatolo/cerbero-7b
model· 3.6k dl· ♡ 14
3.6k dl♡ 14

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling