Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Matteo Rinaldi; Rossella Varvara; Viviana Patti

arXiv:2602.14819·cs.CL·April 10, 2026

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Matteo Rinaldi, Rossella Varvara, Viviana Patti

PDF

TL;DR

Testimole-Conversational is a large-scale Italian discussion board corpus spanning nearly three decades, designed for language modeling and sociolinguistic research, offering insights into informal online communication.

Contribution

It introduces a massive, publicly available Italian discussion board corpus covering 1996-2024, suitable for NLP and sociolinguistic studies, with detailed discourse and social interaction data.

Findings

01

Corpus contains over 30 billion words from 1996 to 2024.

02

Enables training of Italian large language models and sociolinguistic analysis.

03

Supports research on language variation and online social phenomena.

Abstract

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.