Towards Transparency: Exploring LLM Trainings Datasets through Visual Topic Modeling and Semantic Frame
Charles de Dampierre, Andrei Mogoutov, Nicolas Baumard

TL;DR
This paper introduces Bunka, a tool that uses AI and Cognitive Science techniques like Topic Modeling and Frame Analysis to enhance transparency, quality, and bias detection in LLM training datasets.
Contribution
It presents Bunka, a novel software that applies visual topic modeling and semantic frame analysis to improve dataset curation and bias detection for LLMs.
Findings
Topic Modeling with Cartography increases dataset transparency.
Applying Topic Modeling accelerates fine-tuning on Preferences datasets.
Frame Analysis reveals biases in training corpora.
Abstract
LLMs are now responsible for making many decisions on behalf of humans: from answering questions to classifying things, they have become an important part of everyday life. While computation and model architecture have been rapidly expanding in recent years, the efforts towards curating training datasets are still in their beginnings. This underappreciation of training datasets has led LLMs to create biased and low-quality content. In order to solve that issue, we present Bunka, a software that leverages AI and Cognitive Science to improve the refinement of textual datasets. We show how Topic Modeling coupled with 2-dimensional Cartography can increase the transparency of datasets. We then show how the same Topic Modeling techniques can be applied to Preferences datasets to accelerate the fine-tuning process and increase the capacities of the model on different benchmarks. Lastly, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Biomedical Text Mining and Ontologies · Topic Modeling
