Model in Distress: Sentiment Analysis on French Synthetic Social Media
Pierre-Carl Langlais, Pavel Chizhov, Yannick Detrois, Carlos Rosas Hinostroza, Ivan P. Yamshchikov, Bastien Perroy

TL;DR
This paper presents a synthetic data generation pipeline for French social media sentiment analysis, achieving high accuracy while addressing data scarcity and privacy issues.
Contribution
It introduces a novel approach using backtranslation and synthetic reasoning traces to generate large-scale training data for multilingual sentiment analysis.
Findings
Generated 1.7 million synthetic tweets from a small seed corpus.
Achieved 77-79% accuracy on human-annotated data, surpassing some proprietary models.
Reduced annotation costs and preserved user privacy.
Abstract
Automated analysis of customer feedback on social media is hindered by three challenges: the high cost of annotated training data, the scarcity of evaluation sets, especially in multilingual settings, and privacy concerns that prevent data sharing and reproducibility. We address these issues by developing a generalizable synthetic data generation pipeline applied to a case study on customer distress detection in French public transportation. Our approach utilizes backtranslation with fine-tuned models to generate 1.7 million synthetic tweets from a small seed corpus, complemented by synthetic reasoning traces. We train 600M-parameter reasoners with English and French reasoning that achieve 77-79% accuracy on human-annotated evaluation data, matching or exceeding SOTA proprietary LLMs and specialized encoders. Beyond reducing annotation costs, our pipeline preserves privacy by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
