HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Tim Franzmeyer, Aleksandar Shtedritski, Samuel Albanie, Philip Torr,, Jo\~ao F. Henriques, Jakob N. Foerster

TL;DR
HelloFresh is a continuous, real-world benchmark for evaluating LLMs on tasks involving community-driven content moderation and editing, reducing test data contamination and overfitting.
Contribution
It introduces a novel, ongoing benchmark using live data from X community notes and Wikipedia edits, enabling more realistic and temporally consistent LLM evaluation.
Findings
LLMs with web search access perform better on HelloFresh tasks.
HelloFresh provides a more stable and realistic evaluation environment.
The benchmark is publicly available with a leaderboard for ongoing assessment.
Abstract
Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development. Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating new evaluation data is tedious and may result in temporally inconsistent results. We introduce HelloFresh, based on continuous streams of real-world data generated by intrinsically motivated human labelers. It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages, mitigating the risk of test data contamination and benchmark overfitting. Any X user can propose an X note to add additional context to a misleading post (formerly tweet); if the community…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Topic Modeling · Advanced Text Analysis Techniques
