Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation
Zden\v{e}k Kasner, Ond\v{r}ej Du\v{s}ek

TL;DR
This paper investigates how open large language models perform on data-to-text generation tasks, highlighting their fluency but also significant semantic accuracy issues, and introduces a new data collection tool to avoid benchmark contamination.
Contribution
The study provides a comprehensive analysis of open LLMs' behaviors on data-to-text tasks and introduces Quintd, a novel tool for collecting uncontaminated structured data.
Findings
Open LLMs generate fluent, coherent texts in zero-shot settings.
Over 80% of outputs contain at least one semantic error.
Semantic accuracy remains a major challenge for open LLMs.
Abstract
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design Quintd - a tool for collecting novel structured data records from public APIs. We find that open LLMs (Llama 2, Mistral, and Zephyr) can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd. However, we show that the semantic accuracy of the outputs is a major issue: both according to human annotators and our reference-free metric based on GPT-4, more than 80% of the outputs of open LLMs contain at least one semantic error. We publicly release the code, data, and model outputs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Absolute Position Encodings · Layer Normalization · Dropout · Linear Layer · Byte Pair Encoding · Softmax · Adam
