Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on   Data-to-Text Generation

Zden\v{e}k Kasner; Ond\v{r}ej Du\v{s}ek

arXiv:2401.10186·cs.CL·June 7, 2024·1 cites

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation

Zden\v{e}k Kasner, Ond\v{r}ej Du\v{s}ek

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper investigates how open large language models perform on data-to-text generation tasks, highlighting their fluency but also significant semantic accuracy issues, and introduces a new data collection tool to avoid benchmark contamination.

Contribution

The study provides a comprehensive analysis of open LLMs' behaviors on data-to-text tasks and introduces Quintd, a novel tool for collecting uncontaminated structured data.

Findings

01

Open LLMs generate fluent, coherent texts in zero-shot settings.

02

Over 80% of outputs contain at least one semantic error.

03

Semantic accuracy remains a major challenge for open LLMs.

Abstract

We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design Quintd - a tool for collecting novel structured data records from public APIs. We find that open LLMs (Llama 2, Mistral, and Zephyr) can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd. However, we show that the semantic accuracy of the outputs is a major issue: both according to human annotators and our reference-free metric based on GPT-4, more than 80% of the outputs of open LLMs contain at least one semantic error. We publicly release the code, data, and model outputs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Shir123/quintd1_owid
dataset· 5 dl
5 dl

Videos

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Absolute Position Encodings · Layer Normalization · Dropout · Linear Layer · Byte Pair Encoding · Softmax · Adam