TabGenie: A Toolkit for Table-to-Text Generation

Zden\v{e}k Kasner; Ekaterina Garanina; Ond\v{r}ej Pl\'atek; Ond\v{r}ej; Du\v{s}ek

arXiv:2302.14169·cs.CL·October 27, 2023

TabGenie: A Toolkit for Table-to-Text Generation

Zden\v{e}k Kasner, Ekaterina Garanina, Ond\v{r}ej Pl\'atek, Ond\v{r}ej, Du\v{s}ek

PDF

Open Access 1 Repo

TL;DR

TabGenie is a comprehensive toolkit that standardizes, explores, and analyzes diverse table-to-text datasets, facilitating research and development in data-to-text generation systems.

Contribution

It introduces a unified framework and toolkit for exploring, preprocessing, and analyzing heterogeneous table-to-text datasets with interactive and command-line tools.

Findings

01

Enables exploration of various datasets via web interface.

02

Supports debugging and comparison of generated outputs.

03

Provides easy dataset processing with Python bindings.

Abstract

Heterogenity of data-to-text generation datasets limits the research on data-to-text generation systems. We present TabGenie - a toolkit which enables researchers to explore, preprocess, and analyze a variety of data-to-text generation datasets through the unified framework of table-to-text generation. In TabGenie, all the inputs are represented as tables with associated metadata. The tables can be explored through the web interface, which also provides an interactive mode for debugging table-to-text generation, facilitates side-by-side comparison of generated system outputs, and allows easy exports for manual analysis. Furthermore, TabGenie is equipped with command line processing tools and Python bindings for unified dataset loading and processing. We release TabGenie as a PyPI package and provide its open-source code and a live demo at https://github.com/kasnerz/tabgenie.

Tables1

Table 1. Table 1: The list of datasets included in TabGenie . Glossary of data types: Key-value : key-value pairs, Graph : subject-predicate-object triples, Table : tabular data ( w/hl : with highlighted cells), Chart : chart data, Logic / SQL : strings with logical expressions / SQL queries. The datasets marked with † † \dagger were already present on Huggingface Datasets. We uploaded the rest of the datasets to our namespace: https://huggingface.co/kasnerz .

Dataset	Source	Data Type	Number of examples			License
Dataset	Source	Data Type	train	dev	test	License
CACAPO	van der Lee et al. (2020)	Key-value	15,290	1,831	3,028	CC BY
DART^†	Nan et al. (2021)	Graph	62,659	2,768	5,097	MIT
E2E^†	Dušek et al. (2019)	Key-value	33,525	1,484	1,847	CC BY-SA
EventNarrative	Colas et al. (2021)	Graph	179,544	22,442	22,442	CC BY
HiTab	Cheng et al. (2021)	Table w/hl	7,417	1,671	1,584	C-UDA
Chart-To-Text	Kantharaj et al. (2022)	Chart	24,368	5,221	5,222	GNU GPL
Logic2Text	Chen et al. (2020b)	Table w/hl + Logic	8,566	1,095	1,092	MIT
LogicNLG	Chen et al. (2020a)	Table	28,450	4,260	4,305	MIT
NumericNLG	Suadaa et al. (2021)	Table	1,084	136	135	CC BY-SA
SciGen	Moosavi et al. (2021)	Table	13,607	3,452	492	CC BY-NC-SA
SportSett:Basketball^†	Thomson et al. (2020)	Table	3,690	1,230	1,230	MIT
ToTTo^†	Parikh et al. (2020)	Table w/hl	121,153	7,700	7,700	CC BY-SA
WebNLG^†	Ferreira et al. (2020)	Graph	35,425	1,666	1,778	CC BY-NC
WikiBio^†	Lebret et al. (2016)	Key-value	582,659	72,831	72,831	CC BY-SA
WikiSQL^†	Zhong et al. (2017)	Table + SQL	56,355	8,421	15,878	BSD
WikiTableText	Bao et al. (2018)	Key-value	10,000	1,318	2,000	CC BY

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kasnerz/tabgenie
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Mathematics, Computing, and Information Processing · Semantic Web and Ontologies

Full text

TabGenie: A Toolkit for Table-to-Text Generation

Zdeněk Kasner1 Ekaterina Garanina1,2 Ondřej Plátek1 Ondřej Dušek1

1Charles University, Czechia

2University of Groningen, The Netherlands

{kasner,oplatek,odusek}@ufal.mff.cuni.cz

[email protected]

Abstract

Heterogenity of data-to-text generation datasets limits the research on data-to-text generation systems. We present TabGenie – a toolkit which enables researchers to explore, preprocess, and analyze a variety of data-to-text generation datasets through the unified framework of table-to-text generation. In TabGenie, all the inputs are represented as tables with associated metadata. The tables can be explored through the web interface, which also provides an interactive mode for debugging table-to-text generation, facilitates side-by-side comparison of generated system outputs, and allows easy exports for manual analysis. Furthermore, TabGenie is equipped with command line processing tools and Python bindings for unified dataset loading and processing. We release TabGenie as a PyPI package111https://pypi.org/project/tabgenie/ and provide its open-source code and a live demo at https://github.com/kasnerz/tabgenie.222Video: https://youtu.be/iUC3NCGoFRg

1 Introduction

Building and evaluating data-to-text (D2T) generation systems Gatt and Krahmer (2018); Sharma et al. (2022) requires understanding the data and observing system behavior. It is, however, not trivial to interact with the large volume of D2T generation datasets that have emerged in the last years (see Table 1). Although research on D2T generation benefits from platforms providing unified interfaces, such as HuggingFace Datasets Lhoest et al. (2021) or the GEM benchmark Gehrmann et al. (2021), these platforms still leave the majority of the data processing load on the user.

A key component missing from current D2T tools is the possibility to visualize the input data and generated outputs. Visualization plays an important role in examining and evaluating scientific data Kehrer and Hauser (2013) and can help D2T generation researchers to make more informed design choices. A suitable interface can also encourage researchers to step away from unreliable automatic metrics Gehrmann et al. (2022) and focus on manual error analysis van Miltenburg et al. (2021, 2023).

Along with that, demands for a unified input data format have recently been raised with multi-task training for large language models (LLMs) (Sanh et al., 2021; Scao et al., 2022; Ouyang et al., 2022, inter alia). Some works have used simple data linearization techniques for converting structured data to a textual format, in order to align it with the format used for other tasks Xie et al. (2022); Tang et al. (2022). However, linearizations are using custom preprocessing code, leading to discrepancies between individual works.

In this paper, we present TabGenie – a multi-purpose toolkit for interacting with D2T generation datasets and systems designed to fill these gaps. On a high level, the toolkit consists of (a) an interactive web interface, (b) a set of command-line processing tools, and (c) a set of Python bindings (see Figure 1).

The cornerstone of TabGenie is a unified data representation. Each input represented is as a matrix of $m$ columns and $n$ rows consisting of individual cells accompanied with metadata (see §2). Building upon this representation, TabGenie then provides multiple features for unified workflows with table-to-text datasets, including:

visualizing individual dataset examples in the tabular format (§3.1), 2. 2.

interacting with table-to-text generation systems in real-time (§3.2), 3. 3.

comparing generated system outputs (§3.2), 4. 4.

loading and preprocessing data for downstream tasks (§4.1), 5. 5.

exporting examples and generating spreadsheets for manual error analysis (§4.2).

In §6, we present examples of practical use-cases of TabGenie in D2T generation research.

2 Data

We currently include 16 datasets listed in Table 1 in TabGenie, covering many subtasks of D2T generation. All the datasets are available under a permissive open-source license.

2.1 Data Format

The inputs in D2T generation datasets may not consist only of tables, but also of e.g. graphs or key-value pairs. However, we noticed that in many cases, converting these formats to tables requires only minimal changes to the data structure while allowing a unified data representation and visualization. This conversion narrows down the task of D2T generation as the task of generating description for a tabular data, i.e. table-to-text generation Parikh et al. (2020); Liu et al. (2022); Gong et al. (2020).

In our definition, a table is a two-dimensional matrix with $m$ columns and $n$ rows, which together define a grid of $m\times n$ cells. Each cell contains a (possibly empty) text string. A continuous sequence of cells $\{c_{i},\ldots,c_{i+k}\}$ from the same row or column may be merged, in which case the values of $\{c_{i+1},\ldots,c_{i+k}\}$ are linked to the value of $c_{i}$ . A cell may be optionally marked as a heading, which is represented as an additional property of the cell.333The headings are typically located in the first row or column, but may also span multiple rows or columns and may not be adjacent. To better accommodate the format of datasets such as ToTTo Parikh et al. (2020) or HiTab Cheng et al. (2021), we also allow individual cells to be highlighted. Highlighted cells are assumed to be preselected for generating the output description.

The tables may be accompanied with an additional set of properties (see Figure 2) – an example of such a property is a “title” of the table in WikiBio Lebret et al. (2016) or a “category” in WebNLG Gardent et al. (2017). We represent properties as key-value pairs alongside the table. The properties may be used for generating the table description.

2.2 Data Transformation

We aim to present the data as true to the original format as possible and only make some minor changes for datasets which do not immediately adhere to the tabular format:

•

For graph-to-text datasets, we format each triple as a row, using three columns labeled subject, predicate, and object.

•

For key-value datasets, we use two columns with keys in the first column as row headings.

•

For SportSett:Basketball Thomson et al. (2020), we merge the box score and line score tables and add appropriate headings where necessary.

2.3 Data Loading

To ease the data distribution, we load all the datasets using the Huggingface datasets package Lhoest et al. (2021), which comes equipped with a data downloader. Out of 16 datasets we are using, 7 were already available in Huggingface datasets, either through the GEM benchmark Gehrmann et al. (2021) or other sources. We publicly added the 9 remaining datasets (see Table 1).

TabGenie also supports adding custom data loaders. Creating a data loader consists of simple sub-classing the data loader class and overriding a single method for processing individual entries, allowing anyone to add their custom dataset.

3 Web Interface

TabGenie offers a user-friendly way to interact with table-to-text generation datasets through the web interface. The interface can be rendered using a local server (cf. §4.2) and can be viewed in any modern web browser. The interface features a simple, single-page layout, which contains a navigation bar and three panels containing user controls, input data, and system outputs (see Figure 2). Although the interface primarily aims at researchers, it can be also used by non-expert users.

3.1 Content Exploration

The input data in TabGenie is rendered as HTML tables, providing better visualizations than existing data viewers, especially in the case of large and hierarchical tables.444Compare, e.g., with the ToTTo dataset in Huggingface Datasets for which the table is provided in a single field called “table”: https://huggingface.co/datasets/totto In the web interface, users can navigate through individual examples in the dataset sequentially, access an example using its index, or go to a random example. The users can add notes to examples and mark examples as favorites for accessing them later. The interface also shows the information about the dataset (such as its description, version, homepage, and license) and provides an option to export the individual examples (see §4.2).

3.2 Interactive Mode

TabGenie offers an interactive mode for generating an output for a particular example on-the-fly. The user can highlight different cells, edit cell contents, and edit parameters of the downstream processor. For example, the user can prompt a LLM for table-to-text generation and observe how it behaves while changing the prompt.

The contents of a table are processed by a processing pipeline. This pipeline takes table contents and properties as input, processes them with a sequence of modules, and outputs HTML code. The modules are custom Python programs which may be re-used across the pipelines.

TabGenie currently provides two basic pipelines: (1) calling a generative language model through an API with a custom prompt, and (2) generating graph visualizations of RDF triples. We describe the case-study for the model API pipeline in §6.2. Users can easily add custom pipelines by following the instructions in the project repository.

3.3 Pre-generated Outputs

In addition to interactive generation, TabGenie allows to visualize static pre-generated outputs. These are loaded in the JSONL555https://jsonlines.org format from the specified directory and displayed similarly to the outputs from the interactive mode. Multiple outputs can be displayed alongside a specific example, allowing to compare outputs from multiple systems.

4 Developer Tools

TabGenie also provides a developer-friendly interface: Python bindings (§4.1) and a command-line interface (§4.2). Both of these interfaces aim to simplify dataset preprocessing in downstream tasks. The key benefit of using TabGenie is that it provides streamlined access to data in a consistent format, removing the need for dataset-specific code for extracting information such as table properties, references, or individual cell values.

4.1 Python Bindings

TabGenie can be integrated in other Python codebases to replace custom preprocessing code. With a single unified interface for all the datasets, the TabGenie wrapper class allows to:

•

load a dataset from the Huggingface Datasets or from a local folder,

•

access individual table cells and their properties,

•

linearize tables using pre-defined or custom functions,

•

prepare the Huggingface Dataset objects for downstream processing.

TabGenie can be installed as a Python package, making the integration simple and intuitive. See §6.1 for an example usage of the TabGenie Python interface.

4.2 Command-line Tools

TabGenie supports several basic commands via command line.

Run

The tabgenie run command launches the local web server, mimicking the behavior of flask run. Example usage:

⬇

tabgenie run –port=8890 –host="0.0.0.0"

Export

The tabgenie export command enables batch exporting of the dataset. The supported formats are xlsx, html, json, txt, and csv. Except for csv, table properties can be exported along with the table content. Example usage:

⬇

tabgenie export –dataset "webnlg" \ –split "dev" \ –out_dir "export/datasets/webnlg" \ –export_format "xlsx"

Export can also be done in the web interface.

Spreadsheet

For error analysis, it is common to select $N$ random examples from the dataset along with the system outputs and manually annotate them with error categories (see §6.3). The tabgenie sheet command generates a suitable spreadsheet for this procedure. Example usage:

⬇

tabgenie sheet –dataset "webnlg" \ –split "dev" \ –in_file "out-t5-base.jsonl" \ –out_file "analysis_webnlg.xlsx" \ –count 50

5 Implementation

TabGenie runs with Python >=3.8 and requires only a few basic packages as dependencies. It can be installed as a stand-alone Python module from PyPI (pip install tabgenie) or from the project repository.

Backend

The web server is based on Flask,666https://pypi.org/project/Flask/ a popular lightweight Python-based web framework. The server runs locally and can be configured with a YAML777https://yaml.org configuration file. On startup, the server loads the data using the datasets888https://pypi.org/project/datasets/ package. To render web pages, the server uses the tinyhtml999https://pypi.org/project/tinyhtml/ package and Jinja101010https://jinja.palletsprojects.com/ templating language.

Frontend

The web frontend is built on HTML5, CSS, Bootstrap,111111https://getbootstrap.com/ JavaScript, and jQuery.121212https://jquery.com We additionally use the D3.js131313https://d3js.org library for visualizing the structure of data in graph-to-text datasets. To keep the project simple, we do not use any other major external libraries.

6 Case Studies

In this section, we outline several recipes for using TabGenie in D2T generation research. The instructions and code samples for these tasks are available in the project repository.

6.1 Table-To-Text Generation

Application

Finetuning a sequence-to-sequence language model for table-to-text generation in PyTorch Paszke et al. (2019) using the Huggingface Transformers Wolf et al. (2020) framework.

Process

In a typical finetuning procedure using these frameworks, the user needs to prepare a Dataset object with tokenized input and output sequences. Using TabGenie, preprocessing a specific dataset is simplified to the following:

⬇

from transformers import AutoTokenizer

import tabgenie as tg

\par# instantiate a tokenizer

tokenizer = AutoTokenizer.from_pretrained(…)

\par# load the dataset

tg_dataset = tg.load_dataset(

dataset_name="totto"

)

\par# preprocess the dataset

hf_dataset = tg_dataset.get_hf_dataset(

split="train",

tokenizer=tokenizer

)

The function get_hf_dataset() linearizes the tables (the users may optionally provide their custom linearization function) and tokenizes the inputs and references.

For training a single model on multiple datasets in the multi-task learning setting Xie et al. (2022), the user may preprocess each dataset individually, prepending a dataset-specific task description to each example. The datasets may then be combined using the methods provided by the datasets package.

Demonstration

For running the baselines, we provide an example script, which can be applied to any TabGenie dataset and pre-trained sequence-to-sequence model from the transformers library. For multi-task learning, we provide an example of joint training on several datasets with custom linearization functions. We run the example scripts for several datasets and display the resulting generations in the application demo. Details on the fine-tuned models can be found in Appendix A.

6.2 Interactive Prompting

Application

Observing the impact of various inputs on the outputs of a LLM prompted for table-to-text generation.

Process

The user customizes the provided model_api pipeline to communicate with a LLM through an API. The API can communicate either with an external model (using e.g. OpenAI API141414https://openai.com/api/), or with a model running locally (using libraries such as FastAPI151515https://fastapi.tiangolo.com). The user then interacts with the model through TabGenie web interface, modifying the prompts, highlighted cells, and table content (see §3.2).

Demonstration

We provide an interactive access to the instruction-tuned Tk-Instruct def-pos-11b LLM Wang et al. (2022) in the project live demo. The user can use the full range of possibilities included in the interactive mode, including customizing the prompt and the input data.161616Note that using the model for the task of table-to-text generation is experimental and may not produce optimal outputs. The model should also not be used outside of demonstration purposes due to our limited computational resources. The interface is shown in Appendix B.

6.3 Error Analysis

Application

Annotating error categories in the outputs from a table-to-text generation model.

Process

The user generates the system outputs (see §6.1) and saves the outputs for a particular dataset split in a JSONL format. Through the command-line interface, the user will then generate a XLSX file which can be imported in any suitable office software and distributed to annotators for performing error analysis.

Demonstration

We provide instructions for generating the spreadsheet in the project documentation. See Appendix B for a preview of the spreadsheet format.

7 Related Work

7.1 Data Loading and Processing

As noted throughout the work, Huggingface Datasets Lhoest et al. (2021) is the primary competitor package for data loading and preprocessing. Our package serves as a wrapper on top of this framework, providing additional abstractions for D2T generation datasets.

DataLab Xiao et al. (2022) is another platform for working with NLP datasets. Similarly to Huggingface Datasets, this platform has much broader focus than our package. Besides data access, it offers fine-grained data analysis and data manipulation tools. However, it has limited capabilities of visualizing the input data or interactive generation and at present, it does not cover the majority of datasets available in TabGenie.

PromptSource Bach et al. (2022) is a framework for constructing prompts for generative language models using the Jinja templating language. It can be used both for developing new prompts and for using the prompts in downstream applications.

Several tools have been developed for comparing outputs of language generation systems (notably for machine translation) such as CompareMT Neubig et al. (2019) or Appraise Federmann (2018), but the tools do not visualize the structured data.

7.2 Interactive D2T Generation

Until now, platforms for interactive D2T generation have been primarily limited to commercial platforms, such as Arria,171717https://www.arria.com Automated Insights,181818https://automatedinsights.com or Tableau Software191919https://www.tableau.com (formerly Narrative Science). These platforms focus on proprietary solutions for generating business insights and do not provide an interface for research datasets. Dou et al. (2018) present Data2Text Studio, a platform which provides a set of developer tools for building custom D2T generation systems. The platform currently does not seem to be publicly available.

7.3 Table-To-Text Generation

Although pre-trained sequence-to-sequence models have been found to be effective for D2T generation (Kale and Rastogi, 2020; Xie et al., 2022), they have difficulties with handling the input structure, generation diversity, and logical reasoning. Multiple works have tried to address these issues. For a comprehensive review of the field, we point out the interested reader to the recent survey of Sharma et al. (2022).

8 Conclusion

We presented TabGenie, a multifunctional software package for table-to-text generation. TabGenie bridges several gaps including visualizing input data, unified data access, and interactive table-to-text generation. As such, TabGenie provides a comprehensive set of tools poised to accelerate progress in the field of D2T generation.

Limitations

For some D2T generation inputs, the tabular structure may be inappropriate. This involves hierarchical tree-based structures, bag-of-words, or multimodal inputs Balakrishnan et al. (2019); Lin et al. (2019); Krishna et al. (2017). Due to deployment issues, TabGenie also does not include large synthetic datasets Agarwal et al. (2021); Jin et al. (2020). TabGenie is currently in early development stages, which is why it primarily targets the research community.

Ethical Impact

The table-to-text generation datasets may contain various biases or factually incorrect outputs, which may be further reproduced by the table-to-text generation models. Although our software package is designed to help to examine and eliminate the biases and errors, we cannot guarantee the correctness of the processed outputs.

As TabGenie is an open-source software package with a permissive license, we do not control its downstream applications. We advocate using it for responsible research with the aim of improving natural language generation systems.

Appendix A Fine-tuned models

For the demo purposes, we have fine-tuned the following models using our example scripts:

•

t5-small for Chart-To-Text, LogicNLG, ToTTo, WikiTableText;

•

t5-base for DART, E2E, WebNLG;

•

t5-base in a prefix-based multi-task setup on E2E and WebNLG, using custom linearization functions.

All models (individual and multi-task) were fine-tuned using transformers library. The parameters are the following:

•

Epochs: 30 or individual models and 15 for multi-task,

•

Patience: 5 epochs,

•

Batch size: 16,

•

Optimizer: AdamW,

•

Learning rate: 1e-4,

•

Weight decay: 0,

•

AdamW betas: 0.9, 0.999,

•

Maximum input length: 512,

•

Maximum output length: 512,

•

Generation beam size: 3.

Appendix B User Interface

Figure 3 shows the interactive mode in the TabGenie web interface. Figure 4 shows the spreadsheet for manual annotations generated using TabGenie.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agarwal et al. (2021) Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 , pages 3554–3565, Online. · doi ↗
2Bach et al. (2022) Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Févry, et al. 2022. Promptsource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages 93–104.
3Balakrishnan et al. (2019) Anusha Balakrishnan, Jinfeng Rao, Kartikeya Upasani, Michael White, and Rajen Subba. 2019. Constrained decoding for neural nlg from compositional representations in task-oriented dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 831–844.
4Bao et al. (2018) Junwei Bao, Duyu Tang, Nan Duan, Zhao Yan, Yuanhua Lv, Ming Zhou, and Tiejun Zhao. 2018. Table-to-text: Describing table region with natural language . In AAAI .
5Chen et al. (2020 a) Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, and William Yang Wang. 2020 a. Logical natural language generation from open-domain tables. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 7929–7942.
6Chen et al. (2020 b) Zhiyu Chen, Wenhu Chen, Hanwen Zha, Xiyou Zhou, Yunkai Zhang, Sairam Sundaresan, and William Yang Wang. 2020 b. Logic 2Text: High-Fidelity Natural Language Generation from Logical Forms . In Findings of the Association for Computational Linguistics: EMNLP 2020 , volume EMNLP 2020 of Findings of ACL , pages 2096–2111, Online Event. · doi ↗
7Cheng et al. (2021) Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. 2021. Hitab: A hierarchical table dataset for question answering and natural language generation. ar Xiv preprint ar Xiv:2108.06712 .
8Colas et al. (2021) Anthony Colas, Ali Sadeghian, Yue Wang, and Daisy Zhe Wang. 2021. Eventnarrative: A large-scale event-centric dataset for knowledge graph-to-text generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) .