# On Extracting Data from Tables that are Encoded using HTML

**Authors:** Juan C. Rold\'an, Patricia Jim\'enez, Rafael Corchuelo

arXiv: 1903.08305 · 2019-11-05

## TL;DR

This paper reviews and compares various methods for extracting data from HTML-encoded tables, highlighting unresolved challenges and the lack of standardized evaluation practices in the field.

## Contribution

It introduces a unified vocabulary for the field, summarizes existing proposals, and provides a side-by-side comparison to identify gaps and future research directions.

## Key findings

- No proposal offers a complete solution to HTML table data extraction.
- Lack of consensus on datasets and evaluation methods hampers comparison.
- Several challenges remain unaddressed in the field.

## Abstract

Tables are a common means to display data in human-friendly formats. Many authors have worked on proposals to extract those data back since this has many interesting applications. In this article, we summarise and compare many of the proposals to extract data from tables that are encoded using HTML and have been published between $2000$ and $2018$. We first present a vocabulary that homogenises the terminology used in this field; next, we use it to summarise the proposals; finally, we compare them side by side. Our analysis highlights several challenges to which no proposal provides a conclusive solution and a few more that have not been addressed sufficiently; simply put, no proposal provides a complete solution to the problem, which seems to suggest that this research field shall keep active in the near future. We have also realised that there is no consensus regarding the datasets and the methods used to evaluate the proposals, which hampers comparing the experimental results.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.08305/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1903.08305/full.md

## References

72 references — full list in the complete paper: https://tomesphere.com/paper/1903.08305/full.md

---
Source: https://tomesphere.com/paper/1903.08305