TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Junyuan Zhang; Bin Wang; Qintong Zhang; Fan Wu; Zichen Wen; Jialin Lu; Junjie Shan; Ziqi Zhao; Shuya Yang; Ziling Wang; Ziyang Miao; Huaping Zhong; Yuhang Zang; Xiaoyi Dong; Ka-Ho Chow; Conghui He

arXiv:2512.01248·cs.CV·March 25, 2026

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Junyuan Zhang, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen, Jialin Lu, Junjie Shan, Ziqi Zhao, Shuya Yang, Ziling Wang, Ziyang Miao, Huaping Zhong, Yuhang Zang, Xiaoyi Dong, Ka-Ho Chow, Conghui He

PDF

Open Access 1 Models

TL;DR

TRivia introduces a self-supervised fine-tuning approach for vision-language models to perform table recognition without labeled data, significantly improving open-source models' performance on benchmark datasets.

Contribution

The paper presents TRivia, a novel self-supervised fine-tuning method for VLMs that eliminates the need for labeled data in table recognition tasks.

Findings

01

TRivia-3B surpasses existing models on benchmark datasets.

02

The self-supervised approach reduces reliance on costly labeled data.

03

The method achieves state-of-the-art performance with a compact model.

Abstract

Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
opendatalab/TRivia-3B
model· 483 dl· ♡ 8
483 dl♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Text and Document Classification Technologies · Machine Learning and Data Classification