GPTR: Gestalt-Perception Transformer for Diagram Object Detection

Xin Hu; Lingling Zhang; Jun Liu; Jinfu Fan; Yang You; Yaqiang Wu

arXiv:2212.14232·cs.CV·January 2, 2023·1 cites

GPTR: Gestalt-Perception Transformer for Diagram Object Detection

Xin Hu, Lingling Zhang, Jun Liu, Jinfu Fan, Yang You, Yaqiang Wu

PDF

Open Access

TL;DR

This paper introduces GPTR, a novel transformer-based model inspired by gestalt perception laws, designed to improve diagram object detection by grouping patches into meaningful objects, outperforming existing methods.

Contribution

The paper proposes a gestalt-perception transformer with a graph-based encoder that effectively groups diagram patches into objects, addressing the unique challenges of diagram visual features.

Findings

01

GPTR achieves state-of-the-art results in diagram object detection.

02

The model performs comparably to existing methods on natural image detection.

03

The gestalt-perception graph enhances object grouping in sparse diagram features.

Abstract

Diagram object detection is the key basis of practical applications such as textbook question answering. Because the diagram mainly consists of simple lines and color blocks, its visual features are sparser than those of natural images. In addition, diagrams usually express diverse knowledge, in which there are many low-frequency object categories in diagrams. These lead to the fact that traditional data-driven detection model is not suitable for diagrams. In this work, we propose a gestalt-perception transformer model for diagram object detection, which is based on an encoder-decoder architecture. Gestalt perception contains a series of laws to explain human perception, that the human visual system tends to perceive patches in an image that are similar, close or connected without abrupt directional changes as a perceptual whole object. Inspired by these thoughts, we build a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Image and Object Detection Techniques