GPTR: Gestalt-Perception Transformer for Diagram Object Detection
Xin Hu, Lingling Zhang, Jun Liu, Jinfu Fan, Yang You, Yaqiang Wu

TL;DR
This paper introduces GPTR, a novel transformer-based model inspired by gestalt perception laws, designed to improve diagram object detection by grouping patches into meaningful objects, outperforming existing methods.
Contribution
The paper proposes a gestalt-perception transformer with a graph-based encoder that effectively groups diagram patches into objects, addressing the unique challenges of diagram visual features.
Findings
GPTR achieves state-of-the-art results in diagram object detection.
The model performs comparably to existing methods on natural image detection.
The gestalt-perception graph enhances object grouping in sparse diagram features.
Abstract
Diagram object detection is the key basis of practical applications such as textbook question answering. Because the diagram mainly consists of simple lines and color blocks, its visual features are sparser than those of natural images. In addition, diagrams usually express diverse knowledge, in which there are many low-frequency object categories in diagrams. These lead to the fact that traditional data-driven detection model is not suitable for diagrams. In this work, we propose a gestalt-perception transformer model for diagram object detection, which is based on an encoder-decoder architecture. Gestalt perception contains a series of laws to explain human perception, that the human visual system tends to perceive patches in an image that are similar, close or connected without abrupt directional changes as a perceptual whole object. Inspired by these thoughts, we build a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Image and Object Detection Techniques
