Bridging the Performance Gap between DETR and R-CNN for Graphical Object   Detection in Document Images

Tahira Shehzadi; Khurram Azeem Hashmi; Didier Stricker; Marcus Liwicki; and Muhammad Zeshan Afzal

arXiv:2306.13526·cs.CV·June 26, 2023

Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images

Tahira Shehzadi, Khurram Azeem Hashmi, Didier Stricker, Marcus Liwicki, and Muhammad Zeshan Afzal

PDF

Open Access

TL;DR

This paper adapts and enhances the DETR transformer-based object detection model for graphical object detection in document images, achieving state-of-the-art results and demonstrating its effectiveness compared to traditional methods.

Contribution

The paper introduces modifications to the DETR model, including different query strategies and noise addition, to improve graphical object detection in document images.

Findings

01

Achieved state-of-the-art mAP scores on multiple datasets.

02

Transformer-based methods outperform traditional CNN-based approaches.

03

Query modifications improve robustness to object size and position variations.

Abstract

This paper takes an important step in bridging the performance gap between DETR and R-CNN for graphical object detection. Existing graphical object detection approaches have enjoyed recent enhancements in CNN-based object detection methods, achieving remarkable progress. Recently, Transformer-based detectors have considerably boosted the generic object detection performance, eliminating the need for hand-crafted features or post-processing steps such as Non-Maximum Suppression (NMS) using object queries. However, the effectiveness of such enhanced transformer-based detection algorithms has yet to be verified for the problem of graphical object detection. Essentially, inspired by the latest advancements in the DETR, we employ the existing detection transformer with few modifications for graphical object detection. We modify object queries in different ways, using points, anchor boxes and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing · Adam · Byte Pair Encoding · Residual Connection