Deep Interactive Region Segmentation and Captioning
Ali Sharifi Boroujerdi, Maryam Khanian, Michael Breuss

TL;DR
This paper introduces a hybrid deep learning system that allows users to specify regions in images for targeted segmentation and captioning, improving interpretability and accuracy over existing methods.
Contribution
It presents a novel interactive segmentation and captioning architecture combining a specialized FCN and dense captioning, enabling user-guided region processing.
Findings
Outperforms state-of-the-art interactive segmentation methods
Enhances understanding of dense captioning outputs
Improves object detection accuracy with segmentation-based region focus
Abstract
With recent innovations in dense image captioning, it is now possible to describe every object of the scene with a caption while objects are determined by bounding boxes. However, interpretation of such an output is not trivial due to the existence of many overlapping bounding boxes. Furthermore, in current captioning frameworks, the user is not able to involve personal preferences to exclude out of interest areas. In this paper, we propose a novel hybrid deep learning architecture for interactive region segmentation and captioning where the user is able to specify an arbitrary region of the image that should be processed. To this end, a dedicated Fully Convolutional Network (FCN) named Lyncean FCN (LFCN) is trained using our special training data to isolate the User Intention Region (UIR) as the output of an efficient segmentation. In parallel, a dense image captioning model is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMax Pooling · Convolution · Fully Convolutional Network
