Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness
Valentin Barriere, Felipe del Rio, Andres Carvallo De Ferari, Carlos, Aspillaga, Eugenio Herrera-Berg, Cristian Buc Calderon

TL;DR
This paper introduces TIDA, a targeted data augmentation method that uses text-to-image models to improve image captioning models' robustness in recognizing specific skills like gender and counting, by editing images to match modified captions.
Contribution
TIDA is a novel targeted augmentation technique that enhances captioning models' ability to generalize to out-of-context examples by filling correlational gaps with generated images.
Findings
Improved captioning performance on gender, color, and counting tasks.
Enhanced robustness of models to out-of-context examples.
Different behaviors observed in visual encoding and textual decoding.
Abstract
Artificial neural networks typically struggle in generalizing to out-of-context examples. One reason for this limitation is caused by having datasets that incorporate only partial information regarding the potential correlational structure of the world. In this work, we propose TIDA (Targeted Image-editing Data Augmentation), a targeted data augmentation method focused on improving models' human-like abilities (e.g., gender recognition) by filling the correlational structure gap using a text-to-image generative model. More specifically, TIDA identifies specific skills in captions describing images (e.g., the presence of a specific gender in the image), changes the caption (e.g., "woman" to "man"), and then uses a text-to-image model to edit the image in order to match the novel caption (e.g., uniquely changing a woman to a man while maintaining the context identical). Based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
