Privacy Preserving Visual Question Answering
Cristian-Paul Bara, Qing Ping, Abhinav Mathur, Govind Thattai, Rohith, MV, Gaurav S. Sukhatme

TL;DR
This paper presents a privacy-preserving approach for visual question answering that uses a compact, symbolic scene representation to protect image privacy while maintaining effective VQA performance.
Contribution
It introduces a hybrid method combining a small vision model with symbolic scene representation, significantly reducing model size and enhancing privacy in VQA tasks.
Findings
Model is over 25 times smaller than SOTA vision models
Symbolic representation prevents image recovery, ensuring privacy
Detailed error analysis and trade-off discussion provided
Abstract
We introduce a novel privacy-preserving methodology for performing Visual Question Answering on the edge. Our method constructs a symbolic representation of the visual scene, using a low-complexity computer vision model that jointly predicts classes, attributes and predicates. This symbolic representation is non-differentiable, which means it cannot be used to recover the original image, thereby keeping the original image private. Our proposed hybrid solution uses a vision model which is more than 25 times smaller than the current state-of-the-art (SOTA) vision models, and 100 times smaller than end-to-end SOTA VQA models. We report detailed error analysis and discuss the trade-offs of using a distilled vision model and a symbolic representation of the visual scene.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
