Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation
Tanner Schmidt, Richard Newcombe

TL;DR
Segment This Thing introduces a foveated tokenization approach for point-prompted segmentation, significantly reducing computational costs by focusing on regions of interest, enabling real-time performance on consumer hardware.
Contribution
The paper proposes a novel foveated patch tokenization method that improves efficiency in point-prompted segmentation without increasing model size.
Findings
Achieves higher efficiency than prior segmentation models
Runs at interactive frame rates on consumer hardware
Maintains competitive performance on segmentation benchmarks
Abstract
This paper presents Segment This Thing (STT), a new efficient image segmentation model designed to produce a single segment given a single point prompt. Instead of following prior work and increasing efficiency by decreasing model size, we gain efficiency by foveating input images. Given an image and a point prompt, we extract a crop centered on the prompt and apply a novel variable-resolution patch tokenization in which patches are downsampled at a rate that increases with increased distance from the prompt. This approach yields far fewer image tokens than uniform patch tokenization. As a result we can drastically reduce the computational cost of segmentation without reducing model size. Furthermore, the foveation focuses the model on the region of interest, a potentially useful inductive bias. We show that our Segment This Thing model is more efficient than prior work while remaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
