Looking Locally: Object-Centric Vision Transformers as Foundation Models for Efficient Segmentation
Manuel Traub, Martin V. Butz

TL;DR
FLIP is a biologically-inspired, object-centric vision transformer that efficiently segments objects by focusing high-resolution processing on object centers, outperforming larger models in accuracy and speed across multiple benchmarks.
Contribution
Introduces FLIP, a scale-invariant, resource-efficient vision model that uses top-down attention to improve object segmentation, especially for small objects, with significantly fewer parameters.
Findings
FLIP outperforms SAM models in IoU across benchmarks.
FLIP achieves high accuracy with over 1000x fewer parameters.
FLIP effectively segments very small objects in diverse scenes.
Abstract
Current state-of-the-art segmentation models encode entire images before focusing on specific objects. As a result, they waste computational resources - particularly when small objects are to be segmented in high-resolution scenes. We introduce FLIP (Fovea-Like Input Patching), a parameter-efficient vision model that realizes object segmentation through biologically-inspired top-down attention. FLIP selectively samples multi-resolution patches centered on objects of interest from the input. As a result, it allocates high-resolution processing to object centers while maintaining coarser peripheral context. This off-grid, scale-invariant design enables FLIP to outperform META's Segment Anything models (SAM) by large margins: With more than 1000x fewer parameters, FLIP-Tiny (0.51M parameters) reaches a mean IoU of 78.24% while SAM-H reaches 75.41% IoU (641.1M parameters). FLIP-Large even…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper is presented neatly with a clear motivation. It starts with a motivation from biological fovea structure and then introduces a learnable 2D Gaussian distribution as possible focal regions. - The proposed fovea patching does not require encoding the full image, which is able to significantly accelerate inference and training. This is verified on different benchmarks as it achieves the same performance with SAM while using only 1000x fewer parameters and being 6x faster during inferen
- Novelty issues. The idea that using explicit samping patches at multiple resolutions and then embedding them with resolution-specific modules has been proposed in the literature on different tasks. The most relevant one to this study is STT[1], which proposes foveating the input as a way to tokenize images efficiently for point-prompted segmentation. However, this study is not cited nor referred to in the manuscript. - The structural modifications as described in Sec 3.2 and 3.3 are mostly in
- The fovea-like patching mechanism is interesting, which addresses the inefficiencies of full-image encoding. - The scale-invariant design is robustly validated on the proposed ObjaScale dataset, where FLIP outperforms SAM variants by large margins. - The hierarchical inference scheme further enhances practicality for real-time applications.
- The reliance on a 2D Gaussian prior derived from ground-truth masks during training raises concerns about its generalizability to real-world scenarios, where object shapes are often irregular or annotations are imperfect. This dependency warrants further discussion. - While FLIP demonstrates excellence in handling small objects, its performance on large-scale or complex occluded objects remains underexplored. The comparison is limited to SAMv1, omitting SAMv2 and recent object-centric models s
1. The paper is well written, and the motivation is clear and convincing. 2. FLIP provides an effective way to reduce the patch count while preserving segmentation accuracy, and it includes an efficient low level implementation of fovea patching. 3. The experimental results demonstrate reasonable and consistent performance gains across datasets.
1. The fovea inspired input patching appears incremental. STT [1] presents a similar idea to speed up SAM. The task setting is very close. Both use a prompt to focus on the object and build multi level foveated patches. FLIP relies on less structured and partly random sampling. STT uses a more structured tokenization. The paper does not cite STT. STT is CVPR 2025 and is within scope for ICLR 2026. This omission weakens the novelty claim. 2. The comparison may be unfair. FLIP uses a Gaussian de
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Image Retrieval and Classification Techniques
MethodsFLIP · Activation Patching · Focus · Segment Anything Model
