OtterHD: A High-Resolution Multi-modality Model
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei, Liu

TL;DR
OtterHD-8B is a high-resolution multimodal model that outperforms existing models in detailed visual understanding tasks, emphasizing the importance of flexible input handling and high-resolution processing.
Contribution
The paper introduces OtterHD-8B, a versatile high-resolution multimodal model with flexible input dimensions and a new evaluation framework, MagnifierBench, for detailed visual analysis.
Findings
OtterHD-8B outperforms current models on MagnifierBench.
High-resolution inputs significantly improve visual detail recognition.
Flexibility in input size enhances model versatility.
Abstract
In this paper, we present OtterHD-8B, an innovative multimodal model evolved from Fuyu-8B, specifically engineered to interpret high-resolution visual inputs with granular precision. Unlike conventional models that are constrained by fixed-size vision encoders, OtterHD-8B boasts the ability to handle flexible input dimensions, ensuring its versatility across various inference requirements. Alongside this model, we introduce MagnifierBench, an evaluation framework designed to scrutinize models' ability to discern minute details and spatial relationships of small objects. Our comparative analysis reveals that while current leading models falter on this benchmark, OtterHD-8B, particularly when directly processing high-resolution inputs, outperforms its counterparts by a substantial margin. The findings illuminate the structural variances in visual information processing among different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
MethodsHigh-resolution input
