Loading paper
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models | Tomesphere