Loading paper
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling | Tomesphere