Loading paper
GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions | Tomesphere