Loading paper
CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion | Tomesphere