Loading paper
Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment | Tomesphere