Loading paper
Efficient Multi-modal Large Language Models via Visual Token Grouping | Tomesphere