GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

Jake R. Patock; Nicole Catherine Lewis; Kevin McCoy; Christina Gomez; Canling Chen; Lorenzo Luzi

arXiv:2507.18009·cs.CV·July 25, 2025

GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

Jake R. Patock, Nicole Catherine Lewis, Kevin McCoy, Christina Gomez, Canling Chen, Lorenzo Luzi

PDF

Open Access

TL;DR

GRR-CoCa introduces architectural enhancements inspired by LLMs into multimodal models, significantly improving performance on contrastive and generative vision-language tasks through novel modifications to the encoder and decoder.

Contribution

This work applies LLM-inspired architectural modifications to the CoCa multimodal model, achieving state-of-the-art performance improvements across multiple datasets.

Findings

01

27.25% reduction in contrastive loss during pretraining

02

Significant improvements in perplexity and CoCa loss metrics

03

Enhanced generalization across diverse vision-language tasks

Abstract

State-of-the-art (SOTA) image and text generation models are multimodal models that have many similarities to large language models (LLMs). Despite achieving strong performances, leading foundational multimodal model architectures frequently lag behind the architectural sophistication of contemporary LLMs. We propose GRR-CoCa, an improved SOTA Contrastive Captioner (CoCa) model that incorporates Gaussian error gated linear units, root mean squared normalization, and rotary positional embedding into the textual decoders and the vision transformer (ViT) encoder. Each architectural modification has been shown to improve model performance in LLMs, but has yet to be adopted in CoCa. We benchmarked GRR-CoCa against Baseline CoCa, a model with the same modified textual decoders but with CoCa's original ViT encoder. We used standard pretraining and fine-tuning workflows to benchmark the models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation