Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team Google: Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin, Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang,, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark, Omernick, Lexi Walker, Cosmin Paduraru

TL;DR
Gemini 1.5 models represent a significant advancement in multimodal understanding, capable of processing millions of tokens for tasks like long-document QA, video, and audio analysis, with high recall and efficiency.
Contribution
Introduction of Gemini 1.5 models, including a high-performance Pro version and a lightweight Flash variant, pushing the limits of long-context multimodal reasoning and retrieval.
Findings
Achieves near-perfect retrieval (>99%) up to 10 million tokens
Sets new state-of-the-art in long-document and long-video QA
Enables practical applications like professional task assistance and language translation
Abstract
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗PJMixers-Images/Florence-2-base-Castollux-v0.5model· 505 dl· ♡ 5505 dl♡ 5
- 🤗PJMixers-Images/Florence-2-base-Castollux-v0.1model· 2 dl2 dl
- 🤗PJMixers-Images/Florence-2-base-Castollux-v0.2model· 7 dl· ♡ 27 dl♡ 2
- 🤗PJMixers-Images/Florence-2-base-Castollux-v0.4model· 6 dl· ♡ 16 dl♡ 1
Videos
AI CEO: ‘Stock Crash Could Stop AI Progress’, Llama 4 Anti-climax + ‘Superintelligence in 2027’ ...· youtube
Taxonomy
TopicsSemantic Web and Ontologies
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Softmax · Dense Connections
