Granite Embedding R2 Models
Parul Awasthy, Aashka Trivedi, Yulong Li, Meet Doshi, Riyaz Bhat, Vignesh P, Vishwajeet Kumar, Yushu Yang, Bhavani Iyer, Abraham Daniels, Rudra Murthy, Ken Barker, Martin Franz, Madison Lee, Todd Ward, Salim Roukos, David Cox, Luis Lastras, Jaydeep Sen, Radu Florian

TL;DR
The Granite Embedding R2 models are a family of high-performance, enterprise-ready English embedding models that significantly improve retrieval accuracy, speed, and context length across diverse domains, with open-source availability.
Contribution
Introduction of the Granite Embedding R2 models, featuring expanded context length, improved performance, and enterprise-grade governance, for dense retrieval applications.
Findings
16x expanded context length (8,192 tokens)
State-of-the-art retrieval performance across multiple domains
19-44% speed improvements over competitors
Abstract
We introduce the Granite Embedding R2 models, a comprehensive family of high-performance English encoder-based embedding models engineered for enterprise-scale dense retrieval applications. Building upon our first-generation release, these models deliver substantial improvements, including 16x expanded context length (8,192 tokens), state-of-the-art performance across diverse retrieval domains - text, code, long-document search, multi-turn conversational, and tabular data - and measurable speed advantages of 19-44\% over leading competitors while maintaining superior accuracy. Our release encompasses both bi-encoder and cross-encoder architectures, featuring a highly effective 22-layer retriever model and its efficient 12-layer counterpart, alongside a high-quality reranker model, all trained exclusively on enterprise-appropriate data with comprehensive governance oversight. The models…
| granite-encoder-small-english | granite-encoder-english | |
|---|---|---|
| Embedding size | ||
| Layers | ||
| Intermediate size | ||
| Global Rope Theta | ||
| Vocabulary Size |
| Data | Range | Avg. | Range | Avg. |
|---|---|---|---|---|
| (chars) | (chars) | (tokens) | (tokens) | |
| IBM documentation | [10, 475001] | 6393 | [2, 329116] | 1873 |
| Model | Parameters | Seq. | BEIR | MLDR | Miracl |
| (M) | Length | Avg. | (en) | (en) | |
| Retriever: granite-embedding-small-english-r2 | 47 | 8192 | 50.9 | 40.1 | 42.4 |
| ms-marco-MiniLM-L12-v2 | 33 | 512 | 52.0 | 34.8 | 54.5 |
| bge-reranker-base | 278 | 512 | 51.6 | 36.7 | 40.7 |
| bge-reranker-large | 560 | 512 | 53.0 | 37.9 | 42.2 |
| gte-reranker-modernbert-base | 149 | 8192 | 54.8 | 51.2 | 54.3 |
| granite-embedding-reranker-english-r2 | 149 | 8192 | 54.4 | 44.9 | 53.7 |
| Retriever: granite-embedding-english-r2 | 149 | 8192 | 53.1 | 41.6 | 43.6 |
| ms-marco-MiniLM-L12-v2 | 33 | 512 | 53.2 | 34.5 | 55.4 |
| bge-reranker-base | 278 | 512 | 53.0 | 36.6 | 40.9 |
| bge-reranker-large | 560 | 512 | 54.3 | 38.0 | 42.3 |
| gte-reranker-modernbert-base | 149 | 8192 | 56.1 | 50.4 | 54.8 |
| granite-embedding-reranker-english-r2 | 149 | 8192 | 55.4 | 44.4 | 54.5 |
| Global RoPE Theta | MTEB-v1 | CoIR | MLDR |
|---|---|---|---|
| Retrieval (15) | (10) | (En) | |
| 20k | 50.9 | 53.4 | 38.6 |
| 40k | 50.9 | 53.4 | 39.6 |
| 80k | 50.9 | 53.4 | 40.1 |
| 160k | 50.9 | 53.6 | 39.3 |
| 80k with 160k inference | 50.9 | 53.8 | 39.4 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ibm-granite/granite-embedding-english-r2model· 36k dl· ♡ 8036k dl♡ 80
- 🤗ibm-granite/granite-embedding-small-english-r2model· 1.0M dl· ♡ 651.0M dl♡ 65
- 🤗ibm-granite/granite-embedding-reranker-english-r2model· 2.3k dl· ♡ 232.3k dl♡ 23
- 🤗onnx-community/granite-embedding-small-english-r2-ONNXmodel· 650 dl· ♡ 1650 dl♡ 1
- 🤗RedHatAI/granite-embedding-english-r2model· 11 dl11 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Natural Language Processing Techniques
Granite Embedding R2 Models
Granite Team
IBM Research AI See Section A for full author list. For questions, comments, compliments contact [email protected] or [email protected] feedback or comments on this work, please open an issue at https://github.com/ibm-granite/granite-embedding-models.
Abstract
We introduce the Granite Embedding R2 models, a comprehensive family of high-performance English encoder-based embedding models engineered for enterprise-scale dense retrieval applications. Building upon our first-generation release, these models deliver substantial improvements, including 16x expanded context length (8,192 tokens), state-of-the-art performance across diverse retrieval domains - text, code, long-document search, multi-turn conversational, and tabular data - and measurable speed advantages of 19-44% over leading competitors while maintaining superior accuracy. Our release encompasses both bi-encoder and cross-encoder architectures, featuring a highly effective 22-layer retriever model and its efficient 12-layer counterpart, alongside a high-quality reranker model, all trained exclusively on enterprise-appropriate data with comprehensive governance oversight. The models demonstrate exceptional versatility across standard benchmarks, IBM-developed evaluation suites, and real-world enterprise use cases, establishing new performance standards for open-source embedding models. In an era where retrieval speed and accuracy are paramount for competitive advantage, the Granite R2 models deliver a compelling combination of cutting-edge performance, enterprise-ready licensing, and transparent data provenance that organizations require for mission-critical deployments. All models are publicly available under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, enabling unrestricted research and commercial use.
1 Introduction
Bi-encoder text embedding models convert text into a fixed-dimension vector, such that semantically close texts are close in the vector space, while dissimilar texts have a low similarity. These embeddings can then be used in a variety of tasks, most commonly in retrieval applications, where the relevance of a document to a given query can be determined by the similarity of their embeddings (Dunn et al., 2017; Xiong et al., 2020; Neelakantan et al., 2022; Zamani et al., 2018; Zhao et al., 2020), but also in document clustering (Angelov, 2020) and text classification (Sun et al., 2019).
Encoder-based embedding models (Wang et al., 2022; Xiao et al., 2023; Chen et al., 2024; Merrick et al., 2024; Zhang et al., 2024; Nussbaum et al., 2024) are widely used for these retrieval tasks, due to their low latency of inference and small memory footprint compared to decoder-based embedding models (Lee et al., 2024; Wang et al., 2023). However, many encoder embedding models are often trained on data with non-commercial licenses, leading to restrictions in commercial deployment, and have a smaller context length, limiting their usability for long-context applications.
While bi-encoders independently encode each text from a pair into fixed-length vectors that can be compared using distance measures like cosine similarity, cross-encoders or rerankers produce a single similarity score after jointly processing pairs of text. Cross-encoders often outperform bi-encoders, due to the ability of both texts to attend to each other, however, searching the entire corpus with them is computationally restrictive as it requires conducting an inference pass over every possible pair of query-document in the corpus (Reimers & Gurevych, 2019). Thus, they are often used to rerank the top retrieved documents as judged by a bi-encoder, in a retrieve-and-rerank framework, that improves search quality without severe speed overhead.
This report introduces Granite Embedding R2 models, purpose built for information retrieval tasks, comprising of both bi-encoder and cross-encoder models. These models provide many improvements over our R1 Granite Embedding models (Awasthy et al., 2025), including an increased context length, improved inference optimizations, and an updated encoder model based on the ModernBERT architecture (Warner et al., 2024), trained on 2T tokens from a high-quality web-based corpus (Gohari et al., 2025) and code data (Mishra et al., 2024). The Granite Embedding models have been trained on high-quality, curated data, with data quality checks and screening to remove personal information and profane language. We release these models under the Apache 2.0 license, which is permissible for both commercial and research applications. We release three English models of varying sizes for a variety of inference budgets, spanning bi-encoder retriever models and cross-encoder rerankers:
- •
granite-embedding-english-r2 (149M parameters)111ibm-granite/granite-embedding-english-r2: with an output embedding size of 768, replacing granite-embedding-125m-english.
- •
granite-embedding-small-english-r2 (47M parameters)222ibm-granite/granite-embedding-small-english-r2: a first-of-its-kind reduced-size model, with fewer layers and a smaller output embedding size (384), replacing granite-embedding-30m-english.
- •
granite-embedding-reranker-english-r2 (149M parameters): a full-sized reranking model optimized for relevance ordering, built on granite-embedding-english-r2.
Both bi-encoders are designed to replace the older Granite embedding models, delivering state-of-the-art performance across standard and IBM-built information retrieval benchmarks (BEIR, ClapNQ), code retrieval (COIR), long-document search benchmarks (MLDR, LongEmbed), conversational multi-turn (MT-RAG), Table-IR (OpenWikiTables, NQTables, OTT-QA, MultiHierTT, AIT-QA), and on many enterprise use cases while supporting extended 8192-token context lengths meeting the industry standards. Crucially, the increase in parameter count for these models does not contribute to an increase in inference speed due to optimizations such as Flash Attention (Dao, 2023).
Similarly, the accompanying reranker model demonstrates strong performance across diverse benchmarks, enabling sophisticated retrieve-and-rerank pipelines that maximize both recall and precision. Together, these models form a complete retrieval ecosystem that addresses the full spectrum of enterprise information retrieval challenges.
The paper’s structure is as follows: Section 2 describes the training of the improved encoder models, which are based on the ModernBERT architecture. Section 3 describes in detail the training recipes for the bi-encoder Granite retriever models, while Section 4 describes the training of the cross-encoder Granite Reranker model. Finally, Section 5 presents a comprehensive evaluation of the Granite Embedding models, comparing their performance with that of other open-source encoder embedding models.
2 Granite Encoder Models
The Granite Embedding R2 models have been trained on top of updated encoder models, featuring longer context lengths, a richer training corpus, and modern architectural improvements. We discuss the architecture of our base and small Granite Encoder models, which form the backbone of both the bi-encoder and cross-encoder models, as well as the training recipe based on Warner et al. (2024) and details of the high-quality corpus used to optimize the model for code and long-context retrieval.
2.1 Encoder Model Architecture
The Granite Encoder models have been trained following the ModernBERT (Warner et al., 2024) training recipe, including modern model optimizations such as alternating attention mechanism, rotary positional embeddings for flexible context length, and streamlined parameters. The models also support Flash Attention (Dao, 2023) for improved efficiency, leading to no slowdown compared to the R1 models, even with a slightly larger parameter size, as shown in Section 5.2. The Granite Encoder models use the ModernBERT tokenizer, which is a modified version of the OLMo tokenizer (Groeneveld et al., 2024), which shows better performance on code-related tasks. These encoder models have been trained on English and code data.
We train two models, granite-encoder-english and granite-encoder-small-english, which our our base and small sized models, the backbone of granite-embedding-english-r2 and granite-embedding-small-english-r2, respectively. The architecture of granite-encoder-english-r2 follows that of ModernBERT-base, with 22 layers, 149M parameters, and a vector size of 768. We select this architecture without any further ablations, referring to the findings of Warner et al. (2024). On the other hand, granite-encoder-small-english is a first-of-its-kind small ModernBERT-style model, with 12 layers, 47M parameters, and a vector size of 384, which still supports the increased context length of 8192 tokens. This architecture follows that of popular small embedding models (Li et al., 2023; Xiao et al., 2023; Wang et al., 2022), and was selected based on its performance on natural language understanding tasks (Wang et al., 2018), and downstream retrieval tasks (Kwiatkowski et al., 2019; Husain et al., 2019; Chen et al., 2024). For ablations on the architectural choices of granite-encoder-small-english, please refer to Appendix B. Similar to the ModernBERT architecture, our models have alternating global attention in every third layer. Detailed specifications of the architecture of each model are shown in Table 1.
2.2 Training Data
We curate a diverse, high-quality corpus of text and code data to train our encoder models. The largest dataset we use in terms of tokens is the GneissWeb dataset (Gohari et al., 2025), a collection of web data filtered to create a high-quality corpora for language model training. We also include Wikipedia, BookCorpus, StackExchange, and PubMed articles in our training mix for diversity. For improved performance on Code-related tasks, we include a subset of Code data from the training corpora of Granite Code Models (Mishra et al., 2024), also referred to here as Codepile. To improve performance on IBM benchmarks, we use internal IBM documents targeting specific technical domains. Furthermore, we also include multi-turn conversational data (Section 3.1)r to improve performance on conversational IR tasks.
Through thorough ablations, we find that sampling from these datasets according to their relative sizes, with most of the weight being assigned to GneissWeb and CodePile, yields better performance than a data curriculum in most cases. For the base model, we conduct the first two stages described in Section 2.3 on only data from GneissWeb, and conduct the last stage of training on a mixture of datasets. For the small model, we find mixing all datasets in all stages to be more beneficial, with the first two stages heavily sampling from GneissWeb.
2.3 Training Recipe
Following the ModernBERT training setting, we train our models on the Masked Language Modeling objective over three distinct stages:
Large Scale Pretraining: First, we train on 2 trillion tokens of text data, with a maximum context length of 1024. We use a Warmup-Stable-Decay learning rate schedule (Hu et al., 2024), with peak learning rate of 8e-4 after an initial warmup on 3 billion tokens, and a RoPE theta of 10,000. 2. 2.
Context Extension: We then scale up the context length to 8192 and the RoPE theta to 160,000 and train for 250 billion tokens on a constant learning rate of 3e-4. 3. 3.
Learning Rate Decay: Finally, we train on 50 billion tokens, with the same context length and RoPE theta as above, but with a 1-sqrt learning rate decay from the peak learning rate of 3e-4.
We also use the StableAdamW optimizer (Wortsman et al., 2023), and employ efficient training mechanisms such as sequence packing, unpadding, and flash attention, as described in Warner et al. (2024). We train both our base and small models from scratch, employing no special initialization techniques for either of the models.
3 Granite Embedding R2
Granite Embedding R2 models consist of high quality English embedding models, purpose built for retrieval tasks, using carefully curated enterprise-ready data. We discuss the training data and methodology for these models here, focusing on techniques such as retrieval oriented pretraining, contrastive finetuning and distillation.
3.1 Training Data
As detailed in Section 3.2, the embedding models undergo a round of retrieval-oriented and tabular pretraining on general text data, and then contrastive finetuning on paired data. The data for each of these steps, including publically available and synthetically generated datasets, are discussed below.
Retrieval Oriented Pretraining:
For RetroMAE-style pretraining (Xiao et al., 2022), we train the model with about 200,000 sentences from Wikipedia, BookCorpus and StackExchange, with a context length of 8192 tokens. We experimented with other data mixtures, including those used for training the Granite Encoder models (Section 2.2), however, we find this data mix to give the best performance on downstream retrieval tasks.
Tabular Pretraining:
Given the scarcity of paired retrieval training data and the need for tabular models to handle diverse structures and domains, we curate a corpus of approximately 8M tables from sources such as WikiTables (Kweon et al., 2023), Arxiv Tables (staghado, 2025), PubTables (Smock et al., 2022), Git Tables (Hulsebos et al., 2023), FinTabNet (Zheng et al., 2021), and NQ (Herzig et al., 2021). This collection provides a large and varied dataset for tabular pretraining, consisting of multiple formats, including CSV, Markdown, HTML, and table-marker representations to encourage format-agnostic learning.
A substantial portion of these tables contain numerical data, which poses challenges for common pretraining objectives such as RetroMAE (Xiao et al., 2022), where masking often targets numbers, and the Inverse Cloze Task (ICT) (Lee et al., 2019), which relies on informative contextual text that is often unavailable or insufficient for tables. To address this, we generate synthetic summaries and metadata for the tables using Mistral-7B-Instruct (Jiang et al., 2023a), providing richer context for pretraining. In the table pretraining stage, we use both the original tables and the tables with summaries, and we also include a data replay from the RetroMAE pretraining stage.
Retriever Training:
Granite Embedding Models are trained on two types of paired data:
Weakly paired data mined from the web with in-batch negatives 2. 2.
High-quality annotated data with hard negatives for finetuning, from three key sources:
- (a)
Publicly available paired data 2. (b)
IBM-internal paired data targeting specific technical domains 3. (c)
IBM-generated synthetic data.
For governance, our data undergoes a data clearance process subject to technical, business, and governance review. This comprehensive process captures critical information about the data, including, but not limited to, its content description, intended use, data classification, licensing information, usage restrictions, as well as an assessment of sensitive information (i.e, personal information).
The Granite Embedding R2 Models reuse training data from English Granite Embedding R1 Models (Awasthy et al., 2025), and add data from the Code, Tabular, and Multi-Turn Conversation domains:
- •
Code Data: We create code retrieval pairs from various sources to maintain diversity, mining hard negatives in most cases. We extract problem and solution pairs from Project CodeNet333https://github.com/IBM/Project_CodeNet, consisting of pairs in Python, Java, C++ and C. The original problems, in mixed English and Japanese, were translated to English and further summarized to obtain a short query using Mixtral-8x22B. We also create pairs of code and natural language from the CoNaLa train set (Yin et al., 2018), as well as code and docstring pairs extracted from CodePile (Mishra et al., 2024). For CoNaLa and Codepile pairs, we mine hard negatives using granite-embedding-125m-english, while we select random negatives for Project CodeNet.
- •
Table-IR Data: We use several publicly available training datasets, including Open-WikiTables, NQTables, OTT-QA (Chen et al., 2021a), FinQA (Chen et al., 2021b), and MultiHierTT (Zhao et al., 2022) for training granite-embedding-english-r2 for Table-IR tasks. For datasets that require query rewriting or de-contextualization, we implement a structured query-rewrite and filter pipeline using Mistral-7B-Instruct. To further enhance retrieval robustness, we mine hard negatives using granite-embedding-125m-english model and incorporate these during distillation from the teacher model, as described in Section 3.2.
- •
Multi-Turn Conversational IR Data: For multi-turn conversationdal data, we use the train split of MultiDoc2Dal (Feng et al., 2021). We also synthetically generate about 2000 multi-turn conversations using Mixtral8x22B for the ClapNQ and IBM Cloud corpora of the MT-RAG dataset (Katsis et al., 2025b). The generation involves grouping passages from the same document, generating turns (user query and assistant response) for each passage within the group and connecting these turns by de-contextualizing the queries.
3.2 Training Recipe
Embedding models are typically trained with a contrastive learning objective (Gao et al., 2021), which brings the embeddings of a query closer to those of relevant passages and pushes them further away from non-relevant ones. Recent work (Zhang et al., 2024; Chen et al., 2024; Li et al., 2023; Xiao et al., 2023; Wang et al., 2022) employs a two-stage contrastive finetuning approach, first finetuning on a large corpus of semi-supervised pairs, then finetuning on a higher quality set of triples. The Granite Embedding models have been trained with additional techniques to improve performance, resulting in a training pipeline with the following stages:
Retrieval Oriented Pre-training: starting with the Granite Encoder models, we conduct some steps of retrieval-oriented pre-training, such as RetroMAE (Xiao et al., 2022), as a means to train the [CLS] vector to produce richer representation without explicitly training on the contrastive objective. This step is done with a context length of 8192 tokens. 2. 2.
Tabular Pretraining: To effectively leverage large-scale tabular datasets for representation learning in retrieval applications, traditional bag-of-words techniques fall short due to cell redundancy and the prevalence of numerical content, which offers limited utility for learning correlations between tables and text (Yin et al., 2020). We therefore extend the RetroMAE framework to better learn tabular representations for retrieval.
Formally, let a table be represented as the token sequence T=\bigl{[}\texttt{[CLS]},\,h_{1},\dots,h_{J},\,\texttt{[SEP]},\,c_{1},\dots,c_{M},\,\texttt{[SEP]}\bigr{]}, where () are header tokens, and () are cell tokens with for an table. Each table is paired with a natural language summary, represented as S=\bigl{[}\texttt{[CLS]},\,s_{1},\dots,s_{K}\bigr{]}, where () are summary tokens. We apply an M1 attention mask (Mouravieff et al., 2025a) on , and feed it into the encoder to generate a contextual table embedding via z_{T}=\mathbf{E}\bigl{(}\widetilde{T}\bigr{)}_{\texttt{[CLS]}}. This representation is then provided to a shallow decoder , together with the input embeddings of the masked summary sequence . Unlike the original RetroMAE objective, where the decoder reconstructs the masked input sequence, our modified objective requires the decoder to predict masked tokens over the summary and table metadata, rather than the table tokens themselves:
[TABLE]
where indexes the masked summary positions. We use a masking ratio of 20% for the encoder and 60% for the decoder tokens. This formulation forces the encoder to align table structure and content with textual summaries, enabling more effective representation learning for downstream table-text retrieval tasks. 3. 3.
Contrastive Finetuning: the models are then finetuned for the contrastive learning objective on a large corpus of semi-supervised paired data, using the improved contrastive loss proposed in Li et al. (2023). Specifically, for a batch of triples consisting of a query and a set of passages – without loss of generality, we can assume that is a positive passage for query , while are negative passages – we define the contrastive loss as:
[TABLE]
[TABLE]
where is a the temperature-scaled cosine similarity between the [CLS] embeddings of and :
[TABLE]
Here, we can perform finetuning on a large set of semi-supervised pair data using a large batch size and in-batch negatives to better approximate the contrastive objective. 4. 4.
Contrastive Distillation: Instead of further finetuning on high-quality triples, we instead distill the distribution of temperature-scaled cosine similarity scores from a Mistral-7B-Instruct model (Jiang et al., 2023a) trained on the contrastive loss objective. Specifically, the training objective minimizes the cross entropy between the teacher’s distribution of similarity scores between pairs and the student’s distribution, . Following Hinton et al. (2014), we also scaled the score distribution of both teacher and student by a temperature, :
[TABLE]
[TABLE]
[TABLE]
We find this objective to yield a larger improvement in performance than finetuning with hard negatives. We use 3 mined hard negatives for more informative contrastive training, keeping the maximum sequence length of 1024 tokens. 5. 5.
Domain Adaptation for Multi-turn Conversational IR: to improve the quality of the embeddings for multi-turn conversational IR, the models undergo a final domain-adaptation stage wherein the distribution of similarity scores from a domain-adapted Mistral teacher is distilled into the model. We conduct this step for granite-embedding-english-r2, and omit it for the small model as we do not see significant performance improvement in the latter case.
At each stage of our training, we perform scaling of the global rope base frequency parameter for better Positional Interpolation (Chen et al., 2023). For both our models, we find using a global rope theta of 80K gives better performance than the default of 160K on both short and long-context downstream retrieval tasks, with ablations presented in Appendix E. Detailed hyperparameters for each stage are provided in Appendix F.
3.3 Teacher Training
We fine tune Mistral-7B-Instruct-v0.2 (Jiang et al., 2023b) using contrastive training as our teacher model for distillation. Similar to Wang et al. (2024), for each query-document pair , we add an instruction template to the original query to generate a new one:
[TABLE]
where {task_definition} is a placeholder for a one-sentence description of the embedding task. To calculate the embedding of a text, an [EOS] token is first appended to the end of the text before feeding it to the Mistral model, and the [EOS] vector of the last layer is treated as the text embedding.
We train two separate embedding models from the Mistral-7B-Instruct-v0.2 and create the final teacher model by merging the two models together. The first model is trained using three million pairs of weakly paired data with in-batch negatives. The second model is trained using one million annotated data with one hard negative.
4 Granite Reranker
The Granite reranker model is a cross-encoder model built on top of granite-embedding-english-r2, using a list-wise ranking objective. The model also features context length of 8192 tokens and is trained with curated high-quality data and carefully mined negatives.
4.1 Training Recipe
The granite-embedding-reranker-english-r2 is cross-encoder, that jointly encodes the query and document as \bigl{[}\texttt{[CLS]}\;q\;\texttt{[SEP]}\;d\;\bigr{]}, and predicts the relevance score from the [CLS] representation:
[TABLE]
We randomly initialize the classifier parameters, and fine-tune the model with PListMLE (Lan et al., 2014) loss objective, an extension of ListMLE (Xia et al., 2008) which defines the probability of a permutation as the product of step-wise conditional probabilities. PListMLE biases the permutation probability distribution according to position-dependent weights, making it sensitive to rank positions.
Given a ranked list and relevance scores , the loss is defined as:
[TABLE]
where is a decreasing function. To align with NDCG, is set as the gain function , giving higher weight to more relevant documents.
We fine-tuned the model for 15K steps with a learning rate of 2e-4, weight decay 0.05, warmup ratio 0.15, and gradient-norm clipping at 1.0. We used the same training data as the Granite Embedding models, differing mainly in hard-negative mining and relative-order refinement. Hard negatives were first mined using granite-embedding-english-r2 by selecting negative documents whose query similarity is close to that of the positive documents (margin 0.95). From the retriever’s top-20 candidates, an in-house reranker, built on granite-embedding-125m-english, was then used to refine the ranking and produce a better-ordered negative set, ensuring that the selected negatives were harder and more semantically relevant. During training, we used the top-8 hard negatives per query from this reranked list to optimize the ranking objective.
5 Evaluation
We evaluate the performance of our models on a variety of tasks and domain. Granite Encoder models are evaluated on a variety of natural language understanding and retrieval tasks in Appendix C.
We evaluate Granite Embedding models on retrieval benchmarks across domains, such as text, code, table, conversational, and long context retrieval in in 5.1 and 5.2. The models show strong performance, achieving higher scores than other open source models on average, while maintaining the highest inference speeds
We evaluate Granite Reranker model by reranking the top-20 documents retrieved by granite-embedding-english-r2 and granite-embedding-small-english-r2 on various retrieval datasets in 5.3 and show the reranker model shows strong performance over other open-source rerankers.
5.1 Retrieval Performance
We evaluate the Granite Embedding models on a variety of retrieval tasks, spanning multiple domains, document lengths and text objects (eg. documents, tables, conversations):
- •
English Retrieval: We evaluate on general information retrieval benchmarks such as (Enevoldsen et al., 2025), comprising retrieval tasks on a variety of domains with a focus on zero-shot evaluations. We also include evaluation on the popular BEIR benchmark (Thakur et al., 2021) in Appendix D.
- •
Code Retrieval: We evaluate on code retrieval tasks of the COIR benchmark (Li et al., 2024), which consists of text-to-code, code-to-text, and hybrid code retrieval.
- •
Long Context Retrieval: To evaluate performance on retrieving long-context documents, we measure the performance on the MLDR task (Chen et al., 2024) and the LongEmbed benchmark (Zhu et al., 2024).
- •
Table Retrieval: Existing text embedding models frequently underperform when encoding structured data such as tables (Trabelsi et al., 2022; Mouravieff et al., 2025b). We evaluate on the tabular retrieval task across five datasets: OpenWikiTables, NQTables, OTT-QA, MultiHierTT, and AIT-QA (Katsis et al., 2022).
- •
Multi-Turn Conversation Retrieval: We evaluate our models to retrieve documents in a multi-turn conversation setting using the MT-RAG retrieval task (Katsis et al., 2025a).
We compare our bi-encoders with other state-of-the-art embedding models of similar parameter size. granite-embedding-enlish-r2 is compared to popular open source base models, such as BGE Base (Xiao et al., 2023), E5 Base (Wang et al., 2022), as well as recent models with larger sequence length, such as Arctic Embed (M) (Yu et al., 2024), GTE Base (Zhang et al., 2024; Li et al., 2023), GTE ModernBERT Base (Zhang et al., 2024; Li et al., 2023), and Nomic-AI ModernBERT Embed Base (Nussbaum et al., 2024). The small model is compared to BGE Small (Xiao et al., 2023) and E5 Small (Wang et al., 2022), and is one of the first small models with a long context. We also compare the R2 embedding models to the R1 Granite Embedding Models (Awasthy et al., 2025), to quantify the improvement over the previous release.
Unless specifically mentioned, all tasks are evaluated with a maximum sequence length of 8192. While we show only the average performance for each benchmark in Table 2, we give a detailed evaluation for our models in Appendix D, including the evaluation on the complete English MTEB-v2 benchmark in Appendix D.1.
As shown in Table 2, Granite embedding R2 models show a strong performance across diverse tasks despite all tasks being zero-shot except for NQ, Hotpot, FEVER. Notably, on average, granite-embedding-english-r2 outperforms other models of similar size. The Granite Embedding R2 models achieve state-of-the-art performance on long-context retrieval benchmarks like LongEmbed, with granite-embedding-english-small-r2 having a very high accuracy (even compared to some larger models) without increasing inference cost.
5.2 Embedding Speed
Text embedding models are fundamental to information retrieval systems and Retrieval-Augmented Generation (RAG) applications. Organizations typically process millions of documents, with frequent updates and new content requiring continuous ingestion. This makes encoding speed as important as accuracy—a slow model can become a significant bottleneck in large-scale deployments.
To evaluate the performance of embedding models in realistic scenarios, we construct a benchmark using 23,000 public IBM technical documents covering various products from ServeRAID controllers to IBM Storwize systems. Guiding principles for the experiment:
- •
Create a realistic scale and document complexity: a large number of documents of varying lengths, ranging from 10 to 475,001 characters (averaging 6,393 characters - see Table 3)
- •
Use standardized processing: 512-token chunks with 100-token overlaps across all models
- •
Perform consistent testing: Identical corpus and hardware (single Nvidia H100 GPU) for all tests
As seen in Table 4, the Granite R2 embedding models perform well across several metrics:
- •
Speed performance: fast with high performance
- –
granite-embedding-small-english-r2 processes 199 documents per second, which is 38% faster than comparable ModernBERT models (144 docs/s)
- –
granite-embedding-english-r2 processes 144 documents per second, outperforming smaller BERT-based alternatives like e5-small-v2 and bge-small-en-v1.5 (both at 138 docs/s)
- –
The larger granite models consistently outperform smaller competitors, indicating good architectural efficiency
- •
Model efficiency: The Granite R2 series maintains the speed advantages of their R1 predecessors while offering additional capabilities such as an increased context length. Despite architectural changes, both models run with the same speed as their predecessors.
- •
Comparison with alternatives: When compared against popular open-source alternatives like bge-base-en-v1.5, granite models show competitive performance.
5.3 Reranking Performance
We evaluate the Granite Reranker model on BEIR, MLDR and Miracl benchmarks, a common benchmark used for general information retrieval. We compare our reranker model with other state-of-the-art ranking models such as MiniLM-L12, BGE Base and Large (Xiao et al., 2023), and GTE base Zhang et al. (2024). All models are evaluated on the top-20 documents retrieved by granite-embedding-english-r2. Each reranking model is evaluated with its maximum supported sequence length, while queries are truncated to 64 tokens.
As shown in Table 5, the Granite Reranker model achieve strong performance on all benchmarks, improving the performance over using the retriver alone, while outperforming all reranker models except Zhang et al. (2024). The gap with Zhang et al. (2024) is relatively small, with the main difference observed on MLDR. Notably, Zhang et al. (2024) incorporates MS MARCO and MLDR in training, whereas we do not.
6 Conclusion
In this work, we have presented the Granite Embedding R2 Models, a family of specialized retrieval and reranker models designed to address the computational and accuracy requirements of enterprise-scale information retrieval systems. Our experimental evaluation demonstrates that these models achieve substantial performance gains, with processing speeds that are 19% faster than leading base model baselines and 44% faster than competitive small model alternatives, while preserving state-of-the-art retrieval accuracy across diverse domains including text, code, conversational data, tabular content, and long-context scenarios.
The proposed models incorporate several key contributions: optimized bi-encoder and cross-encoder architectures that support extended 8192 token contexts, comprehensive data curation methodologies ensuring enterprise-grade quality standards, and transparent training procedures. We release these models under the Apache 2.0 license supporting both academic research and practical deployment scenarios. Our findings indicate that the Granite R2 model family provides a viable solution for organizations seeking to implement robust, scalable information retrieval systems. The combination of computational efficiency, retrieval performance, and enterprise-ready licensing positions these models as a significant contribution to both academic research and business-critical applications.
In an era where milliseconds matter and accuracy cannot be compromised, Granite R2 models don’t just meet the standard—they set it.
Appendix A Contributions
The Granite R2 embedding models were truly the outcome of a successful collaboration across geographies led by Radu Florian - with contributions from IBM Watson Research Lab (WRL) lab and India Research Lab (IRL). Parul Awasthy was the challenge lead on the project overall, calling from WRL, with Jaydeep Sen coordinating the work from IRL. We are very grateful for the wonderful and successful collaboration across continents - looking forward to even better models!
Encoder Model Training
Parul Awasthy, Aashka Trivedi
Retriever and Reranker Training
Parul Awasthy, Riyaz Bhat, Meet Doshi, Bhavani Iyer, Vishwajeet Kumar, Yulong Li, Vignesh P, Aashka Trivedi, Yushu Yang
Data and Evaluation
Parul Awasthy, Ken Barker, Meet Doshi, Radu Florian, Martin Franz, Bhavani Iyer, Vishwajeet Kumar, Yulong Li, Rudra Murthy, Vignesh P, Aashka Trivedi, Todd Ward
Product Management
Abraham Daniels, Madison Lee
Technical Leadership
Parul Awasthy, David Cox, Radu Florian, Luis Lastras, Salim Roukos, Jaydeep Sen
Appendix B granite-encoder-small-english Architecture Ablations
granite-encoder-small-english is one of the first open small ModernBERT-style encoder model, and we explore the following options for its architecture, loosely based on the ModernBERT-base architecture:
Modifications to ModernBERT-base architectures: we experiment with halving the number of layers, intermediate size and the attention size of the ModernBERT-base architecture, keeping other dimensions the same. We also try to take a fourth of the number of layers to further reduce latency of inference. 2. 2.
Maintain the ratio of dimensions of ModernBERT-base to ModernBERT-large: the ModernBERT models come in both base and large varieties. We create architectures of small models such that the ratio of small-to-base dimensions maintain that of base-to-large. Note, the model architecture that maintains this ratio in all dimensions has a hidden size of 576- we include the performance of this model for completeness, however we intend to choose an architecture which produces embeddings of size 384, as a replacement to granite-embedding-30m-english. 3. 3.
Attention head size of 64: An important aspect of the ModernBERT architecture is that the attention head size (i.e., the hidden size divided by the number of attention heads) is 64. We experiment with maintaining this ratio. 4. 4.
Maintaining the number of layers to be of the form : the ModernBERT architecture alternates between global and local attention, keeping global attention at every third layer. We fix this design in both our base and small encoder models, and thus ablate the effects of maintaining a pattern for the small model, such that the last layer always has global attention. 5. 5.
Standard “small” architecture: Finally, we use the standard architecture of many small embedding models, with 12 layers and a hidden size of 384.
We ablate the architecture for granite-encoder-small-english by training various choices on 100B tokens for the first stage of training (Section 2.3). As a baseline, we choose the encoder used to train granite-embedding-30m-english, a RoBERTa-like small model, which we refer to as granite-encoder-30m-english. We then evaluate the performance of the encoder on natural language understanding tasks of the GLUE benchmark Wang et al. (2018). For estimating downstream retrieval performance, we finetune each encoder for 1 epoch on MS-Marco triples with a contrastive learning objective, and evaluate the performance of the resulting embedding models on general retrieval (Natural Questions Kwiatkowski et al. (2019)), code retrieval (CodeSearchNet (Husain et al., 2019)), and long-context retrieval (MLDR (Chen et al., 2024)).
As shown in Table 6, the standard small encoder gives the best performance on downstream retrieval tasks with a vector size of 384, without much loss in Glue Performance and a modest increase in parameter size compared to other low-performing ablations. We thus select this architecture as the basis of granite-enoder-english.
Appendix C Encoder Performance
We evaluate our encoder models on natural language understanding and downstream retrieval performance:
- •
NLU Performance: for natural language understanding, we measure the performance on the GLUE benchmark Wang et al. (2018), comprising 8 language understanding tasks. Following Warner et al. (2024), for the RTE, MRPC, and STS-B tasks, we start training from the MNLI checkpoint.
- •
Retrieval Performance: The Granite Encoder models are purpose-built for retrieval tasks, and we evaluate the dense retrieval performance of these models by first finetuning them for a single epoch on MS-MARCO triples (Bajaj et al., 2018), using the standard InfoNCE loss. We then evaluate three retrieval tasks- Natural Questions (Kwiatkowski et al., 2019) for general-purpose IR, Code Search Net (Husain et al., 2019) for code retrieval, and MLDR (Chen et al., 2024) for long-context retrieval.
For all tasks, we compare the base Granite Encoder model with popular open-source base-sized encoders such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and ModernBERT (Warner et al., 2024). The small Granite Encoder is compared against the 12-layer MiniLM model (Wang et al., 2020), which matches its architecture exactly, and is a popular starting point for training smaller embedding models. We evaluate all the models with the same pipeline to ensure fair comparison.
As shown in Table 7, the Granite Encoder models show strong performance on both NLU and retrieval tasks, while being trained on high-quality data suitable for enterprise use.
Appendix D Detailed Retriever Performance Evaluation
D.1 English Text Embedding Performance
For general-purpose embedding performance beyond retrieval, we measure the performance of the Granite Embedding Models on the English MTEB-v2 benchmark, which consists of 7 tasks spanning 41 datasets, used to evaluate the quality of text embeddings on classification, clustering, pair classification, reranking, retrieval, semantic similarity and summarization.
We provide the average score per task in Table 8, using the main metric described in the paper: Accuracy for classification tasks, V-Measure for clustering tasks, Average Precision for pair classification tasks, MAP for re-ranking tasks, nDCG@10 for retrieval tasks and Spearman Correlation (based on cosine similarity) for STS and summarization tasks.
While Granite Embedding models are purpose built for Retrieval tasks, they still maintain a strong performance on the other tasks of the MTEB benchmark, notably without adding any training data from non-retrieval tasks. This indicates the high quality of the embeddings produced by these models.
D.2 English Retrieval Performance
We measure the performance of our models on two popular retrieval benchmarks:
- •
BEIR (Thakur et al., 2021): This Information Retrieval benchmark consist of 15 datasets spanning multiple domains, and is used to test the model’s ability to find the relevant document for a given query. The tasks of this benchmark also comprise the MTEB v1 (Muennighoff et al., 2022) retrieval tasks.
- •
MTEB v2 retrieval (Enevoldsen et al., 2025): this benchmark is an update to the MTEB v1 retrieval benchmark, wherein they exclude tasks such as MS-Marco and NQ which are often used in the finetuning of embedding models.
The experiment results for the two tasks are reported in Table 9 and Table 10. We report the nDCG@10 scores on each dataset, showing strong performance relative to other open-source models of similar sizes. We also note the average performance on all tasks except MS-MARCO retrieval (Bajaj et al., 2018), for a fair zero-shot comparison, as other embedding models train on MS-MARCO, but our models do not, due to unfavorable license. Despite being trained on less data, and only permissibly licensed public datasets, our models show strong performance.
Note, other than the Granite Embedding R2 models, all model performances have been reported from the MTEB leaderboards as of August 8th 2025.
D.3 Code Retrieval Performance
We evaluate the model’s code retrieval capability on the COIR benchmark (Li et al., 2024), consisting of 10 datasets across 7 domains in Table 11. Granite Embedding models show strong performance compared models of the same size, despite the fact that unlike other models, we do not include any COIR training data in the training of our models, leading to purely zero-shot evaluation for our models.
D.4 Long-Context Retrieval Performance
For evaluating the performance of our models on long-context documents, we measure performance on the following benchmarks:
- •
LongEmbed: The LongEmbed benchmark contains two synthetic and four real-world tasks, designed to benchmark long context retrieval.
- •
MLDR (English): Multilingual Long-Document Retrieval dataset is built on Wikipeida, Wudao and mC4, covering 13 languages, with questions generated using GPT-3.5. We limit our evaluations to the English subset of this dataset
As shown in Table 12, the Granite Embedding R2 models show very strong performance on the long context benchmarks, with state-of-the-art performance on LongEmbed compared to other models trained with long context.
D.5 Table Retrieval Performance
To evaluate the performance of our Granite Embedding R2 models on tabular retrieval tasks, we assess them using the following datasets:
- •
OpenWikitables: An open-domain QA dataset carefully designed for Table-IR task in open domain setting, derived from WikiSQL and WikiTableQuestions.
- •
NQTables: A large-scale table retrieval subset extracted from the Natural Questions dataset, focusing on open-domain retrieval and question answering over Wikipedia tables.
- •
OTT-QA: Open domain table-and-text question answering dataset requiring retrieval and fusion of both tabular data and textual passages from large pools. We use the corpus and queries included in the TARGET Benchmark (Ji et al., 2024) for evaluation.
- •
MultiHierTT: A benchmark for numerical reasoning over multi-hierarchical tabular and textual data extracted from financial reports. The documents contain multiple hierarchical tables and lengthy narrative text, requiring complex multi-step reasoning. As many queries are context-dependent, we decontextualize them prior to evaluation in the retrieval task.
- •
AIT-QA: A domain-specific table QA dataset in the airline industry, comprising 515 human-annotated questions over 116 complex, hierarchical tables sourced from SEC filings. We use an extended version of AIT-QA that incorporates both the tables and their source 10-K forms from the SEC, enabling a hybrid retrieval setting that utilizes tabular and textual data. The final version of the evaluation set consists of 515 questions with ground truth answers and pages with a total search corpus of 1939 pages with 1682 tables.
As shown in Table 13, the Granite Embedding R2 models demonstrate consistently strong retrieval performance across all Table-IR datasets.
Appendix E RoPE Theta scaling Ablations
To investigate the impact of global RoPE theta, we conducted an ablation study on the granite-embedding-small-english-r2 model using values of 20k, 40k, 80k, and 160k. In the ModernBERT architecture, where global and local attention layers alternate, we adjusted RoPE theta only for the global attention layers, keeping local attention fixed at the default 10k. All experiments were conducted during the final distillation stage with the same setup as discussed in Section 3.2.
As shown in the Table 14, benchmarks such as MTEB-v1 and CoIR were insensitive to changes in global RoPE theta, whereas MLDR achieved its highest performance at 80k. Increasing the value to 160k did not yield consistent improvements across benchmarks. For reference, inference results with a global RoPE theta of 160k are also reported for comparison with the best performing 80k configuration.
Appendix F Retriever Training Hyperparameters
Table 15 provides the detailed hyperparameters for each stage of retriever training for the Granite Embedding models.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Angelov (2020) Dimo Angelov. Top 2vec: Distributed representations of topics. Co RR , abs/2008.09470, 2020. URL https://arxiv.org/abs/2008.09470 .
- 2Awasthy et al. (2025) Parul Awasthy, Aashka Trivedi, Yulong Li, Mihaela Bornea, David Cox, Abraham Daniels, Martin Franz, Gabe Goodhart, Bhavani Iyer, Vishwajeet Kumar, Luis Lastras, Scott Mc Carley, Rudra Murthy, Vignesh P, Sara Rosenthal, Salim Roukos, Jaydeep Sen, Sukriti Sharma, Avirup Sil, Kate Soule, Arafat Sultan, and Radu Florian. Granite embedding models, 2025. URL https://arxiv.org/abs/2502.20204 .
- 3Bajaj et al. (2018) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew Mc Namara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. Ms marco: A human generated machine reading comprehension dataset, 2018. URL https://arxiv.org/abs/1611.09268 .
- 4Chen et al. (2024) Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m 3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
- 5Chen et al. (2023) Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. Co RR , abs/2306.15595, 2023. doi: 10.48550/ARXIV.2306.15595 . URL https://doi.org/10.48550/ar Xiv.2306.15595 . · doi ↗
- 6Chen et al. (2021 a) Wenhu Chen, Ming-Wei Chang, Eva Schlinger, William Yang Wang, and William W. Cohen. Open question answering over tables and text. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . Open Review.net, 2021 a. URL https://openreview.net/forum?id=Mm C Rswl 1U Yl .
- 7Chen et al. (2021 b) Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Kenneth Huang, Bryan R. Routledge, and William Yang Wang. Finqa: A dataset of numerical reasoning over financial data. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, · doi ↗
- 8Dao (2023) Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
