NinjaLLM: Fast, Scalable and Cost-effective RAG using Amazon SageMaker and AWS Trainium and Inferentia2
Tengfei Xue, Xuefeng Li, Roman Smirnov, Tahir Azim, Arash Sadrieh and, Babak Pahlavan

TL;DR
NinjaLLM introduces a fast, scalable, and cost-effective retrieval-augmented generation system optimized for AWS Trainium and Inferentia2 chips, enhancing performance, safety, and tool integration for large language models.
Contribution
The paper presents novel enhancements to RAG techniques, specifically optimized for AWS AI chips, improving accuracy, safety, and tool usage in LLM deployment.
Findings
Achieved 62% accuracy on Natural Questions dataset.
Achieved 59% accuracy on HotPotQA dataset.
Outperformed models like DBRX and Mixtral Instruct.
Abstract
Retrieval-augmented generation (RAG) techniques are widely used today to retrieve and present information in a conversational format. This paper presents a set of enhancements to traditional RAG techniques, focusing on large language models (LLMs) fine-tuned and hosted on AWS Trainium and Inferentia2 AI chips via SageMaker. These chips are characterized by their elasticity, affordability, and efficient performance for AI compute tasks. Besides enabling deployment on these chips, this work aims to improve tool usage, add citation capabilities, and mitigate the risks of hallucinations and unsafe responses due to context bias. We benchmark our RAG system's performance on the Natural Questions and HotPotQA datasets, achieving an accuracy of 62% and 59% respectively, exceeding other models such as DBRX and Mixtral Instruct.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Anomaly Detection Techniques and Applications · Seismology and Earthquake Studies
