AI

AI

AI

Cache-Augmented Generation (CAG): Revolutionizing AI Efficiency, by replacing RAG?

Jan 29, 2025

I recently came across an exciting pre-print on ArXiv where researchers introduced a new technique called Cache-Augmented Generation (CAG). This approach has the potential to reduce the reliance on RAG (Retrieval-Augmented Generation) and eliminate many of its drawbacks. Instead of retrieving information from an external knowledge store, CAG pre-loads all relevant knowledge directly into the extended context of a large language model (LLM) and uses it to answer queries during inference. What’s even more interesting is that when applied to long-context LLMs, the results suggest that CAG not only performs on par with RAG but in some cases even outperforms it across various benchmarks. Let me take you through this fascinating technique, how it works, and how it compares to RAG.


Let’s begin! But First, What Is RAG?

There are several methods available to ensure that large language models (LLMs) can respond to user queries with up-to-date information. Techniques like LLM fine-tuning and Low-Rank Adaptation (LoRA) embed knowledge directly into the model's parameters, but they come with a significant downside—they are time-consuming, expensive, and need to be frequently updated, making them less practical in the long run. Enter RAG (Retrieval-Augmented Generation), a technique designed to address this issue. RAG works by integrating knowledge and retrieving information to help LLMs produce more accurate, up-to-date responses using private datasets specific to a given use case. Here’s how the process breaks down:

  • Retrieval: Information is fetched from a knowledge base or private dataset.

  • Augmentation: The retrieved information is then added to the input context.

  • Generation: The LLM generates a response using both the original query and the augmented context.


Despite its potential, RAG isn't without its flaws. Here are some of the challenges it faces:

  • Retrieval latency: Fetching data from an external source during inference can introduce delays.

  • Retrieval errors: Irrelevant or incomplete documents may be retrieved, leading to inaccurate responses.

  • Knowledge fragmentation: Poor chunking or ranking of retrieved documents can result in disjointed or incoherent information.

  • Increased complexity: Building and maintaining an RAG pipeline is complex, requiring infrastructure and frequent updates.


Here Comes Cache-Augmented Generation (CAG)

Modern LLMs today have vast context lengths, allowing them to process significantly more information. Some popular models and their context lengths include:

  • GPT-3.5: 4k tokens

  • Mixtral and DBRX: 32k tokens

  • Llama 3.1: 128k tokens

  • GPT-4-turbo and GPT-4o: 128k tokens

  • Claude 2: 200k tokens

  • Gemini 1.5 pro: 2 million tokens

Researchers are tapping into this capability to enable retrieval-free knowledge integration. The technique they’ve developed, known as Cache-Augmented Generation (CAG), works in three distinct steps:

Preloading External Knowledge

First, all the relevant documents are preprocessed and transformed into a precomputed key-value (KV) cache. This cache is then stored either on the disk or in memory, ready for future use. By doing this, the need for repeated document processing is eliminated, significantly reducing computational costs—documents are processed only once, regardless of how many user queries there are.

The process involves preloading external knowledge by creating a KV cache, denoted as ‘C(KV)’. ‘D’ represents the set of all the necessary relevant documents. This approach enables the LLM to gain a more comprehensive and cohesive understanding of the documents, ultimately leading to better-quality responses.


Inference

During inference, the precomputed KV cache is loaded alongside the user’s query, allowing the LLM to use this information to generate a response.

The response, ‘R’, is generated by an LLM, ‘M’, based on the user query, ‘Q’, and the precomputed KV cache, ‘C(KV)’. This step removes retrieval latency and minimizes retrieval errors, as the LLM already understands the preloaded knowledge and the query within its context.


Cache Reset

The KV cache grows sequentially during inference, with new tokens being added to the previous ones. To maintain system performance during extended or repeated inference sessions, it’s easy to reset the system by simply truncating the new tokens.

By resetting the cache and truncating new tokens, from t(1) to t(k), the system can quickly reinitialize, as there is no need to reload the entire cache from disk. The diagram below compares the workflows of RAG (top) and CAG (bottom). In RAG, knowledge (K1, K2) is retrieved dynamically for each query (Q1, Q2) from a knowledge base, then combined with the query and used by the LLM to generate responses (A1, A2). In contrast, CAG (bottom) preloads all relevant knowledge into a Knowledge Cache. Queries (Q1, Q2) are added to this cache, and responses (A1, A2) are generated based on the preloaded knowledge.

Illustration of RAG (top) and CAG (bottom) workflows (Image obtained from the original research paper)


But How Good Is CAG?

To evaluate the performance of CAG, two question-answering benchmarks are considered:

For testing, three sets are created from each dataset, with varying lengths of reference text. As the reference text length increases, retrieval becomes more challenging.


Test sets were created from SQuAD and HotPotQA by varying the lengths of the reference text (image sourced from the original research paper). Researchers used the Llama 3.1 8-B Instruction model, with a context length of 128k tokens, to test both RAG and CAG. The goal for each method was to generate accurate, contextually relevant answers for the benchmark questions.

BERTScore was employed to assess the performance of these methods, based on how closely their answers matched the ground-truth answers.

RAG was implemented with LlamaIndex, utilizing two retrieval strategies:

  • Sparse Retrieval with BM25: This method ranks and returns documents based on a combination of Term Frequency-Inverse Document Frequency (TF-IDF) and document length normalization.

  • Dense Retrieval with OpenAI Indexes: Dense embeddings are created using OpenAI’s models, which represent both the query and documents in a shared semantic space, returning the most semantically aligned passages.


So, does CAG really stand out?

Surprisingly, the results show that CAG outperforms both sparse (BM25) and dense (OpenAI Indexes) RAG systems, achieving the highest BERT-Score in most evaluations.

Coming back to our question: Is CAG really good enough?

Surprisingly, the results show that CAG outperforms both sparse (BM25) and dense (OpenAI Indexes) RAG systems, getting the highest BERT-Score in most evaluations.

Additionally, CAG significantly cuts down generation time, especially as the reference text length grows. For the largest HotpotQA test dataset, CAG is approximately 40.5 times faster than RAG, providing an impressive speed boost!

Comparison of generation time between RAG vs. CAG (Image obtained from the original research paper)

CAG seems like a highly promising solution for ensuring up-to-date retrieval from LLMs, either on its own or alongside RAG, especially as their context lengths continue to grow in the future. Do you think you’ll replace your RAG pipelines with it? Share your thoughts in the comments below!

Comparison with Retrieval-Augmented Generation (RAG)

The following table summarizes the differences between CAG and RAG:


Advantages of CAG
  1. Reduced Latency: By preloading knowledge, CAG provides instant answers without waiting for data retrieval.

  2. Improved Accuracy: Preloaded documents minimize the risk of errors associated with retrieving irrelevant or incorrect information.

  3. Holistic Context Understanding: The model can process all relevant information in a unified context, enhancing answer consistency and accuracy.

  4. Cost Efficiency: With reduced infrastructure needs and maintenance overhead, organizations can save on operational costs while improving performance.


Practical AI Agents for Legal Sector
Contract Analysis and Review
  • Use Case: Law firms or legal teams can benefit from CAG when analyzing large volumes of contracts, identifying key clauses, terms, and potential risks.

  • Why CAG is Better Than RAG: CAG stands out for its ability to pre-load relevant knowledge directly into the model’s context, offering faster and more accurate insights without the retrieval delays typical of RAG. This makes CAG especially useful for legal teams dealing with large datasets where time and precision are critical.

  • How CAG Helps: By understanding complex legal language, CAG efficiently processes contracts and extracts vital information like termination clauses, indemnification terms, and confidentiality agreements. The ability to load all relevant legal knowledge upfront means CAG can generate summaries and highlight areas that require attention without the latency of retrieving documents externally.

  • Example: Consider a legal team reviewing hundreds of NDAs (Non-Disclosure Agreements). With CAG, the system can quickly flag inconsistent clauses and ensure each document aligns with a standard template, all while minimizing time-consuming retrieval steps and providing a more seamless workflow.


Litigation Support and Case Summaries
  • Use Case: Lawyers and paralegals can leverage CAG to generate case summaries or historical legal precedents, significantly reducing time spent on case law research.

  • Why CAG is Better Than RAG: CAG’s ability to pre-load relevant case law into the model’s context removes the need for time-consuming document retrieval, enabling quicker and more precise access to critical information. This is particularly beneficial in litigation support, where efficiency and accuracy are paramount.

  • How CAG Helps: CAG can generate concise, accurate summaries of previous court rulings, focusing on key elements like verdicts, legal arguments, and judicial reasoning. Moreover, by analyzing a vast body of past legal decisions, CAG can identify patterns that may inform current litigation strategies, all without the latency or errors typically associated with retrieval-based methods.

  • Example: A lawyer preparing for a patent infringement case could use CAG to swiftly generate summaries of previous, relevant patent cases, significantly streamlining their research process and helping them focus on the most pertinent legal precedents for their case.


Legal Document Drafting
  • Use Case: Legal teams can utilize CAG to draft basic legal documents, like wills, powers of attorney, or legal notices, based on predefined templates and legal parameters.

  • Why CAG is Better Than RAG: CAG's ability to pre-load relevant legal templates and rules directly into the model’s context allows for the generation of compliant, template-driven documents without the need to retrieve external documents. This results in faster, more reliable outputs, especially for standardized legal work.

  • How CAG Helps: CAG can automate the drafting of documents by taking user inputs and applying them to predefined templates. It ensures that the generated content aligns with legal norms, providing consistency and reducing the potential for human error.

  • Example: A law firm could develop a chatbot powered by CAG to assist clients in generating simple legal documents like a last will and testament by simply filling out forms, without needing a full legal review for standard cases. This greatly streamlines the process while ensuring accuracy and compliance.


Compliance Monitoring and Reporting
  • Use Case: Compliance officers in law firms or businesses can use CAG to ensure that legal documents, contracts, and operational processes are in full compliance with regulations.

  • Why CAG is Better Than RAG: CAG offers the advantage of pre-loading relevant compliance frameworks, regulations, and legal checklists directly into the model's context. This eliminates the need for real-time document retrieval, providing a faster, more consistent approach to compliance checks while ensuring the model's outputs are always based on the most current regulations.

  • How CAG Helps: CAG can automatically cross-check legal documents, contracts, and operational processes against compliance standards. By analyzing documents in context with the preloaded regulations, it can identify compliance gaps and generate detailed reports, saving time and reducing the risk of errors.

  • Example: A financial institution could use CAG to automatically review loan agreements, ensuring they align with evolving regulations, and to generate reports that are ready for auditors or regulators, minimizing compliance risk and streamlining the audit process.


Mergers and Acquisitions (M&A) Due Diligence
  • Use Case: During mergers or acquisitions, CAG can be used to streamline the due diligence process by analyzing and summarizing thousands of documents, helping to identify risks or critical issues.

  • Why CAG is Better Than RAG: CAG allows the model to process and retain all relevant legal and financial documents in memory before analyzing them, which significantly speeds up the due diligence process. By preloading these documents into the context, CAG avoids the delays and potential errors associated with dynamic retrieval, ensuring faster and more accurate identification of critical risks and provisions.

  • How CAG Helps: CAG can analyze large volumes of contracts, financial statements, and other documents in the context of the deal, flagging potential liabilities, hidden risks, and non-standard clauses that might affect the merger or acquisition. It enables lawyers and business analysts to focus their attention on the most critical issues, saving time and enhancing the thoroughness of the review.

  • Example: A law firm representing a client in an M&A deal could use CAG to quickly sift through countless contracts and legal documents, providing a comprehensive summary of key legal risks, liabilities, and obligations, thus enabling the team to focus on high-priority areas for negotiation or risk mitigation.


Intellectual Property (IP) Management
  • Use Case: CAG can help legal teams track and manage intellectual property (IP) portfolios by generating reports on patents, trademarks, and copyrights, as well as monitoring renewal dates and potential infringements.

  • Why CAG is Better Than RAG: By preloading the relevant IP documents, such as patents and trademarks, into the model’s context, CAG can instantly access this critical data during analysis, eliminating the retrieval delays inherent in RAG. This enables faster and more accurate monitoring of IP status, helping legal teams stay ahead of deadlines and potential infringement issues without having to repeatedly query external knowledge sources.

  • How CAG Helps: CAG can efficiently summarize patent documents, track renewal deadlines, and flag potential conflicts or infringements. It can also generate comprehensive reports on patent landscapes and related IP, ensuring legal teams have a holistic view of their clients' IP portfolios. By pre-loading all pertinent information, it ensures the team has all the context they need, without the need for real-time information retrieval.

  • Example: An IP law firm could use CAG to monitor the status of patent filings, alerting them to any potential infringement issues or upcoming deadlines. The system could also assist in drafting new patent applications by referencing similar existing patents, streamlining the process and reducing the risk of overlooking important details.


E-Discovery and Document Review
  • Use Case: In large-scale e-discovery for litigation, CAG can assist in processing and categorizing thousands of legal documents to identify relevant evidence or privileged information.

  • Why CAG is Better Than RAG: CAG eliminates the delays associated with retrieving external data by preloading all relevant documents directly into the model’s context. This allows for real-time processing and categorization of large document sets, reducing the time spent on document retrieval and making the e-discovery process more efficient and accurate.

  • How CAG Helps: By pre-loading all pertinent legal documents into the model, CAG can quickly read, classify, and extract the context of key documents from massive datasets. This enables legal teams to identify potentially relevant files, such as evidence or privileged information, much faster, helping them focus their efforts on the most critical aspects of the case.

  • Example: A corporate lawyer involved in an antitrust investigation could use CAG to sift through thousands of emails and contracts, rapidly identifying evidence of collusion or price-fixing. This helps streamline the discovery process and ensures no important details are overlooked, ultimately saving valuable time and resources during litigation.


Legal Chatbots for Client Interaction
  • Use Case: Law firms or legal services can develop chatbots powered by CAG to interact with clients, answering routine legal questions or helping clients navigate the legal system.

  • Why CAG is Better Than RAG: CAG’s ability to preload all relevant legal knowledge directly into the model’s context ensures that chatbots can provide instant, accurate, and contextually relevant responses without the delays or potential errors associated with external retrieval. This makes CAG ideal for providing reliable legal assistance at scale.

  • How CAG Helps: By leveraging preloaded knowledge, CAG allows chatbots to generate highly accurate, context-aware responses to client inquiries. This ensures that users receive legally sound information based on the latest legal standards, making the chatbot an effective tool for assisting with everyday legal questions or guiding clients through legal processes.

  • Example: A legal service provider could deploy a chatbot powered by CAG that helps users understand basic legal concepts, such as contract terms or dispute resolution procedures, or assists them in filing claims, ensuring they receive information tailored to their specific needs based on the most up-to-date legal knowledge.


When to Choose RAG Over CAG: Considerations for Startups

While CAG certainly offers several advantages for building intelligent, context-aware agents, there are scenarios where a startup might still choose to use a Retrieval-Augmented Generation (RAG) approach over CAG. Here are a few reasons why.


Dynamic Access to External Knowledge

RAG is better suited for use cases where the knowledge base is constantly evolving or where access to real-time, external data is crucial. For instance, if a startup needs to pull information from live databases, news articles, or any frequently updated sources, RAG allows for real-time retrieval of information without having to pre-load everything into the model’s context.

Example: A legal tech startup that provides real-time legal updates or case law analysis might rely on RAG to fetch the latest legal precedents from external databases or news feeds.


Lower Initial Computational Cost

For a startup with limited resources or one that is just starting, RAG may seem like a more practical solution in the short term. Precomputing and storing large amounts of knowledge for use with CAG can be resource-intensive and may require substantial infrastructure. In contrast, RAG dynamically fetches relevant data, which can help reduce the upfront costs associated with large-scale data storage and processing.

Example: A startup focused on building a chatbot for general customer service might initially choose RAG to keep costs low, especially if they don’t yet have a vast, pre-existing knowledge base to preload into the system.


Scalability with Smaller Knowledge Bases:

Startups with smaller knowledge bases or those just starting to build their data repositories may find it more feasible to use RAG initially. Preloading large amounts of knowledge into the model’s context may not be necessary at the outset, especially if the startup’s data is not yet extensive or doesn’t justify the computational investment required for CAG.

Example: A startup focused on a specific niche area of law or business might find it more effective to use RAG, as they can dynamically pull relevant documents as needed, even if their knowledge base is not yet extensive.


Easier Integration into Existing Systems

RAG might be more suitable for startups that need to integrate their AI solution into existing systems with pre-existing infrastructure, particularly when the external databases or knowledge stores are already in place. RAG allows the integration of retrieval mechanisms without having to overhaul the entire system or pre-load data into the model beforehand.

Example: A startup offering document search services or a knowledge management platform might find it easier to implement RAG alongside their current system, which already stores and manages the data.


Familiarity and Established Tools

RAG, while perhaps more complex, has been around longer than CAG and has a more established ecosystem of tools, frameworks, and expertise. Startups might opt for RAG because they are familiar with its mechanics, or they have access to existing software that makes implementing RAG easier. For example, using frameworks like LlamaIndex or OpenAI’s retrieval-based tools could be a faster path to deployment for some startups.

Example: A startup working on AI-driven customer support might have existing workflows and tools built around retrieval-based methods, making RAG a more natural fit for their needs.


Flexibility in Data Handling

RAG offers flexibility in handling diverse data sources. If a startup is working with highly heterogeneous data — such as unstructured text, images, and structured databases — RAG can retrieve the most relevant pieces of data across these sources, making it adaptable for more varied types of input.

Example: A startup building an AI assistant for managing various corporate documents, contracts, and reports may benefit from RAG’s ability to query and retrieve information across multiple formats and databases.


Faster Updates in Dynamic Environments

Startups in rapidly changing industries (like law, finance, or tech) may prefer RAG since it can instantly retrieve the latest information without the need to reload or retrain models. CAG requires that new knowledge be incorporated into the context in advance, which could be cumbersome in fast-moving environments.

Example: A startup offering a chatbot for legal advice might find RAG useful for fetching the latest regulatory changes or new case law updates, as it can dynamically pull in new data in real-time.

You can find the code implementation of CAG in this github repo: https://github.com/hhhuang/CAG

References: Chan, B. J., Chen, C.-T., Cheng, J.-H., & Huang, H.-H. (2024). Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks. arXiv. https://arxiv.org/abs/2412.15605v1


Further Reading


We’re ready to help your career grow!

We prioritize flexibility! Customize your workspace and salary package to fit your preferences. From choosing a higher gross wage to opting for a company laptop or using your own, we empower you to create the perfect package for your needs.

We’re ready to help your career grow!

We prioritize flexibility! Customize your workspace and salary package to fit your preferences. From choosing a higher gross wage to opting for a company laptop or using your own, we empower you to create the perfect package for your needs.

We’re ready to help your career grow!

We prioritize flexibility! Customize your workspace and salary package to fit your preferences. From choosing a higher gross wage to opting for a company laptop or using your own, we empower you to create the perfect package for your needs.

©Summ.link 2025. All Rights Reserved.

©Summ.link 2025. All Rights Reserved.

©Summ.link 2025. All Rights Reserved.

©Summ.link 2025. All Rights Reserved.