Why we need to move on from RAG
RAG is great but not enough on its own. Here's what you actually need.
How we got here
Despite the ludicrous pace of research in Generative AI, the core training objective of the most popular family of models: Large Language Models (LLMs) has mostly remained the same. They are neural-networks (specifically transformers and multi-layer perceptrons) trained to predict the next token in a sequence.
Around the time of the introduction of the transformer, it was completely unintuitive to everyone (researchers included) that scaling this training to trillions of tokens (sub-word units) would lead to AI assistants that can perform a seemingly infinite array of tasks without any specific downstream training, given the right instructions (or “prompts”).
The problem with LLMs is that they lack any specific context. Unless you’re very famous, these models don’t know anything about you, your organization, or that “Bob from finance has the details of our latest funding round nested 13 folders deep in Confluence.”
The problem with LLMs is that they lack any specific context. Unless you’re very famous, these models don’t know anything about you, your organization, or that “Bob from finance has the details of our latest funding round nested 13 folders deep in Confluence.” This makes their usefulness limited. Enter Retrieval Augmented Generation (RAG).
RAG attempts to solve these gaps in an LLM’s knowledge by dynamically incorporating local context into prompts, based on a search over data that you provide. This is an intentionally broad definition, because in practice RAG spans everything from a side-project hacked together in an hour, all the way to the backbone of products from companies with multi-billion dollar valuations, like Perplexity.
The Typical Life and Death of a RAG MVP
Nowadays, getting the basics of a RAG system in place is trivial. With just a few hours, and your no-code tool of choice, you can have your own AI personal assistant that can answer questions that are unique to you or your company. At first, these simple implementations will perform deceptively well (often due to the creator asking biased questions because they know the data fairly well). It’s very easy to believe that you’re not that far off from a final, production-grade solution…
Then you get the first “hallucination”. And it’s not the type of mistake a human would make, like a typo or being explicitly vague when you don’t know the exact answer to something. Instead, it is entirely incorrect information presented as convincing, authoritative fact.
Then you get the first “hallucination”. And it’s not the type of mistake a human would make, like a typo or being explicitly vague when you don’t know the exact answer to something. Instead, it is entirely incorrect information presented as convincing, authoritative fact. Uh oh.
Fine. The first iteration was never going to perfect. You test some easier questions, and while the answers appear to be meandering in the right direction, it turns out sifting through rambling AI generated paragraphs that “delve” into the virtues of data security actually takes more time than simply finding that SOC2 report yourself.
The deeper you go, the more failure patterns and challenges emerge. Some common ones:
- A language model not understanding the broader context of a document (for example the relationship between subsidiaries in a company)
- Conflicting information (outdated versions of documents that haven’t been reconciled)
- Core information locked away in highly visual documents (process diagrams, flow charts, etc.)
- Vital information locked away in structured data - spreadsheets are an obvious example but contracts, for example, also have structure (clauses and sub-clauses).
It’s worth noting the obvious here: a bad RAG system can be quite dangerous and in most cases is worse than nothing at all.
What’s Actually Needed to get (just) RAG into Production
Our agents have served hundreds of thousands of queries. We know that to get a trustworthy RAG deployment into production takes (at least) the following:
Word class search
The internet has conditioned us to associate natural language inputs with keyword based search. Think about it, how often do you put a full, well-formed sentence into Google? Usually its just a combination of a few words you know will surface the result. This is often confusing for a Language Model that depends on precise pieces of information to generate clear, concise answers. It usually becomes essential to combine semantic search, keyword search, and reranking.
Hallucination detection
Language models have a tendency to create plausible, but incorrect answers. They will also “read between the lines” and interpret your content in ways you didn’t anticipate. Simply put: 99% accuracy isn’t good enough, getting one number wrong in a bank account number could be catastrophic.
Intelligent document processing
It’s table stakes for your RAG system to be able to interpret everything everything from PDFs to spreadsheets. Since text is the primary input for these models you need inventive ways to represent the structure of different document types.
A systemic approach to evaluating change
The input and output space of a RAG tool is of course infinite. Traditional approaches for monitoring performance fall flat very quickly. It’s extremely difficult to know if any change you’ve made has lead to an overall improvement (many start-ups' entire focus is this). However, without investing here you’ll have no way to meaningfully iterate beyond the earliest stages, or even to know if your tool is dangerous.
Moving on
At Flank we’re building expert AI colleagues that instantly resolve bottlenecks for commercial teams. And while the set of tools that a RAG formulation gives us is important for a lot of tasks, on its own it’s not sufficient to reach the level of intelligence and reliability we’re aiming for. A truly helpful colleague doesn’t simply look at the immediate data they’re presented with and come up with an answer; they understand the broader context, continuously learn, adapt to new information, leverage available software, and work collaboratively with their human counterparts to form proactive solutions.
Working with AI that’s embedded in your team, learns from conversational interactions, provides transparent explanations and is easy to steer is what unlocks its true potential and is the true zero to one moment you might have been expecting from RAG.
More importantly, RAG is heavily dependent on well crafted and up to date documentation. But in fast growing teams, workflows are dynamic and change is constant. There simply isn’t time to rigorously document every process so an AI tool can understand and—hopefully—help (at some distant point in the future). Working with AI that’s embedded in your team, learns from conversational interactions, provides transparent explanations and is easy to steer is what unlocks its true potential and is the true zero to one moment you might have been expecting from RAG.
So the irony is, the outcomes you expect from RAG aren’t possible with RAG alone.
If you've read this and are wondering 'What even is RAG, tokens, embedding, etc?!' - here's a super simple guide to LLMs for you.
For more information about Flank, you can speak to us here.