Boost AI accuracy for developers! Split leverages Retrieval Augmented Generation (RAG) to provide developers with highly accurate and relevant AI features by incorporating domain-specific information.
Retrieval Augmented Generation (RAG) is a technique used in Generative AI to be able to enhance a generative AI model with text retrieved from an additional source of information. This is typically used in situations where there is a domain specific information resource that is needed to augment the generative AI’s model, such as a company knowledge base or integrating data from proprietary APIs or data sets in order to enhance the accuracy and value provided by the AI.
The key components of this are the retrieval system and the generative model. They need to work together in order to successfully generate the augmented result from the query entered by the user.
At a very high level, you can think of RAG working as a three step process. The first step is when the user inputs their query to the Generative AI system using RAG. After that, the retrieval system then needs to process this information to fetch the relevant information from the knowledge base or data set that it is supposed to use as ‘the truth’ for use in the Generative model. Then, after that, the generative model takes the fetched information and builds it into an answer.
OpenAI offers the ability to tell the GPT model to use custom functions that allow it to retrieve data on the fly. These would be used for enriching requests when the additional data to integrate with the GPT comes from private API data, as that is not necessarily static over long periods of time.
What is currently popular for storing and retrieving text similarity is to use vector databases for this purpose. Chunks of textual information are vectorized and then stored in a vector database. The vectorization process for documents used in a RAG pipeline is often referred to as creating ‘embeddings’ for them. This is the style of RAG we will be demonstrating in this article
Integrating information from the knowledge base or other data set has benefits for generative models. It allows improved accuracy and contextual relevance that would not be possible for a generative model to build with its training data alone. This is a quite common use case in the enterprise when you need to reduce or eliminate AI hallucinations and base answers on an existing, limited corpus of factual information.
Here’s an example of how a RAG might work. In this example we have an application that allows users to ask questions about great books in history. In this specific example we’ll source from Mark Twain’s ‘The Adventures of Tom Sawyer’
Python:
Save this to a file called rag.py.
To run this, you will need to install the Python libraries required
And when you run this, you will get the generated Text below:
If you’re not familiar with “The Adventures of Tom Sawyer” this is an exact line from the story.
The boys forgot all their fears, all their miseries in an instant. Withgloating eyes they watched every movement. Luck!—the splendor of it wasbeyond all imagination! Six hundred dollars was money enough to makehalf a dozen boys rich!
So our RAG was able to successfully pull this data out of the story.
There are some great tutorials and blog posts on RAG in general. We won’t duplicate their effort. These include from OpenAI themselves and also FreeCodeCamp. Feel free to reference these if you get lost as we move through this tutorial at any point.
Let’s examine the code above section by section first to make sure we have a fairly straightforward understanding of what it does.
The first section here is where we download our textual corpus and split it into paragraphs. There are multiple ways to chunk text for retrieval. If you have many shorter documents it may be fine to just ingest them as they are. However in our case we have one really long document we are going to split it into paragraphs. If your source material is already segregated by topical sections or topic-based articles then you’re already ahead of the game. Splitting by sentences generally is too granular and will likely miss context.
Python:
In this next code section we are going to set up the OpenAI Python client as well as create a function to get the embeddings for each paragraph of text. For generating the embeddings we will be using OpenAI’s API and using the text-embedding-ada-002
model. Embeddings are retrieved in parallel to speed up processing.
Python:
The next code here will store the vectorizations (‘embeddings’) into a local vector database. We will be using Facebook’s Open Source faiss library. At enterprise scale you may be using cloud or hosted vector databases, but the logic is essentially the same.
Here, to avoid having to create the embeddings multiple times we will first check to see if the embeddings file first exists. If it does we will load it and move on to the next part of the process.
If it doesn’t exist, we will download “The Adventures of Tom Sawyer” from Project Gutenberg, split it into paragraphs, generate an embedding from each paragraph using the OpenAI Embeddings API, and then save it to the faiss local vector database. The index that is created is a ‘L2’ distance index, also referred to as the euclidean distance.
Python:
This next function uses OpenAI to create embeddings for the query text, and then search the vector database for the top ‘k’ paragraphs of text that are closest to the embeddings for the query using the euclidean distance index we created previously. You can think of it as the embedding of a paragraph being a point in multidimensional space, and the index is finding the top ‘k’ paragraphs closest to the multi-dimensional point represented by the embedding of the query text. The ‘k’ here defaults to 5 paragraphs.
Python:
Lastly, we use this code to generate the response for the query. We will gather the top paragraphs according to the embeddings and use them as the context for the question. Once we have built the prompt that includes the query and the context we have pulled from the vector database, we send that prompt to OpenAI’s GPT-4 model to return us a response.
Python:
For the example we use here - the prompt generated to the GPT is as follows:
Use the following Context to answer the question:The boys forgot all their fears, all their miseries in an instant. Withgloating eyes they watched every movement. Luck!—the splendor of it wasbeyond all imagination! Six hundred dollars was money enough to makehalf a dozen boys rich! Here was treasure-hunting under the happiestauspices—there would not be any bothersome uncertainty as to where todig. They nudged each other every moment—eloquent nudges and easilyunderstood, for they simply meant—“Oh, but ain’t you glad _now_ we’rehere!”
The Widow Douglas put Huck’s money out at six per cent., and JudgeThatcher did the same with Tom’s at Aunt Polly’s request. Each lad hadan income, now, that was simply prodigious—a dollar for every weekday inthe year and half of the Sundays. It was just what the minister got—no,it was what he was promised—he generally couldn’t collect it. A dollarand a quarter a week would board, lodge, and school a boy in those oldsimple days—and clothe him and wash him, too, for that matter.
But Tom’s energy did not last. He began to think of the fun he hadplanned for this day, and his sorrows multiplied. Soon the free boyswould come tripping along on all sorts of delicious expeditions, andthey would make a world of fun of him for having to work—the verythought of it burnt him like fire. He got out his worldly wealth andexamined it—bits of toys, marbles, and trash; enough to buy an exchangeof _work_, maybe, but not half enough to buy so much as half an hourof pure freedom. So he returned his straitened means to his pocket, andgave up the idea of trying to buy the boys. At this dark and hopelessmoment an inspiration burst upon him! Nothing less than a great,magnificent inspiration.
“I judged so; the boys in this town will take more trouble and fool awaymore time hunting up six bits’ worth of old iron to sell to the foundrythan they would to make twice the money at regular work. But that’shuman nature—hurry along, hurry along!”
The two men examined the handful of coins. They were gold. The boysabove were as excited as themselves, and as delighted.
Question: How much money to make half a dozen boys rich?
Answer:
You can see the paragraphs pulled out of the vector database, including the first, most relevant one that holds the answer to our question.
Now that we have a RAG pipeline that we understand, we may want to experiment a bit with some of the parameters to see if we get any appreciable changes to our app metrics. This is where Split comes in. With our powerful calculation engine and easy to use SDKs we can do a bit of experimentation here, changing parts of the code using feature flags.
Let’s first determine which parts of the code to make changes to.
A couple of places immediately come to mind.
First, we can adjust the number of paragraphs of relevant data to pull out of the vector database. Pulling more data adds more context but may muddle the actual responses by providing too much irrelevant information. Pulling too little data may make it not as useful
Second, let’s also adjust the embeddings model. Some embedding models are more expensive than others, and we may want to see if our users’ experiences are degraded by using a cheaper model. Similarly, we can also test with a less expensive GPT model to generate the resulting response.
Let’s start by creating the feature flags in Split. We will call them useMoreTextChunks, embeddingsModel
, and GPTModel
.
We will go through the example for the useMoreTextChunks
flag and then you can follow these steps for the other two flags.
First, log in to Split and select your workspace. Click on Create feature flag
.
Then you will see this modal pop up - enter the name of the flag and put in an option tag and restrictions. Make sure to select the proper traffic type. In this example we are going to use user
with the assumption that our Python code has access to the userID of the user who is interacting with the chatbot, and that this is a feature for logged in users only.
Click Create
to create the feature flag.
The last step we need to do is to initialize the flag in our test environment.
Click on Initiate Environment
and then save your changes to make the feature flag rules active.
Complete the above steps to also create flags named embeddingsModel
and GPTModel
Next, let’s go into the flags to create the targeting rules. We will use Split’s Dynamic Configuration in order to be able to experiment with these values without deploying new code. We will store the k_chunks value and model names for both the embeddings and the GPT in Split using this feature.
Let’s start with the useMoreTextChunks flag. Open up the flag’s treatments by clicking on where it says Treatments.
Now let’s add the dynamic configuration. Click on the Select Format box and select Key-value pairs. Then for each treatment (on, and off) click Add new pair and enter k_chunks with a value of 15for on and a value of 5 for off (our default).
It should look like this:
Now save your changes to the feature flag.
Next, let’s make similar changes to the other flags.
For the GPTModel
flag let off be the current gpt-4 model, and let’s have the new (on) treatment be gpt-4o - let’s test to see if 4o has better responses for our questions that make it worth being a bit more expensive.
Save your changes.
Now let’s update the embeddingsModel
feature flag to see if we get better embedding results from the text-embedding-3-large
model, or if our existing, less expensive model – ada-002 – doesn’t perform appreciably worse for our metrics.
Make sure to save the changes to the feature flag.
Now, we will adjust our code to include these flags, including the Split Python SDK.
Let’s first install the Split SDK in our project.
Now let’s go into the code and instantiate the Split SDK. Notice that in addition to importing the Split libraries we are also importing json
in order to allow us to read the dynamic configuration key-value pairs we just stored into the feature flags above. This also requires that we put our Split SDK key into an environment variable called SPLIT_SDK_KEY
.
Python:
Now that the SDK is ready, let’s adjust the code to make this work. First, let’s adjust the code for the ‘top k embeddings’ to use the value stored in the feature flag.
Let’s adjust the function where we call our get_top_k_paragraphs
function to include using the Split SDK and reading from the dynamic configuration to retrieve the ‘k’ value. As always is best practice, we should ensure we handle the case that the SDK cannot connect for some reason, and as such we will program defensively and support the case where config
is an empty object.
Python:
Now that we’re here in this function, this is also where we’d change the GPT model used to generate the response. Here we can further adjust the code in this function to use what we have configured in the feature flag.
Python:
Now let’s do something a bit more complicated in changing our embeddings model. So one thing to notice is that we saved the embeddings to a file. If we were to re-run the embeddings we would have different embeddings due to a potentially different embeddings model. To get around this we are going to prepend the model name to the embeddings file in order to make sure that we can still cache and avoid having to create the same embeddings over and over again.
Python:
Now we also need to update the get_embeddings function and the get_top_k_paragraphs functions to read in the embeddings model we just defined.
And for get_embeddings
Here’s what the final code looks like with the feature flags integrated into it.
Python:
Now we run it with our current setup, since all of our defaults are the values that existed before, we run it and we’ll get the same response.
Let’s play around with the feature flags to see if we get a different response. If we turn GPTModel
to on - let’s see if we get a different response from GPT-4o:
(make sure to save the flag change!)
Looks like here we get some more context back using the new model.
Cool - now let’s turn the embeddingsModel
flag on as well and see if that changes anything to use the more expensive embeddings model:
Seems like for this question the embeddings model didn’t have any effect.
Let’s change the question and see - maybe having a straightforward question doesn’t get strongly affected by these things.
If we change this line:
We get this answer:
Now let’s turn embeddingsModel
off and see if there is a difference here.
And we do get a different answer:
Keep playing around and change the prompt and the flags to see if you get any different responses - you’ll see how altering these configuration parameters can have a strong effect on the text produced by the GPT model without needing to change any of the code.
When hooking this up to a production workload with real user IDs - you may also want to explore percentage based rollouts in order to do the math to really measure the feature impact.
Using the flags we just enabled you can see the power of using Split in concert with its dynamic configuration to be able to control the RAG pipeline with just a click of a button and a few keystrokes in the Split Web Console. Go further from here and create metrics, set up percentage based rollouts and send events to Split. With our IFID Capability we can handle measurement for any number of concurrent experiments and pinpoint the winners and the losers, alerting you when they’ve hit significance.
We can’t wait to see what you’re capable of with Split and Generative AI.
The Split Feature Data Platform™ gives you the confidence to move fast without breaking things. Set up feature flags and safely deploy to production, controlling who sees which features and when. Connect every flag to contextual data, so you can know if your features are making things better or worse and act without hesitation. Effortlessly conduct feature experiments like A/B tests without slowing down. Whether you’re looking to increase your releases, to decrease your MTTR, or to ignite your dev team without burning them out–Split is both a feature management platform and partnership to revolutionize the way the work gets done. Switch on a free account today or Schedule a demo to learn more.
Split Arcade includes product explainer videos, clickable product tutorials, manipulatable code examples, and interactive challenges.