Ship Better RAG Systems to Production with MongoDB & Freeplay

Apr 4, 2024

At this point anyone building AI applications is likely familiar with the importance of Retrieval Augmented Generation (RAG). And they’re also likely familiar with the challenges of getting a reliable RAG system shipped to production.

It can be daunting! Especially in an enterprise setting where the stakes are high for both quality and compliance. 

That’s why we’re excited to announce our partnership with MongoDB, and how we’re working together to help production software teams test & tune RAG features that work well at scale. 

This post goes into detail for how to set up and then use Freeplay and MongoDB together to find and fix issues with RAG systems. The video at the end shows how it all comes together in practice.

The Challenge

Getting RAG systems to perform well today often means a lot of experimentation, tuning, and optimization. This can be complicated by the fact that there are so many moving pieces that impact the ultimate customer experience for AI products, in particular:

  • Retrieval: Is your RAG pipeline finding the right content? If it’s not, what do you change? Chunk sizes, embedding strategy, ranking, etc. can all be adjusted, not to mention more advanced techniques from research.

  • Generation: Is the LLM doing what you want? If not, is it because of your prompt engineering? The model itself? Model parameters that you could adjust? Any tweak to one of those variables can have surprising impacts — positive or negative.

We hear from teams nearly every day who feel overwhelmed by all the variables involved, and more than a few who get so exhausted from all the experimentation that eventually they just decide to ship something — even if they know it’s not as good as they’d want.

And then: Even after you get a basic version working, you’re never “done.” Models change, underlying data changes, and results change for customers.

If you're building AI products, you need a system and process in place to help continuously uncover and fix issues. Everyone working in this space is looking for a better way.

What is Freeplay?

Freeplay gives product development teams the power to experiment, test, monitor & evaluate the use of LLMs in their products. Product teams use Freeplay to build better chatbots, agents, and to integrate LLMs into existing products. They choose Freeplay because it gives them a single tool to collaborate as a team across the product development lifecycle — and saves time & money.

Why we love MongoDB

MongoDB is a non-relational document database designed for large scale applications. It’s been around for years and it’s well-loved by developers. 

For serious production generative AI use cases, it stands out because it’s already optimized for scale and high throughput/low latency workloads, and it’s easy to adopt for existing MongoDB customers. MongoDB Atlas is a developer data platform that provides a seamless integration between existing operational databases and vector search, enabling developers to use a common vendor and familiar interface for both functions. 

When our team first tried testing out another open source vector database, it took a couple hours to get familiar with writing queries. With MongoDB, it took about two minutes. 🔥


Using Freeplay & MongoDB to Ship Great RAG Systems

In this post we’ll show how developers, product managers, domain experts and other collaborators work together to optimize a RAG system:

  • Quickly experiment with new prompts, models and RAG system changes

  • Run batch tests both in the Freeplay app and in code to test the parts of a RAG systems

  • Monitor & evaluate RAG retrievals and LLM completions across environments in realtime to detect quality changes

  • Review & curate datasets from observed data to continuously track & fix issues

This set of steps supports a continuous product optimization loop, and it’s core to how our customers use Freeplay.

Getting Set Up

For the purposes of this demo we are going to be building a RAG chat system over all our Freeplay blog posts and documentation. Here’s what the application architecture will look like.

Step 1: Load content into MongoDB and Create a New Embedding Index

The beauty of MongoDB is being able to store your operational data alongside your embedding index as opposed to having to manage those two things across separate platforms. Their docs go intro greater detail:

  • Then: Fetch, chunk, embed and load data into your Mongo database. A sample code snippet is below.

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from llama_index.readers.web import WholeSiteReader
from llama_index.core.node_parser import SentenceSplitter
from dotenv import load_dotenv
import os
import random
import json
from helperFuncs import embed_text, hash_text

load_dotenv("../.env")
# configure mongo credentials
mongo_user = os.getenv("MONGO_USER")
mongo_pass = os.getenv("MONGO_PASSWORD")
uri = f'mongodb+srv://{mongo_user}:{mongo_pass}@freeplay-content.izuvhvj.mongodb.net/?retryWrites=true&w=majority'
mongo_db_name = "freeplay-chat"
mongo_collection_name = "doc-chunks"

# create a new mongo clinet
mongoClient = MongoClient(uri, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    mongoClient.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)


### load url content and chunk it ###
urls = [
    "https://docs.freeplay.ai/docs", # all the docs content
    "https://freeplay.ai/blog" # all the blog content
]

all_docs = []

for url in urls:
    loader = WholeSiteReader(prefix=url, max_depth=10)
    docs = loader.load_data(url)
    # loop over each doc and add to docs dict with hash of key as doc id added to metadata
    for doc in docs:
        doc_id = hash_text(doc.text)
        doc.extra_info.update({"doc_id": doc_id})
        all_docs.append(doc)

# view 1 random doc
print(random.choice(docs))
print("\n\n")

# add each chunk to the db
for doc in docs:
    # add a tag for blogs or docs
    if "blog" in doc.extra_info['URL']:
        tag = "blog"
    else:
        tag = "docs"
    base_payload = {
        'doc_id': doc.extra_info['doc_id'],
        'link': doc.extra_info['URL'],
        'tag': tag,
        'number_times_relevant': 0
    }
    # split the text at paragraph level
    splitter = SentenceSplitter(chunk_size=1000)
    chunks = splitter.split_text(doc.text)
    for chunk in chunks:
        payload = base_payload.copy()
        payload['text'] = chunk
        payload['embedding'] = embed_text(chunk)
        print(json.dumps(payload, indent=2))
        collection = mongoClient[mongo_db_name][mongo_collection_name]
        collection.insert_one(payload)

Step 2: Configure your Prompt and Evaluation Criteria in Freeplay

See our guide here for configuring a project in Freeplay, from Prompt Management to Evaluations.

Screenshot of prompt versions & evaluations configured in Freeplay.

Step 3: Integrate your Application Code with Freeplay and MongoDB

Freeplay and MongoDB both offer a number of SDK languages. In this case we will be using Python. Freeplay also has full SDK support for Typescript and the JVM, as well as an API for customers using other languages.

MongoDB Integration

MongoDB’s SDK for vector search follows the same patterns as their traditional document stores making it very easy to pick up. 

There are a number of knobs to tune on your Retrieval pipeline. They fall into two major buckets:

Chunking and Embedding Strategies – How you chunk and embed content

  • In this example, we’ve opted for a chunking strategy that roughly breaks documents in paragraphs. 

  • We’re using OpenAI’s embedding model to generate embeddings.

Retrieval Parameters – Parameters that directly impact vector search results

  • Top K controls how many documents are returned during vector search. 

  • Cosine Similarity Threshold controls the threshold for how similar a document embedding has to be to the query embedding to be included in results

We’ve started with some reasonable default values for each of these, but this is a big lever to pull when tuning your RAG system. Chunking strategies, different embedding models, and search parameters can yield markedly different results. It’s important to track these things with your experiments, which can be done by logging Custom Metadata in Freeplay. An example is below.

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from dotenv import load_dotenv
import os
import random
import json
from helperFuncs import embed_text
from pydantic import BaseModel

load_dotenv("../.env")
# configure mongo credentials
mongo_user = os.getenv("MONGO_USER")
mongo_pass = os.getenv("MONGO_PASSWORD")
uri = f'mongodb+srv://{mongo_user}:{mongo_pass}@freeplay-content.izuvhvj.mongodb.net/?retryWrites=true&w=majority'
mongo_db_name = "freeplay-chat"
mongo_collection_name = "doc-chunks"

# create a new mongo clinet
mongoClient = MongoClient(uri, server_api=ServerApi('1'))
collection = mongoClient[mongo_db_name][mongo_collection_name]

class SearchChunk(BaseModel):
    title: str
    link: str
    content: str


def vector_search(text, top_k=5, cosine_threshold=0.75, tag=["docs", "blogs"], title=None):
    # vectorsearch config
    vector_search_config = {
                'index': 'vector_index',
                'path': 'embedding',
                'queryVector': embed_text(text),
                'numCandidates': top_k * 10, # use 10 nearest neighbors for each return result, suggested in mongo docs
                'limit': top_k
    }
    if tag:
        pre_filter = {'tag': {'$in': tag}}
        vector_search_config['filter'] = pre_filter
    elif title:
        pre_filter = {'title': {'$in': [title]}}
        vector_search_config['filter'] = pre_filter
    else:
        pre_filter = None

    print("Running semantic search with the following config: ")
    print("numCandidates: ", vector_search_config['numCandidates'])
    print("limit: ", vector_search_config['limit'])
    print("filter: ", json.dumps(pre_filter, indent=2))
    print("\n\n")
    # create the search query
    pipeline = [
        {
            '$vectorSearch': vector_search_config
        },
        {
            '$project': {
                '_id': 1,
                'doc_id': 1,
                'title': 1,
                'link': 1,
                'description': 1,
                'text': 1,
                'tag': 1,
                'embedding': 1,
                'number_time_relevant': 1,
                'sim_score': {
                    '$meta': 'vectorSearchScore'
                }
            }
        }
    ]
    # run the query
    result = collection.aggregate(pipeline)
    # build the return set
    return_set = []
    for record in result:
        if record['sim_score'] < cosine_threshold:
            continue
        else:
            return_set.append(SearchChunk(title=record['title'],
                                          link=record['link'], 
                                          content=record['text']))
            # increment the releavance counter
            collection.update_one({'_id': record['_id']}, {'$inc': {'number_time_relevant': 1}})
    return return_set, pre_filter

Freeplay Integration

Freeplay offers a flexible, lightweight SDK that is optimized for developer control — keeping core application code fully in your hands without wrappers or proxies. We’ve focused on making it easy to integrate with existing projects, without having to significantly re-work code. 

Simply fetch and format your prompt/model configuration from Freeplay, make your LLM call directly with your API key, and record the results back to Freeplay. Here's another example.

from freeplay.thin import Freeplay, RecordPayload, CallInfo, ResponseInfo, SessionInfo, CallInfo
import openai
from vectorSearch import vector_search
from dotenv import load_dotenv
import os
import time
import json
import asyncio

load_dotenv("./.env")
freeplay_key = os.getenv("FREEPLAY_KEY")
freeplay_url = os.getenv("FREEPLAY_URL")
freeplay_project_id = os.getenv("FREEPLAY_PROJECT_ID")
openai_key = os.getenv("OPENAI_API_KEY")

# configure top k
top_k = 5
cosine_threshold = 0.75

# create a new freeplay client
fpClient = Freeplay(
    freeplay_api_key=freeplay_key,
    api_base=freeplay_url
)


def getCompletion(message, session, tag=["docs", "blogs"], title=None):
    # start timer for logging latency of the full chain
    start = time.time()
    # run semantic search
    search_res, filter = vector_search(message, top_k=top_k,
                               cosine_threshold=cosine_threshold,
                               tag=tag, title=title)
    
    # get the formatted prompt
    prompt_vars = {
        "question": message,
        "supporting_information": str(search_res)
    }

    # get prompt template
    prompt_template = fpClient.prompts.get(
        project_id=freeplay_project_id,
        template_name="rag-qa",
        environment="prod"
    )
    print(prompt_template.messages)
    formatted_prompt = fpClient.prompts.get_formatted(
        project_id=freeplay_project_id,
        template_name="rag-qa",
        environment="prod",
        variables=prompt_vars
    )
    print(json.dumps(prompt_vars, indent=2))

    # run the completion
    chat_completion = openai.chat.completions.create(
        model=formatted_prompt.prompt_info.model,
        messages = formatted_prompt.messages,
        **formatted_prompt.prompt_info.model_parameters
    )
    # log latency
    end = time.time()

    # update messages
    messages = formatted_prompt.all_messages(
        {'role': chat_completion.choices[0].message.role,
         'content': chat_completion.choices[0].message.content}
    )

    # create an async record call payload
    record_payload = RecordPayload(
        all_messages=messages,
        inputs=prompt_vars,
        session_info=session,
        prompt_info=formatted_prompt.prompt_info,
        call_info=CallInfo.from_prompt_info(formatted_prompt.prompt_info, start_time=start, end_time=end),
        response_info=ResponseInfo(
            is_complete=chat_completion.choices[0].finish_reason == "stop"
        )
    )
    # record the call
    completion_log = fpClient.recordings.create(record_payload)
    completion_id = completion_log.completion_id

    # record the filter used
    fpClient.customer_feedback.update(
        completion_id=completion_id,
        feedback={"search_filter": json.dumps(filter)}
    )

    return chat_completion.choices[0].message.content, completion_id

Evaluate, Monitor, and Optimize 

Once integrated, all the LLM interactions and RAG retrievals in your product will be recorded to Freeplay. You can record single completions, chains of completions, multi-turn chat, etc. as "Sessions" in Freeplay.

As Sessions get recorded, you can label them with human or model-graded evals (both using the same criteria), curate them into datasets for batch testing, and pull them into the playground to iterate on prompts with real data. 

For a RAG system, it’s particularly helpful to create RAG-specific model-graded evals (like Context Relevance or Answer Faithfulness) that help identify issues with retrieval or hallucination. Those evals can then run on a sample of live production data as well, so you can have a pulse on how your LLM systems are performing in production — not just on static test sets.

Snapshot of Freeplay monitoring dashboard.

Together this system gives you the ability to quickly spot issues and fix them. Two examples for our RAG chatbot are below.

1. Uncovering & Fixing Poor Document Retrieval

As mentioned above, there are a lot of knobs to tune in a retrieval pipeline. Let’s say you’re noticing a lot of ‘Context Relevance’ scores below 3 — meaning there’s irrelevant content in our RAG results. We dig into those flagged Sessions and find there’s a lot of extraneous documents in our RAG results. We might want to address this by increasing our cosine similarity threshold, effectively raising the bar for relevance.

If we make a change to our pipeline like adjusting cosine similarity, we’d want to test that change against a benchmark dataset and compare the results side by side. Freeplay’s batch testing feature allows you to test against your datasets to benchmark and compare the performance of your pipeline. These tests can be initiated both from the UI or from the SDK, which is critical to test the change in your RAG pipeline.

Screenshot of a completed test result in Freeplay showing Claude 3 Sonnet's improved performance.

Below is a code sample that shows what it looks like to create a Test Run using the Freeplay SDK.

from freeplay.thin import Freeplay, RecordPayload, CallInfo, ResponseInfo, SessionInfo, CallInfo, TestRunInfo
from openai import OpenAI
from vectorSearch import vector_search
from dotenv import load_dotenv
import os
import time
import json
import asyncio

load_dotenv("./.env")
freeplay_key = os.getenv("FREEPLAY_KEY")
freeplay_url = os.getenv("FREEPLAY_URL")
freeplay_project_id = os.getenv("FREEPLAY_PROJECT_ID")
openai_key = os.getenv("OPENAI_API_KEY")

fpClient = Freeplay(
    freeplay_api_key=freeplay_key,
    api_base=freeplay_url,
)

# create a new test run
test_run = fpClient.test_runs.create(project_id=freeplay_project_id, testlist="rag-tests")

# get the prompt associated with the test run
template_prompt = fpClient.prompts.get(project_id=freeplay_project_id,
                                       template_name="rag-qa",
                                       environment="prod"
                                       )

# run each test case in the test list
for test_case in test_run.test_cases:
    # run rag on the question
    question = test_case.variables['question']
    search_res, filter = vector_search(question, top_k=5, cosine_threshold=0.90)
    # format the prompt with the test case variables
    update_vars = {'question': question, 'supporting_information': str(search_res)}
    formatted_prompt = template_prompt.bind(update_vars).format()

    # make your llm call
    s = time.time()
    openaiClient = OpenAI(api_key=openai_key)
    chat_response = openaiClient.chat.completions.create(
        model=formatted_prompt.prompt_info.model,
        messages=formatted_prompt.messages,
        **formatted_prompt.prompt_info.model_parameters
    )
    e = time.time()

    # append the results to the messages
    all_messages = formatted_prompt.all_messages(
        {'role': chat_response.choices[0].message.role,
         'content': chat_response.choices[0].message.content}
    )

    # create a session which will create a UID
    session = fpClient.sessions.create()

    # build the record payload
    payload = RecordPayload(
        all_messages=all_messages,
        inputs=test_case.variables, # the variables from the test case are the inputs
        session_info=session, # use the session id created above
        test_run_info=TestRunInfo(test_run_id=test_run.test_run_id,
                                  test_case_id=test_case.id), # link the record call to the test run and test case
        prompt_info=formatted_prompt.prompt_info, # log the prompt information 
        call_info=CallInfo.from_prompt_info(formatted_prompt.prompt_info, start_time=s, end_time=e), # log call information
        response_info=ResponseInfo(
            is_complete=chat_response.choices[0].finish_reason == 'stop'
        )
    )
    # record the results to freeplay
    fpClient.recordings.create(payload)

2. Uncovering & Fixing Hallucinations in Production

You can easily use the Session dashboard to filter for sessions where Answer Faithfulness is “No”. This represents a hallucination because in effect the model is ignoring the provided information and hallucinating an answer. 

Here I have an example where it looks like Context Relevance evaluated positively but the other criteria like Answer Faithfulness evaluated negatively. This indicates there was an issue with my generation steps rather than my retrieval step. Looking at the supporting information confirms this — we returned the right portion of documentation, but the model hallucinated an incorrect answer. It was not “faithful” to the provided context. 

Screenshot of an auto-evaluated completion in Freeplay

But uncovering issues isn’t all that helpful if it’s not actionable! We can easily pull this example directly into the playground and experiment with ways to resolve the issue. 

For example, we were using GPT 3.5 in this case, but in the Freeplay playground we can quickly see what happens if we were to upgrade the model to GPT 4. Additionally we could pull in other Test Cases from saved datasets and ensure we’re not over indexing on a single example. 

In this case it looks like the hallucination was resolved by upgrading the model, and that trend held against other examples in our “Failure Cases” dataset. From here we could do further testing batch against our benchmark dataset or deploy the changes directly to a lower environment. 

These are just two examples of ways Freeplay helps detect and address issues in RAG pipelines. With in-app testing, batch testing in code, and live production monitoring in place, you can have a powerful set of tools to uncover, diagnose, and address issues with your LLM application. Giving you the freedom to iterate as a team and ship with confidence.

With Freeplay and MongoDB working together in tandem you get a powerful toolset to bring your enterprise data to bear on your LLM applications.

Here's a video showing the end to end workflow.


Want to go deeper? Check out our docs here, or reach out here if you’d like to talk with our team. 

© 228 Labs Inc. 2024