To build customer support bots, internal knowledge graphs, or Q&A systems, customers often use Retrieval Augmented Generation (RAG) applications which leverage pre-trained models together with their proprietary data. However, the lack of guardrails for secure credential management and abuse prevention prohibits customers from democratizing access and development of these applications. We recently announced the MLflow AI Gateway, a highly scalable, enterprise-grade API gateway that enables organizations to manage their LLMs and make them available for experimentation and production. Today we are excited to announce extending the AI Gateway to better support RAG applications. Organizations can now centralize the governance of privately-hosted model APIs (via Databricks Model Serving), proprietary APIs (OpenAI, Co:here, Anthropic), and now open model APIs via MosaicML to develop and deploy RAG applications with confidence.
In this blog post, we’ll walk through how you can build and deploy a RAG application on the Databricks Lakehouse AI platform using the Llama2-70B-Chat model for text generation and the Instructor-XL model for text embeddings, which are hosted and optimized through MosaicML’s Starter Tier Inference APIs. Using hosted models allows us to get started quickly and have a cost-effective way to experiment with low throughput.
The RAG application we’re building in this blog answers gardening questions and gives plant care recommendations.
What is RAG?
RAG is a popular architecture that allows customers to improve model performance by leveraging their own data. This is done by retrieving relevant data/documents and providing them as context for the LLM. RAG has shown success in chatbots and Q&A systems that need to maintain up-to-date information or access domain-specific knowledge.
Use the AI Gateway to put guardrails in place for calling model APIs
The recently announced MLflow AI Gateway allows organizations to centralize governance, credential management, and rate limits for their model APIs, including SaaS LLMs, via an object called a Route. Distributing Routes allows organizations to democratize access to LLMs while also ensuring user behavior doesn’t abuse or take down the system. The AI Gateway also provides a standard interface for querying LLMs to make it easy to upgrade models behind routes as new state-of-the-art models get released.
We typically see organizations create a Route per use case and many Routes may point to the same model API endpoint to make sure it is getting fully utilized.
For this RAG application, we want to create two AI Gateway Routes: one for our embedding model and another for our text generation model. We are using open models for both because we want to have a supported path for fine-tuning or privately hosting in the future to avoid vendor lock-in. To do this, we will use MosaicML’s Inference API. These APIs provide fast and easy access to state-of-the-art open source models for rapid experimentation and token-based pricing. MosaicML supports MPT and Llama2 models for text completion, and Instructor models for text embeddings. In this example, we will use Llama2-70b-Chat, which was trained on 2 trillion tokens and fine-tuned for dialogue, safety, and helpfulness by Meta and Instructor-XL, a 1.2B parameter instruction fine-tuned embedding model by HKUNLP.
It’s easy to create a route for Llama2-70B-Chat using the new support for MosaicML Inference APIs on the AI Gateway:
from mlflow.gateway import create_route
mosaicml_api_key = "your key"
create_route(
"completion",
"llm/v1/completions",
{
"name": "llama2-70b-chat",
"provider": "mosaicml",
"mosaicml_config": {
"mosaicml_api_key": mosaicml_api_key,
},
},
)
Similarly to the text completion route configured above, we can create another route for Instructor-XL available through MosaicML Inference API
create_route(
"embeddings",
"llm/v1/embeddings",
{
"name": "instructor-xl",
"provider": "mosaicml",
"mosaicml_config": {
"mosaicml_api_key": mosaicml_api_key,
},
},
)
To get an API key for MosaicML hosted models, sign up here.
Use LangChain to piece together retriever and text generation
Now we need to build our vector index from our document embeddings so that we can do document similarity lookups in real-time. We can use LangChain and point it to our AI Gateway Route for our embedding model:
# Create the vector index
from langchain.embeddings.mlflow_gatewayllms import MLflowAIGatewayEmbeddings
from langchain.vectorstores import Chroma
# Retrieve the AI Gateway Route
mosaicml_embedding_route = MLflowAIGatewayEmbeddings(
gateway_uri="databricks",
route="embedding"
)
# load it into Chroma
db = Chroma.from_documents(docs, embedding_function=mosaicml_embedding_route, persist_directory="/tmp/gardening_db")
We then need to stitch together our prompt template and text generation model:
from langchain.llms import MLflowAIGateway
# Create a prompt structure for Llama2 Chat (note that if using MPT the prompt structure would differ)
template = """[INST] <<SYS>>
You are an AI assistant, helping gardeners by providing expert gardening answers and advice.
Use only information provided in the following paragraphs to answer the question at the end.
Explain your answer with reference to these paragraphs.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
<</SYS>>
{context}
{question} [/INST]
"""
prompt = PromptTemplate(input_variables=['context', 'question'], template=template)
# Retrieve the AI Gateway Route
mosaic_completion_route = MLflowAIGateway(
gateway_uri="databricks",
route="completion",
params={ "temperature": 0.1 },
)
# Wrap the prompt and Gateway Route into a chain
retrieval_qa_chain = RetrievalQA.from_chain_type(llm=mosaic_completion_route, chain_type="stuff", retriever=db.as_retriever(), chain_type_kwargs={"prompt": prompt})
The RetrievalQA chain chains the two components together so that the retrieved documents from the vector database seed the context for the text summarization model:
query = "Why is my Fiddle Fig tree dropping its leaves?"
retrieval_qa_chain.run(query)
You can now log the chain using MLflow LangChain flavor and deploy it on a Databricks CPU Model Serving endpoint. Using MLflow automatically provides model versioning to add more rigor to your production process.
After completing proof-of-concept, experiment to improve quality
Depending on your requirements, there are many experiments you can run to find the right optimizations to take your application to production. Using the MLflow tracking and evaluation APIs, you can log every parameter, base model, performance metric, and model output for comparison. The new Evaluation UI in MLflow makes it easy to compare model outputs side by side and all MLflow tracking and evaluation data is stored in query-able formats for further analysis. Some experiments we commonly see:
- Latency – Try smaller models to to reduce latency and cost
- Quality – Try fine tuning an open source model with your own data. This can help with domain-specific knowledge and adhering to a desired response format.
- Privacy – Try privately hosting the model on Databricks LLM-Optimized GPU Model Serving and using the AI Gateway to fully utilize the endpoint across use cases
Get started developing RAG applications today on Lakehouse AI with MosaicML
The Databricks Lakehouse AI platform enables developers to rapidly build and deploy Generative AI applications with confidence. To replicate the above chat application in your organization, you will need:
- MosaicML API keys for quick and easy access to text embedding models and llama2-70b-chat. Sign up for access here.
- Join the MLflow AI Gateway Preview to govern access to your model APIs
Further explore and enhance your RAG applications: