Building RAG Apps With Apache Cassandra, Python, and Ollama

Retrieval-augmented generation (RAG) is the most popular approach for obtaining real-time data or updated data from a data source based on text input by users. Thus empowering all our search applications with state-of-the-art neural search. 

In RAG search systems, each user request is converted into a vector representation by embedding model, and this vector comparison is performed using various algorithms such as cosine similarity, longest common sub-sequence, etc., with existing vector representations stored in our vector-supporting database.

The existing vectors stored in the vector database are also generated or updated asynchronously by a separate background process.

This diagram provides a conceptual overview of vector comparison This diagram provides a conceptual overview of vector comparison 

To use RAG, we need at least an embedding model and a vector storage database to be used by the application. Contributions from community and open-source projects provide us with an amazing set of tools that help us build effective and efficient RAG applications.

In this article, we will implement the usage of a vector database and embedding generation model in a Python application. If you are reading this concept for the first time or nth time, you only need tools to work, and no subscription is needed for any tool. You can simply download tools and get started.

Our tech stack consists of the following open-source and free-to-use tools:

  • Operating system – Ubuntu Linux
  • Vector database – Apache Cassandra
  • Embedding model – nomic-embed-text
  • Programming language – Python

Key Benefits of this Stack

  • Open-source
  • Isolated data to meet data compliance standards

This diagram provides a high-level dependency architecture of the system

This diagram provides a high-level dependency architecture of the system

Implementation Walkthrough

You may implement and follow along if prerequisites are fulfilled; otherwise, read to the end to understand the concepts.

Prerequisites

Ollama Model Setup

Ollama is an open-source middleware server that acts as an abstraction between generative AI and applications by installing all the necessary tools to make generative AI models available to consume as CLI and API in a machine. It has most of the openly available models like llama, phi, mistral, snowflake-arctic-embed, etc. It is cross-platform and can be easily configured in OS.

In Ollama, we will pull the nomic-embed-text model to generate embeddings.

Run in command line:

Plain Text

 

This model generates embeddings of size 768 vectors.

Apache Cassandra Setup and Scripts

Cassandra is an open-source NoSQL database designed to work with a high amount of workloads that require high scaling as per industry needs. Recently, it has added support for Vector search in version 5.0 that will facilitate our RAG use case.

Note: Cassandra requires Linux OS to work; it can also be installed as a docker image.

Installation

Download Apache Cassandra from https://cassandra.apache.org/_/download.html.

Configure Cassandra in your PATH.

Start the server by running the following command in the command line:

Plain Text

 

Table

Open a new Linux terminal and write cqlsh; this will open the shell for Cassandra Query Language. Now, execute the below scripts to create the embeddings keyspace, document_vectors table, and necessary index edv_ann_index to perform a vector search. 

SQL

 

Note: content_vector VECTOR <FLOAT, 768> is responsible for storing vectors of 768 length that are generated by the model.  

Milestone 1: We are ready with database setup to store vectors.

Python Code

This programming language certainly needs no introduction; it is easy to use and loved by the industry with strong community support.

Virtual Environment

Set up virtual environment:

Plain Text

 

Activate virtual environment:

Plain Text

 

Packages

Download Datastax Cassandra package:

Plain Text

 

Download requests package:

Plain Text

 

File

Create a file named app.py.

Now, write the code below to insert sample documents in Cassandra. This is the first step always to insert data in the database; it can be done by a separate process asynchronously. For demo purposes, I have written a method that will insert documents first in the database. Later on, we can comment on this method once the insertion of documents is successful.

Python

 

Now, run this file using the commandline in the virtual environment:

Plain Text

 

Once the file is executed and documents are inserted, this can be verified by querying the Cassandra database from the cqlsh console. For this, open cqlsh and execute:

SQL

 

This will return 10 documents inserted in the database, as seen in the screenshot below.

10 documents inserted in the database

Milestone 2: We are done with data setup in our vector database.

Now, we will write code to query documents based on cosine similarity. Cosine similarity is the dot product of two vector values. Its formula is A.B / |A||B|. This cosine similarity is internally supported by Apache Cassandra, helping us to compute everything in the database and handle large data efficiently.

The code below is self-explanatory; it fetches the top three results based on cosine similarity using ORDER BY <column name> ANN OF <text_vector> and also returns cosine similarity values. To execute this code, we need to ensure that indexing is applied to this vector column.

Python

 

Remember to comment insertion code:

Python

 

Now, execute the Python code by using python app.py.

We will get the output below:

Plain Text

 

You can see the cosine similarity of “The street food stalls in Bangkok served fiery pad Thai that left Varun with a tangy memory of the city’s vibrant energy.” is 0.8205469250679016, which is the closest match.

Final Milestone: We have implemented the RAG search. 

Enterprise Applications

Apache Cassandra

For enterprises, we can use Apache Cassandra 5.0 from popular cloud vendors such as Microsoft Azure, AWS, GCP, etc.

Ollama

This middleware requires a VM compatible with Nvidia-powered GPU for running high-performance models, but we don’t need high-end VMs for models used for generating vectors. Depending upon traffic requirements, multiple VMs can be used, or any generative AI service like Open AI, Anthropy, etc, whichever Total Cost of Ownership is lower for scaling needs or Data Governance needs.   

Linux VM

Apache Cassandra and Ollama can be combined and hosted in a single Linux VM if the use case doesn’t require high usage to lower the Total Cost of Ownership or to address Data Governance needs.

Conclusion

We can easily build RAG applications by using Linux OS, Apache Cassandra, embedding models (nomic-embed-text) used via Ollama, and Python with good performance without needing any additional cloud subscription or services in the comfort of our machines/servers. 

However, hosting a VM in server(s) or opt for a cloud subscription for scaling as an enterprise application compliant with scalable architectures is recommended. In this Apache, Cassandra is a key component to do the heavy lifting of our vector storage and vector comparison and Ollama server for generating vector embeddings.

That’s it! Thanks for reading ’til the end.

Source:
https://dzone.com/articles/build-rag-apps-apache-cassandra-python-ollama