LlamaIndex: A closer look into storage customization, persisting and loading data

In this post, we're going to go over some important concepts as well as discuss the low-level interface to customize persisting and retrieving data using LlamaIndex.

Introduction

As we've seen in the previous article, we queried a PDF file after ingesting it. In a real-world application, we'll need to be able to store the data somewhere so that we don't re-ingest it with every query. Not only this will save us money (if we're using OpenAI's API) but it will also improve the application's speed since we'll be loading previously processed data.

LlamaIndex, by default, uses a high-level interface designed to streamline the process of ingesting, indexing, and querying data.

Storage Types

LlamaIndex supports swappable storage components that allow you to customize Document, Index, Vector, and Graph stores. We're going to define what each one of those store types is and then explore how we can customize them so that they fit our use case.

Here's a breakdown of the store types:

1. Document Stores

Document stores include chunks of the ingested documents represented as Node objects. By default, LlamaIndex stores these Node objects in memory. However, this can be customized so that Node objects are stored and loaded from disk or completely swapped with other stores such as MongoDB, Redis, and others.

2. Index Stores

Index stores contain metadata related to our indices. This information is essential for efficient querying and retrieval of data. By default, LlamaIndex uses a simple index key-value in-memory store. Similar to Document Store the default behavior can be customized and supports databases such as MongoDB, Redis, and others.

3. Vector Stores

Vector stores contain embedding vectors of ingested document chunks. Embeddings are a data representation (generated during data ingestion) that holds semantic information. We'll need specialized stores to store this type of data, also known as Vector Stores which are databases optimized for storing and querying embeddings efficiently. By default, LlamaIndex uses an in-memory simple vector store with the capability of persisting and loading data from the disk. The default store can easily be swapped with other stores such as Pinecone, Apache Cassandra, and others.

4. Graph Stores

LlamaIndex also caters to graph stores. If you're working with graphs, you'll find support right out of the box for Neo4j, Nebula, Kuzu, and FalkorDB.

Storage Location

Great, we've seen the different kinds of stores that integrate with LlamaIndex out of the box. By default, and if we don't customize any store LlamaIndex will persist everything locally as we're going to see in the example below. We can also swap our local disk to a remote disk such as AWS S3.

If you're using AWS within your stack, LlamaIndex integrates easily with AWS S3. This enables you to use your existing cloud storage for your data management needs. We're not going to go into the details of overriding the default settings in this post but it can be easily done by passing in a fsspec.AbstractFileSystem object.

💡
Storing on your local disk is great since it's the fastest and no-cost option. You can keep the default settings for quick prototyping and then easily move to S3 or other remote locations if needed.

Extending our Application

Now let's go back to the simple Python application that we built in the previous post. We're going to add a few lines of code to persist the data on the local disk so that we don't have to re-ingest the PDF file with each subsequent query.

Persisting our Data

As mentioned, by default, LlamaIndex uses in-memory storage, but we're going to persist the data into our local filesystem by using the below:

On line 12 we previously instantiated the index variable. To persist the data on the local disk, we just call storage_context.persist() method. It's that easy!

index = VectorStoreIndex.from_documents(documents)

index.storage_context.persist() # <-- Save to disk

What this does, is create a storage folder, and four files docstore.json, graph_store.json, index_store.json, and vector_store.json.

These files contain all the required information to load the index from the local disk whenever needed. Keep in mind that the default folder storage can be easily changed to any other directory (i.e./data) by passing the persist_dir parameter as shown below:

index.storage_context.persist(persist_dir="./data")

Loading our Data

Great, now our data is persisted on the local disk. We're going to look at how we can load the index from the default storage folder storage by rebuilding the storage_context and reloading the index:

# Create storage context from persisted data
storage_context = StorageContext.from_defaults(persist_dir="./storage")

# Load index from storage context
index = load_index_from_storage(storage_context)

After we load our index, we can just query the data again as we previously did using our query_engine:

query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")

That's it, we basically just added three lines of code to persist and load the data from our local disk. Obviously, this can get much more complicated in a real-world application but to keep things simple we're going to stick with the default settings.

I am going to cover more complex scenarios in future posts. Make sure you subscribe for free to get the latest updates as soon as they are published.

Wrapping Up

Let's put everything together. Here is the final code:

if os.path.exists("storage"):
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context) 
else:
    PDFReader = download_loader("PDFReader")
    
    loader = PDFReader()
    
    documents = loader.load_data(file=Path('dominos.pdf'))
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist()

query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")

print(response)

The code above checks if a folder called storage exists. If it does, it loads the data from storage. If not, it ingests the PDF and then stores that information in the storage folder for future use.

✌️
The source code is available for download on this GitHub repo.

Conclusion

In this post, we covered the basic store types that are needed by LlamaIndex. Using the default settings, we also saved the ingest data onto our local disk and then we modified our code to look for available data and load from storage instead of ingesting the PDF every time we ran our Python app.

If you're getting started with LlamaIndex, give this a try, and let me know in the comments below if you have any questions.

Thanks for reading!