LlamaIndex: A closer look into storage customization, persisting and loading data
In this post, we're going to go over some important concepts as well as discuss the low-level interface to customize persisting and retrieving data using LlamaIndex.
Introduction
As we've seen in the previous article, we queried a PDF file after ingesting it. In a real-world application, we'll need to be able to store the data somewhere so that we don't re-ingest it with every query. Not only this will save us money (if we're using OpenAI's API) but it will also improve the application's speed since we'll be loading previously processed data.
LlamaIndex, by default, uses a high-level interface designed to streamline the process of ingesting, indexing, and querying data.
Storage Types
LlamaIndex supports swappable storage components that allow you to customize Document, Index, Vector, and Graph stores. We're going to define what each one of those store types is and then explore how we can customize them so that they fit our use case.
Here's a breakdown of the store types:
1. Document Stores
Document stores include chunks of the ingested documents represented as Node
objects. By default, LlamaIndex stores these Node
objects in memory. However, this can be customized so that Node
objects are stored and loaded from disk or completely swapped with other stores such as MongoDB, Redis, and others.
2. Index Stores
Index stores contain metadata related to our indices. This information is essential for efficient querying and retrieval of data. By default, LlamaIndex uses a simple index key-value in-memory store. Similar to Document Store
the default behavior can be customized and supports databases such as MongoDB, Redis, and others.
3. Vector Stores
Vector stores contain embedding vectors of ingested document chunks. Embeddings
are a data representation (generated during data ingestion) that holds semantic information. We'll need specialized stores to store this type of data, also known as Vector Stores
which are databases optimized for storing and querying embeddings efficiently. By default, LlamaIndex uses an in-memory simple vector store with the capability of persisting and loading data from the disk. The default store can easily be swapped with other stores such as Pinecone, Apache Cassandra, and others.
4. Graph Stores
LlamaIndex also caters to graph stores. If you're working with graphs, you'll find support right out of the box for Neo4j, Nebula, Kuzu, and FalkorDB.
Storage Location
Great, we've seen the different kinds of stores that integrate with LlamaIndex out of the box. By default, and if we don't customize any store LlamaIndex will persist everything locally as we're going to see in the example below. We can also swap our local disk to a remote disk such as AWS S3.
If you're using AWS within your stack, LlamaIndex integrates easily with AWS S3. This enables you to use your existing cloud storage for your data management needs. We're not going to go into the details of overriding the default settings in this post but it can be easily done by passing in a fsspec.AbstractFileSystem
object.
Extending our Application
Now let's go back to the simple Python application that we built in the previous post. We're going to add a few lines of code to persist the data on the local disk so that we don't have to re-ingest the PDF file with each subsequent query.
Persisting our Data
As mentioned, by default, LlamaIndex uses in-memory storage, but we're going to persist the data into our local filesystem by using the below:
On line 12
we previously instantiated the index
variable. To persist the data on the local disk, we just call storage_context.persist()
method. It's that easy!
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist() # <-- Save to disk
What this does, is create a storage
folder, and four files docstore.json
, graph_store.json
, index_store.json
, and vector_store.json
.
These files contain all the required information to load the index from the local disk whenever needed. Keep in mind that the default folder storage
can be easily changed to any other directory (i.e./data) by passing the persist_dir
parameter as shown below:
index.storage_context.persist(persist_dir="./data")
Loading our Data
Great, now our data is persisted on the local disk. We're going to look at how we can load the index from the default storage folder storage
by rebuilding the storage_context
and reloading the index
:
# Create storage context from persisted data
storage_context = StorageContext.from_defaults(persist_dir="./storage")
# Load index from storage context
index = load_index_from_storage(storage_context)
After we load our index, we can just query the data again as we previously did using our query_engine
:
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")
That's it, we basically just added three lines of code to persist and load the data from our local disk. Obviously, this can get much more complicated in a real-world application but to keep things simple we're going to stick with the default settings.
I am going to cover more complex scenarios in future posts. Make sure you subscribe for free to get the latest updates as soon as they are published.
Wrapping Up
Let's put everything together. Here is the final code:
if os.path.exists("storage"):
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
else:
PDFReader = download_loader("PDFReader")
loader = PDFReader()
documents = loader.load_data(file=Path('dominos.pdf'))
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist()
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")
print(response)
The code above checks if a folder called storage
exists. If it does, it loads the data from storage. If not, it ingests the PDF and then stores that information in the storage
folder for future use.
Conclusion
In this post, we covered the basic store types that are needed by LlamaIndex. Using the default settings, we also saved the ingest data onto our local disk and then we modified our code to look for available data and load from storage instead of ingesting the PDF every time we ran our Python app.
If you're getting started with LlamaIndex, give this a try, and let me know in the comments below if you have any questions.
Thanks for reading!