LlamaIndex: Using data connectors to build a custom ChatGPT for private documents
In this post, we're going to see how we can use LlamaIndex's PDF Loader Data Connector to ingest data from the Domino's Pizza Nutritional Information PDF, then query that data, and print the LLM's response.
Introduction
Have you ever wanted to quickly get information from your files without reading a lot of pages? Well with the advancements in LLMs and tools around them, you can now literally chat with your documents (a PDF for example). We're going to be doing exactly that using LlamaIndex and Data Connectors.
LlamaIndex will help you build LLM applications by providing a framework that can easily ingest data from multiple sources and then use that data as context with a Large Language Model (LLM) such as GPT-4.
In this post, we're going to ingest data from a PDF
file using a LlamaIndex Data Connector
.
What are Data Connectors
Data Connectors in LlamaIndex are essentially plugins that allow us to take in data from a source (such as PDF files) and then use the loaded data in our LLM application. For this example, we're going to ingest a PDF
document, so we'll be using the PDF Loader
Data Connector.
After ingesting data, an index could be constructed and used to query the data about specific questions using a Query Engine
or to have a chat-style conversation using a Chat Engine.
LlamaIndex Engines
We're going to quickly define what the Query and Chat Engines are and briefly explain their function.
Query Engine
A query engine is a generic interface that allows you to ask questions about the data ingested from one or more sources using Data Connectors
. A query engine takes in a natural language input and returns a response.
A query engine can be initialized by using the as_query_engine()
method as shown below:
query_engine = index.as_query_engine()
response = query_engine.query("What are Data Connectors?")
Chat Engine
Similarly, we can think of a chat engine as an extension of a query engine that supports having a conversation (back-and-forth messages) with your data. It achieves this by keeping track of the message history and retaining context for future queries. If you're building a bot for your custom data or any conversation-type interface you'll probably use the chat engine.
A chat engine can be initialized by using the as_chat_engine()
method as shown below:
chat_engine = index.as_chat_engine()
response = chat_engine.chat("What are Data Connectors?")
Setting Up
To set up our first Data Connector
for this example we'll need an OpenAI API Key
and a PDF file that you'd like to process.
Installing LlamaIndex
Let's get started by installing LlamaIndex
using pip
. In your terminal window, type the following:
pip install llama-index
Creating Empty Directory
mkdir data-connectors
Then, let's cd
into our new directory:
cd data-connectors
We can finally create our app.py
Python file:
touch app.py
Querying PDF Example
Next, we're going to do the following:
- Set the OpenAI API Key
- Import required packages
- Load LlamaIndex Data Connector:
PDF Reader
- Ingest a sample PDF file
- Use the
Query Engine
to query OpenAI's LLM
Set up OpenAI API Key
import os
os.environ["OPENAI_API_KEY"] = 'YOUR-API-KEY-HERE'
Import Required Packages
from pathlib import Path
from llama_index import VectorStoreIndex, download_loader
Here we'll load Path
from pathlib
which makes it easier to interact with files and directories. We'll also import VectorStoreIndex
and download_loader
from llama_index
.
VectorStoreIndex
represents a vector index which is a type of index used to store and manage multidimensional data called vectors. AI models can produce vectors called "embedding models". These models take something, like an article, picture, or video, and turn it into a set of numbers, or a vector, that represents it.download_loader
will help up load one of the many LlamaIndex Data Connectors
, in our case we'll be using the PDF Loader
connector (as shown below).
Load PDF Reader
Using download_loader
we'll now load the PDF Loader
Data Connector:
PDFReader = download_loader("PDFReader")
loader = PDFReader()
Ingest Sample PDF File
Next week some friends are coming over and we're having Domino's Pizza for dinner. I genuinely want to query their nutritional information and get more details about my choices, so I decided to use the Canadian Domino's Pizza Nutritional Guide as my sample PDF but obviously you can swap it with any other PDF based on your use-case.
documents = loader.load_data(file=Path('dominos.pdf'))
index = VectorStoreIndex.from_documents(documents)
Using load_data
that takes in the PDF path, we can convert the PDF's content to a VectorStoreIndex
as shown above.
Query the PDF
Final step here is to query the PDF:
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")
print(response)
The printed response in my case is: This document is about the nutrition guide for Domino's Pizza.
Here's another interesting query:
response = query_engine.query("How many Pizza types are there?")
To which the LLM responded: There are 6 pizza types mentioned in the context information.
Recap and Next Steps
In this post, we've seen how we can use LlamaIndex's PDF Loader Data Connector to ingest data from the Domino's Pizza Nutritional Information PDF and query that data and receive a response from OpenAI's model. LlamaIndex supports other LLMs, and for your specific use case you could use a different model that does not require internet access to keep your private data, private.
You can download the full code from this repo.
Feel free to experiment with your own documents, and stay tuned for future posts if you like the content.