How to convert natural language text to SQL using LangChain

In this post, we're going to look at how you can use LangChain and OpenAI's GPT model to convert natural language queries to SQL, execute them, and get an answer from your database in plain English.

👋
New to LangChain? Start with this introductory post first. It'll give you a great overview of everything you need to know before diving in. Come back when you're done!

Introduction

LangChain is a powerful framework for building LLM powered applications. In this post, we're going to go over and cover:

  • The Basics of LangChain
  • Installing LangChain using Python
  • Creating Chains using SQL data
  • Prompting Large Language Models (GPT-3.5)
  • And more!
⚠️
I recently published a post looking at LlamaIndex and LangChain. LlamaIndex is a tool that can be used with (or without) LangChain depending on your specific use case. Check it out!

Use cases

In my opinion the most relevant use cases for generating and executing SQL queries using Natural Language with LangChain are:

  • Creating chatbots that can answer questions based on database data
  • Generating complex queries based on natural language
  • Expanding database access to non-technical people and stakeholders

SQL Query building

SQL query building simply refers to the process of generating a SQL query from a question or an input. It involves writing the necessary statements that specify the desired data, conditions, and operations to be performed on the database. It starts with a prompt, then that prompt is converted into a SQL statement. This makes it really easy to generate queries based on questions or requirements instead of manually writing complex INNER and OUTER joins.

Source: https://python.langchain.com/docs/use_cases/sql/

As you can see from the diagram above, it starts with a natural language prompt (common English) which could be a question, such as: How many users with a Gmail email address? which is then converted to a SQL statement. This gets executed by the SQL Engine, the response is then sent back to the LLM and a natural language answer is generated.

Awesome, non-technical stakeholders asking for specific metrics can directly ask the database as they don't need to have a deep understanding of SQL. More importantly, using natural language will improve efficiency due to the speed at which questions can be turned into answers vs. having someone on the dev team write the query, clean the data, convert to Excel and send to stakeholders for example.

Overall, this makes SQL databases easy to work with and data retrieval and analysis more accessible to a wider range of users.

Querying a SQL Database with Chains

Let's suppose you have a SQL Database hosted on RDS. It stores expense transaction information for your users. If you were to retrieve the total spend for a specific user, you'd need to write a SQL query similar to this one: SELECT SUM(amount) from Expenses WHERE UserId = 1.

But instead using LangChain and LLMs you can run a chain that takes in natural language and then creates the SQL for you. So to create that same query you'll instead use: What's the total spend for user 1?

Similarly your users would be able to chat with their data instead of you having to build complex and unnecessary endpoints and interfaces.

💡
LangChain supports separating the query building process for the query execution process. So you can just generate SQL statements or you can generate & execute.

Text-to-SQL Query

In this code example, we'll see how we can create our SQL statement from text without execution:

from langchain.chat_models import ChatOpenAI
from langchain.chains import create_sql_query_chain

chain = create_sql_query_chain(ChatOpenAI(temperature=0), db)
response = chain.invoke({"question":"How many employees are there"

print(response)

Code sample of text-to-SQL query without execution

Text-to-SQL Query and Execution

To create and execute a query that will retrieve the number of employees from our SQL table using chains, and then execute it:

from langchain.llms import OpenAI
from langchain_experimental.sql import SQLDatabaseChain

llm = OpenAI(temperature=0, verbose=True)

db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)
db_chain.run("How many employees are there?")

Code sample of text-to-SQL query with execution

This code block will create a chain that builds the SQL query based on the user question and then execute it.

⚠️
Be careful while using this approach as it is susceptible to SQL Injection attacks. Queries must be validated before execution.

Installation & setup

💡
While the official documentation is really well written and easy to understand. I will simplify things in this post so that we can get the basics running.

To install LangChain, you need to follow these steps:

  1. Open your command prompt or terminal.
  2. Type the command: pip install langchain langchain-experimental openai and press Enter
  3. Wait for the installation process to complete.

Create project folder

We'll then need to create our Python file and add our OpenAI API Key. To do so you'll need to follow the steps below:

  1. cd to your project folder.
  2. Type the command: touch app.py
  3. If you're using VSCode you can then type code . to load the contents of the folder.

Load OpenAI API key

💡
Note that LangChain supports other LLMs and I'm planning on covering this topic (customizing LLM) in a future post, but for this example we'll be working with OpenAI.
  1. In your terminal type touch .env
  2. Open the .env file and type OPENAI_API_KEY="..." then save.
  3. Install python-dotenv from terminal using pip by running: pip install python-dotenv
  4. Open app.py and add the following:
from dotenv import load_dotenv

# Load environment variables from the .env file
load_dotenv()
💡
This makes sure we're loading our OpenAI API Key from the environment variables instead of hard-coding it into app.py. It is generally best practice to do it like this.

That's it! To recap:

  • We installed required packages
  • Created a project folder and app.py file
  • Added our OpenAI API Key to our environment variables

Connecting to RDS instance

In our example, we're working with a Microsoft SQL Database, hosted on RDS. To set it up with LangChain, we'll need to use the SQLAlchemy wrapper exposed as SQLDatabase. Let's import it:

from langchain import SQLDatabase

Here's the database set up:

# Connection parameters
server = '***.rds.amazonaws.com'
database = '***'
username = '***'
password = '***'
port = '1234' # If you're using a custom port (Remove for default)
  
connection_string = f'mssql+pyodbc://{username}:{password}@{server}:{port}/{database}?driver=ODBC+Driver+18+for+SQL+Server&TrustServerCertificate=yes'

db = SQLDatabase.from_uri(connection_string)
⚠️
I added TrustServerCertificate=yes just because I've had some issues while connecting using SSL. It might not be needed in your case.

Preparing our LLM & Chain

from langchain.llms import OpenAI
from langchain_experimental.sql import SQLDatabaseChain

llm = OpenAI(model_name="gpt-3.5-turbo-16k-0613", temperature=0, verbose=True)
db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)

Ok, so as we've seen earlier, we're setting up our llm model by specifying the model_name, temperature and verbose parameters. Using SQLDatabaseChain tells LangChain to generate and execute the SQL simultaneously when we run this.

The temperature refers to the randomness of the model's output, 0 is the lowest value. Verbose on the other hand just prints out what LangChain is doing, instead of just generating the final answer to our query. Setting it to True will show the SQL query.

⚠️
Also note that i'm using gpt-3.5-turbo-16k-0613 since it has a higher token capacity. I was getting token limitations errors with other models. This also might not be the case for you.

Finally, we execute:

db_chain.run("How many total users?")

Finally, in your terminal you can execute the code by typing python app.py. And would you look at that, it works like magic! 🪄

(Here's the overly used emoji in any AI-powered app nowadays: 🪄)

LangChain and LLM Converting Natural Language to SQL Statement

Great. I tried this with much more complex queries that included some nested SELECT statements and many joins. And most of the time, it works!

Source code for the geeks

👇
Instantly access the source code below by subscribing for zero dollars.

Final thoughts

In this tutorial, I covered the basics of setting up LangChain and creating chains with SQL and OpenAI's LLM.

LangChain is a powerful tool that opens up a world of possibilities in building large language model powered apps. I believe we're going to see much more LLM apps in the near future, all sorts of startups and SaaS offerings.

💡
Important: Get familiar with Prompt and SQL Injection. This post is a must read if you're building a LLM app.

Given that it's an easy-to-use framework with lots of capabilities, you can create autonomous agents, personal assistants, chatbots, and much more. I recommend you start playing around with LangChain today and recommend you also look into LlamaIndex. I've written a nice series of posts about the topic here:

  1. Introduction to Augmenting LLMs with Private Data using LlamaIndex
  2. Using Data Connectors to Build a Custom ChatGPT for Private Documents
  3. A Closer Look into Storage Customization, Persisting and Loading Data

Thank you for reading this tutorial. ❤️ As.you know, writing content takes a lot of time, and I do this so that we can both benefit. Therefore, if you enjoyed this post, go ahead and subscribe, it's completely free, and it takes less than a minute. You'll get early access to all my posts as well as my free newsletter and zero spam. Guaranteed.

I hope that the material I covered comes in handy in your journey of building amazing LLM-powered products! Let's connect on X (Formerly Twitter) and let me know if you have any comments below.

Happy coding!


Further readings

More from Getting Started with AI

More from the Web