How to use LangChain output parsers to structure large language models responses

If you're wondering how you can convert the text returned by an LLM to a Pydantic (JSON) model in your Python app, this post is for you.

👋
New to LangChain? Start with this introductory post first. It'll give you a great overview of everything you need to know before diving in. Come back when you're done!
⚠️
Nov. 2023 Update: OpenAI recently announced JSON mode that ensures output is always in JSON format. You can read more on OpenAI's official site.

Large Language Models (or LLMs) generate text and when you're building an application, you'll sometimes need to work with structured data instead of strings. LangChain provides Output Parsers which can help us do just that.

We will go over the Pydantic (JSON) Parser provided by LangChain.

There are more parsers available, but I'll leave those out of this post. If you'd like to know more about any of them make sure to let me know by dropping a comment at the end of this post!

⚠️
This is an introductory post, so I am not going to dive deep into complex scenarios or customizations but we'll rather take a look at the most basic implementation of the Pydantic (JSON) Parser.

Why Parse Data?

This one is obvious, but we'll answer it anyway. Parsing data helps us convert it into more readable formats which improves the overall quality of data.

Suppose you wanted to do a simple arithmetic operation between two integers, you'd need to convert a given string into an integer. There are also other benefits of having clean and structured data. The most obvious is how easily the data would fit into your existing models and databases.

Making a Reservation

Back in May of 2023, I published a post about interacting with computers using natural language where I asked ChatGPT to make me a fictional reservation at a restaurant and to respond with a JSON object instead of plain old text.

Long story short, here's the output from that post:

{
  "intent": "book_reservation",
  "parameters": {
    "date": "2023-05-05",
    "time": "18:00:00",
    "party_size": 2,
    "cuisine": "any"
  }
}

While we can ask the LLM to return a JSON and give it the format explicitly (just like we did in the previous post), it's important to recognize that this may not work in some cases as the model might hallucinate.

Preparing our Query Template

Okay, let's use the same query from the previous post, but instead of requesting that the response is in the JSON format, we'll add a {format_instructions} placeholder as shown below:

reservation_template = '''
  Book us a nice table for two this Friday at 6:00 PM. 
  Choose any cuisine, it doesn't matter. Send the confirmation by email.

  Our location is: {query}

  Format instructions:
  {format_instructions}
'''

Great, we have our beautiful query. Below, we're going to see how LangChain will automatically take care of populating the {format_instructions} placeholder.

The Pydantic (JSON) Parser

In order to tell LangChain that we'll need to convert the text to a Pydantic object, we'll need to define the Reservation object first. So, let's jump right in:

from pydantic import BaseModel

class Reservation(BaseModel):
    date: str = Field(description="reservation date")
    time: str = Field(description="reservation time")
    party_size: int = Field(description="number of people")
    cuisine: str = Field(description="preferred cuisine")

Pydantic Representation of the Model

Great, we now have the Reservation object and its parameters. Let's tell LangChain that we'll need a parser that'll convert a given input into that Reservation object:

parser = PydanticOutputParser(pydantic_object=Reservation)

Setting up the Prompt Template

We're now going to set up our prompt template as such:

prompt = PromptTemplate(
    template=reservation_template,
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
) 

Notice the partial_variables={"format_instructions": parser.get_format_instructions()} line? This tells LangChain to replace the format_instructions variable in our template above with the structure of the Pydantic Reservation object that we created.

Let's add our location query and see what LangChain will do behind the scenes to our original query and what it'll look like.

_input = prompt.format_prompt(query="San Francisco, CA")

Let's see what our query looks like now:

print(_input.to_string())
>> Book us a nice table for two this Friday at 6:00 PM. 
>> Choose any cuisine, it doesn't matter. Send the confirmation by email.
>> 
>> Our location is: San Francisco, CA
>> 
>> Format instructions:
>> The output should be formatted as a JSON instance that conforms to the JSON schema below.
>> 
>> As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
>> the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
>> 
>> Here is the output schema:
>> ```
>> {"properties": {"date": {"title": "Date", "description": "reservation date", "type": "string"}, "time": {"title": "Time", "description": "reservation time", "type": "string"}, "party_size": {"title": "Party Size", "description": "number of people", "type": "integer"}, "cuisine": {"title": "Cuisine", "description": "preferred cuisine", "type": "string"}}, "required": ["date", "time", "party_size", "cuisine"]}
>> ```

Great, as you can see LangChain did a lot of work for us. It automatically converted the Pydantic object that we created into a string that is used to define the structure of the response for the LLM.

But that's not all, after we query the model we can use LangChain's parser to automatically convert the text response that we got from the model to the Reservation object.

Here's how we do this:

# We query the model first
output = model(_input.to_string())

# We parse the output 
reservation = parser.parse(output)

Awesome, let's print the reservation fields (and data types) by iterating over each element:

for parameter in reservation.__fields__:
    print(f"{parameter}: {reservation.__dict__[parameter]},  {type(reservation.__dict__[parameter])}")

Here's the output:

>> date: Friday,  <class 'str'>
>> time: 6:00 PM,  <class 'str'>
>> party_size: 2,  <class 'int'>
>> cuisine: Any,  <class 'str'>

Notice that party_size is of type int now. We can also obviously access the party_size property directly as such: reservation.party_size.

Other Output Parsers

As I mentioned earlier in the post, LangChain provides many more output parsers that you can use depending on your specific use case. The same logic of what is happening behind the scenes applies to most of them.

A couple of interesting parsers are the Retry and Auto-fixing parsers. The retry parser attempts to re-query the model for an answer that fits the parser parameters, and the auto-fixing parser triggers if a related output parser fails in an attempt to fix the output. If you'd like me to cover these parsers in a future post, please let me know in the comments below!

Here's a full list of the LangChain output parsers:

Here's the full code:

Final Thoughts

In a nutshell, integrating LangChain's Pydantic Output Parser into your Python application makes working programmatically with the text returned from a Large Language Model easy.

It also helps you structure the data in a way that can easily be integrated with existing models, and databases.

Thanks for reading!

I'd love to connect with you on X as well as on the blog here. So if you're not a member, please subscribe for free now! And if you're on X, here's my profile link if you'd like to follow me for more updates.


Further readings

More from Getting Started with AI

More from the Web