Extract Data from Documents with ChatGPT

Guide on how to extract data from documents like PDFs using Large Language Models (LLMs)

Waveline
4 min readJul 19, 2023

Transform the PDF into plain text

In order to get text out of PDFs, two main approaches can be differentiated:

  1. Optical Character Recognition (OCR)
  2. Parsing

With OCR, you scan the PDF on the pixel level and identify all characters/words. This is often done by a machine learning model that has been trained to recognize common characters. With Parsing, we extract the written words by diving into the internal structure and metadata of the PDF. Conventional algorithms can be used for a simple baseline.

For this guide, we can take a free online converter. Here are some examples: pdf2go.com, smallpdf.com, pdftotext.com

Text: WAVELINE Bahnhofstrasse 456 Zurich, Zurich 8031 Switzerland WAVELINE Billed To Ben Timond Market Street 234 San Francisco, California 94772 United States Date Issued Invoice Number May 25, 2023 INV-75537 Amount Due $330.75 Due Date Jun 24, 2023 DESCRIPTION Custom Avocado chair RATE QTY AMOUNT $250.00 1 $250.00 1 $100.00 1 $0.00 +Tax Mistery Box $100.00 +Tax Lifetime supply of Orange juice $0.00 +Tax Subtotal $350.00 Discount 10.00% -$35.00 Tax 5.00% +$15.75 Total $330.75 Balance Due $330.75

Use an LLM to extract the desired information

Next, we design a prompt to tell the LLM what data we want to extract. We additionally make sure that it outputs the result in a JSON format.

You extract data from the text provided below into a JSON object 
of the shape provided below.

Shape:
{
total: number // total amount due,
invoice_number: string // invoice number,
billed_to: string // name of the person that needs to pay the invoice
}

Text: WAVELINE Bahnhofstrasse 456 Zurich, Zurich 8031 Switzerland WAVELINE Billed To Ben Timond Market Street 234 San Francisco, California 94772 United States Date Issued Invoice Number May 25, 2023 INV-75537 Amount Due $330.75 Due Date Jun 24, 2023 DESCRIPTION Custom Avocado chair RATE QTY AMOUNT $250.00 1 $250.00 1 $100.00 1 $0.00 +Tax Mistery Box $100.00 +Tax Lifetime supply of Orange juice $0.00 +Tax Subtotal $350.00 Discount 10.00% -$35.00 Tax 5.00% +$15.75 Total $330.75 Balance Due $330.75

Hurray🎉 by querying GPT-4, we get the following JSON back:

{ 
"total": 330.75,
"invoice_number": "INV-75537",
"billed_to": "Ben Timond"
}

Things to watch out for

Data extraction might not always work as intended. Here are some things to watch out for.

Bad OCR/Parsing
Depending on your input, an OCR approach is better than a Parsing method and vice-versa. The quality of this first step propagates to your end result. We lose quality if we fail to correctly convert the PDF into a representation that the LLM can read or write down wrong information. There exist advanced OCR AI models that can be leveraged to increase quality, such as Tesseract.

Hallucinations
If the information can’t be found within your provided text (this can happen if the Conversion of the PDF to text missed some parts at the Parsing/OCR step or if the information wasn’t provided within the PDF in the first place), LLMs tend to invent or guess information. We need to make sure that this does not happen.

So although the specified gender was not specified, the model hallucinated it. A common approach is to give the LLM an easy exit way. For example, by adding to the prompt:

If the provided information is not explicitly written, write UNSURE

Context-window-size
Every LLM has a certain amount of tokens it can process. The number of input plus the number of output tokens need to be smaller than this context window. For GPT-4, this is 8k, and our example is small enough. Otherwise, we would have to split our extraction into multiple LLM calls. Be careful that you don’t split, for example, tables into two parts where the second part wouldn’t have the header and the LLM has no clue what each column represents.

Output Structure Consistency
The LLM might not always output the desired JSON. Maybe some filler text is provided like “Absolutely, here is the provided…” or the returned JSON is wrongly formatted, especially if your Shape is more complicated.

Here we can see that the transactions got written as Objects within Objects instead of an Array of Objects. We should therefore doublecheck the structure of the LLM output.

Conclusion

Language Models have amazing capabilities that now allow you to extract specific information for documents. You can test it quite quickly if it is useful for your use case. There are some pitfalls to ensuring reliability and good quality. These can be accounted for but require engineering effort.

If you don’t want to deal with the hassle and want a service that just works, give us a shot at waveline.ai!

Happy Extracting :)

--

--