How to Extract JSON from Documents with Waveline

Waveline
3 min readJul 24, 2023

Step-by-Step guide on how to use Waveline’s API to extract data from PDFs. For a full code example, check out the accompanying Colab Notebook.

Get an API key

  1. Navigate to waveline.ai
  2. Create an Account: Press “Getting Started”. Use your Google Account to sign up. Add a Billing method (don’t worry, you are still on the Free tier and can cancel at any time). Save the API key.
  3. You can manage your API keys on the Dashboard
api_key = "replace-with-your-api-key"

Submit a Job

You can find the complete Documentation of the API on our website. In this case, we want to extract the first_name, last_name, and total of the following invoice:

https://vwxzjwxlflvltwsntpsb.supabase.co/storage/v1/object/public/testing/invoice1.pdf

Create the necessary Shape

Next, we create the Shape that defines what we want to extract and in what structure. We guarantee that the returned JSON will follow your specified structure and types. Since we use Large Language Models (LLMs) in the background, it helps to add a description to tell what you want to extract. Treat it as explaining to a human what you would like to extract.

[
{
"name": "first_name",
"type": "string",
"description": "The first name of the one who receives the bill.",
"isArray": false
},
{
"name": "last_name",
"type": "string",
"description": "The last name of the one who receives the bill.",
"isArray": false
},
{
"name": "total",
"type": "number",
"description": "Total amount to pay",
"isArray": false
}
]

Write the request

With this Shape, we can now submit the Job. There are three main ways how to pass the PDF to Waveline:

  1. ContentUrl
  2. Base64 Encoded (<4.5MB)
  3. Plain Text (<4.5MB)
post_response = requests.post(
url = "https://waveline.ai/api/v1/extract-document",
headers = {
"Authorization": "Bearer " + api_key
},
json={
"fileName": "invoice.pdf",
"contentType": "application/pdf",
"contentUrl": "https://vwxzjwxlflvltwsntpsb.supabase.co/storage/v1/object/public/testing/invoice1.pdf",
"shape": shape,
}
)

We get something back like this:

{
"id":"4c81e644-6e43-4c51-a5b4-12469a155da9",
"createdAt": "2023-07-21T12:44:16.395Z",
"status":"CREATED",
"type":"extract",
"pages": 2,
"fileName": "invoice1.pdf",
"result":null,
"urls":{
"get":"https://waveline.ai/api/v1/jobs/4c81e644-6e43-4c51-a5b4-12469a155da9"
}
}

Get the result

We see that the Status of the job is “CREATED”. The queuing system of Waveline then takes the task, sets it to “RUNNING” when we are processing your request, and finally to “COMPLETED” when we are finished. To get the Status or Result, you can query our API with a GET request with the following URL: “https://waveline.ai/api/v1/jobs/<your_job_id>” To make things easier, we send you back this URL at urls/get in the response.

Note that you also have to authenticate yourself and pass your API key with you.

requests.get(
post_response.json()["urls"]["get"],
headers={"Authorization": "Bearer " + api_key}
)

We then get back the result:

{
"id":"4c81e644-6e43-4c51-a5b4-12469a155da9",
"createdAt": "2023-07-21T12:44:16.395Z",
"status":"CREATED",
"type":"extract",
"pages": 2,
"fileName": "invoice1.pdf",
"result":{
"total":330.75,
"last_name":"Timond",
"first_name":"Ben"
},
"urls":{
"get":"https://waveline.ai/api/v1/jobs/4c81e644-6e43-4c51-a5b4-12469a155da9"
}
}

Wrapping up

We went through how to use the Waveline API to extract specific data from an unstructured PDF. For further information, you can always shoot us an email at team@waveline.ai or check out our website: waveline.ai

--

--