Parse & Extract

Overview

This tutorial walks you through how to parse a document with the API and then extract a subset of fields from it using the API. We provide a separate script for each endpoint, so you can choose to skip the extraction steps if you don’t need them.

Scenario and Materials

Parse this PDF: Wire Transfer Form
Extract these fields: Bank Account and Bank Account Number
JSON extraction schema: Schema for Wire Transfer

1. Parse and Save Content as a Markdown File

First, run the script below to parse the document and save the response to a Markdown file (similar to Markdown for Wire Transfer).

import requests

headers = {
    'Authorization': 'Bearer YOUR_API_KEY'
}

url = 'https://api.va.landing.ai/v1/ade/parse'

# Upload a document
document = open('wire-transfer.pdf', 'rb')
files = {'document': document}
data = {'model': 'dpt-2-latest'}

response = requests.post(url, files=files, data=data, headers=headers)
response_data = response.json()

# Print the full response
print(response_data)

# Extract and save the markdown content
if 'markdown' in response_data:
    markdown_content = response_data['markdown']

    # Save markdown content to file
    with open('markdown-wire-transfer.md', 'w', encoding='utf-8') as f:
        f.write(markdown_content)

    print("\nMarkdown content saved to a Markdown file.")
else:
    print("No 'markdown' field found in the response")

# Close the file
document.close()

2. Create a JSON Extraction Schema

As a reminder, we want to extract these fields from the Wire Transfer form: Bank Account and Bank Account Number. To do this, create a JSON extraction schema that identifies these fields. We will use this JSON file when we run the API in the next step. We’ve created the JSON schema below for you to use. You can also download this schema here: Schema for Wire Transfer.

{  "type": "object",
  "properties": {
    "bankName": {
      "title": "Bank Name",
      "description": "The name of the beneficiary bank as listed in the wire transfer form.",
      "type": "string"
    },
    "bankAccountNumber": {
      "title": "Bank Account Number",
      "description": "The account number of the beneficiary bank as listed in the wire transfer form.",
      "type": "number"
    }
  },
  "required": [
    "bankName",
    "bankAccountNumber"
  ]}

3. Use the Extraction Schema to Extract Data from the Markdown File

Now that we have the parsed output in a Markdown file and a JSON extraction schema, we’re ready to extract these fields: Bank Account and Bank Account Number. To do this, run the script below.

import requests

headers = {
    'Authorization': 'Bearer YOUR_API_KEY'
}

url = 'https://api.va.landing.ai/v1/ade/extract'

# Read the schema file as string
with open('schema-wire-transfer.json', 'r') as f:
    schema_content = f.read()

# Prepare files and data
files = {'markdown': open('markdown-wire-transfer.md', 'rb')}
data = {'schema': schema_content, 'model': 'extract-latest'}

# Run extraction
response = requests.post(url, files=files, data=data, headers=headers)

# Return the results
print(response.json())

The extracted fields and other metadata are included in the API response:

{
   'extraction':{
      'bankName':'JPMorgan Chase Bank, N.A.',
      'bankAccountNumber':4578923456789012
   },
   'extraction_metadata':{
      'bankName':{
         'value':'JPMorgan Chase Bank, N.A.',
         'references':[
            '7c56b114-cc66-4fe4-99cb-9425a5210747'
         ]
      },
      'bankAccountNumber':{
         'value':4578923456789012,
         'references':[
            '7c56b114-cc66-4fe4-99cb-9425a5210747'
         ]
      }
   },
   'metadata':{
      'filename':'markdown-wire-transfer.md',
      'org_id':None,
      'duration_ms':1018,
      'credit_usage':0.6,
      'version':'latest'
   }
}

Overview

Parse Only

​Overview

​Scenario and Materials

​1. Parse and Save Content as a Markdown File

​2. Create a JSON Extraction Schema

​3. Use the Extraction Schema to Extract Data from the Markdown File

Overview

Scenario and Materials

1. Parse and Save Content as a Markdown File

2. Create a JSON Extraction Schema

3. Use the Extraction Schema to Extract Data from the Markdown File