You can extract data when you call the API directly.
To do this, first create a JSON schema of the fields you want to extract. You can use the Playground to help you create the schema.
After creating the schema, you can either pass the JSON file in the API call or define the JSON schema directly in your script.
You can also classify documents before extracting data, and then extract data based on the document type.
Supported Data Types for Fields
When extracting data with the API or library, you can specify the following data types for fields:
bool
bytes
float
int
nested objects
str
list
: This data type from the typing
library is supported if the types within the list
are valid.
union
: This data type from the typing
library is supported if the types within the union
are valid.
These data types are not supported: set
, tuple
, dict
, optional
, none
.
Sample Script: Pass JSON Schema to API
The script below shows how to pass a JSON file with the extraction schema to the API when parsing a document.
import json
VA_API_KEY = <YOUR_VA_API_KEY> # Replace with your API key
headers = {"Authorization": f"Basic {VA_API_KEY}"}
url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"
base_pdf_path = "your_pdf_path" # Replace with the path to the file
pdf_name = "filename.pdf" # Replace the file
pdf_path = f"{base_pdf_path}/{pdf_name}"
schema_name = "w2-schema.json" # Replace with the JSON schema
schema_path = f"{base_pdf_path}/{schema_name}"
with open(schema_path, "r") as file:
schema = json.load(file)
files = [
("pdf", (pdf_name, open(pdf_path, "rb"), "application/pdf")),
]
payload = {"fields_schema": json.dumps(schema)}
response = requests.request("POST", url, headers=headers, files=files, data=payload)
output_data = response.json()["data"]
extracted_info = output_data["extracted_schema"]
print(extracted_info)
Sample Script: Include JSON Schema in Script
The script below shows how to define the extraction schema in the call when parsing a document.
import json
import requests
VA_API_KEY = <YOUR_VA_API_KEY> # Replace with your API key
headers = {"Authorization": f"Basic {VA_API_KEY}"}
url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"
base_pdf_path = "your_pdf_path" # Replace with the path to the file
pdf_name = "filename.pdf" # Replace the file
pdf_path = f"{base_pdf_path}/{pdf_name}"
# Define your schema
schema = {
"type": "object",
"properties": {
"table": {
"type": "array",
"items": {
"type": "object",
"properties": {
"Id": {"type": "number"},
"Name": {"type": "string"},
"Email": {"type": "string"},
"Investments": {"type": "string"},
},
"required": [
"Id",
"Name",
"Email",
"Investments",
],
},
"description": "A table containing user information with columns for Id, Name, Email, and Investments.",
}
},
"required": ["table"],
}
files = [
("pdf", (pdf_name, open(pdf_path, "rb"), "application/pdf")),
]
payload = {"fields_schema": json.dumps(schema)}
response = requests.request("POST", url, headers=headers, files=files, data=payload)
output_data = response.json()["data"]
extracted_info = output_data["extracted_schema"]
print(extracted_info)
As part of the extraction process, you can classify documents and extract data based on the type of document it is. For example, you can extract different sets of data from Invoices, Receipts, and Waybills.
To classify documents, use the enum
keyword to define the document types in your script. Here is an example:
# Define document types
class_schema = {
"type": "object",
"properties": {
"document_type": {"type": "string", "enum": ["Document Type 1", "Document Type 2", "Document Type 3"]}
},
"required": ["document_type"],
}
Sample Script: Classify Documents
Let’s say you need to parse three types of documents: Passports, Invoices, and Other. The data you need to extract depends on the document type.
The following script shows you how to define your document types, the fields to extract for each document type, and the code to actually parse, the documeents, classify the document, and extract the data.
import requests
import json
VA_API_KEY = <YOUR_VA_API_KEY> # Replace with your API key
headers = {"Authorization": f"Basic {VA_API_KEY}"}
url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"
base_pdf_path = "your_pdf_path" # Replace with the path to the file
pdf_name = "filename.pdf" # Replace the file
pdf_path = f"{base_pdf_path}/{pdf_name}"
# Define document types
class_schema = {
"type": "object",
"properties": {
"document_type": {"type": "string", "enum": ["Passport", "Invoice", "Other"]}
},
"required": ["document_type"],
}
# First request: classification
with open(pdf_path, "rb") as f:
files = [("pdf", (pdf_name, f, "application/pdf"))]
payload = {"fields_schema": json.dumps(class_schema)}
classification_response = requests.post(
url, headers=headers, files=files, data=payload
)
classification = classification_response.json()["data"]["extracted_schema"][
"document_type"
]
# Define schema based on classification
if classification == "Passport":
schema = {
"type": "object",
"properties": {
"Given Names": {"type": "string"},
"Date of birth": {"type": "string"},
"ID_number": {"type": "number"},
"Passport Number": {"type": "string"},
},
"required": ["Given Names", "Date of birth", "ID_number", "Passport Number"],
}
elif classification == "Invoice":
schema = {
"type": "object",
"properties": {
"Bill to": {"type": "string"},
"Invoice Number": {"type": "string"},
"Invoice Date": {"type": "string"},
"Due Date": {"type": "string"},
"Total": {"type": "string"},
},
"required": ["Bill to", "Invoice Number", "Invoice Date", "Due Date", "Total"],
}
else:
print("Document type is 'Other'. No extraction schema defined.")
exit()
# Second request: extraction
with open(pdf_path, "rb") as f:
files = [("pdf", (pdf_name, f, "application/pdf"))]
payload = {"fields_schema": json.dumps(schema)}
response = requests.post(url, headers=headers, files=files, data=payload)
output_data = response.json()["data"]
extracted_info = output_data["extracted_schema"]
print(extracted_info)