Asynchronous Extraction (Extract Jobs)

The ADE Extract Jobs API enables you to extract structured data from large Markdown documents asynchronously. Instead of waiting for a single request to finish, you create a job, receive a job_id immediately, and retrieve the results when the job is complete. Use Extract Jobs for long-running extractions, such as when extracting from long documents or when using large, complex schemas.

Run Parse First

runs on the Markdown output created by Parse, which is required as the first step in all ADE workflows. For large documents, use Parse Jobs to generate the Markdown, then pass that Markdown to Extract Jobs.

Monitor Extract Jobs

You can monitor extract jobs with these APIs:

ADE Get Extract Jobs: Get the status for a specific extract job.
ADE List Extract Jobs: List all extract jobs associated with your API key.

Rate Limits for ADE Extract Jobs

Extract Jobs have their own per-hour rate limit, separate from the other ADE APIs. Each extract job counts as a single submission (one page equivalent) toward this limit, regardless of the size of the Markdown document. To see the rate limits for all APIs, go to Rate Limits.

API Reference

To learn more, go to the reference pages for the Extract Jobs APIs:

For information about pricing and credits, go to Pricing & Billing.

Workflow Overview

Create an extraction schema. For schema requirements, see JSON Schema for Extraction.
Submit the Markdown and schema to the ADE Extract Jobs API.
Get the job_id from the API response.
Poll the ADE Get Extract Jobs API with the job_id until status is completed.
Read the extracted fields from the completed job response. For the field structure, see JSON Response for Extraction.

Job Statuses

The ADE Get Extract Jobs API returns the current status of a job:

Status	Description
`pending`	The job is queued and has not started.
`processing`	The job is running. The `progress` field stays `0.0` until the job is `completed`.
`completed`	The job finished. The extracted fields are in the `data` field, or in the `output_url` field for large results.
`failed`	The job did not finish. See the `failure_reason` field for details.
`cancelled`	The job was cancelled.

Extract Job Response

When the ADE Get Extract Jobs API returns a job, the response wraps the extraction results with job-level fields:

Field	Description
`job_id`	The unique identifier for the job.
`status`	The current state of the job. See Job Statuses.
`received_at`	Unix timestamp (in seconds) for when the job was received.
`created_at`	Unix timestamp (in seconds) for when the job was created.
`progress`	Either `0.0` (not yet complete) or `1.0` (complete).
`org_id`	The organization ID associated with the job.
`version`	The model snapshot used for the extraction.
`data`	The extraction results, returned when the job is complete and you did not set an `output_save_url`. This object follows the same structure as the standard API response, including the extraction `metadata` (with `schema_violation_error`, `warnings`, and `fallback_model_version`). See JSON Response for Extraction.
`output_url`	A URL to download the extraction results, returned when the result is larger than 1 MB or you set an `output_save_url`. When `output_url` is present, `data` is `null`.
`metadata`	Job-level metadata summarizing the job, such as `filename`, `duration_ms`, `credit_usage`, and `version`.
`failure_reason`	If the job failed, a message describing what went wrong. Otherwise, `null`.

This example shows the structure of a completed job response, with the extraction abbreviated:

{
  "job_id": "cmf8x2k9p0001abcd1234efgh",
  "status": "completed",
  "received_at": 1781819747,
  "created_at": 1781819747,
  "progress": 1.0,
  "org_id": "a1b2c3d4e5f6",
  "version": "extract-20260314",
  "data": {
    "extraction": { "exam_date": "2010-05-20", "procedure": "MRI OF THE LUMBAR SPINE WITH AND WITHOUT CONTRAST" },
    "extraction_metadata": { "exam_date": { "references": ["93996806-781d-4404-bfa4-f6e49323a227"], "value": "2010-05-20" } },
    "metadata": { "filename": "markdown-mri-report.md", "duration_ms": 6807, "credit_usage": 1.4, "version": "extract-20260314", "schema_violation_error": null, "warnings": [] }
  },
  "output_url": null,
  "metadata": { "filename": "markdown-mri-report.md", "duration_ms": 8100, "credit_usage": 1.4, "version": "extract-20260314" },
  "failure_reason": null
}

ZDR Requirements

When zero data retention (ZDR) is enabled, you must configure the following parameters so that does not store your content:

Pass your Markdown in the markdown_url parameter. You cannot upload a local file with the markdown parameter when ZDR is enabled.
Include the output_save_url parameter. This saves the extracted content to your specified URL instead of returning it in the API response.

End-to-End Workflow: Parse a Document and Extract Fields

This tutorial walks you through how to parse a document into Markdown, create an extract job from that Markdown, and retrieve the extracted fields. The script runs all three steps in sequence and polls for the results, so you never copy the job_id by hand. For simplicity, this example uses a short, 2-page PDF and the synchronous Parse API. For large documents, use Parse Jobs to generate the Markdown.

1. Download the Document

Download the sample MRI Report and save it to a local directory.

2. Create the Script

Copy the script below and save it as extract-job.py in the same directory as the PDF.

import requests
import json
import time

headers = {
    # Replace YOUR_API_KEY with your API key
    'Authorization': 'Bearer YOUR_API_KEY'
}

# 1. Parse the document into Markdown
# Replace mri-report.pdf with the path to your document
parse_url = 'https://api.va.landing.ai/v1/ade/parse'
parse_files = {'document': open('mri-report.pdf', 'rb')}
parse_data = {'model': 'dpt-2-latest'}

parse_response = requests.post(parse_url, files=parse_files, data=parse_data, headers=headers)
parse_response.raise_for_status()
markdown = parse_response.json()['markdown']

# Save the Markdown so it can be uploaded to the extract job
with open('markdown-mri-report.md', 'w', encoding='utf-8') as f:
    f.write(markdown)

# Define the extraction schema
schema = json.dumps({
    "type": "object",
    "properties": {
        "exam_date": {
            "description": "The date on which the medical examination or procedure was performed.",
            "format": "YYYY-MM-DD",
            "type": "string"
        },
        "procedure": {
            "description": "The specific medical procedure or examination that was conducted, such as an MRI or X-ray.",
            "type": "string"
        }
    }
})

# 2. Create the extract job
extract_url = 'https://api.va.landing.ai/v1/ade/extract/jobs'
extract_files = {'markdown': open('markdown-mri-report.md', 'rb')}
extract_data = {'schema': schema, 'model': 'extract-latest'}

create_response = requests.post(extract_url, files=extract_files, data=extract_data, headers=headers)
create_response.raise_for_status()
job_id = create_response.json()['job_id']
print(f"Created job: {job_id}")

# 3. Poll the job until it finishes, then save the results
# This loop polls until the job reaches a final status. For production use,
# consider adding a timeout and exponential backoff instead of polling forever.
while True:
    job = requests.get(f'{extract_url}/{job_id}', headers=headers)
    job.raise_for_status()
    result = job.json()
    status = result.get('status')
    print(f"Job status: {status}")

    if status == 'completed':
        if result.get('data'):
            # The extracted fields are returned inline in the data field
            with open('extract_output.json', 'w') as f:
                json.dump(result['data'], f, indent=2)
            print("Results saved to extract_output.json.")
        elif result.get('output_url'):
            # Large results or ZDR jobs return a download URL instead of inline data
            print(f"Download the results from: {result['output_url']}")
        break

    if status in ('failed', 'cancelled'):
        print(f"Job {status}: {result.get('failure_reason')}")
        break

    time.sleep(5)  # wait before checking again

3. Run the Script

Run the script from the same directory:

python extract-job.py

4. View the Results

When the job status is completed, the script saves the extracted fields to extract_output.json. The completed job returns the results in one of two ways:

Field	When It’s Returned	Contents
`data`	The result is 1 MB or smaller and you did not set an `output_save_url`.	The extracted fields, returned inline. This field follows the same structure as the standard API response. See JSON Response for Extraction.
`output_url`	The result is larger than 1 MB, or you set an `output_save_url`.	A temporary URL to download the results. The `data` field is `null`, and the URL expires one hour after you request the job.

Library Support

Extract Jobs is not available in the Python or TypeScript libraries. Call the Extract Jobs APIs directly.

​Run Parse First

​Monitor Extract Jobs

​Rate Limits for ADE Extract Jobs

​API Reference

​Workflow Overview

​Job Statuses

​Extract Job Response

​ZDR Requirements

​End-to-End Workflow: Parse a Document and Extract Fields

​1. Download the Document

​2. Create the Script

​3. Run the Script

​4. View the Results

​Library Support

Run Parse First

Monitor Extract Jobs

Rate Limits for ADE Extract Jobs

API Reference

Workflow Overview

Job Statuses

Extract Job Response

ZDR Requirements

End-to-End Workflow: Parse a Document and Extract Fields

1. Download the Document

2. Create the Script

3. Run the Script

4. View the Results

Library Support