Skip to main content
After parsing a document, you can use the API to extract data that you specify from a document. This is helpful if you need to extract the same data from multiple documents. For example, if you work for a financial institution and need to extract the Total Income field from tens of thousands of loan applications, you can use the API to do that. Extract Data from Documents

Get Started: Extraction Workflow

We recommend using the schema extraction wizard directly in our Playground to build and validate an extraction schema. The playground generates scripts that you can then copy and use in your own code:
  1. Use the schema extraction wizard in our Playground to build a schema tailored to your documents. Build a Schema with the Wizard
  2. Copy the script for the method you plan on using: the library or the API. Export the Relevant Format
  3. Paste the script into your code.
You can also extract data in the Playground. We recommend doing this only for testing purposes, since the Playground isn’t designed to handle bulk document processing.

Use ADE Extract to Extract Fields from Markdown

Use the API to extract data from the Markdown output created by the API. See the full API reference here.

Specify Documents to Run Extraction On

The API offers two parameters for specifying the document you want to extract from:
  • markdown: Specify the actual Markdown file you want to run extraction on.
  • markdown_url: Include the URL to the Markdown file you want to run extraction on.

Set the Extraction Schema

Set the extraction schema in the schema parameter. The schema must meet specific format and property requirements. For detailed guidance, see JSON Schema for Extraction.

Extracted Output

For details about the extraction response structure and fields, see JSON Response for Extraction.

Run Extract with Our Libraries

Click one of the tiles below to learn how to run the API with our libraries.
The legacy agentic-doc library does not support the API.

Migrate from the Legacy Extract Endpoint

If you’ve been using the legacy API endpoint (v1/tools/agentic-document-analysis), the API output format is different. When migrating to the endpoint, you may need to update scripts that process the extraction output. Key differences in the output:
  • The output doesn’t include confidence scores. Confidence scores are only available with the legacy endpoint.
  • The output doesn’t contain bounding box coordinates for each chunk. Instead, it contains a unique ID (id) for the chunk that an extracted key-value pair is from. To locate the source of a key-value pair, create a script that connects the id to the bounding box coordinates from the output. See the grounding workflow example below.