If you need to parse documents stored in places like Google Drive, Amazon S3, URLs, or local folders, you can use the connectors module to access and authenticate to those locations.

A connector is a Python class, along with configuration settings, that enables the parse function to access and retrieve documents from a specific source, such as a cloud storage bucket or local directory.

You can pass a connector to the parse function to fetch and parse all documents from that source, without manually listing each file.

You can use a connector to access all documents in an Amazon S3 bucket or Google Drive. Also, instead of specifying every file path in a local folder, you can use a connector to parse the entire directory in one call.

The connectors module is available in the agentic-doc library v0.2.3 and later.

Parse Documents from Google Drive

Before parsing documents from Google Drive, we recommend running through this tutorial first to help you set up your Google credentials: Google Drive API Python Quickstart.

The tutorial guides you through:

  1. Creating a Google Cloud project
  2. Enabling the Google Drive API
  3. Setting up OAuth 2.0 credentials

Sample Script: Google Drive

After completing the tutorial, run the following script to parse documents from Google Drive.

from agentic_doc.parse import parse
from agentic_doc.connectors import GoogleDriveConnectorConfig

# Using OAuth credentials file (from quickstart tutorial)
config = GoogleDriveConnectorConfig(
    client_secret_file="path/to/credentials.json",
    folder_id="your-google-drive-folder-id"  # Optional
)

# Parse all documents in the folder
results = parse(config)

# Parse with filtering
results = parse(config, connector_pattern="*.pdf")

Parse Documents from Amazon S3

Run the following script to parse documents from an Amazon S3 bucket.

from agentic_doc.parse import parse
from agentic_doc.connectors import S3ConnectorConfig

config = S3ConnectorConfig(
    bucket_name="your-bucket-name",
    aws_access_key_id="your-access-key",  # Optional if using IAM roles
    aws_secret_access_key="your-secret-key",  # Optional if using IAM roles
    region_name="us-east-1"
)

# Parse all documents in the bucket
results = parse(config)

# Parse documents in a specific prefix/folder
results = parse(config, connector_path="documents/")

Parse Documents from a Local Directory with a Connector

Run the following script to parse documents in a local dirctory. The function only parses documents directly in the local directory; it does not parse documents in nested directories.

from agentic_doc.parse import parse
from agentic_doc.connectors import LocalConnectorConfig

config = LocalConnectorConfig()

# Parse all supported documents in a directory
results = parse(config, connector_path="/path/to/documents")

# Parse with pattern filtering
results = parse(config, connector_path="/path/to/documents", connector_pattern="*.pdf")

Parse Documents from a URL with a Connector

Run the following script to parse documents at a specified URL.

from agentic_doc.parse import parse
from agentic_doc.connectors import URLConnectorConfig

config = URLConnectorConfig(
    headers={"Authorization": "Bearer your-token"},  # Optional
    timeout=60  # Optional
)
# Parse document from URL
results = parse(config, connector_path="https://example.com/document.pdf")