CLI Batch Processing
Process multiple documents efficiently with Doclayer CLI batch ingestion workflows.
Batch Processing
Process hundreds or thousands of documents efficiently with parallel workers, progress tracking, and resume capability.
Key Features
Resume Support
Pick up where you left off after interruptions
Parallel Workers
Process multiple files simultaneously
Progress Tracking
Rich terminal output with status tables
State Files
JSON state files track every upload
Single File Upload
Upload and process a single document:
# Basic upload
doclayer ingest file invoice.pdf --project proj_finance
# Upload and wait for completion
doclayer ingest file invoice.pdf --project proj_finance --wait
# Upload with custom metadata
doclayer ingest file invoice.pdf \
--project proj_finance \
--metadata '{"category":"Q3-2025","department":"accounting"}' \
--wait
# Upload with vector verification
doclayer ingest file invoice.pdf \
--project proj_finance \
--verify-vectors \
--pg-dsn "$DOCLAYER_PG_DSN"Directory Batch Upload
Process all documents in a directory with parallel workers:
# Basic batch upload
doclayer ingest batch ./invoices --project proj_finance
# Parallel processing with 4 workers
doclayer ingest batch ./invoices --project proj_finance --workers 4
# Recursive directory scan with pattern filter
doclayer ingest batch ./documents \
--project proj_finance \
--pattern "*.pdf" \
--recursive \
--workers 4
# With throttling to avoid rate limits
doclayer ingest batch ./documents \
--project proj_finance \
--workers 4 \
--throttle 0.2 # 200ms pause between uploads per worker
# With state file for resume capability
doclayer ingest batch ./documents \
--project proj_finance \
--workers 4 \
--state-file ./upload-state.json
# Resume interrupted upload
doclayer ingest batch ./documents \
--project proj_finance \
--state-file ./upload-state.json \
--resume| Option | Description |
|---|---|
| --workers <n> | Number of parallel upload workers (default: 1) |
| --pattern <glob> | File pattern filter (e.g., "*.pdf") |
| --recursive | Recursively scan subdirectories |
| --throttle <seconds> | Pause between uploads per worker |
| --state-file <path> | JSON file to track upload progress |
| --resume | Skip files already marked as successful in state file |
| --verify-vectors | Verify documents appear in pgvector after upload |
Manifest-Based Upload
For complex batch operations, use a manifest file to specify per-file settings:
Manifest File Format
# manifests/finance-batch.yaml
- path: ./invoices/invoice-001.pdf
project: proj_finance
agent: finance.invoice-summary
metadata:
category: invoice
quarter: Q3-2025
- path: ./invoices/invoice-002.pdf
project: proj_finance
agent: finance.invoice-summary
metadata:
category: invoice
quarter: Q3-2025
- path: ./contracts/sow-acme.pdf
project: proj_legal
agent: contracts.sow-analysis
metadata:
client: Acme Corp
type: statement_of_workRunning Manifest Upload
# Basic manifest upload
doclayer ingest manifest manifests/finance-batch.yaml
# With default project (used when not specified per-entry)
doclayer ingest manifest manifests/finance-batch.yaml \
--default-project proj_finance
# Parallel with resume
doclayer ingest manifest manifests/finance-batch.yaml \
--workers 3 \
--state-file manifest-state.json \
--resumeState File & Resume
The state file is a JSON file that tracks every upload attempt. Use it to resume interrupted batches without re-uploading completed files.
State File Structure
{
"entries": [
{
"file_path": "./invoices/invoice-001.pdf",
"status": "success",
"job_id": "job_abc123",
"document_id": "doc_xyz789",
"duration_ms": 2340,
"timestamp": "2025-01-15T10:23:45.123Z"
},
{
"file_path": "./invoices/invoice-002.pdf",
"status": "error",
"error": "Network timeout",
"timestamp": "2025-01-15T10:24:01.456Z"
},
{
"file_path": "./invoices/invoice-003.pdf",
"status": "pending"
}
],
"summary": {
"total": 100,
"success": 45,
"error": 2,
"pending": 53
}
}๐ก Pro Tip
When using --resume, only files with status: "pending" or status: "error" will be retried. Files marked as status: "success" are skipped.
Monitoring Progress
Monitor your batch ingestion jobs in real-time:
# List recent jobs for your project
doclayer status list --project proj_finance --limit 20
# Watch jobs in real-time (refreshes every 3 seconds)
doclayer status watch --project proj_finance --interval 3
# Inspect a specific job
doclayer status inspect job_abc123
# Inspect with vector verification
doclayer status inspect job_abc123 --vectors
# Check grounding coverage for LangExtract
doclayer status grounding --tenant $DOCLAYER_TENANT --job-id job_abc123Sample Output
โโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโโโโโณโโโโโโโโโโโโ
โ Job ID โ Status โ Progress โ Documents โ Duration โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ job_abc123 โ completed โ 100% โ 45/45 โ 3m 24s โ
โ job_def456 โ running โ 67% โ 30/45 โ 2m 12s โ
โ job_ghi789 โ pending โ 0% โ 0/100 โ - โ
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโBest Practices
1. Always Use State Files for Large Batches
For batches over 50 files, always use --state-file to enable resume capability. Network issues happen.
2. Start with Fewer Workers
Start with 2-4 workers and increase gradually. Too many workers may hit rate limits or overwhelm your network.
3. Use Throttling for Production
Add --throttle 0.2 (200ms) when targeting production APIs to avoid rate limiting.
4. Verify Vectors for Critical Data
Use --verify-vectors for critical batches to ensure all documents are properly indexed in pgvector.