This system is running live

Don't take
our word for it.
Try it.

We rebuilt a miniature version of this engagement and it's running right now. Click below — you are not looking at a screenshot, you are looking at the actual software.

▶ Open the live demo →

Legal services · Client work · 7 weeks

Document Intelligence Pipeline

Structured data extraction from unstructured PDFs

accuracy: > 95%
contracts/week: ~40
time per contract: 6h → 25min

The problem

A mid-sized legal services firm reviewed roughly 40 contracts per week. Each one took a paralegal ~6 hours to read, extract structured fields, and enter into the firm's case management system. The data entry was the bottleneck, and the paralegals hated it.

What we built

A four-stage pipeline:

1. PDF ingestion + OCR fallback for scanned pages 2. LLM-driven section detection (parties, recitals, terms, signatures) 3. Structured extraction with a strict JSON schema and validation 4. Human-in-the-loop review UI for the 5% of fields the model was unsure about

The key design decision was the explicit confidence threshold. Below 90% confidence, the field gets surfaced to a paralegal for one-click correction. Above, it goes straight into the system. Errors get logged and used to improve the prompt.

What we deliberately did not do

We did not fine-tune a model. Off-the-shelf Gemini Flash with a careful prompt and structured output got us to 95% on the first week.
We did not build a full contract analysis platform. The client already used one for risk analysis. We just fixed the data entry pain.

Outcome

Average review time per contract dropped from ~6 hours to ~25 minutes
Paralegals reallocated to higher-value work
Error rate measurably lower than the baseline manual process

Tech stack

Python, FastAPI, Gemini Flash structured output, pgvector for similar-clause search, a small React review UI.

← All work