This system is running live
Don't take
our word for it.
Try it.
We rebuilt a miniature version of this engagement and it's running right now. Click below — you are not looking at a screenshot, you are looking at the actual software.
▶ Open the live demo →Legal services · Client work · 7 weeks
Document Intelligence Pipeline
Structured data extraction from unstructured PDFs
- accuracy
- > 95%
- contracts/week
- ~40
- time per contract
- 6h → 25min
The problem
A mid-sized legal services firm reviewed roughly 40 contracts per week. Each one took a paralegal ~6 hours to read, extract structured fields, and enter into the firm's case management system. The data entry was the bottleneck, and the paralegals hated it.
What we built
A four-stage pipeline:
1. PDF ingestion + OCR fallback for scanned pages 2. LLM-driven section detection (parties, recitals, terms, signatures) 3. Structured extraction with a strict JSON schema and validation 4. Human-in-the-loop review UI for the 5% of fields the model was unsure about
The key design decision was the explicit confidence threshold. Below 90% confidence, the field gets surfaced to a paralegal for one-click correction. Above, it goes straight into the system. Errors get logged and used to improve the prompt.
What we deliberately did not do
- We did not fine-tune a model. Off-the-shelf Gemini Flash with a careful prompt and structured output got us to 95% on the first week.
- We did not build a full contract analysis platform. The client already used one for risk analysis. We just fixed the data entry pain.
Outcome
- Average review time per contract dropped from ~6 hours to ~25 minutes
- Paralegals reallocated to higher-value work
- Error rate measurably lower than the baseline manual process
Tech stack
Python, FastAPI, Gemini Flash structured output, pgvector for similar-clause search, a small React review UI.