Engineering for Scale: How Our Pipeline Runs 20M+ Pages a Day on Commodity Hardware
A technical overview of the architecture, engineering choices, and operational controls that make a nationally deployed medical evidence system reliable, auditable, and cost‑predictable
This article documents the concrete architecture and operational patterns that let our production system process 20 million+ pages of medical evidence per day for a large, unnamed government agency in the medical and healthcare domain. It preserves proprietary details while explaining the design decisions, trade-offs, and controls that matter when you move from pilot to national deployment.
Core design principle — PostgreSQL as the pipeline backbone
At the heart of our architecture is a task pipeline implemented on PostgreSQL. We selected PostgreSQL over messaging systems such as Kafka because it gives us maximum flexibility in how and when tasks are processed and lets the database act as the orchestrator of work.
Key capabilities enabled by PostgreSQL
- Workload grouping and prioritization — tasks are grouped into workloads and prioritized against each other so interactive work can be prioritized over background processing.
- Duplicate suppression — once a task exists in the queue, we do not add duplicates, thus saving large amounts of redundant processing.
- Advanced dependency graph — tasks declare dependencies; some task types run in parallel while others wait. Because tasks are added dynamically, execution order can change after insertion and PostgreSQL lets us manage that reliably.
- Retry semantics — built‑in retry logic handles network and external system failures.
This approach makes the database more than storage: it becomes the source of truth for orchestration, enabling deterministic behavior, strong consistency, and rich operational controls.
Self‑registering servers and true dynamic scaling
Servers self‑register with the pipeline. Install the software, run it, and the server appears in the system with no additional configuration required. Because servers can be added or removed at any time without manual reconfiguration, the platform supports true dynamic scaling.
- Plug‑and‑play scaling — new workers join automatically; retired workers drain and leave without manual intervention.
- Central configuration — all system parameters are stored in the database so operational changes can be made centrally and take effect immediately.
- Horizontal and vertical scaling — breaking processing into smaller steps enables horizontal distribution; self‑registration and central configuration enable rapid capacity changes.
This model simplifies operations and makes capacity changes predictable, auditable, and low‑risk.
Document model — process as a unit, serve as pages
We process documents as single logical units because clinical constructs such as encounters often span many pages and must be interpreted in context. For user experience and browser performance we:
- Split documents into individual pages and cache them in a custom distributed file store.
- Serve pages on demand so examiners load only a few pages at a time, keeping the UI responsive even for documents that are thousands of pages long.
This hybrid model preserves clinical context for NLP while delivering a responsive, practical UI.
PDF extraction, QA, and OCR fallback
Text extraction and quality assurance are foundational to reliable downstream extraction.
- We use PDFBox APIs to extract text and enable us to perform automated checks such as detecting pages that contain images with embedded text but have no underlying text layer and detecting garbage characters. PDFBox provides the primitives; our code implements the domain‑specific checks and remediation logic.
- If a page fails QA, we OCR it and reprocess. This targeted fallback reduces cost and latency while ensuring high‑quality text for downstream NLP.
Automated QA and OCR ensure we do not silently accept low‑quality text that would degrade extraction quality.
Clinical NLP: heavily optimized cTAKES pipeline
Our clinical NLP pipeline is built on a proven clinical NLP foundation and heavily modified for the domain and performance.
- Early runs took ~40 minutes for an average 40‑page document to process in the clinical NLP engine. Through iterative optimization, that same document now processes in ~5.5 seconds, and only about half of that time is spent inside the NLP engine itself.
- We’ll cover specific optimization techniques in a future post; here the important point is that careful engineering and profiling turned a research‑scale pipeline into an industrial one.
- The pipeline produces the annotations and structured outputs that feed storage, search, and downstream models.
- Start offsets for regions and encounters are most often located using rules that reliably detect anchors and section boundaries.
- End offsets are often determined by classification models that decide where an annotation should terminate. Classification models are also employed to infer attributes such as specialty, drug and alcohol addiction, and acts of daily living.
- We blend rules and statistical models so that rules provide high‑precision anchors and models handle nuance and generalization. This hybrid strategy balances precision, recall, and explainability—critical in regulated, high‑stakes domains.
Storage and retrieval — MongoDB plus SOLR with selective caching
We use a two‑tier storage strategy to balance durability, query flexibility, and retrieval speed.
- MongoDB is the primary store for the majority of structured and semi‑structured outputs. It holds authoritative records, supports flexible queries, and enables schema evolution.
- SOLR is used for low‑latency retrieval of extracted terms, passages, and full‑text snippets. To avoid an unmanageably large index, SOLR data is cached on demand and cleared a few days after last use. Indexing remains fast because we index selectively.
- Custom SOLR management — rather than relying on Zookeeper, we built a lightweight SOLR coordination layer that uses PostgreSQL for metadata and coordination, keeping the cluster lean and manageable at a national scale.
This combination keeps search fast and storage predictable.
Operational controls — quality, observability, and rapid remediation
Engineering at national scale is as much about operations as it is about architecture. Our operational controls are designed to detect anomalies quickly and keep the system running at peak quality.
Analysis Viewer and Gold Set
- A custom analysis viewer lets human reviewers grade system annotations, add and correct annotations, and inspect every processing step.
- We maintain a gold set of ~70k documents in a ground‑zero environment. Every run against this set triggers a delta tool that compares every annotation to previous runs and surfaces grade changes immediately.
- These controls let developers move quickly while maintaining confidence in output quality and system reliability.
Comprehensive telemetry and logs
- We store all server logs in PostgreSQL and index them by task or web service. This enables fast triage by correlating logs with tasks and external calls and provides complete forensic trails for audits.
Dependency awareness and outage handling
- A custom monitoring layer detects dependent system outages and pauses tasks that rely on those systems until they recover. Administrators and architects receive immediate notifications so remediation begins before end-users notice.
Nightly anomaly and performance report
- Every error, whether fatal or recoverable, is included in a nightly report emailed to our architect. The report also contains average times across all tasks and web services, task and page counts, the longest running tasks of each type, the longest web service calls, and call counts per web service. If it is an important KPI, it is in this report.
- The report is designed to surface anomalies quickly so they can be fixed before they affect production quality.
Error profile
- It is rare to have more than a few tasks fail on any given day; many days we experience no failures at all across millions of tasks. When failures do occur, they are overwhelmingly related to external network or system issues rather than internal regressions.
Practical takeaways and tradeoffs
- PostgreSQL as orchestrator gives flexibility (prioritization, deduplication, dependency graphs) at the cost of more complex database logic—an acceptable tradeoff for deterministic behavior, operational control, and peak performance.
- Self‑registering servers enable true dynamic scaling: add or remove capacity without manual reconfiguration.
- Process documents as units, serve pages on demand to preserve clinical context while keeping the UI responsive.
- Selective SOLR indexing with MongoDB primary store balances retrieval speed and storage cost.
- Hybrid NLP (rules for start offsets; classification models for end offsets and attributes) delivers precision and coverage.
- Rigorous QA and observability (gold set, delta testing, nightly reports, centralized logs) are non‑negotiable for national deployments.
The human and programmatic payoff
Engineering discipline and operational rigor translate directly into mission outcomes:
- Faster, more consistent decisions for applicants and beneficiaries.
- Lower operational cost through commodity hardware and predictable infrastructure.
- High confidence and auditability for program managers and auditors.
- Rapid detection and remediation of anomalies before they affect outcomes.
These are not abstract gains; they translate into real improvements in people’s lives and in the ability of agencies to meet mission goals reliably.
Previous Post
Next Post