SOLR and Selective Retrieval at Scale: Full‑Document Indexing for User Search and LLM Summaries
In the previous article, we explored Decision Support — the stage where structured data, modeling, and LLM‑based summarization converge. This next article focuses on a critical enabler of that workflow: SOLR and selective retrieval.
Our system indexes the entire document, not just structured slices. This supports two major use cases:
- User search within a single case — examiners can search for any phrase, abbreviation, or concept across all documents in that case.
- Summarization and Decision Support — SOLR retrieves contextual passages when structured data alone isn’t enough.
And it does this at a scale that supports over 20,000 users working across millions of documents.
Full‑document indexing: the foundation of flexibility
Medical evidence is long, heterogeneous, and unpredictable. Even with high‑quality structured extraction, some information is best retrieved directly from text.
By indexing the entire document, we ensure:
- Examiners can search for anything they need
- Summarization can retrieve context around structured hits
- Rare or unusual phrasing remains discoverable
- No clinically relevant text is “lost”
Full indexing gives us completeness and flexibility without sacrificing performance.
User search: UMLS‑powered synonym expansion
Examiners search within a single case, but they need comprehensive results. To support this, we integrate UMLS‑based synonym expansion directly into SOLR queries.
Examples:
- Searching PTSD also returns post‑traumatic stress disorder
- Searching ESRD also returns end stage renal disease
This ensures examiners get complete results without needing to know every variant of a clinical term.
Index lifecycle: purge after 3 days of non‑use
Full‑document indexing at national scale creates massive data volumes. To keep indexes lean and performant, we:
- Track document access
- Purge documents after 3 days of non‑use
- Re‑index on demand if needed
This keeps index size stable and predictable while ensuring active cases remain fast and responsive.
Architecture: 10 independent SOLR servers, round‑robin assigned
We deploy 10 independent SOLR servers and assign cases round‑robin using a PostgreSQL‑backed routing table.
This design is driven by SOLR’s own guidance:
Zookeeper‑managed clusters are recommended only for either 5 or fewer nodes or thousands of nodes.
Anything in between introduces unnecessary operational overhead without meaningful benefit.
Our architecture provides:
- High throughput
- Isolation between cases
- Predictable latency
- No cross‑node contention
- Simple operational behavior
- Easy horizontal scaling
- Reliable performance for 20,000+ users
It’s a pragmatic design optimized for national‑scale workloads.
Structured fields inside the index: enabling advanced filtering
Our SOLR indexes contain structured fields such as:
- Region
- Encounter type
- Encounter date
- Section boundaries
- cTAKES‑aligned offsets
- Domain‑scored features
- Document metadata
These fields enable:
- Advanced user filtering (e.g., “show only imaging reports from 2018–2020”)
- Targeted summarization (e.g., “retrieve passages from the ADL section”)
- Efficient RAG (e.g., “pull context around this specific hit”)
This hybrid of structured and unstructured indexing is what makes retrieval both precise and flexible.
Selective retrieval for summarization: structure first, SOLR second
Even though we index the entire document, summarization does not retrieve the entire document.
The workflow is:
- Structured data identifies the relevant entities, regions, and encounters
- Domain scoring determines which features matter
- We retrieve the exact passages tied to those features
- SOLR supplements with nearby context or rare phrasing
- We assemble a curated evidence set
- The LLM receives only this curated evidence
This keeps token counts low while maintaining high recall.
Token efficiency: the hidden constraint
At 20M+ pages/day:
- Every unnecessary token costs money
- Every unnecessary token increases latency
- Every unnecessary token risks hitting provider limits
Selective retrieval ensures that the LLM sees:
- Only the relevant passages
- Only the necessary context
- Only the evidence tied to structured features
This is how we maintain both accuracy and affordability.
Monitoring: retrieval must be predictable
We monitor:
- Query latency
- Index size (which stays flat due to regular purging)
- Query patterns
When we see users repeatedly searching for the same concepts, it becomes a signal to:
- Promote those concepts into structured data
- Add new classification models
- Improve region detection
- Enhance domain scoring
User behavior directly informs system evolution.
Why SOLR still matters in an LLM world
LLMs are powerful, but they cannot replace retrieval. In fact, they make retrieval more important.
LLMs need:
- Grounding
- Structure
- Relevance
- Context
- Token efficiency
SOLR ensures that the LLM receives the right evidence, in the right order, at the right scale.
Previous Post