Lessons from a $4B Company Bringing AI to Healthcare and the Path to AI Doctors — Unsupervised Learning

Oscar Health is a $4B public health insurance company serving nearly 1 million members, and CTO/co-founder Mario Schar sits at a unique intersection: Oscar is both an insurer and a provider (running a medical group of ~150 clinicians), making it one of the most aggressive real-world adopters of LLMs in healthcare. This conversation covers how Oscar is actually deploying AI today, the structural reasons healthcare is uniquely suited to LLM adoption, the hard technical and business-model barriers to “AI doctors,” and practical lessons from building production AI systems in a heavily regulated industry.

Why healthcare is uniquely suited to LLMs

Healthcare sits at an extreme intersection of formal and informal language, which is exactly where LLMs excel.
- On the formal side: ICD-10 diagnosis codes, CPT procedure codes, utilization management guidelines, claims adjudication rules — all highly structured and regulated.
- On the informal side: doctor-patient conversations, clinical notes, customer service calls — messy, human, contextual.
- Historically, healthcare had surprisingly little algorithmic coverage compared to other industries because traditional ML (logistic regressions, etc.) couldn’t bridge the formal-informal gap. LLMs are the first models that can fluidly translate between the two.
- This is why Mario believes the biggest near-term impact is administrative (claims, authorizations, member communication), with clinical replacement of caregivers being the longer-term goal.

How Oscar uses AI today: three financial levers

Oscar’s financial outcomes depend on three buckets, and AI is applied to each:
1. Growth and retention — keeping members enrolled and attracting new ones.
  - AI-driven personalized outbound campaigns remind members of value Oscar delivered (e.g., cancer screenings, doctor visits).
  - Persona-based messaging matters: chronically ill members respond to convenience messaging (cutting redundant steps); healthy members respond to empathy-driven messaging.
  - LLMs extract persona signals from customer service conversations and fill in missing demographic data (e.g., inferring ethnicity from names and conversation language to improve doctor matching).
2. Operational efficiency (admin) — automating internal processes.
  - Call summarization: LLMs now fully replace manual note-taking by care guides on customer service calls.
  - Lab test summarization and secure messaging medical record generation in Oscar’s medical group.
  - Claims explainers: translating complex internal claim adjudication traces into language care guides can understand.
  - These save on the order of a few cents PMPM (per member per month) each, but at Oscar’s ~$600 PMPM revenue, they add up.
3. Clinical cost reduction — the biggest lever, since ~85% of premium revenue goes to medical costs.
  - The flagship clinical use case is letting doctors (and patients, and care agents) “talk to the medical records” — querying clinical data in natural language.
- Insurance company margins are only 2–5%, so even small percentage savings on the 85% medical cost base can double margins.

The clean vs. dirty input problem: admin vs. clinical

Administrative use cases have an inherent advantage: inputs and outputs are already well-structured.
- Claims come in standard EDI formats; the decision is binary (pay/deny/adjust); clearing houses validate data.
- LLMs operate on a level playing field with humans here.
Clinical use cases are fundamentally harder because of hidden context.
- Example: when summarizing a virtual primary care visit, a human provider might reference a prior conversation with the patient that isn’t in the medical record. The LLM has no access to this “ether” knowledge.
- Example: care teams organized by geography know contextual things (weather, transit options) that affect whether a patient can realistically get to an appointment — knowledge an LLM lacks unless explicitly provided.
- Improving LLM performance in clinical settings requires not just better models but expanding the “horizon of knowledge” fed into them.

Building around healthcare regulatory requirements

HIPAA is the primary constraint: patient-specific data cannot be shared without a Business Associate Agreement (BAA).
- OpenAI initially didn’t sign BAAs; Oscar was the first organization to sign one directly with OpenAI.
- Anthropic and others now sign BAAs; open-source models (e.g., running on your own infrastructure) don’t require one.
- The practical friction: when a new model launches (e.g., Google’s Gemini Ultra), it isn’t automatically covered under existing BAAs. Oscar uses synthetic/anonymized test data for 3–4 months until the model is formally included.
Selling AI to hospitals requires security reviews, policy checklists, and certifications like HITRUST, but the real barrier is trust.
- Health systems and insurers are slow at rapid prototyping and even slower at following through.
- Mario’s key insight: the best products in healthcare do not win — the best enterprise sales processes do. Founders should spend more time in conversation with hospital stakeholders than tweaking models.
- Oscar co-authored a consortium document on AI principles in healthcare, emphasizing that LLMs can democratize analytics and that organizational leaders have a duty to get these tools into more people’s hands.

Fundamental LLM limitations Oscar has encountered

Counting and classification at scale: GPT-4 fails at simple-sounding tasks like “here are 100 phone calls, categorize each and count how many fall into each bucket.” This isn’t a knowledge problem — it’s an architectural one.
- Transformers process information layer by layer. A task that requires both categorizing AND counting in one pass can exhaust the available layers.
- Solution: decompose the task. Chain-of-thought prompting works not because the LLM “thinks harder” but because chaining multiple calls effectively expands the layer space.
Semantic drift in specialized contexts: The question “does this member have a post-traumatic injury?” produces high false-positive rates.
- “Post-traumatic injury” has a precise, narrow definition in utilization management, but GPT-4’s training data contains the term in many broader, colloquial contexts. The model’s “superposition of personalities” causes it to flip between interpretations.
- Solution: prompt the LLM to generate 30 different ways the concept might appear in medical records (self-consistency), then evaluate each independently in parallel to avoid cross-pollination of tokens.
Claims explanation complexity: Oscar’s proprietary claims engine (“Layer Cake”) produces ~1,000 lines of rule traces per claim, with nested function calls and domain-specific language.
- GPT-4 couldn’t process full traces even with 32k context windows, and performance degraded on longer decision trees.
- Solution: hierarchical prompting — give GPT-4 the high-level function calls but hide sub-procedure details, then let the model “double-click” into specific functions where it needs more detail. This works well for denials (find the one failed call among 99) but is harder for payment amount explanations.

Prompting strategy: mostly empirical, shared openly

Oscar’s prompting knowledge is ~90% empirical (“trying stuff out”) and ~10% from literature.
- Mario maintains a public site (hi-oscar.com) with paper notes, focusing on interpretability research and failure cases.
- The most valuable literature is on systems design — how to chain model calls together — rather than individual prompt tricks.
- Oscar’s engineers have independently rediscovered techniques from the literature (e.g., self-consistency), suggesting that much practical AI knowledge is being reinvented inside companies and should be shared more.

General-purpose vs. healthcare-specific models

Oscar consistently finds that specialized/healthcare-specific models lose alignment — they stop following instructions reliably.
- Example: asking Med-PaLM to output JSON works perhaps half the time; GPT-4 has a dedicated JSON mode that works reliably.
- Mario’s view: use the largest, most capable general-purpose model available for reasoning, and use RAG (retrieval-augmented generation) to inject domain knowledge.
- A recent paper in agriculture showed that RAG and fine-tuning provide independent, additive improvements — do both.
- Long-term, the ideal architecture would separate reasoning/planning from content knowledge, at which point a specialized clinical model makes sense. We’re not there yet.

How Oscar structures its AI team

The model emerged from a company-wide hackathon (the most participatory in Oscar’s history) and has three components:
- A centralized “AI Pod” (~7 people: product managers, data scientists, engineers) that owns 3 prioritized projects at any time and holds weekly office hours for anyone in the company.
- Weekly Monday-night hacking sessions open to all employees, where anyone can demo work-in-progress (including failures — Mario deliberately shares his own embarrassing demos to lower the barrier).
- A tracking system for all AI projects across the company.
- This hybrid centralized/decentralized model balances shared learning with product-team ownership.

What Oscar wishes existed: the safety layer gap

The biggest unmet tooling need is a safety/verification layer that sits between LLM output and the end user.
- Currently, Oscar uses humans in the loop for anything risky: lab test summaries go to doctors, claims explanations go to care guides before reaching members.
- For clinical chatbots that talk directly to consumers, this won’t scale — an automated verification layer is needed.
Faster inference times are also a major need.
- Mario’s thought experiment: if you could run GPT-4 a thousand times in parallel and pick the best output, you might not need GPT-5. GPT-3.5 is already cheap enough for this approach, though quality is still far below GPT-4.

Best commercial opportunities in healthcare AI

Regulatory filings and compliance documentation: Generating SOC reports, state regulatory filings, NCQA quality reports — all natural language, all currently manual, all ripe for LLM automation.
Fraud, waste, and abuse detection: Still dominated by old-school, expensive incumbents. No clear reason why specialized vendors need to be overpaid for this.
Prior authorization: A crowded startup space, but Mario is skeptical — prior authorization is a core competency of insurers, and third-party solutions risk having a low ceiling unless deeply integrated. Startups targeting this space are “catching insurers where it hurts” without offering enough platform value.

Will there be AI doctors this decade?

Mario sees no fundamental reason why LLMs can’t eventually replicate clinical reasoning — medicine is highly algorithmic, inference-based, and knowledge-driven.
Three practical barriers:
1. Safety: LLMs can’t yet talk directly to end users without risk of harmful output. This is solvable with better safety layers.
2. Physical interaction: ~35% of medical visits require in-person examination (lab tests, foot exams for diabetics, etc.). Until virtual care can handle these, there will be “leakage” that pulls patients back to the traditional system. Additionally, patient loyalty to specific PCPs is surprisingly low (~28% of members see another PCP in the same year), suggesting the system is already fragmented.
3. Business model misalignment: Health systems have no incentive to adopt lower-cost virtual care because insurers and government will respond by reducing reimbursement rates, forcing capacity cuts. Insurers (who would benefit from lower-cost care) lack the member engagement to deploy it directly. This structural conundrum is perhaps the biggest barrier.

Over-hyped and under-hyped

Over-hyped: Clinical chatbots — the safety, physical exam, and business model barriers are more daunting than most people appreciate.
Under-hyped: Voice output for non-clinical healthcare applications — there’s a lot of low-hanging fruit in voice-based member communication, appointment reminders, and care navigation that doesn’t face the same safety constraints.

Where to follow Mario and Oscar’s AI work

hi-oscar.com — Oscar’s public collection of papers, articles, and AI insights, including detailed write-ups of use cases as they’re solved.
Twitter: @MarioTS — Mario’s personal explorations, Oscar AI updates, and side projects.
Gaming side projects: Mario is exploring two LLM-gaming ideas: (1) an RPG generator that lets you role-play as any stakeholder (CEO, regulator, etc.) using a company’s internal documents, and (2) an LLM that writes and balances game mechanics in real-time as you play (a modern Oregon Trail). He’s looking for collaborators.

Summary

Why healthcare is uniquely suited to LLMs

How Oscar uses AI today: three financial levers

The clean vs. dirty input problem: admin vs. clinical

Building around healthcare regulatory requirements

Fundamental LLM limitations Oscar has encountered

Prompting strategy: mostly empirical, shared openly

General-purpose vs. healthcare-specific models

How Oscar structures its AI team

What Oscar wishes existed: the safety layer gap

Best commercial opportunities in healthcare AI

Will there be AI doctors this decade?

Over-hyped and under-hyped

Where to follow Mario and Oscar’s AI work