Expert Exam Scores Don't Predict Medical LLM Reliability Under Pressure

MedMisBench shows LLM accuracy drops from 71.1% to 38.0% when misleading context is injected, exposing a structural gap in medical AI evaluation.

Expert-level scores on medical licensing exams have become the default credibility signal for clinical AI. The assumption embedded in that signal is wrong. A model that correctly answers a question in isolation may abandon that correct answer the moment plausible-but-false context surrounds it, and no standard benchmark measures whether that flip happens.

The failure mode has a name now: epistemic resilience, the ability to maintain correct judgment under adversarial context. MedMisBench was built to measure it. The benchmark contains 10,932 medical question items paired with 48,889 misleading context-option combinations, spanning medical reasoning, agentic capability, and patient-journey evaluation across 11 model configurations. The core design principle is simple and damaging: take questions models already answer correctly, then inject fabricated context designed to look credible, and measure how often the correct answer survives.

The injections are not random noise. The most effective ones are structural: formal, rule-like fabrications that mimic how clinical guidance actually reads. Authority-framed falsehoods, the kind that open with "According to current guidelines..." or similar constructions, reach 69.5% attack success. Exception-poisoning claims, fabrications that carve out a plausible-sounding special case, reach 64.1%. Both formats exploit the same vulnerability: LLMs appear to weight the surface features of authoritative language, not just its factual content. Think of it as a trust heuristic misfiring. A clinician trained to recognize source quality can distinguish a fabricated guideline from a real one; these models cannot, at least not reliably under pressure.

Mean accuracy across all 11 model configurations falls from 71.1% on original questions to 38.0% under focused misleading context. Attack success sits at 51.5%. A 14-member clinical panel from 7 countries reviewed a subset of cases and identified serious potential harm in 38.2% of them. For teams deploying LLMs in clinical decision support, patient-facing chat, or any setting where users supply their own context, the takeaway is direct: benchmark accuracy on clean questions is not a proxy for reliability when the input environment is noisy, adversarial, or simply wrong.

We're thinking: We find the authority-framing result the most operationally significant piece here. The 69.5% attack success for rule-like fabrications suggests these models have internalized a form of epistemic deference: formal-sounding context overrides encoded knowledge. That is not a retrieval failure or a reasoning gap. It is a calibration failure at the level of source trust. The implication for deployment is that prompt engineering and retrieval guardrails, which teams commonly use to improve factual grounding, may actually increase surface area for this attack vector if injected context arrives through those same channels. MedMisBench gives teams a concrete measurement tool, but the harder problem it surfaces is that safety evaluations for medical AI need an adversarial context layer, not just a held-out accuracy layer.

Key takeaways:

Epistemic resilience, accuracy under misleading context rather than clean-question accuracy, is a distinct capability that current medical benchmarks do not measure; MedMisBench separates the two and finds they diverge sharply.
Accuracy drops from 71.1% to 38.0% under focused misleading context across 11 model configurations, with authority-framed fabrications driving 69.5% attack success; the benchmark covers 10,932 questions but clinical panel review covers only a subset, so harm rate estimates carry sampling caveats.
Teams building patient-facing or clinical decision-support applications should add adversarial context injection to their evaluation suite before deployment, treating misleading-context attack success as a first-class safety metric alongside standard accuracy.

Source: Measuring Epistemic Resilience of LLMs Under Misleading Medical Context