Medical AI Across Clinical Contexts

Jonas de Oliveira Neves · Living Research

Updated 16 February 2026 · CI · Version 0.1

In progress

Question

Medical AI models are benchmarked within narrow clinical settings. How much does performance degrade when context shifts — across specialties, demographics, and data availability? Which shifts cause the largest drops, and is the gap closing as new models release?

Motivated by Li et al. (2026), who propose "context switching" as a paradigm for adaptive medical AI but provide no empirical measurement of the underlying problem.

Approach

Benchmark publicly available medical LLMs across controlled context shifts using open medical QA datasets and clinical benchmarks. Measure degradation along three axes: specialty (cardiology vs. dermatology vs. psychiatry), demographic (age, sex, comorbidities), and data availability (full workup vs. partial information). Each new model becomes a data point in a longitudinal record.

Findings

Results will appear here as experiments run. This section updates when the CI pipeline produces new outputs.