Medical AI models are benchmarked within narrow clinical settings. How much does performance degrade when context shifts — across specialties, demographics, and data availability? Which shifts cause the largest drops, and is the gap closing as new models release?
Benchmark publicly available medical LLMs across controlled context shifts using open medical QA datasets and clinical benchmarks. Measure degradation along three axes: specialty (cardiology vs. dermatology vs. psychiatry), demographic (age, sex, comorbidities), and data availability (full workup vs. partial information). Each new model becomes a data point in a longitudinal record.
Results will appear here as experiments run. This section updates when the CI pipeline produces new outputs.