A benchmark for evaluating diagnostic questioning efficiency of LLMs in patient conversations

Published at: Scientific Reports

Abstract:
Physician–patient diagnostic interactions often rely on incomplete, uncertain, and noisy symptom descriptions expressed in non-specialist language. To reach an accurate diagnosis efficiently, physicians employ adaptive sequences of focused questions in which each inquiry depends on prior patient responses. Similar requirements apply to AI assistants engaged in clinical dialogue, which must strategically select questions to elicit diagnostic information efficiently. While several datasets and benchmarks address clinical reasoning, few focus on evaluating the strategic inquiry process itself. To address this gap, we introduce Q4Dx (Question-Driven Diagnosis), a benchmark for assessing goal-directed diagnostic questioning. Q4Dx consists of synthetically generated patient cases derived from curated symptom–disease knowledge. Each case is instantiated at multiple information levels (100%, 80%, and 50% symptom exposure) to evaluate performance under varying informational constraints. We simulate patient–physician interactions using GPT-4.1 and GPT-4o-mini agents to generate clinician questions, patient responses, and intermediate diagnostic hypotheses. Performance is evaluated using Zero-shot Diagnostic Accuracy (ZDA), Mean Questions to Correct Diagnosis (MQD), and Inquire Sequence Efficiency (ISE), which measures convergence toward the correct diagnosis. Q4Dx provides a reusable framework for benchmarking large language models in strategic clinical inquiry and supports future research on AI-assisted diagnostic training. The dataset and benchmark are publicly available at: https://github.com/MaiWert/MedQDx.

By Yehudit Aperstein1/24/2026

Full article