AI has the potential to make medical reasoning expert more accessible, but current assessments are often not able to make simplified and static scenarios. The real clinical practice is much more dynamic; Doctors adjust their diagnostic approach step by step, asking targeted questions and interpreting new information as they are. This iterative process helps them refine hypotheses, weighing the costs and benefits of tests and avoiding jumping to conclusions. Although language models have shown strong performance on structured exams, these tests do not reflect the complexity of the real world, where premature decisions and test tests remain serious often missed by static assessments.
Medical resolution of problems has been explored for decades, the first AI systems using Bayesian executives to guide sequential diagnoses in specialties such as pathology and trauma care. However, these approaches were faced with challenges due to the need for an in -depth contribution of experts. Recent studies have moved to the use of language models for clinical reasoning, often evaluated by multi -choice static benchmarks which are now largely saturated. Projects like AMIE and NEJM-CPC have introduced a more complex case material, but were still based on fixed vignettes. Although certain more recent approaches assess conversational quality or basic information collection, little capture the complete complexity of diagnostic decision -making in real time and cost to costs.
To better reflect the clinical reasoning of the real world, Microsoft AI researchers have developed SDBENCH, a reference based on 304 cases of real diagnosis of the New England Journal of Medicine, where doctors or AI systems must ask questions and order interactively before making a final diagnosis. A language model acts as a goalkeeper, revealing information only when it is specifically requested. To improve performance, they introduced May-Dxo, an orchestrator system co-designed with doctors who simulate a virtual medical panel to choose high-value and profitable tests. When associated with models like O3 of Openai, it has reached precision up to 85.5% while considerably reducing diagnostic costs.
The sequential diagnostic reference (SDBENCH) was built using Défi de Case de Case 304 NEJM (2017-2025), covering a wide range of clinical conditions. Each case has been transformed into an interactive simulation where diagnostic agents could ask questions, request tests or make a final diagnosis. A goalkeeper, powered by a language model and guided by clinical rules, responded to these actions using realistic details or synthetic but consistent results. The diagnoses were evaluated by a judge model using a section authorized by a doctor focused on clinical relevance. The costs were estimated using CPT codes and prices to reflect diagnostic constraints and decision -making of the real world.
The researchers evaluated various AI diagnostic agents on the SDBENCH and found that Mai-DXO constantly surpassed standard models and doctors. While standard models have shown a compromise between cost and precision, Mai-dxo, built on the O3, has delivered higher precision at lower costs thanks to structured reasoning and decision-making. For example, he reached an accuracy of 81.9% to $ 4,735 per case, against 78.6% of the O3 at $ 7,850. It has also proven robust on several models and test data maintained, indicating a strong generalizability. The system has improved the lower models considerably and has helped those stronger to use resources more effectively, which reduces unnecessary tests thanks to the collection of smarter information.
In conclusion, SDBENCH is a new diagnostic reference which transforms the cases of CPC NEJM into realistic and interactive challenges, forcing AI or doctors to actively ask questions, to order tests and to make diagnostics, each with associated costs. Unlike static references, he imitates real clinical decision -making. The researchers also introduced May-Dxo, a model that simulates various medical characters to achieve high-cost diagnostic precision at a lower cost. Although current results are promising, especially in complex cases, limitations include a lack of daily conditions and constraints of the real world. Future work is aimed at testing the system in real and low -resources clinics, with global health impact potential and the use of medical education.
Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.
