, Microsoft released its latest Healthcare AI paper, Sequential Diagnosis with Language Models, and it shows immense promise. They label it “The Path to Medical Superintelligence”. Are doctors going to get overtaken by AI? Is this really a revolutionary advancement in our field? Although the paper has just been submitted for review and may need additional experimentation, this article will go over the main points of the paper and provide some discussion and limitations of the paper.
The overall headlines are eye-popping: a method to increase AI diagnostic performance to 80% (with Microsoft’s new SDBench metric). But let’s see how that happens.
For a brief summary of the paper, researchers created a new benchmark, SDBench, based on clinical cases. Unlike most scenarios, performance was based on diagnostic accuracy and total cost to get to the diagnosis. This isn’t a new AI model but a MAI Diagnostic Orchestrator called MAI-DxO (which we will discuss more later on). This AI orchestration is model-agnostic, and many variants of experiments were performed to obtain the cost-accuracy Pareto frontier. Final results cite physicians at 20% accuracy and MAI-DxO at 80%. However, these percentages don’t necessarily tell the whole story.
What is Sequential Diagnosis?
To start, the paper is called Sequential Diagnosis with Language Models. So what exactly is it? When patients arrive at a doctor, they need to recite their patient history to provide context for the doctor. Through iterative questioning and testing, doctors can narrow down their hypothesis for a diagnosis. The paper cites several considerations during sequential diagnosis that later come into play for development: informative questions, balancing diagnostic yield and cost with patient burden, and knowing when to make a confident diagnosis [1].
SDBench
The Sequential Diagnosis Benchmark is a novel benchmark introduced by Microsoft Research. Prior to this paper, most medical benchmarks involve multiple choice questions and answers. Google famously used MedQA, consisting of US Medical Licensing Exam (USMLE) style questions, in the development of their medical LLM, MeD-PaLM 2 (you may remember the headlines MeD-PaLM originally made as the medical LLM passing the USMLE [2]. This type of Q+A benchmark seems appropriate since doctors are licensed by the USMLE multiple choice questions. However, there is an argument that these questions test some level of memorization and not necessarily deep understanding. In the age of LLMs being known for memorization, this isn’t necessarily the best benchmark.
To counter this, SDBench combines 304 New England Journal of Medicine (NEJM) clinicopathological conference (CPC) cases published between 2017 and 2025 [1]. It is designed to mimic the iterative process a human physician undertakes to diagnose a patient. In these scenarios, an AI model (or human physician) starts with a patient’s original history and must iteratively make decisions to narrow in on a diagnosis. In this situation, the decision-making model is called the diagnostic agent, and the model revealing information is called the gatekeeper agent. We will discuss these agents more in the next sections.
Another novel part of SDBench is the consideration of cost. Every diagnosis could be far more accurate with unlimited money and resources for unlimited tests, but that is unrealistic. Therefore, every question asked and test ordered incurs a simulated financial cost, mirroring real-world healthcare economics with Current Procedural Terminology (CPT) codes. This means AI performance is evaluated not only on diagnostic accuracy (comparing its final diagnosis to the NEJM’s gold standard) but also on its ability to achieve that diagnosis in a cost-effective manner.
Judging the Diagnosis with SDBench
The natural question that arises is, “how exactly are these diagnoses evaluated for correctness within the SD Bench framework?” This isn’t straightforward, as diseases often have multiple names, making direct string matching unreliable. To address this, Microsoft researchers created a judge agent.
The full diagram of everything that was just described for SDBench is shown in Figure 1.
Agents and AI
The most important thing to remember is that MAI-DxO is model-agnostic. It’s an AI orchestrator. Perhaps not a well known term, but Microsoft defines it for us. “In the context of generative AI, an orchestrator is like a digital conductor helping to coordinate multiple steps in achieving a complex task. In healthcare, the role of orchestration is crucial given the high stakes of each decision” [3]. Therefore, any model can be used as the agents. This is great because the system does not go out of date every time a new model comes out. A full diagram of MAI-DxO is shown in Figure 3.

Earlier, it was mentioned that there were 3 agents present: diagnostic, gatekeeper, and judge. It’s interesting to think about the diagnostic and gatekeeper judges functioning as some sort of GAN (Generative Adversarial Network) where the diagnostic agent is trying to improve while being limited by the gatekeeper’s information. Let’s investigate the agents further.
Diagnostic Agent
For the diagnostic agent, the language model orchestrates 5 distinct parts at once. It isn’t known how distinct each role is trained but likely a specialized component or fine-tuned LLM for that task. The 5 roles are as follows:
- Dr. Hypothesis– Contains top 3 most likely diagnoses at each step in the sequential diagnosis
- Dr. Test Chooser– Selects 3 diagnostic tests at each time step to attempt to discriminate between hypotheses for diagnoses
- Dr. Challenger– Acts as the devil’s advocate trying to undermine the current hypotheses for diagnoses
- Dr. Stewardship– Focuses on cost by minimizing cost while maximizing diagnostic yield
- Dr. Checklist– Quality control for the entire diagnostic agent ensuring sure valid results and consistency
The 5-panel diagnostic agent must agree on 1 of 3 decisions after each time step in the sequential diagnosis: asking a question, ordering a diagnostic test, or producing a final diagnosis. The final diagnosis is based on a confidence threshold the agent can reach. There is also a budget tracker that is optional if the agent wants to take cost into account.
Gatekeeper Agent
The gatekeeper agent was implemented using ChatGPT o4-mini. The agent’s guidelines and responses were created and guided by physicians. This meant the gatekeeper would not reveal more information than an actual patient or diagnostic test would. The paper cites that there can be hindsight bias in educational case write-ups, but the agent is supposed to prevent this.
An initial bias presented in the system was that the diagnostic agent would prompt for information not present in the case. When the gatekeeper returned “Not Responding”, the diagnostic agent would gain hints. This could be similarly compared to mode collapse with GANs where the generator can trick the adversarial network. To correct this, the gatekeeper would return synthetic data based on what the query/test would most likely yield.
Judge Agent
The judge agent is implemented from ChatGPT’s o3 model and follows a rubric shown in Figure 2, but it was based on HealthBench, introduced by OpenAI in May 2025 [4]. To count as a correct diagnosis, researchers said the judge agent must assess a score of at least 4.

To figure out if the judge agent was doing its job correctly, the diagnoses produced by the diagnostic agent were also graded by physicians. There was only a disagreement in several cases, and in the majority of those cases, the judge was found to be overly strict.
Experimentation
Prior to training, 56 of the most recent cases from the dataset were used for testing, and the rest were used for training. In terms of the different agents, Microsoft tested many different foundation models: GPT-3.5-turbo, GPT-4o, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o3, o4-mini, Claude 4 Sonnet, Claude 4 Opus, Gemini 2.5 Pro, Gemini 2.5 Flash, Grok-3, Grok-3-mini, Llama 4 Maverick, and Deepseek-R1.
As an aside, the model was prompted using XML formatting which recently seems to be the best way to prompt LLMs along with JSON prompting. XML formatting seems to be most popular for Claude models.
In testing the accuracy-cost results from SDBench, 5 main variants were experimented with:
- Instant Answer– Diagnosis must be produced only from initial presentation of patient (no follow up questions/tests allowed)
- Question Only– Diagnostic agent can ask questions but order no tests
- Budgeted– Implemented a budgeting system where tests can be canceled once cost is seen
- No Budget– Exactly as it seems. There is no budget consideration
- Ensemble– Similar to model ensembling with multiple diagnostic agent panels run in parallel
The performance of each variant will be shown in results, but results are similar to what you would expect in traditional machine learning with different data stratification, constraints, and model ensembling.
Results
Now that we have covered the basis of the paper and its agentic setup, we can look at the results. The MAI-DxO in its final form has the best diagnostic accuracy when ensembling, and it has the best accuracy at a given budget as shown in Figure 3. All individual LLMs referred to are the result of just feeding the case to the LLM and asking for a diagnosis.

From this figure, the results look amazing. The Pareto frontier is defined by results from MAI-DxO. MAI-DxO destroys other models and physicians in both diagnostic accuracy and cost. This is where the major news headlines about doctors no longer being necessary due to AI supremacy comes from. At a similar budget, MAI-DxO is 4 times more accurate than the sampled physicians.
The paper shows a few more figures containing results, but for the sake of simplicity, this is the main result shown. Other results include MAI-DxO boosting performance of off-the-shelf models and Pareto Frontier curves showing the model doesn’t purely memorize information.
How Good are these Results?
You might be wondering if these results are really that good. Despite these amazing results, the researchers do a great job of nuancing their results, explaining the drawbacks the system has. Let’s go over some of these nuances explained in the paper.
To start, a patient summary is not usually presented in 2-3 concise sentences. Patients may never directly present their main complaint, their main complaint may not be the actual issue, and they may talk for minutes upon initial history. If MAI-DxO were to be used in practice, it would need to be trained to handle all of these scenarios. The patient doesn’t always know what is wrong or how to express it correctly.
In addition, the paper mentions that the NEJM cases presented were some of the most challenging cases to exist. Many of the top doctors in the world wouldn’t be able to solve those. MAI-DxO performed great on these, but how do they perform on normal day to day cases taking up the majority of many doctors’ careers. AI agents do not think like us. Just because they can solve hard cases doesn’t mean they can solve easier ones. There are also more factors such as wait times for tests and patient comfort that factor into diagnoses. More results are needed to demonstrate and prove this.
The 20% accuracy for physicians is also a bit misleading. The paper does a good job of discussing this issue in the limitations section. The physicians were not allowed to use the internet when going through the cases. How many times have we heard in school that we will always be able to use the Internet in real life? Even doctors need to look up information too. With search engines, doctors would likely get a far higher score on the cases.
Earlier in the paper, we discussed that the gatekeeper agent generates synthetic data to prevent the diagnostic agent from gaining hints. The quality of this synthetic data needs to be further examined. There is still potential for hints to be leaked from these tests as we don’t actually know the human results for these cases. All this to say, this system may not generalize as the diagnostic agent may be slowed down by confusing test results from an inaccurate diagnostic test it ordered.
What’s the Takeaway?
In the world of Healthcare AI, Microsoft’s MAI-DxO is extremely promising. Just a few years ago, it seemed crazy that the world would have AI agents. Now, a system can perform sequential, medical reasoning and solve NEJM cases balancing cost and accuracy.
However, this isn’t without its limitations. We must find a true gold standard to compare healthcare AI agents to. If every paper benchmarks physician accuracy a different way, it will be difficult to tell how good AI really is. We also need to determine the most important factors in diagnostics. Are cost and accuracy the only 2 factors or should there be more? SDBench seems like a step in the right direction replacing memorization testing with conceptual learning, but there is more to consider.
The headlines all over the news shouldn’t scare you. We are still a ways from medical superintelligence. Even if a great system were to be created, years of validation and regulatory approval would ensue. We are still in the early stages of intelligence, but AI does hold the power to revolutionize medicine.
References
[1] Nori, Harsha, et. al. “Sequential Diagnosis with Language Models.” arXiv:2506.22405v1 (June 2025).
[2] Singhal, Karan, et. al. “Toward expert-level medical question answering with large language models.” Nature Medicine (January 2025).
[3] https://microsoft.ai/new/the-path-to-medical-superintelligence/
[4] Arora, Rahul, et. al. “HealthBench: Evaluating Large Language Models Towards Improved Human Health.” arXiv:2505.08775v1 (May 2025).