Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis
Data preprocessing
Our goal was to develop a model capable of predicting the specialty, triage level, and diagnosis for patients in an emergency department (ED) setting or those experiencing symptoms at home. Since we aimed to evaluate the difference in model performance based on whether the information was entered by the patient themselves or a clinician, we designed our dataset accordingly. For the general user, we required two main inputs: a description of the patient’s symptoms and some basic patient information. For the clinical user we added the initial vitals signs, such as temperature, heart rate, respiratory rate, oxygen saturation, and blood pressure, which can be measured upon arrival at the ED.
We processed and created our curated dataset using the MIMIC-IV ED dataset22,25 in conjunction with the MIMIC-IV Notes22,26 dataset, both modules from MIMIC-IV20,21,22, to support clinical decision-making in an emergency department setting. The MIMIC-IV ED dataset contains extensive information from patients admitted to the emergency department, while the Notes module provides valuable unstructured clinical notes of these patients.
The data processing pipeline is presented in Fig. 7. First, we merged the necessary data tables from each source. Triage information was obtained from the MIMIC-IV-ED “triage” file, while the patient demographics such as race, and gender were extracted from the “edstays” file. The age specifically was extracted from the MIMIC-IV “patients” file. The initial vital signs were extracted from the MIMIC-IV-ED “triage” file, and the unstructured clinical notes were extracted from the MIMIC-IV-Note “discharge” file.

This figure illustrates our data preprocessing pipeline. From Physionet we utilized MIMIC-IV-ED 2.2 and MIMIC-IV-Note. The necessary data tables from each data source were merged. Next, the merged data undergoes processing and cleaning. Finally, we process the clinical notes to extract the relevant information – history of present illness and primary diagnoses.
Initially, we extracted relevant discharge notes from MIMIC-IV-Note dataset and linked them with the patient records from the MIMIC-IV-ED 2.2 “edstays” file using the stay_id. We then merged the triage information and the patient demographics – gender, race and age) from the respective files, and integrated the initial vital signs. During this merging process, we dropped duplicate subject entries, removed cases with missing triage information, and filtered the records to retain only those with sequence number (seq_num) equal to 1. This ensures the uniqueness of the ED stays. We also excluded patients that had died.
A separate preprocessing step was applied for the unstructured clinical notes. Specifically, we selected only the patients that had a history of present illness in the unstructured notes. We extracted the history of present illness paragraph from the discharge notes – discarding any other information included in the notes. We further selected only cases with HPIs that had a string length between 50 and 2000 characters, to avoid getting too short or too long HPIs. We additionally removed any entries that mentioned “ED” or “ER”, as these references did not include any necessary information regarding the patient’s symptoms or how the patient was feeling.
Additionally to extracting the HPI, we extracted the diagnoses list for each patient from the clinical notes. These lists were typically divided into primary and secondary diagnoses. For our evaluation, we used only the primary diagnoses and discarded cases that had more than 15 primary diagnoses, as most cases had up to 3 diagnoses. This approach ensures that the dataset accurately reflects patient information and vital signs at the time of emergency department triage, offering a comprehensive view of early-stage clinical decision-making.
Prompts
We created a series of prompts to guide the LLM in performing specific clinical tasks. These included predicting the triage level, predicting the specialty and diagnosis both together as they are both related and complement each other. Additionally, we used the prompt creating a ground truth referral specialist, and using the LLM as a judge to compare predicted diagnoses with the true diagnoses. To decide on these prompts, we experimented with several variations of the prompts on a subset of data that was not included in our evaluation to refine the prompts to our tasks. To ensure consistent and reliable outputs, we set the temperature parameter to zero during these experiments. We observed that the results were identical across runs, with no variations. Based on this observation, and given the cost constraints of running the LLM multiple times, we decided to run the predictions only once for the final evaluation. Additionally, our goal is to evaluate the LLM’s performance in a scenario that mimics a clinical environment, where a clinician would typically rely on the LLM’s first output rather than running it multiple times. By focusing on the first output, we aimed to test the reliability and practical usability of the LLM in such a setting.
Each prompt begins by setting the system’s role, such as, “You are an experienced healthcare professional with expertise in medical and clinical domains”, followed by clear task instructions. We also provided the data necessary for each task and specified how the LLM should format its responses, ensuring concise answers within predefined tags. The different prompts can be seen in the Supplementary Fig. 1-6.
Model selection
To comply with privacy regulations restricting the use of the MIMIC-IV dataset with external APIs like OpenAI’s GPT-4o and the Claude family models, we employed AWS Privatelink to securely connect to the Claude models hosted on AWS. This kind of evaluation reduces the likelihood that the data has been previously seen by the LLM models, which cannot be guaranteed when using publicly available datasets.
Claude 3.5 Sonnet, Claude 3 Sonnet, and Claude 3 Haiku are advanced LLMs developed to enhance natural language understanding, with improvements in performance and efficiency across multiple benchmarks over their predecessors, including GPT-4o, GPT-4T, Gemini 1.5 Pro and Llama 3 400B23. They excel in contextual understanding, efficiency, and their ability to handle specialized queries. This makes them well-suited for applications in clinical decision-making, where precision and adaptability are essential.
Claude 3 Haiku is the fastest and most compact model in Anthropic’s Claude 3 family. It excels in tasks where it requires quick analysis and response times24, making this feature suitable for the clinical-decision process.
Claude 3 Sonnet is a balanced combination of speed and intelligence, offering significant improvement in reasoning and accuracy. This model is versatile, handling complex text generation, analysis and reasoning24.
Claude 3.5 Sonnet is built on the foundations of Claude 3 Sonnet with further enhancement in speed and intelligence. It excels in different tasks like reasoning and question answering, while being faster and cost-efficient relative to the previous models. It has shown competitive or superior performance in a variety of language-based tasks23.
RAG-assisted LLM
A RAG-assisted LLM approach involves two components: a retrieval mechanism that gathers the relevant information corresponding to the query from a specific external knowledge base, and a language model that integrates the retrieved information with the query to produce a response that is both grounded in the external knowledge base and tailored to the specifics of the given query. This method has shown improvements in both accuracy and reliability, which significantly reduces false or misleading information, referred to as hallucination, and produces more factual, context-aware outputs28,29,30,31. In this study, the framework is implemented using Claude 3.5 Sonnet as the LLM component and incorporates a multi-step process where the LLM plays a key role in refining and enhancing query processing and answer generation. The workflow is represented in Fig. 8.

The workflow starts with query decomposition, breaking down patient queries into smaller chunks. These chunks are embedded and undergo a semantic similarity comparison with the embeddings of 30 million PubMed abstracts to extract the most relevant information. The retrieved information is then combined with the query, and the LLM generates responses supported by the source references. An iterative critique-refinement loop further enhances the outputs by identifying gaps, refining responses, and ensuring alignment with the query.
The workflow starts with a query decomposition, breaking down the patient’s query into smaller queries. This process allows RAG systems to break down the input into its smaller key components and retrieve the most relevant information for each component. This idea is supposed to mimic the natural way humans approach understanding by breaking down complex information into smaller parts to focus on each element individually.
The knowledge base supporting this workflow consists of 30 million PubMed abstracts, which have been converted into embedding vectors and stored as a knowledge high-dimensional vector database. This allows the system to measure semantic similarity by comparing these vectors to those in the knowledge vector database. By identifying the closest matches, the system retrieves the most relevant information for the given query.
The LLM uses the retrieved information alongside the query to generate a response that is supported by the retrieved data, while also providing the source PubMed references for further review. An additional layer in the workflow tries to enhance the performance through iterative loops of critique, refinement, and retrieval. In these loops, the LLM evaluates the generated responses, identifies gaps or inaccuracies, and refines the output as needed. We used an LLM to evaluate the output and determine whether it was sufficient for the given query. This iterative process intends to achieve higher accuracy alignment with the query, to create more precise and reliable outputs.
Triage level evaluation framework
The triage level is based on the Emergency Severity Index (ESI)19, which consists of five levels, as outlined in the Supplementary Table 1. We evaluate the model’s triage level predictions using two different assessment frameworks. The first is a straightforward comparison between the predicted triage level and the ground truth, with accuracy as the metric. The second evaluation framework uses a triage range approach, accounting for the variability in clinical judgment when assigning triage levels. The ESI is typically determined by a clinician assigning a score based on their assessment of a patient’s condition. Although there are defined levels within the ESI system, ranging from 1 to 5, the assignment of these levels can vary due to the clinician’s intuition and experience. In some cases, clinicians may lean on the side of caution, assigning a more severe level to avoid the risk of patient deterioration or the possibility of misclassifying a patient as less critical than they actually are. To account for this variability, our evaluation allows some flexibility in model predictions. If the real triage level value is 1, the model must predict 1, as immediate life-saving intervention is required. For a real value of 2, the model can predict either 1 or 2, ensuring patients needing urgent care aren’t harmed by overclassification. Similarly, if the real value is 3, the model can predict 2 or 3, and so on—up to a real value of 5, where the model can predict either 4 or 5.
Specialty evaluation framework
To assess the performance of LLMs in recommending appropriate medical specialties, we developed two distinct evaluation scenarios: one tailored for the general users and another for clinical users. In each scenario, the models generated the top three specialty recommendations based on the available patient information. For the general user case, this input consisted of a description of the symptoms and basic patient information, while for the clinical user case, the input was augmented with the patient’s initial vital signs. For the general user setting, we implemented this evaluation with the German healthcare system in mind, where patients can choose any specialist without prior consultation with primary care or emergency care specialists. For the clinical user, we designed the evaluation to assist primary care doctors in referring patients to a specialist or seeking consultation as needed.
Since the MIMIC-IV-ED and MIMIC-IV-Notes do not include information on whether a consultation is necessary – and we could not compensate for this missing detail – we put our focus on evaluating the question “which specialist would be most helpful given the symptoms at hand.” As the datasets lack exact information on the medical specialist each patient visited, we used Claude 3.5 Sonnet to predict the most likely specialist for each diagnosis for each case, given that patients often present with multiple diagnoses rather than just one, thereby establishing the ground truth for this study.
Predicting a single specialist would be insufficient and unfair to the model when comparing its performance to the ground truth consisting of several specialties. In fact, it’s not uncommon for a patient to suffer from several medical conditions simultaneously, each requiring attention. To address this complexity, we chose to predict the top three specialists for each case. An Example is provided in Table 6.
This approach provides a more realistic comparison and offers clinicians and patients multiple possibilities to consider, reducing the risk of bias toward a single diagnosis. Ultimately, the LLM serves as a support tool, providing valuable insights, while the clinician makes the final, informed decision based on both the LLM’s recommendations and their own expertise.
Diagnosis evaluation framework
As mentioned in the specialty evaluation previously, patients often come in with more than one diagnosis. To reflect this, we predicted a top three list of diagnoses for each case. We then compared each of these predictions to the actual diagnoses. Notably, we examined the time from admission to release and confirmed that all cases used in our evaluation had a stay duration of less than one day. This minimizes the possibility to include diagnoses that might arise from later during hospitalization.
To make the comparison more accurate, we used an LLM judge to decide if the predicted diagnosis either matched the ground truth or fit into a broader category of one of the actual diagnoses. This way, we accounted for differences in wording while still ensuring a fair evaluation. Additionally, on a subset of the dataset, we involved four clinicians who compared the predicted diagnoses to the ground truth diagnoses and reviewed them. More details about this process can be found in the subsection “Reader Study”.
We employed two evaluation methods for assessing the model’s performance in predicting the correct specialty. The first method evaluated whether each predicted specialty appeared in the ground truth list. For each patient, we counted how many specialties were correctly predicted and then divided that number by the length of the shorter list, either the ground truth or the prediction list.
For example, if the ground truth for a patient included only one entry, a cardiologist, and the model predicted three specialists—one cardiologist, one general medicine, and one electrophysiologist—only the cardiologist would be considered correct. Although general medicine and electrophysiology could also be relevant in some cases, our evaluation was specifically set to match the ground truth. This ties into a point discussed in the paper, where we explore how a single diagnosis might be managed by multiple specialists, a factor we plan to address in future work.
In this example, since only the cardiologist was correctly predicted, the patient would receive one point, which is then divided by the length of the shorter list (in this case, one, as the ground truth had only one entry). So, the score for this patient would be 1. If the ground truth had included two specialties, and the model only correctly predicted one out of three, the score would be 0.5. The total points across all patients were then summed and divided by the total number of patients to calculate the overall accuracy.
The second evaluation framework was simpler, focusing on whether at least one of the predicted specialties appeared in the ground truth list. If any one of the model’s predicted specialties matched one of the ground truth specialties, the prediction for that patient was considered successful.
LLM judge
For our study, we utilized LLMs to evaluate and compare the accuracy of predicted diagnoses for a given set of patient cases. This evaluation aimed to assess the model’s diagnostic capabilities by comparing the predicted diagnoses with those listed in the patient’s medical records. The prompt for the evaluation can be found in the Methods: Prompts.
The model was given the true list of diagnoses for each patient, along with three predicted diagnoses.
It was then asked to determine if the predicted diagnosis matched any of the primary diagnoses by focusing on semantic equivalence and meaning, or if it fell under a broader category related to the real diagnosis. Since LLMs may use different phrasing for the same concept, which string-matching algorithms could miss, the model was asked to evaluate whether the predicted diagnosis matched the real one or was related to it in a broader sense. If it did, the model returned “True”, ensuring that only diagnoses with the same or related meanings were marked as such. Otherwise, it returned “False”.
Similar methodologies have been explored successfully in recent research, showing that LLMs can effectively perform human-like evaluations in various tasks, including text summarization, quality assessments, and chat assistant evaluations, with results aligning closely to human judgments43,44,45. These findings support the use of LLMs as reliable tools for tasks like our diagnostic comparison evaluation.
Moreover, insights from the paper “Self-Recognition in Language Models”46 further argue that LLMs do not specifically prefer their own responses over those from other models. When asked to choose among several answers, the study showed that weaker models tend to select the one they consider as best, rather than their own, demonstrating that LLMs prioritize answer quality over origin. As a result, high-quality models are more likely to recognize their own outputs as good—not out of bias, but because of their focus on quality. This reinforces the idea that LLMs can perform evaluations without self-preference. Importantly, we did not use the LLM to compare outputs across models, which could risk introducing bias. Instead, the LLM evaluator compared each of the top three predicted diagnoses directly to the ground truth, determining whether they aligned in meaning or category. By focusing only on direct comparisons between predictions and the ground truth, we aimed to minimize self-bias and ensure an objective evaluation process.
While promising, the reliability and interpretability of LLMs as evaluation tools in real-world clinical environments still need further validation and refinement to ensure their safe and effective use. To address this, a subset of the dataset was validated by four clinicians, which is described in the subsection “Reader Study” in the “Methods” section. The results of this validation are detailed in the subsection “Inter-Rater Agreement on Diagnosis Evaluation” in the “Results” section.
Reader study
In this study, we asked four clinicians from different institutions to review the performance of an LLM in generating and predicting clinical specialties and diagnoses. The clinicians come from diverse medical backgrounds, ensuring a broad perspective in the evaluation process with several years of experience. We included one clinician affiliated with Policlinico Gemelli in Rome, Italy, another with the Radiology Department at Klinikum rechts der Isar in Munich, Germany, and two clinicians are based at the University of Chicago in the United States.
The revision aimed to assess how well the LLM performed the following two tasks: first, creating a ground truth specialty based on the given diagnosis, and second, predicting diagnoses for each patient. We selected a subset of 400 out of the 2000 cases from the dataset. Each clinician was assigned 200 cases, with Clinicians 1 and 2 reviewing the same subset, and Clinicians 3 and 4 reviewing a different subset. This setup allowed for independent evaluations of the same cases by each pair, improving objectivity as much as possible.
For the first task, the clinicians evaluated the LLM generated ground truth specialty for each diagnosis. The clinicians assessed the accuracy of these predictions by categorizing them into the following four levels: Correct, where the prediction matched the specialty a clinician would select for the diagnosis; Partially Correct, where the prediction was relevant but not ideal, such as suggesting a generalist or related specialty; Reasonable but Suboptimal, where the prediction was valid but less optimal, demonstrating a plausible but less precise understanding of the diagnosis; and Incorrect, where the prediction had no logical connection to the diagnosis.
For the second task, we used a subset from the outputs of the clinical user setting of Claude 3.5 Sonnet and the RAG-assisted LLM. For each model the clinicians compared the LLM predicted diagnoses with the ground truth diagnoses and categorized them as follows: Exact Match, where the prediction matched the ground truth diagnosis exactly; Clinically Equivalent, where the prediction conveyed the same clinical condition as the ground truth but used slightly different terminology or scope; Clinically Related, where the prediction referred to a related condition relevant to clinical reasoning but diverged from the ground truth; and Incorrect, where the prediction was clinically unrelated to the ground truth.
The goal of this evaluation is to demonstrate that the LLM performs well in predicting both the specialty and the diagnosis, with a high level of acceptance among clinicians. In addition to predicting diagnoses, the LLM was also used to compare and evaluate these predicted diagnoses against the ground truth. Eventually add here that also here the clinicians review showed a well performance fo the llms.
This dual role highlights the LLM’s ability not only to generate outputs but also to assess its own performance. These findings show the potential of LLMs to assist in clinical decision-making and evaluation processes. By providing a cost-effective and time-efficient solution, LLMs could serve as a valuable tool to support clinicians and offer a reliable second opinion in medical practice.
Intra-model agreement
We evaluated the agreement between models by comparing the predictions of different variants of the eight models, consisting of the RAG-assisted model and the three Claude language models with general user and clinical user settings each. Agreement was calculated separately for triage level predictions and specialty predictions and is symmetrical. Therefore, the results for both datasets are shown in the Supplementary Table 2, where the upper triangular matrix shows the intra-model agreement for triage and the lower triangular matrix for specialty, excluding self-comparisons (i.e., perfect agreement with the same model).
We evaluated and highlighted the two highest agreement values between model pairs for each dataset (specialty and triage) and for each of the three model user setting subgroups (general user to general user, general user to clinical user, clinical user to clinical user).
link
