Abstract
-
Objective
This study aimed to develop and validate MEDIVAL (Medical Documentation Validation), a progressive chain-of-thought (CoT) evaluation framework for automated assessment of large language model (LLM)-generated emergency department documentation, designed to align with expert clinical judgment in acute care settings.
-
Methods
We designed a three-tier evaluation framework incorporating persona-based, error-enhanced, and insight-integrated strategies. The framework was tested across four LLMs (GPT-4o, GPT-4.1, Claude-3.5, Claude-3.7) on 33 emergency department records reviewed by four expert emergency physicians. Each model applied the three CoT strategies across five criteria: appropriateness, accuracy, structure/format, conciseness, and clinical validity. Model outputs were compared with expert ratings using Spearman correlation coefficients. Differences were analyzed with the Friedman test and Wilcoxon signed rank test with Bonferroni correction. Reproducibility was assessed through intraclass correlation coefficient (ICC) analysis.
-
Results
All models demonstrated stronger alignment with expert ratings as CoT complexity increased, with Claude-3.7 (r=0.712, P<0.001) and GPT-4o (r=0.702, P<0.001) showing the highest correlations under the insight-integrated strategy. GPT-4.1 showed the greatest relative improvement (43.3% increase, r=0.457 to r=0.655, P<0.001). Significant overall differences were observed across strategies (χ2(2)=48.39, P<0.001), though the error-enhanced and insight-integrated approaches differed only modestly yet significantly (P=0.002). High reproducibility was confirmed (ICC >0.919), with Claude-3.5 achieving the most consistent results (ICC, 0.997–0.998).
-
Conclusion
MEDIVAL demonstrates that progressive CoT strategies systematically improve automated evaluation of emergency department documentation while maintaining excellent reproducibility. This framework offers a viable prescreening tool to reduce expert workload and support reliable artificial intelligence integration into emergency medicine workflows.
-
Keywords: Artificial intelligence; Medical documentation; Emergency department; Large language models; Clinical evaluation
Capsule Summary
What is already known
Evaluating large language model (LLM)-generated emergency department documentation requires substantial expert time and resources, creating a major bottleneck for clinical adoption of automated systems in acute care.
What is new in the current study
A progressive chain-of-thought evaluation framework provides reliable alignment with expert assessment while maintaining high reproducibility across diverse LLM architectures, offering a practical prescreening solution for emergency medicine documentation workflows.
INTRODUCTION
Medical documentation is a cornerstone of healthcare delivery [
1], enabling continuity of care and facilitating multidisciplinary communication [
2–
4]. However, it imposes a significant burden on physicians and nurses, often requiring additional hours to complete [
5]. This documentation burden is now recognized as a major contributor to burnout [
6–
11], raising significant concerns about care quality and patient safety [
12]. Accordingly, efforts to mitigate this burden, including optimization of electronic health records (EHRs) and workflow streamlining, have intensified [
13–
15]. Within this context, large language models (LLMs) have emerged as potential game-changers in healthcare through their ability to generate medical documentation [
16–
19]. While they have shown promise in generation, summarization, and standardization tasks, concerns remain regarding their accuracy and reliability [
20].
However, evaluating LLM-generated output presents particular challenges. It requires extensive expert involvement and time-consuming processes, which can delay implementation in clinical settings [
21,
22]. Recent advances, however, suggest a shift in this paradigm. Increasing evidence shows that LLMs can serve not only as content generators but also as evaluators, effectively functioning as “LLM-as-a-judge” systems that may help reduce reliance on expert review [
23,
24]. Despite these developments, the effectiveness of different model architectures and prompting strategies for medical document evaluation remains underexplored.
Our previous study examined the feasibility of LLM-based medical documentation generation and proposed an evaluation framework combining qualitative and quantitative elements [
21]. Through this dual approach, we identified recurring themes in expert feedback, particularly their emphasis on clinical accuracy, logical flow, and adherence to documentation standards. Experts consistently prioritized these elements when reviewing medical records, offering a basis for structuring LLM-based evaluation methods.
Building upon these insights, we developed an LLM evaluation framework with three progressive strategies—persona-based, error-enhanced, and insight-integrated—each reflecting distinct expert evaluation patterns. The present study applies this framework across several state-of-the-art LLMs (GPT-4o, GPT-4.1, Claude-3.5, and Claude-3.7) to examine their effectiveness in evaluating medical documentation. By benchmarking model performance against expert assessments, we evaluate their potential to reduce workload without compromising quality, addressing a critical gap in efficient evaluation approaches for clinical documentation.
This comparative analysis served two main purposes. First, it sought to validate the effectiveness of progressive chain-of-thought (CoT) prompting across multiple LLM architectures. Second, it examined the feasibility of LLM-based systems as prescreening tools for clinical documentation. We hypothesize that although baseline performance varies across models, applying structured reasoning via progressive CoT prompting consistently improves evaluation quality. A preliminary version of this work was presented as a short paper at MEDINFO 2025 [
25]. This article extends that work through additional cross-model validation, comparative analyses, and expanded discussion.
METHODS
Ethics statement
This study used synthetic emergency department records and did not involve human subjects. The original data collection was approved by the Institutional Review Board of Samsung Medical Center (No. ****).
Dataset and expert evaluation score baseline
We used the full set of 33 LLM-generated emergency department (ED) records and their corresponding expert evaluation scores from our previous study, which originated from a medical prompt challenge [
21]. To generate these ED records, participants were given detailed virtual patient scenarios developed and validated by an ED professor and a registered nurse. Based on these scenarios, participants created prompts that were processed by the HyperCLOVA X (HCX-002; Naver Cloud) [
26] LLM model to produce medical documentation.
Each ED record was required to follow a standardized format, as outlined in
Table 1. To construct effective prompts, participants were provided with patient triage records, initial consultation transcripts, and physical examination data, all in Korean. The goal was to ensure that generated records adhered closely to the expected structure and contained comprehensive clinical details.
In the prior study, four medical experts, each with more than 10 years of clinical experience, evaluated every document using five criteria shown in
Table 2 [
21]. Scores were assigned on a 5-point Likert scale (1 [poor] to 5 [excellent]). Reliability of this evaluation process was thoroughly validated. Test-retest analysis showed high correlations (r=0.776, P<0.001), and inter-rater reliability demonstrated strong consistency (intraclass correlation coefficient [ICC], 0.653–0.887; P<0.001). Clinical validity achieved the highest agreement (ICC, 0.887), underscoring the consistency of expert judgment on this critical dimension. These five criteria now serve as the foundation for our current comparative analysis of different LLM models and CoT strategies in medical document evaluation.
MEDIVAL: three-module CoT framework
We propose a CoT evaluation framework called MEDIVAL (Medical Documentation Validation) that systematically evaluates medical records through three distinct approaches (
Fig. 1). CoT prompting is a technique that guides LLMs to decompose complex tasks into intermediate steps, simulating human reasoning processes [
27]. In this framework, “modules” refer to prompting components with specific evaluation functions, whereas “strategies” represent progressive combinations of these modules into increasingly comprehensive evaluation approaches. Each strategy builds upon the prior one by incorporating additional modules, thereby producing a progressively more sophisticated evaluation process.
Framework design and implementation
Our evaluation framework implemented three progressive evaluation strategies using multiple state-of-the-art LLMs: GPT-4o (OpenAI) [
28], GPT-4.1 (OpenAI) [
29], Claude-3.5 (Anthropic) [
30], and Claude-3.7 (Anthropic) [
31]. Each strategy added prompt components while preserving the core evaluation structure, as summarized in
Table 3. Initial prompt engineering was carried out with GPT-4o as the development model. We employed iterative refinements in which preliminary CoT prompts were compared against expert scores, and observed discrepancies were used to improve alignment. Once coherence with expert evaluation patterns was achieved, the same three strategies were applied to other models to assess cross-model robustness.
During cross-model adaptation, we maintained an identical logical structure and evaluation approach across all tested models, adjusting only for model-specific formatting requirements. This approach ensured that performance differences between models reflected their inherent capabilities rather than prompt design variations. The detailed implementation of each module was based on the following prompt components:
(1) Module 1 (expert persona) established a medical professor persona with over 20 years of clinical practice and research experience, recognized for precision in evaluating medical documentation. This module emphasized expertise in emergency medicine and diagnostic accuracy, instructing the persona to adopt a rigorous evaluative stance and to carefully review the document and patient information using the five-criteria framework.
(2) Module 2 (error analysis) introduced systematic quantitative error identification before qualitative review. The evaluation covered six error categories: invalid generation error (unwarranted content), non-generation error (missing necessary information), information error (inaccurate details such as values or terminology), prompt echoing error (direct repetition of prompt text), content misplacement error (information in the wrong section), and typo error (grammatical mistakes). Structural malformation errors were prioritized, as our prior research showed they most strongly affect clinical evaluations.
(3) Module 3 (clinical insight integration) incorporated expert-derived insights from our previous findings. It emphasized adherence to structured clinical documentation and prioritizes errors in clinical categories. In particular, errors in differential diagnosis, diagnostic planning, and treatment planning were weighted more heavily than errors in other categories, reflecting their strongest correlation with clinical validity. This module aligned evaluations more closely with expert assessment patterns by emphasizing clinically critical content.
Three progressive CoT strategies
Based on the prompt modules described above, we implemented three progressively sophisticated evaluation strategies, as shown in
Table 4. All strategies concluded with a standardized instruction requesting JSON-formatted output. Each model produced scores across the five criteria (appropriateness, accuracy, structure/format, conciseness, and clinical validity) using a 5-point Likert scale, accompanied by detailed comments. The prompt consistently ended with “Let’s think step by step” to encourage systematic reasoning.
Persona-based strategy
This baseline strategy applied only module 1, focusing on qualitative, expert-like evaluation. The LLM adopted a medical expert persona to assess documents across the five criteria, producing a structured but primarily subjective assessment comparable to traditional expert review.
Error-enhanced strategy
This enhanced strategy combined modules 1 and 2, integrating systematic error identification prior to qualitative review. Our previous findings showed that structural malformation errors had the strongest negative effect on clinical evaluations (r=0.654, P<0.001), making this quantitative pre-analysis especially important for robust assessment.
Insight-integrated strategy
The most advanced strategy incorporated all three modules, adding expert-derived insights from prior research. This approach emphasized differential diagnosis, diagnostic planning, and treatment planning—elements shown to have the strongest correlation with overall clinical validity (r=0.698, P<0.001). By embedding these clinical priorities, the LLM’s evaluation more closely aligned with expert assessment patterns.
Cross-model implementation
We implemented these three CoT strategies across four state-of-the-art LLM architectures:
(1) GPT-4o [
28]: OpenAI’s multimodal model, featuring integrated vision-language capabilities and optimized latency.
(2) GPT-4.1 [
29]: An advanced text-only variant in the GPT-4 series, optimized for long-context reasoning and instruction-following performance.
(3) Claude-3.5 [
30]: Anthropic’s conversational AI developed with Constitutional AI principles focusing on harmlessness and helpfulness.
(4) Claude-3.7 [
31]: Anthropic’s advanced model with significantly enhanced reasoning capabilities for complex evaluation tasks.
(5) Other contemporary LLMs were not included in this study due to constraints related to accessibility, computational resources, and cost.
All models used identical parameter settings (temperature = 0, top_p = 1.0), with prompts adapted only for formatting requirements while preserving logical structure. This approach ensured that any observed performance differences reflected model capabilities rather than prompt variation.
Evaluation process implementation
The evaluation process followed a systematic protocol:
(1) Document preparation: All 33 ED records were standardized in format and anonymized to prevent bias.
(2) Prompt construction: For each strategy, we constructed a single general prompt for comparison across models.
(3) Evaluation execution: Each model evaluated all 33 documents under all three strategies, yielding 396 evaluations (33×4×3). Each evaluation was repeated four times to assess reproducibility, resulting in 1,584 assessment runs. The repetition mirrored human evaluation conditions, in which four clinicians independently reviewed each record, enabling direct comparison of inter-rater reliability (ICC3k) between human and model outputs.
(4) Output standardization: All model outputs were processed to extract standardized JSON data containing the five criteria scores and supporting rationales. This standardization enabled direct comparison across models and strategies.
(5) Quality control: We validated JSON outputs to ensure structural consistency and completeness across all evaluations. Invalid or malformed outputs (less than 0.5% of total) were rerun until valid results were obtained.
All evaluations were performed via application programming interface (API) calls: OpenAI API for GPT-4o and GPT-4.1, and Anthropic Claude API for Claude-3.5 and Claude-3.7. Each model-strategy combination generated standardized outputs with five 5-point Likert scores (maximum total, 25) and supporting justifications.
Statistical analysis
To evaluate the performance of our framework, we conducted a comprehensive statistical analysis using Python ver. 3.11 (Python Software Foundation), with pandas ver. 2.2.0, numpy ver. 1.26.3, scipy ver. 1.12.0, statsmodels ver. 0.14.1, and pingouin ver. 0.5.5. Visualizations were created with matplotlib ver. 3.8.2 and seaborn ver. 0.13.1. All statistical tests were two-sided, with P<0.05 after correction considered significant.
Correlation analysis
We compared mean scores between expert and LLM evaluations using Spearman rank correlation coefficient, selected for its nonparametric properties and robustness to outliers. This makes it particularly appropriate for ordinal data such as Likert scale scores [
32,
33]. Separate correlation coefficients were calculated for each of the five criteria (appropriateness, accuracy, structure/format, conciseness, and clinical validity) to determine which aspects of medical documentation improved most under progressive strategies. Analyses were conducted independently for each model and prompt strategy to capture nuanced differences in human-model alignment.
Comparative analysis
To examine differences between human and model evaluations, we analyzed score discrepancies across strategies. The Friedman test [
34,
35] was used as a nonparametric alternative to repeated-measures analysis of variance, appropriate when normality assumptions are not met. Because each prompt strategy was applied to the same set of documents, the Friedman test also accounted for the paired data structure.
Pairwise comparisons were then performed with the Wilcoxon signed rank test [
36], with the Bonferroni correction [
37] to adjust for multiple comparisons. This approach was chosen to robustly control the family-wise error rate while respecting the repeated-measures design. Effect sizes were expressed as rank-biserial correlations, which range from −1 to 1 and provide an interpretable measure of both strength and direction of paired differences. Given the modest sample size and potential deviations from normality, nonparametric methods were consistently applied. Boxplots and bar charts were generated to visualize trends across models and strategies.
Reproducibility analysis
Reproducibility was evaluated using the ICC, mean absolute error (MAE), and standard deviation (SD) across four separate evaluation rounds for each document–model combination. We employed ICC(3,k)—a two-way mixed effects model (consistency, average measures)—because it is most suitable when the same evaluator (LLM) performs repeated assessments on fixed items [
38]. This analysis was performed separately for each LLM to compare evaluation stability across models.
RESULTS
Correlation analysis
Our analysis showed consistent improvements in correlation with expert evaluations across LLMs as strategies progressed from persona-based to insight-integrated approaches (
Fig. 2). Claude-3.7 achieved the strongest overall correlation using the insight-integrated strategy (r=0.712, P<0.001), followed closely by GPT-4o (r=0.702, P<0.001). GPT-4.1 exhibited the largest relative improvement, with a 43.3% increase in correlation from the persona-based (r=0.457, P=0.007) to the insight-integrated approach (r=0.655, P<0.001).
This consistent improvement across architecturally different models suggests that our progressive CoT strategies provide generalizable benefits regardless of the underlying model architecture. However, models with stronger baselines (Claude-3.7 and Claude-3.5) showed smaller relative gains, whereas models with lower starting correlations (GPT-4.1) demonstrated greater improvements.
Comparative analysis
Our analysis of absolute score differences between LLMs and human experts revealed consistent alignment trends across model-strategy combinations (
Fig. 3). All LLMs tended to assign higher scores than human experts, with mean differences ranging from 5.37 to 9.00 points on a 25-point scale. Importantly, the insight-integrated strategy consistently reduced these gaps. GPT-4.1 under the insight-integrated strategy showed the smallest mean difference (5.37 points), achieving the closest alignment with human evaluations despite a moderate correlation coefficient.
Boxplot analysis indicated that the insight-integrated strategy reduced both median score differences and variance in many cases, suggesting more consistent alignment with human expert judgment.
The Friedman test confirmed significant differences in model-human score discrepancies across strategies (χ2(2)=48.39, P<0.001). Pairwise Wilcoxon signed rank tests with Bonferroni correction showed that both error-enhanced and insight-integrated strategies significantly reduced score differences compared to the persona-based strategy (both P<0.001).
Although the error-enhanced and insight-integrated strategies also differed significantly (P=0.002), the effect size was modest, indicating only incremental benefit of insight integration beyond error analysis. Descriptive statistics and pairwise comparison results are presented in
Table 5.
Detailed analysis of evaluation criteria
Because GPT-4.1 showed the greatest relative improvement across strategies (
Fig. 4), we examined its performance across evaluation criteria to illustrate the impact of progressive CoT prompting.
Appropriateness showed the most remarkable improvement, increasing from r=0.289 (persona-based) to r=0.610 (insight-integrated), a 111% increase. This suggests that structured reasoning markedly strengthens GPT-4.1’s ability to maintain a professional tone and adhere to documentation standards.
Clinical validity also demonstrated substantial improvement from r=0.329 to r=0.603 (83.3% increase), indicating that the insight-integrated strategy significantly enhances the model’s ability to evaluate sound clinical reasoning and diagnostic accuracy in medical documentation.
Structure/format showed the second-largest relative improvement, increasing from r=0.439 to r=0.584 (33% increase), suggesting that GPT-4.1 particularly benefits from guidance in evaluating document organization and logical flow.
Accuracy began with a relatively strong baseline (r=0.552) and improved modestly with insight integration (r=0.608, 10.1% increase). However, the error-enhanced strategy yielded greater improvement (r=0.612, 10.9% increase), implying that systematic error detection is particularly effective for factual correctness in medical documentation.
The clear stepwise improvement across nearly all criteria demonstrates that each added module contributes meaningfully to evaluation quality. GPT-4.1 displayed the most distinct progression among all tested models, suggesting it is especially responsive to structured reasoning prompts.
Reproducibility analysis
To test reliability, we calculated ICC values across four evaluation rounds for each model-strategy combination (
Table 6). All combinations showed exceptionally high reproducibility, with ICC values ranging from 0.919 to 0.998. Claude-3.5 demonstrated the highest reproducibility (ICC, 0.997–0.998), while GPT-4o achieved peak reproducibility under the insight-integrated strategy (ICC, 0.996).
MAE varied by model, with Claude-3.5 showing the lowest errors (range, 0.04–0.12) and Claude-3.7 the highest (range, 1.25–1.42). Interestingly, the insight-integrated strategy occasionally produced slightly higher MAE values, suggesting its more complex reasoning could introduce minor variability. Nevertheless, the extremely high ICC values indicate this variability remained minimal.
For comparison, human expert evaluations showed good reliability, with ICC values ranging from 0.653 to 0.887 across criteria, and the highest consistency for clinical validity (ICC, 0.887). While human reliability was strong, all LLM model-strategy combinations in this study demonstrated higher reproducibility, underscoring the potential advantage of LLM-based evaluation frameworks in ensuring consistent assessments.
DISCUSSION
Enhancing LLM evaluation through progressive CoT strategies
Our three-module CoT evaluation framework demonstrated the value of progressively integrating reasoning strategies into LLM-based assessment systems for medical documentation. The consistent improvement patterns observed across different architectures provide strong evidence that structured reasoning can enhance alignment with expert evaluations regardless of baseline capability. The systematic increase in correlation coefficients from persona-based to insight-integrated strategies across all models highlights the effectiveness of this progressive approach. GPT-4.1 was particularly notable, showing the most distinct stepwise improvement (43.3% overall increase). This finding indicates that models with moderate baseline performance may gain the most from structured CoT techniques. In this way, our framework can potentially “level the playing field” across LLMs by enabling even smaller or less specialized models to approach the evaluation quality of larger models when guided by structured strategies.
Performance also varied across evaluation criteria. Accuracy benefited substantially from structured reasoning, suggesting that stepwise analysis enhances factual correctness assessment. Likewise, clinical validity improved significantly under the insight-integrated strategy, indicating that this approach effectively captures the complex reasoning patterns used by human experts in documentation review.
These improvements carry direct clinical implications. Closer alignment with expert judgment strengthens the reliability of LLMs as support tools for documentation review, enabling earlier error detection while safeguarding patient safety. Importantly, gains in clinical validity imply that progressive CoT strategies can help LLMs approximate expert-level reasoning, yielding not only statistical improvements but also clinically meaningful improvements.
Interpretive challenges in human-model evaluation alignment
Our analysis showed that the insight-integrated strategy consistently narrowed the gap between LLM and human evaluations across all models. However, evaluation performance peaked with the error-enhanced strategy rather than the insight-integrated strategy. The Wilcoxon signed rank analysis confirmed that the difference between these two strategies, though statistically significant, was modest in magnitude.
This suggests that objectively measurable features, such as information density and structural completeness, are best captured through error-enhanced strategies, while more subjective elements, such as clinical reasoning quality, may be better reflected by insight integration. Future frameworks may need to balance objective and subjective evaluation components depending on their intended use.
Notably, all LLMs scored medical documents higher than human experts, with mean differences ranging from 5.37 to 9.00 points on a 25-point scale. This finding aligns with our earlier study, which showed that human experts tend to focus more critically on categories such as diagnoses and treatment plans, as well as structural errors. These areas often drove harsher scoring from experts. In contrast, LLMs are typically tuned to align with general human preferences [
39], potentially resulting in outputs that are more superficially fluent but less rigorously scrutinized for critical clinical content. This tendency may explain why LLM-generated outputs appear relatively more favorable under surface-level evaluation but might lack the depth required for thorough clinical documentation assessment.
Future applications and system integration
Our findings indicate that frameworks like MEDIVAL could play an important role in clinical documentation workflows. One immediate application is as a prescreening system that flags records requiring expert review, reducing routine workload. In hybrid workflows, models could also direct clinician attention to high-risk sections, such as differential diagnoses or treatment plans, where errors are most consequential. Additionally, criterion-specific feedback may serve an educational role in clinical training.
LLMs exhibited even higher inter-rater reliability than was observed for human experts (ICC, 0.653–0.887 across evaluation criteria), highlighting the potential advantages of adopting LLMs as evaluators. With the exception of Claude-3.7, most models achieved MAE and SD values below 1 point, suggesting that progressive CoT strategies not only improve relative evaluation but also calibrate absolute scoring more closely to human judgment. Such calibration is essential in clinical settings, where understanding the degree of deviation from acceptable standards is critical for patient safety [
40].
Although this study focused on text-based documentation, our framework could extend to emerging modalities such as voice-based electronic medical records and broader ambient AI systems. With the increasing adoption of voice recognition in healthcare [
41], future research should explore how structured evaluation frameworks can adapt to challenges in these modalities, including recognition errors, contextual ambiguity, and automated coding accuracy.
Ethical and operational considerations
Despite the demonstrated benefits and promising potential, careful consideration is required when deploying LLM-based evaluators in practice [
42,
43]. The consistent tendency of LLMs to score documents more favorably than human experts underscores the need for calibration and oversight. While our progressive CoT strategies narrow this scoring gap, they do not eliminate it. Human oversight therefore remains essential, particularly in high-stakes clinical contexts where documentation directly informs patient care.
Transparency regarding model limitations, data provenance, and evaluation methodologies is critical for building user trust [
44,
45]. Although this study used only synthetic data without patient identifiers, future applications involving real-world documentation must adhere strictly to institutional governance policies and privacy regulations. Moreover, healthcare systems must evaluate fairness and robustness across diverse clinical contexts to prevent propagation of biases in LLM assessments.
Beyond these ethical and governance issues, practical implementation requires addressing computational efficiency, interoperability with existing EHR systems, and clinician acceptance. These operational factors will be decisive in determining the feasibility of integrating LLM-based evaluators into routine workflows and should be prioritized in translational research efforts.
Collectively, our findings support the gradual integration of LLM-based evaluators into clinical workflows, beginning with expert-augmented decision support and progressively advancing toward greater automation as reliability and trust are established.
Limitations
Several limitations must be considered when interpreting our results. First, prompts were optimized specifically for GPT-4o, reflecting relative rather than absolute model capabilities. Tailored optimization for other models could further improve their performance beyond what was observed here.
Second, our analysis was based on the complete dataset of 33 documents, precluding a formal sample size calculation. Instead, each document underwent two independent reviews by four senior clinicians across five criteria, generating 1,320 structured expert judgments. This dense annotation enabled robust reliability analyses (ICC, test-retest) that would not have been feasible with a larger but less thoroughly annotated dataset. Nonetheless, broader validation will require large-scale testing across diverse clinical settings and documentation types.
Finally, medical documentation standards vary substantially across clinical departments and stages of patient care. The present study did not define a universally applicable “gold standard” medical record, nor did it address documentation requirements across different care contexts and timepoints. Future work should therefore develop adaptable methodologies and evaluation tools that accommodate department-specific standards and temporal variation.
Conclusions
This study demonstrates that progressive CoT strategies substantially enhance the ability of LLMs to evaluate medical documentation in alignment with human expert judgment, thereby reducing the burden of manual evaluation. Future research should validate this framework in varied documentation contexts and pursue practical implementation steps, including pilot deployments, calibration protocols, and clinician training, to ensure safe and effective integration of LLM-based evaluation into healthcare workflows.
NOTES
-
Author contributions
Conceptualization: DC, TK; Data curation: JS, WCC; Formal analysis: DC, MK; Investigation: SH, HC; Methodology: DC, TK; Project administration: TK; Software: DC; Supervision: TK; Writing–original draft: DC, JS; Writing–review & editing: all authors. All authors read and approved the final manuscript.
-
Conflicts of interest
Won Chul Cha and Taerim Kim are editorial board members of this journal, but were not involved in the peer reviewer selection, evaluation, or decision process of this article. The authors have no other conflicts of interest to declare.
-
Funding
The authors received no financial support for this study.
-
Acknowledgments
The authors would like to thank all participants who contributed to the original data collection for this study.
-
Data availability
The data analyzed in this study are available from the corresponding author upon reasonable request.
Fig. 1.Overview of the three-module chain-of-thought (CoT) evaluation framework for large language model (LLM)-based assessment. The framework integrates expert persona, error analysis, and clinical insight modules to progressively enhance evaluation quality.
Fig. 2.Correlation coefficients across models and evaluation strategies. All models demonstrated progressive improvement from persona-based to insight-integrated approaches. Claude-3.7 achieved the strongest correlation with expert ratings (r=0.712), followed closely by GPT-4o (r=0.702). All correlations were statistically significant (P<0.001 for most comparisons; GPT-4.1 persona-based P=0.007).
Fig. 3.Score differences between large language model and human expert evaluations on a 25-point scale. Lower values represent closer alignment with expert assessments. Insight-integrated strategies consistently produced the smallest differences, with GPT-4.1 showing the closest overall alignment (mean difference, 5.37 points).
Fig. 4.GPT-4.1 performance across evaluation criteria and strategies. The insight-integrated approach produced the most dramatic improvement in appropriateness (r=0.610) and substantial gains in clinical validity (r=0.603). All reported correlations were statistically significant (P<0.001).
Table 1.Standardized structure of emergency department records
Table 1.
|
Patient information |
Clinical assessment |
Medical decision-making |
|
Chief complaint |
Physical examination findings |
Problem list |
|
Vital signs |
Review of systems |
Differential diagnosis |
|
History of present illness, past medical history |
Personal and social history |
Diagnosis plan, treatment plan |
Table 2.Expert evaluation criteria for LLM-generated medical documents
a)
Table 2.
|
Criterion |
Description |
Focus area |
|
Appropriateness |
Adherence to medical documentation standards |
Content relevance, proper medical terminology, professional tone |
|
Accuracy |
Correctness of medical information |
Factual correctness, absence of hallucinations, completeness |
|
Structure/format |
Logical organization and formatting |
Section organization, information flow, hierarchical presentation |
|
Conciseness |
Information density and brevity |
Elimination of redundancy while maintaining completeness |
|
Clinical validity |
Practical usability in clinical context |
Sound clinical reasoning, appropriate differential diagnosis, actionable treatment plans |
Table 3.Overview of the three-module CoT framework (MEDIVAL)
Table 3.
|
Module |
Key component |
|
Expert persona |
Medical professor persona with high experience |
|
Strict evaluation tendency |
|
Five criteria assessment framework |
|
Error analysis |
Six error categories: invalid generation, non-generation, information, prompt echoing, content misplacement error, typo |
|
Quantitative evaluation before qualitative assessment |
|
Clinical insight |
Integration of expert evaluation patterns |
|
Emphasis on document structure |
|
Priority on differential diagnosis, diagnosis, and treatment plan |
Table 4.Relationship between modules and strategies in the chain-of-thought framework
Table 4.
|
Strategy |
Incorporated module |
|
Persona-based strategy |
Module 1 |
|
Error-enhanced strategy |
Modules 1 and 2 |
|
Insight-integrated strategy |
Modules 1, 2, and 3 |
Table 5.Pairwise comparison of score differences across evaluation strategies
Table 5.
|
Strategy comparison |
Difference (mean±SD)a)
|
Effect size (rb)) |
P-valuec)
|
|
Error-enhanced vs. Insight-integrated |
0.084±0.253 |
0.412 |
0.002 |
|
Error-enhanced vs. Persona-based |
–0.278±0.305 |
–1.000 |
<0.001 |
|
Insight-integrated vs. Persona-based |
–0.363±0.417 |
–0.987 |
<0.001 |
Table 6.Reproducibility analysis across models and strategies (33 documents, 4 repeated evaluations)
Table 6.
|
Model |
ICC(3,k) |
F-test (df1,df2) |
P-value |
95% CI |
MAE |
SD |
|
GPT-4o (OpenAI) |
|
|
|
|
|
|
|
Persona-based |
0.986 |
72.3 (32, 96) |
<0.001 |
0.98–0.99 |
0.20 |
0.26 |
|
Error-enhanced |
0.993 |
134.4 (32, 96) |
<0.001 |
0.99–1.00 |
0.26 |
0.33 |
|
Insight-integrated |
0.990 |
95.6 (32, 96) |
<0.001 |
0.98–0.99 |
0.36 |
0.46 |
|
GPT-4.1 (OpenAI) |
|
|
|
|
|
|
|
Persona-based |
0.990 |
102.7 (32, 96) |
<0.001 |
0.98–0.99 |
0.22 |
0.29 |
|
Error-enhanced |
0.991 |
113.3 (32, 96) |
<0.001 |
0.98–1.00 |
0.44 |
0.55 |
|
Insight-integrated |
0.990 |
103.3 (32, 96) |
<0.001 |
0.98–0.99 |
0.49 |
0.64 |
|
Claude-3.5 (Anthropic) |
|
|
|
|
|
|
|
Persona-based |
0.997 |
343.6 (32, 96) |
<0.001 |
1.00–1.00 |
0.04 |
0.06 |
|
Error-enhanced |
0.998 |
512.8 (32, 96) |
<0.001 |
1.00–1.00 |
0.09 |
0.11 |
|
Insight-integrated |
0.997 |
388.1 (32, 96) |
<0.001 |
1.00–1.00 |
0.12 |
0.17 |
|
Claude-3.7 (Anthropic) |
|
|
|
|
|
|
|
Persona-based |
0.919 |
12.4 (32, 96) |
<0.001 |
0.86–0.96 |
1.25 |
1.66 |
|
Error-enhanced |
0.949 |
19.6 (32, 96) |
<0.001 |
0.91–0.97 |
1.42 |
1.89 |
|
Insight-integrated |
0.969 |
32.3 (32, 96) |
<0.001 |
0.95–0.98 |
1.32 |
1.76 |
REFERENCES
- 1. Kuhn T, Basch P, Barr M, Yackel T. Clinical documentation in the 21st century: executive summary of a policy position paper from the American College of Physicians. Ann Intern Med 2015;162:301-3.
- 2. Sanderson AL, Burns JP. Clinical documentation for intensivists: the impact of diagnosis documentation. Crit Care Med 2020;48:579-87.
- 3. Lorenzetti DL, Quan H, Lucyk K, et al. Strategies for improving physician documentation in the emergency department: a systematic review. BMC Emerg Med 2018;18:36.
- 4. Shala DR, Jones A, Fairbrother G, Thuy Tran D. Completion of electronic nursing documentation of inpatient admission assessment: insights from Australian metropolitan hospitals. Int J Med Inform 2021;156:104603.
- 5. Ahn M, Choi M, Kim Y. Factors associated with the timeliness of electronic nursing documentation. Healthc Inform Res 2016;22:270-6.
- 6. Gesner E, Gazarian P, Dykes P. The burden and burnout in documenting patient care: an integrative literature review. Stud Health Technol Inform 2019;264:1194-8.
- 7. Moy AJ, Schwartz JM, Chen R, et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inform Assoc 2021;28:998-1008.
- 8. Gesner E, Dykes PC, Zhang L, Gazarian P. Documentation burden in nursing and its role in clinician burnout syndrome. Appl Clin Inform 2022;13:983-90.
- 9. Murad MH, Vaa Stelling BE, West CP, et al. Measuring documentation burden in healthcare. J Gen Intern Med 2024;39:2837-48.
- 10. Levy DR, Withall JB, Mishuris RG, et al. Defining documentation burden (DocBurden) and excessive DocBurden for all health professionals: a scoping review. Appl Clin Inform 2024;15:898-913.
- 11. Preiksaitis C, Wright KN, Alvarez A, et al. Measuring burnout and professional fulfillment among emergency medicine residency program leaders in the United States: a cross-sectional survey study. Clin Exp Emerg Med 2025;12:76-85.
- 12. Li LZ, Yang P, Singer SJ, Pfeffer J, Mathur MB, Shanafelt T. Nurse burnout and patient safety, satisfaction, and quality of care: a systematic review and meta-analysis. JAMA Netw Open 2024;7:e2443059.
- 13. Sloss EA, Abdul S, Aboagyewah MA, et al. Toward alleviating clinician documentation burden: a scoping review of burden reduction efforts. Appl Clin Inform 2024;15:446-55.
- 14. Holmgren AJ, Hendrix N, Maisel N, et al. Electronic health record usability, satisfaction, and burnout for family physicians. JAMA Netw Open 2024;7:e2426956.
- 15. Park S, Marquard J, Austin RR, Pieczkiewicz D, Jantraporn R, Delaney CW. A systematic review of nurses' perceptions of electronic health record usability based on the human factor goals of satisfaction, performance, and safety. Comput Inform Nurs 2024;42:168-75.
- 16. Ding H, Simmich J, Vaezipour A, Andrews N, Russell T. Evaluation framework for conversational agents with artificial intelligence in health interventions: a systematic scoping review. J Am Med Inform Assoc 2024;31:746-61.
- 17. Wei Q, Yao Z, Cui Y, Wei B, Jin Z, Xu X. Evaluation of ChatGPT-generated medical responses: a systematic review and meta-analysis. J Biomed Inform 2024;151:104620.
- 18. Satheakeerthy S, Jesudason D, Pietris J, Bacchi S, Chan WO. LLM-assisted medical documentation: efficacy, errors, and ethical considerations in ophthalmology. Eye (Lond) 2025;39:1440-2.
- 19. Lee YC. Rethinking artificial intelligence in medicine: from tools to agents. Clin Exp Emerg Med 2025;12:101-3.
- 20. Goodman KE, Yi PH, Morgan DJ. AI-generated clinical summaries require more than accuracy. JAMA 2024;331:637-8.
- 21. Seo J, Choi D, Kim T, et al. Evaluation framework of large language models in medical documentation: development and usability study. J Med Internet Res 2024;26:e58329.
- 22. Hartman V, Zhang X, Poddar R, et al. Developing and evaluating large language model-generated emergency medicine handoff notes. JAMA Netw Open 2024;7:e2448723.
- 23. Gu J, Jiang X, Shi Z, et al. A survey on LLM-as-a-judge. [posted 2024 Nov 23]. arXiv:2411.15594 [Preprint]. https://doi.org/10.48550/arXiv.2411.15594.
- 24. Li D, Jiang B, Huang L, et al. From generation to judgment: opportunities and challenges of LLM-as-a-judge. [posted 2024 Nov 25]. arXiv:2411.16594 [Preprint]. https://doi.org/10.48550/arXiv.2411.16594.
- 25. Seo J, Choi D, Cha W, Kim T. Llm-based medical document evaluation: integrating human expert insights. Stud Health Technol Inform 2025;329:1029-33.
- 26. Naver Cloud. HyperCLOVA X [Internet]. Naver Cloud; [cited 2025 May 1]. Available from: https://www.ncloud.com/solution/featured/hyperclovax
- 27. Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. [posted 2022 Jan 28]. arXiv:2201.11903 [Preprint]. https://doi.org/10.48550/arXiv.2201.11903.
- 28. OpenAI, Hurst A, Lerer A, et al. GPT-4o system card. [posted 2024 Oct 25]. arXiv:2410.21276 [Preprint]. https://doi.org/10.48550/arXiv.2410.21276.
- 29. OpenAI. Introducing GPT-4.1 in the API [Internet]. OpenAI; 2025 [cited 2025 May 1]. Available from: https://openai.com/index/gpt-4-1/
- 30. Anthropic [Internet]. Antrhopic; [cited 2025 May 1]. Available from: https://www.anthropic.com.
- 31. Anthropic. Claude 3.7 Sonnet system card. Antrhopic; 2025.
- 32. Hauke J, Kossowski T. Comparison of values of Pearson's and Spearman's correlation coefficients on the same sets of data. Quaest Geogr 2011;30:87-93.
- 33. Siegel S. Nonparametric statistics. Am Stat 1957;11:13-9.
- 34. Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 1937;32:675-701.
- 35. Friedman M. A comparison of alternative tests of significance for the problem of m rankings. Ann Math Statist 1940;11:86-92.
- 36. Wilcoxon F. Individual comparisons by ranking methods; In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. Springer; 1992. p.196-202.
- 37. Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. BMJ 1995;310:170.
- 38. Ionan AC, Polley MY, McShane LM, Dobbin KK. Comparison of confidence interval methods for an intra-class correlation coefficient (ICC). BMC Med Res Methodol 2014;14:121.
- 39. Liu W, Wang X, Wu M, et al. Aligning large language models with human preferences through representation engineering. [posted 2023 Dec 26]. arXiv:2312.15997 [Preprint]. https://doi.org/10.48550/arXiv.2312.15997.
- 40. Fragiadakis G, Diou C, Kousiouris G, Nikolaidou M. Aligning large language models with human preferences through representation engineering. [posted 2024 Jul 9]. arXiv:2407.19098 [Preprint]. https://doi.org/10.48550/arXiv.2407.19098.
- 41. Owens LM, Wilda JJ, Hahn PY, Koehler T, Fletcher JJ. The association between use of ambient voice technology documentation during primary care patient encounters, documentation burden, and provider burnout. Fam Pract 2024;41:86-91.
- 42. Li F, Ruijs N, Lu Y. Ethics & AI: a systematic review on ethical concerns and related strategies for designing with AI in healthcare. AI 2023;4:28-53.
- 43. Rajpurkar P, Chen E, Banerjee O, Topol EJ. Ai in health and medicine. Nat Med 2022;28:31-8.
- 44. Uygun Ilikhan S, Özer M, Tanberkan H, Bozkurt V. How to mitigate the risks of deployment of artificial intelligence in medicine? Turk J Med Sci 2024;54:483-92.
- 45. Okada Y, Ning Y, Ong ME. Explainable artificial intelligence in emergency medicine: an overview. Clin Exp Emerg Med 2023;10:354-62.