Evaluation of ChatGPT-4.5 and DeepSeek-V3-R1 in answering patient-centered questions about orthognathic surgery: a comparative study across two languages

İpek Necla Güldiken; Emrah Dilaver

doi:10.54307/2025.NWMJ.220

Abstract

Aim: Patients undergoing orthognathic surgery frequently seek online resources to better understand the procedure, risks, and outcomes. As generative artificial intelligence (AI) models are increasingly integrated into healthcare communication, it is essential to evaluate their ability to deliver accurate, comprehensive, and readable patient information.

Methods: This study conducted a comparative assessment of two large language models (LLMs)—ChatGPT-4.5 and DeepSeek-V3-R1—in answering frequently asked orthognathic patient questions, analyzing accuracy, completeness, readability, and quality across English (EN) and Turkish (TR). Twenty-five patient-centered questions categorized into five clinical domains yielded 200 AI-generated responses, independently evaluated by two oral and maxillofacial surgeons (OMFSs) using a multidimensional framework. Statistical analyses included non-parametric tests and inter-rater reliability assessments (Intraclass Correlation Coefficient (ICC), and Cohen’s Kappa).

Results: Significant differences emerged across clinical categories in difficulty and accuracy scores (p <0.05). Questions in the “Postoperative Complications & Rehabilitation” category were least difficult, while those in “Diagnosis & Indication” category were rated most difficult but achieved the highest accuracy and quality ratings. English (EN) responses significantly outperformed Turkish (TR) responses in readability, word count, and accuracy (p <0.05), though completeness and quality did not differ significantly by language. No significant performance differences were found between the two chatbots. Inter-observer agreement was generally high, except for completeness (p = 0.001), where Observer-I assigned higher scores.

Conclusion: Both LLMs effectively generated clinically relevant responses, demonstrating substantial potential as supplemental tools for patient education, although the superior performance of EN responses emphasizes the need for further multilingual optimization.

Keywords: ChatGPT, DeepSeek, large language models, orthognathic surgery, patient education

Copyright and license

Copyright © 2025 The Author(s). This is an open-access article published by Bolu İzzet Baysal Training and Research Hospital under the terms of the Creative Commons Attribution License (CC BY) which permits unrestricted use, distribution, and reproduction in any medium or format, provided the original work is properly cited.

How to cite

Güldiken İN, Dilaver E. Evaluation of ChatGPT-4.5 and DeepSeek-V3-R1 in answering patient-centered questions about orthognathic surgery: a comparative study across two languages. Northwestern Med J. 2025;5(4):209-21. https://doi.org/10.54307/2025.NWMJ.220

Download Citation

References

Su H, Sun Y, Li R, et al. Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis. J Med Internet Res. 2025; 27: e72062. https://doi.org/10.2196/72062
Aziz AAA, Abdelrahman HH, Hassan MG. The use of ChatGPT and Google Gemini in responding to orthognathic surgery-related questions: A comparative study. J World Fed Orthod. 2025; 14(1): 20-6. https://doi.org/10.1016/j.ejwf.2024.09.004
Bavbek NC, Tuncer BB. Information on the Internet Regarding Orthognathic Surgery in Turkey: Is It an Adequate Guide for Potential Patients? Turk J Orthod. 2017; 30(3): 78-83. https://doi.org/10.5152/TurkJOrthod.2017.17027
Balel Y. Can ChatGPT be used in oral and maxillofacial surgery? J Stomatol Oral Maxillofac Surg. 2023; 124(5): 101471. https://doi.org/10.1016/j.jormas.2023.101471
Dursun D, Bilici Geçer R. Can artificial intelligence models serve as patient information consultants in orthodontics? BMC Med Inform Decis Mak. 2024; 24(1): 211. https://doi.org/10.1186/s12911-024-02619-8
Gumilar KE, Indraprasta BR, Faridzi AS, et al. Assessment of Large Language Models (LLMs) in decision-making support for gynecologic oncology. Comput Struct Biotechnol J. 2024; 23: 4019-26. https://doi.org/10.1016/j.csbj.2024.10.050
Gordon EB, Towbin AJ, Wingrove P, et al. Enhancing Patient Communication With Chat-GPT in Radiology: Evaluating the Efficacy and Readability of Answers to Common Imaging-Related Questions. J Am Coll Radiol. 2024; 21(2): 353-9. https://doi.org/10.1016/j.jacr.2023.09.011
Metin U, Goymen M. Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions. Am J Orthod Dentofacial Orthop. 2025; 168(3): 348-57. https://doi.org/10.1016/j.ajodo.2025.04.008
Daraqel B, Wafaie K, Mohammed H, et al. The performance of artificial intelligence models in generating responses to general orthodontic questions: ChatGPT vs Google Bard. Am J Orthod Dentofacial Orthop. 2024; 165(6): 652-62. https://doi.org/10.1016/j.ajodo.2024.01.012
Zhou M, Pan Y, Zhang Y, Song X, Zhou Y. Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models. Int J Med Inform. 2025; 198: 105871. https://doi.org/10.1016/j.ijmedinf.2025.105871
Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res. 2024; 13: e54704. https://doi.org/10.2196/54704
Modig M, Andersson L, Wårdh I. Patients’ perception of improvement after orthognathic surgery: pilot study. Br J Oral Maxillofac Surg. 2006; 44(1): 24-7. https://doi.org/10.1016/j.bjoms.2005.07.016
Lee S, McGrath C, Samman N. Impact of orthognathic surgery on quality of life. J Oral Maxillofac Surg. 2008; 66(6): 1194-9. https://doi.org/10.1016/j.joms.2008.01.006
Yurdakurban E, Topsakal KG, Duran GS. A comparative analysis of AI-based chatbots: Assessing data quality in orthognathic surgery related patient information. J Stomatol Oral Maxillofac Surg. 2024; 125(5): 101757. https://doi.org/10.1016/j.jormas.2023.101757
Etikan I, Musa SA, Alkassim RS. Comparison of convenience sampling and purposive sampling. American Journal of Theoretical and Applied Statistics. 2015; 5(1): 1-4. https://doi.org/10.11648/j.ajtas.20160501.11
Chong Q, Marwadi A, Supekar K, Lee Y. Ontology based metadata management in medical domains, Journal of Research and Practice in Information Technology. 2003; 35(2): 139-54.
Goodman RS, Patrinely JR, Stone CA, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open. 2023; 6(10): e2336483. https://doi.org/10.1001/jamanetworkopen.2023.36483
Kincaid JP, Fishburne RP Jr, Rogers RL, Chissom BS. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel (Report No. ADA006655). 1975. https://doi.org/10.21236/ADA006655
Ateşman E. Türkçede okunabilirliğin ölçülmesi. Dil Dergisi. 1997; 58: 71-4. Available at: https://web.archive.org/web/20201031111301/http://www.atesman.info/wp-content/uploads/2015/10/Atesman-okunabilirlik.pdf
Bezirci B, Yılmaz AE. A software library on measuring the readability of texts and a new readability criterion for Turkish. Dokuz Eylul University Faculty of Engineering Journal of Science and Engineering. 2010; 12(3): 49-62.
Er A, Ay IE, Horozoğlu Ceran T. Evaluating Turkish Readability and Quality of Strabismus-Related Websites. Cureus. 2024; 16(4): e58603. https://doi.org/10.7759/cureus.58603
Hancı V, Ergün B, Gül Ş, Uzun Ö, Erdemir İ, Hancı FB. Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care. Medicine (Baltimore). 2024; 103(33): e39305. https://doi.org/10.1097/MD.0000000000039305
Chan L, Xu X, Lv K. DeepSeek-R1 and GPT-4 are comparable in a complex diagnostic challenge: a historical control study. Int J Surg. 2025; 111(6): 4056-9. https://doi.org/10.1097/JS9.0000000000002386
Qin L, Chen Q, Zhou Y, et al. A survey of multilingual large language models. Patterns (N Y). 2025; 6(1): 101118. https://doi.org/10.1016/j.patter.2024.101118
Asfuroğlu ZM, Yağar H, Gümüşoğlu E. High accuracy but limited readability of large language model-generated responses to frequently asked questions about Kienböck’s disease. BMC Musculoskelet Disord. 2024; 25(1): 879. https://doi.org/10.1186/s12891-024-07983-0
Hassona Y, Alqaisi D, Al-Haddad A, et al. How good is ChatGPT at answering patients’ questions related to early detection of oral (mouth) cancer? Oral Surg Oral Med Oral Pathol Oral Radiol. 2024; 138(2): 269-78. https://doi.org/10.1016/j.oooo.2024.04.010
Park SH, Suh CH, Lee JH, Kahn CE, Moy L. Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM). Korean J Radiol. 2024; 25(10): 865-8. https://doi.org/10.3348/kjr.2024.0843
Kocak B, Baessler B, Bakas S, et al. CheckList for EvaluAtion of Radiomics research (CLEAR): a step-by-step reporting guideline for authors and reviewers endorsed by ESR and EuSoMII. Insights Imaging. 2023; 14(1): 75. https://doi.org/10.1186/s13244-023-01415-8
Beheshti M, Toubal IE, Alaboud K, et al. Evaluating the Reliability of ChatGPT for Health-Related Questions: A Systematic Review. Informatics. 2025; 12(1): 9. https://doi.org/10.3390/informatics12010009
Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a Large Language Model’s Responses to Questions and Cases About Glaucoma and Retina Management. JAMA Ophthalmol. 2024; 142(4): 371-5. https://doi.org/10.1001/jamaophthalmol.2023.6917
Pandey S, Sharma S. A comparative study of retrieval-based and generative-based chatbots using Deep Learning and Machine Learning. Healthcare Analytics. 2023; 3: 100198. https://doi.org/10.1016/j.health.2023.100198
Lehman E, Hernandez E, Mahajan D, et al. Do We Still Need Clinical Language Models? arXiv 2302.08091[Preprint]. 2023 Feb 16. https://doi.org/10.48550/arXiv.2302.08091
Ngo M, Jensen E, Meade M. The quality of orthognathic surgery information on social media: A scoping review. Int Orthod. 2025; 23(1): 100959. https://doi.org/10.1016/j.ortho.2024.100959

Northwestern Medical Journal

Evaluation of ChatGPT-4.5 and DeepSeek-V3-R1 in answering patient-centered questions about orthognathic surgery: a comparative study across two languages

Authors

Abstract

Copyright and license

How to cite

References