Abstract

Aim: Patients undergoing orthognathic surgery frequently seek online resources to better understand the procedure, risks, and outcomes. As generative artificial intelligence (AI) models are increasingly integrated into healthcare communication, it is essential to evaluate their ability to deliver accurate, comprehensive, and readable patient information.

Methods: This study conducted a comparative assessment of two large language models (LLMs)—ChatGPT-4.5 and DeepSeek-V3-R1—in answering frequently asked orthognathic patient questions, analyzing accuracy, completeness, readability, and quality across English (EN) and Turkish (TR). Twenty-five patient-centered questions categorized into five clinical domains yielded 200 AI-generated responses, independently evaluated by two oral and maxillofacial surgeons (OMFSs) using a multidimensional framework. Statistical analyses included non-parametric tests and inter-rater reliability assessments (Intraclass Correlation Coefficient (ICC), and Cohen’s Kappa).

Results: Significant differences emerged across clinical categories in difficulty and accuracy scores (p <0.05). Questions in the “Postoperative Complications & Rehabilitation” category were least difficult, while those in “Diagnosis & Indication” category were rated most difficult but achieved the highest accuracy and quality ratings. English (EN) responses significantly outperformed Turkish (TR) responses in readability, word count, and accuracy (p <0.05), though completeness and quality did not differ significantly by language. No significant performance differences were found between the two chatbots. Inter-observer agreement was generally high, except for completeness (p = 0.001), where Observer-I assigned higher scores.

Conclusion: Both LLMs effectively generated clinically relevant responses, demonstrating substantial potential as supplemental tools for patient education, although the superior performance of EN responses emphasizes the need for further multilingual optimization.

Keywords: ChatGPT, DeepSeek, large language models, orthognathic surgery, patient education

Copyright and license

How to cite

1.
Güldiken İN, Dilaver E. Evaluation of ChatGPT-4.5 and DeepSeek-V3-R1 in answering patient-centered questions about orthognathic surgery: a comparative study across two languages. Northwestern Med J. 2025;5(4):209-21. https://doi.org/10.54307/2025.NWMJ.220

References

  1. Su H, Sun Y, Li R, et al. Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis. J Med Internet Res. 2025; 27: e72062. https://doi.org/10.2196/72062
  2. Aziz AAA, Abdelrahman HH, Hassan MG. The use of ChatGPT and Google Gemini in responding to orthognathic surgery-related questions: A comparative study. J World Fed Orthod. 2025; 14(1): 20-6. https://doi.org/10.1016/j.ejwf.2024.09.004
  3. Bavbek NC, Tuncer BB. Information on the Internet Regarding Orthognathic Surgery in Turkey: Is It an Adequate Guide for Potential Patients? Turk J Orthod. 2017; 30(3): 78-83. https://doi.org/10.5152/TurkJOrthod.2017.17027
  4. Balel Y. Can ChatGPT be used in oral and maxillofacial surgery? J Stomatol Oral Maxillofac Surg. 2023; 124(5): 101471. https://doi.org/10.1016/j.jormas.2023.101471
  5. Dursun D, Bilici Geçer R. Can artificial intelligence models serve as patient information consultants in orthodontics? BMC Med Inform Decis Mak. 2024; 24(1): 211. https://doi.org/10.1186/s12911-024-02619-8
  6. Gumilar KE, Indraprasta BR, Faridzi AS, et al. Assessment of Large Language Models (LLMs) in decision-making support for gynecologic oncology. Comput Struct Biotechnol J. 2024; 23: 4019-26. https://doi.org/10.1016/j.csbj.2024.10.050
  7. Gordon EB, Towbin AJ, Wingrove P, et al. Enhancing Patient Communication With Chat-GPT in Radiology: Evaluating the Efficacy and Readability of Answers to Common Imaging-Related Questions. J Am Coll Radiol. 2024; 21(2): 353-9. https://doi.org/10.1016/j.jacr.2023.09.011
  8. Metin U, Goymen M. Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions. Am J Orthod Dentofacial Orthop. 2025; 168(3): 348-57. https://doi.org/10.1016/j.ajodo.2025.04.008
  9. Daraqel B, Wafaie K, Mohammed H, et al. The performance of artificial intelligence models in generating responses to general orthodontic questions: ChatGPT vs Google Bard. Am J Orthod Dentofacial Orthop. 2024; 165(6): 652-62. https://doi.org/10.1016/j.ajodo.2024.01.012
  10. Zhou M, Pan Y, Zhang Y, Song X, Zhou Y. Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models. Int J Med Inform. 2025; 198: 105871. https://doi.org/10.1016/j.ijmedinf.2025.105871
  11. Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res. 2024; 13: e54704. https://doi.org/10.2196/54704
  12. Modig M, Andersson L, Wårdh I. Patients’ perception of improvement after orthognathic surgery: pilot study. Br J Oral Maxillofac Surg. 2006; 44(1): 24-7. https://doi.org/10.1016/j.bjoms.2005.07.016
  13. Lee S, McGrath C, Samman N. Impact of orthognathic surgery on quality of life. J Oral Maxillofac Surg. 2008; 66(6): 1194-9. https://doi.org/10.1016/j.joms.2008.01.006
  14. Yurdakurban E, Topsakal KG, Duran GS. A comparative analysis of AI-based chatbots: Assessing data quality in orthognathic surgery related patient information. J Stomatol Oral Maxillofac Surg. 2024; 125(5): 101757. https://doi.org/10.1016/j.jormas.2023.101757
  15. Etikan I, Musa SA, Alkassim RS. Comparison of convenience sampling and purposive sampling. American Journal of Theoretical and Applied Statistics. 2015; 5(1): 1-4. https://doi.org/10.11648/j.ajtas.20160501.11
  16. Chong Q, Marwadi A, Supekar K, Lee Y. Ontology based metadata management in medical domains, Journal of Research and Practice in Information Technology. 2003; 35(2): 139-54.
  17. Goodman RS, Patrinely JR, Stone CA, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open. 2023; 6(10): e2336483. https://doi.org/10.1001/jamanetworkopen.2023.36483
  18. Kincaid JP, Fishburne RP Jr, Rogers RL, Chissom BS. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel (Report No. ADA006655). 1975. https://doi.org/10.21236/ADA006655
  19. Ateşman E. Türkçede okunabilirliğin ölçülmesi. Dil Dergisi. 1997; 58: 71-4. Available at: https://web.archive.org/web/20201031111301/http://www.atesman.info/wp-content/uploads/2015/10/Atesman-okunabilirlik.pdf
  20. Bezirci B, Yılmaz AE. A software library on measuring the readability of texts and a new readability criterion for Turkish. Dokuz Eylul University Faculty of Engineering Journal of Science and Engineering. 2010; 12(3): 49-62.
  21. Er A, Ay IE, Horozoğlu Ceran T. Evaluating Turkish Readability and Quality of Strabismus-Related Websites. Cureus. 2024; 16(4): e58603. https://doi.org/10.7759/cureus.58603
  22. Hancı V, Ergün B, Gül Ş, Uzun Ö, Erdemir İ, Hancı FB. Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care. Medicine (Baltimore). 2024; 103(33): e39305. https://doi.org/10.1097/MD.0000000000039305
  23. Chan L, Xu X, Lv K. DeepSeek-R1 and GPT-4 are comparable in a complex diagnostic challenge: a historical control study. Int J Surg. 2025; 111(6): 4056-9. https://doi.org/10.1097/JS9.0000000000002386
  24. Qin L, Chen Q, Zhou Y, et al. A survey of multilingual large language models. Patterns (N Y). 2025; 6(1): 101118. https://doi.org/10.1016/j.patter.2024.101118
  25. Asfuroğlu ZM, Yağar H, Gümüşoğlu E. High accuracy but limited readability of large language model-generated responses to frequently asked questions about Kienböck’s disease. BMC Musculoskelet Disord. 2024; 25(1): 879. https://doi.org/10.1186/s12891-024-07983-0
  26. Hassona Y, Alqaisi D, Al-Haddad A, et al. How good is ChatGPT at answering patients’ questions related to early detection of oral (mouth) cancer? Oral Surg Oral Med Oral Pathol Oral Radiol. 2024; 138(2): 269-78. https://doi.org/10.1016/j.oooo.2024.04.010
  27. Park SH, Suh CH, Lee JH, Kahn CE, Moy L. Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM). Korean J Radiol. 2024; 25(10): 865-8. https://doi.org/10.3348/kjr.2024.0843
  28. Kocak B, Baessler B, Bakas S, et al. CheckList for EvaluAtion of Radiomics research (CLEAR): a step-by-step reporting guideline for authors and reviewers endorsed by ESR and EuSoMII. Insights Imaging. 2023; 14(1): 75. https://doi.org/10.1186/s13244-023-01415-8
  29. Beheshti M, Toubal IE, Alaboud K, et al. Evaluating the Reliability of ChatGPT for Health-Related Questions: A Systematic Review. Informatics. 2025; 12(1): 9. https://doi.org/10.3390/informatics12010009
  30. Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a Large Language Model’s Responses to Questions and Cases About Glaucoma and Retina Management. JAMA Ophthalmol. 2024; 142(4): 371-5. https://doi.org/10.1001/jamaophthalmol.2023.6917
  31. Pandey S, Sharma S. A comparative study of retrieval-based and generative-based chatbots using Deep Learning and Machine Learning. Healthcare Analytics. 2023; 3: 100198. https://doi.org/10.1016/j.health.2023.100198
  32. Lehman E, Hernandez E, Mahajan D, et al. Do We Still Need Clinical Language Models? arXiv 2302.08091[Preprint]. 2023 Feb 16. https://doi.org/10.48550/arXiv.2302.08091
  33. Ngo M, Jensen E, Meade M. The quality of orthognathic surgery information on social media: A scoping review. Int Orthod. 2025; 23(1): 100959. https://doi.org/10.1016/j.ortho.2024.100959