#PAGE_PARAMS# #ADS_HEAD_SCRIPTS# #MICRODATA#

Evaluation of AI Citation Accuracy in Anterior Segment Research


Evaluation of AI Citation Accuracy in Anterior Segment Research

Aims: To conduct a pilot evaluation of the citation accuracy of four contemporary artificial intelligence (AI) models –⁠ ChatGPT (OpenAI GPT-5.1), Copilot (Microsoft Copilot 4.2), DeepSeek (DeepSeek-R1), and Gemini (Google Gemini Ultra 2.5) –⁠ in generating PubMed-style references for corneal, conjunctival, and eyelid disease research, and to identify common error patterns.

Material and Methods: Thirty-five standardized clinical paragraphs were selected from The Review of Ophthalmology (4th edition). Each AI model was prompted to generate AMA 11-style references relevant to the provided text, simulating a literature retrieval task. Generated citations were assessed for accuracy, DOI matching, and clinical relevance. In a second validation phase, citations were independently reviewed by two ophthalmology experts and classified as fully cited, partially cited, or not cited. Statistical comparisons of accuracy proportions among models were performed using chi-squared tests.

Results: DeepSeek demonstrated the highest citation accuracy (78.6%, 22/35), followed by ChatGPT (51.4%, 18/35), and Copilot (51.4%, 18/35). Gemini showed the lowest accuracy (12.9%, 5/35). Differences in accuracy rates across models were statistically significant (χ² = 19.0, df = 3, p < 0.001). Expert validation confirmed DeepSeek’s relative advantage, with 42.9% (15/35) of its references classified as fully cited, compared with Copilot (20.0%, 7/35), ChatGPT (11.4%, 4/35), and Gemini (11.4%, 4/35). The most frequent error types were DOI mismatches and the generation of irrelevant or unverifiable references.

Conclusion: This pilot study indicates that contemporary AI models, particularly those like DeepSeek, show potential in assisting with citation generation. However, the observed error rates, including instances of hallucination, remain substantial. These findings underscore that rigorous human verification is indispensable when using AI for academic referencing in specialized medical fields, and highlight the need for continuous, version-specific benchmarking as these tools evolve.

Keywords:

artificial intelligence; citation accuracy; corneal disease; conjunctival disorders; eyelid diseases; large language models


Autoři: Mustafa Civelekler;  Mehmet Çıtırık
Působiště autorů: University of Health Sciences, Ankara Etlik City Hospital, Department of Ophthalmology, Ankara, Türkiye
Vyšlo v časopise: Čes. a slov. Oftal., 82, 2026, No. Ahead of Print, p. 1-5
Kategorie: Původní práce
doi: https://doi.org/10.31348/2026/21

Souhrn

Aims: To conduct a pilot evaluation of the citation accuracy of four contemporary artificial intelligence (AI) models –⁠ ChatGPT (OpenAI GPT-5.1), Copilot (Microsoft Copilot 4.2), DeepSeek (DeepSeek-R1), and Gemini (Google Gemini Ultra 2.5) –⁠ in generating PubMed-style references for corneal, conjunctival, and eyelid disease research, and to identify common error patterns.

Material and Methods: Thirty-five standardized clinical paragraphs were selected from The Review of Ophthalmology (4th edition). Each AI model was prompted to generate AMA 11-style references relevant to the provided text, simulating a literature retrieval task. Generated citations were assessed for accuracy, DOI matching, and clinical relevance. In a second validation phase, citations were independently reviewed by two ophthalmology experts and classified as fully cited, partially cited, or not cited. Statistical comparisons of accuracy proportions among models were performed using chi-squared tests.

Results: DeepSeek demonstrated the highest citation accuracy (78.6%, 22/35), followed by ChatGPT (51.4%, 18/35), and Copilot (51.4%, 18/35). Gemini showed the lowest accuracy (12.9%, 5/35). Differences in accuracy rates across models were statistically significant (χ² = 19.0, df = 3, p < 0.001). Expert validation confirmed DeepSeek’s relative advantage, with 42.9% (15/35) of its references classified as fully cited, compared with Copilot (20.0%, 7/35), ChatGPT (11.4%, 4/35), and Gemini (11.4%, 4/35). The most frequent error types were DOI mismatches and the generation of irrelevant or unverifiable references.

Conclusion: This pilot study indicates that contemporary AI models, particularly those like DeepSeek, show potential in assisting with citation generation. However, the observed error rates, including instances of hallucination, remain substantial. These findings underscore that rigorous human verification is indispensable when using AI for academic referencing in specialized medical fields, and highlight the need for continuous, version-specific benchmarking as these tools evolve.

Stránka

Štítky
Oftalmologie

Článek vyšel v časopise

Česká a slovenská oftalmologie

Číslo Ahead of Print

2026 Číslo Ahead of Print
Nejčtenější tento týden
Nejčtenější v tomto čísle
Kurzy

Zvyšte si kvalifikaci online z pohodlí domova

Revma Focus: Spondyloartritidy
nový kurz

Denzitometrie v praxi: od kvalitního snímku po správnou interpretaci
Autoři: prof. MUDr. Vladimír Palička, CSc., Dr.h.c., doc. MUDr. Václav Vyskočil, Ph.D., MUDr. Petr Kasalický, CSc., MUDr. Jan Rosa, Ing. Pavel Havlík, Ing. Jan Adam, Hana Hejnová, DiS., Jana Křenková

Čelistně-ortodontické kazuistiky od A do Z
Autoři: MDDr. Eleonóra Ivančová, PhD., MHA

Cesta od prvních příznaků RS k optimální léčbě
Autoři: prof. MUDr. Eva Kubala Havrdová, DrSc.

BONE ACADEMY 2025
Autoři: prof. MUDr. Pavel Horák, CSc., doc. MUDr. Ludmila Brunerová, Ph.D., doc. MUDr. Václav Vyskočil, Ph.D., prim. MUDr. Richard Pikner, Ph.D., MUDr. Olga Růžičková, MUDr. Jan Rosa, prof. MUDr. Vladimír Palička, CSc., Dr.h.c.

Všechny kurzy
Přihlášení
Zapomenuté heslo

Zadejte e-mailovou adresu, se kterou jste vytvářel(a) účet, budou Vám na ni zaslány informace k nastavení nového hesla.

Přihlášení

Nemáte účet?  Registrujte se

#ADS_BOTTOM_SCRIPTS#