Evaluation of AI Citation Accuracy in Anterior Segment Research
Evaluation of AI Citation Accuracy in Anterior Segment Research
Aims: To conduct a pilot evaluation of the citation accuracy of four contemporary artificial intelligence (AI) models – ChatGPT (OpenAI GPT-5.1), Copilot (Microsoft Copilot 4.2), DeepSeek (DeepSeek-R1), and Gemini (Google Gemini Ultra 2.5) – in generating PubMed-style references for corneal, conjunctival, and eyelid disease research, and to identify common error patterns.
Material and Methods: Thirty-five standardized clinical paragraphs were selected from The Review of Ophthalmology (4th edition). Each AI model was prompted to generate AMA 11-style references relevant to the provided text, simulating a literature retrieval task. Generated citations were assessed for accuracy, DOI matching, and clinical relevance. In a second validation phase, citations were independently reviewed by two ophthalmology experts and classified as fully cited, partially cited, or not cited. Statistical comparisons of accuracy proportions among models were performed using chi-squared tests.
Results: DeepSeek demonstrated the highest citation accuracy (78.6%, 22/35), followed by ChatGPT (51.4%, 18/35), and Copilot (51.4%, 18/35). Gemini showed the lowest accuracy (12.9%, 5/35). Differences in accuracy rates across models were statistically significant (χ² = 19.0, df = 3, p < 0.001). Expert validation confirmed DeepSeek’s relative advantage, with 42.9% (15/35) of its references classified as fully cited, compared with Copilot (20.0%, 7/35), ChatGPT (11.4%, 4/35), and Gemini (11.4%, 4/35). The most frequent error types were DOI mismatches and the generation of irrelevant or unverifiable references.
Conclusion: This pilot study indicates that contemporary AI models, particularly those like DeepSeek, show potential in assisting with citation generation. However, the observed error rates, including instances of hallucination, remain substantial. These findings underscore that rigorous human verification is indispensable when using AI for academic referencing in specialized medical fields, and highlight the need for continuous, version-specific benchmarking as these tools evolve.
Keywords:
artificial intelligence; citation accuracy; corneal disease; conjunctival disorders; eyelid diseases; large language models
Autoři:
Mustafa Civelekler; Mehmet Çıtırık
Působiště autorů:
University of Health Sciences, Ankara Etlik City Hospital, Department of Ophthalmology, Ankara, Türkiye
Vyšlo v časopise:
Čes. a slov. Oftal., 82, 2026, No. Ahead of Print, p. 1-5
Kategorie:
Původní práce
doi:
https://doi.org/10.31348/2026/21
Souhrn
Aims: To conduct a pilot evaluation of the citation accuracy of four contemporary artificial intelligence (AI) models – ChatGPT (OpenAI GPT-5.1), Copilot (Microsoft Copilot 4.2), DeepSeek (DeepSeek-R1), and Gemini (Google Gemini Ultra 2.5) – in generating PubMed-style references for corneal, conjunctival, and eyelid disease research, and to identify common error patterns.
Material and Methods: Thirty-five standardized clinical paragraphs were selected from The Review of Ophthalmology (4th edition). Each AI model was prompted to generate AMA 11-style references relevant to the provided text, simulating a literature retrieval task. Generated citations were assessed for accuracy, DOI matching, and clinical relevance. In a second validation phase, citations were independently reviewed by two ophthalmology experts and classified as fully cited, partially cited, or not cited. Statistical comparisons of accuracy proportions among models were performed using chi-squared tests.
Results: DeepSeek demonstrated the highest citation accuracy (78.6%, 22/35), followed by ChatGPT (51.4%, 18/35), and Copilot (51.4%, 18/35). Gemini showed the lowest accuracy (12.9%, 5/35). Differences in accuracy rates across models were statistically significant (χ² = 19.0, df = 3, p < 0.001). Expert validation confirmed DeepSeek’s relative advantage, with 42.9% (15/35) of its references classified as fully cited, compared with Copilot (20.0%, 7/35), ChatGPT (11.4%, 4/35), and Gemini (11.4%, 4/35). The most frequent error types were DOI mismatches and the generation of irrelevant or unverifiable references.
Conclusion: This pilot study indicates that contemporary AI models, particularly those like DeepSeek, show potential in assisting with citation generation. However, the observed error rates, including instances of hallucination, remain substantial. These findings underscore that rigorous human verification is indispensable when using AI for academic referencing in specialized medical fields, and highlight the need for continuous, version-specific benchmarking as these tools evolve.
Štítky
OftalmologieČlánek vyšel v časopise
Česká a slovenská oftalmologie
2026 Číslo Ahead of Print
- Selektivní laserová trabekuloplastika nesnižuje nitroční tlak více než argonová laserová trabekuloplastika
- Progresi glaukomu je třeba hodnotit strukturálními i funkčními parametry
- Ztráta centrálního vidění po filtrujících operacích glaukomu
- Od PGF-2 alfa-isopropyl esteru k latanoprostu: přehled vývoje Xalatanu
- Compliance u pacientů s glaukomem
-
Všechny články tohoto čísla
- Effect of the ACE inhibitor Zofenopril on the Oxidative Status of the Eye in Animals with Experimental Glaucoma
- Role slzného filmu v refrakci: Kvantitativní hodnocení před a po terapii na přístroji Rexon-Eye
- Refrakční lensektomie u pacientů s Fuchsovou endotelovou dystrofií
- Retinal Nerve Fiber Layer and Ganglion Cell Complex Thickness Analysis in Treatment – Naive Glaucoma Patients
- Customized Cryotherapy for Ocular Salvage in Descemetocele and Iris Prolapse: A 17-Year Retrospective Study
- Význam profylaktického topického podávání antibiotik u nekomplikovaných operací šedého zákalu v éře intrakamerálních antibiotik
- Hodnocení účinků intravitreální aplikace implantátu Ozurdex u pacientů s makulárním edémem u sítnicové žilní okluze
- Progresivní keratokonus jako komplikace atopické keratokonjunktivitidy. Kazuistika
- Trifocal Versus Monofocal Intraocular Lenses: A Prospective Assessment of Visual Outcomes and Patient-Reported Satisfaction
- Diagnostic and Potentially Prognostic Value of Novel Inflammatory Indices in Non-Arteritic Anterior Ischemic Optic Neuropathy
- Okluze centrální retinální arterie – naše budoucí cesta k zavedení intravenózní trombolýzy
- Two-year Outcomes of Combined Gonioscopy-Assisted Transluminal Trabeculotomy and Cataract Extraction in Ocular Hypertension and Primary Open-Angle Glaucoma
- Retinitis pigmentosa sine pigmento maskovaná ako normotenzný glaukóm. Kazuistika
- Dlouhodobé výsledky chirurgické léčby jednostranného primárního kongenitálního a infantilního glaukomu
- Cholesterolóza přední komory u ročního dítěte s Coatsovou chorobou. Kazuistika
- Outcomes of Post-Traumatic Pediatric Endophthalmitis Following 25-Gauge Pars Plana Vitrectomy
- Clinical Features and Therapeutic Alternatives in Eyes with Secondary Vasoproliferative Tumors: A Single-Center Turkish Perspective
- Association of Aqueous Humor Tumor Necrosis Factor Alpha with Retinal Ganglion Cell Thickness in Juvenile versus Adult-Onset Primary Open-Angle Glaucoma
- Evaluation of AI Citation Accuracy in Anterior Segment Research
- Clinical Profile and Surgical Outcomes in Brown Syndrome – A Retrospective Case Series
- Histomorphological Changes in Experimental Autoimmune Uveitis of Varying Severity
- Příspěvek k patogenese a časné diagnostice glaukomu. Přehled
- Česká a slovenská oftalmologie
- Archiv čísel
- Aktuální číslo
- Informace o časopisu
Nejčtenější v tomto čísle
- Hodnocení účinků intravitreální aplikace implantátu Ozurdex u pacientů s makulárním edémem u sítnicové žilní okluze
- Role slzného filmu v refrakci: Kvantitativní hodnocení před a po terapii na přístroji Rexon-Eye
- Progresivní keratokonus jako komplikace atopické keratokonjunktivitidy. Kazuistika
- Effect of the ACE inhibitor Zofenopril on the Oxidative Status of the Eye in Animals with Experimental Glaucoma
Zvyšte si kvalifikaci online z pohodlí domova
Mazová zátka a její řešení
nový kurzVšechny kurzy