Medical AI Projects

This page summarizes ongoing and past projects in medical AI, with a particular focus on ophthalmology, retinal imaging, and trustworthy clinical decision support.

Key Themes

Domain-specialized LLMs for ophthalmology (e.g., Ophtimus-V2-Tx)
Noise-robust medical image analysis and quantification
Reliable mapping from model outputs to clinical coding systems
Evaluation frameworks for safety, robustness, and explainability

Selected Projects

Ophtimus-V2-Tx: An 8B-parameter ophthalmology LLM trained on case reports and evaluated with CliBench-based coding.
ERM Quantification: Low-cost and fast SD-OCT based epiretinal membrane detection and thickness quantification.
Trustworthy Clinical Support: Methods to validate and monitor LLM predictions before deployment in real clinical workflows.

Ophtimus: Ophthalmology-specific LLM

Python · PyTorch · Transformers · LangChain · Streamlit · FastAPI

🤗 Models and Datasets | 📕 AAAI 2025 Workshop Paper

Introduction

Ophtimus is an open-source large language model (LLM) specialized in ophthalmology, built with 8 billion parameters based on the LLaMA architecture. It is trained on carefully curated ophthalmology-specific data, including medical papers, textbooks, and research reports. Through filtering, summarization, and preprocessing, only the most relevant and high-quality information was retained.

Designed to be both lightweight and high-performing, Ophtimus is suitable for real-world applications such as clinical decision support, medical education, and patient communication. The model and its training pipeline are fully open-sourced, providing a practical reference for developing similar domain-specific LLMs in other areas of medicine.

Related GitHub Repositories
• Ophtimus-Ophthalmology-LLM
• SD-OCT-ERM-Quantification

Dataset Details

All datasets used for Ophtimus were either newly constructed or adapted for this project. Pre-training datasets were curated from open-source ophthalmology materials, while instruction-tuning and evaluation datasets were obtained by extracting only ophthalmology-relevant samples from broader medical corpora. All data underwent preprocessing steps, including deduplication, English-only filtering, and removal of any personally identifiable information (PII).

Dataset name	Source	Size	Purpose	Key Features
Ophthalmology-pubmed-corpus	Ophthalmology papers	18.4M Tokens	Pre-Training	• Map-reduce style summaries • Broad ophthalmic keywords
Ophthalmology-textbook-corpus	Ophthalmology textbooks	4M Tokens	Pre-Training	• Trusted medical sources • Rich in diagnostic cases
Ophthalmology MCQA Inst Dataset	Ophthalmology documents	51.7k QAs	Instruction-Tuning	• Diverse multiple-choice formats • Reasoning included • Various ophthalmic topics
Ophthalmology EQA Inst Dataset	Ophthalmology documents	49.3k QAs	Instruction-Tuning	• Essay / explanation-style QA • Variety of ophthalmic topics
Ophtimus-Eval-Dataset	Medical platform data	2,153 QAs	Evaluation	• Expert-verified data • Multi-choice QA dataset
PubMedQA-ophthal-Dataset	PubMedQA	297 QAs	Evaluation	• Ophthalmology domain filtered • True/False MCQA dataset
MedMCQA-Ophthal-Dataset	MedMCQA	6,932 QAs	Evaluation	• Ophthalmology domain filtered • Multi-choice QA dataset
EQAEval-Dataset	MedQuAD, others	1,389 QAs	Evaluation	• Diverse open-source datasets • Ophthalmology domain filtered • Essay-style QA

Model Details

The pre-training and instruction-tuning columns below refer to the training conducted in this project. The base models had already undergone their own pre-training and/or fine-tuning, and Ophtimus was built using transfer learning on top of these models.

Model name	Base model	Parameters	Pre-training	Instruction-tuning
Ophtimus-Base	Llama-3.1-8B	8B	✅	❌
Ophtimus-Llama-1B	Llama-3.2-1B-Instruct	1B	❌	✅
Ophtimus-Llama-3B	Llama-3.2-3B-Instruct	3B	❌	✅
Ophtimus-Llama-8B	Llama-3.1-8B-Instruct	8B	❌	✅
Ophtimus-Instruct-8B	Ophtimus-Base	8B	✅	✅

Performance

Multi-Choice QA: Ophtimus-Eval, MedMCQA, PubMedQA (ophthalmology-subset)
Essay QA: MedQuAD, Medical Flashcards, Medical Wikidoc (ophthalmology-filtered)

Ophtimus-Eval is a proprietary dataset collected from a medical platform. The other datasets are established medical benchmarks, from which only ophthalmology-related QA pairs were extracted for evaluation.

Model	Multi-Choice Question			Essay Question
Model	Ophtimus Eval	MedMCQA (Ophth)	PubMedQA (Ophth)	RougeL	BLEU	METEOR	SemScore
OpenAI GPT-4o	71.95%	81.95%	89.90%	0.193	0.082	0.341	0.761
Llama-3-8B-Instruct	48.60%	74.02%	63.97%	0.193	0.064	0.244	0.684
Llama-3.1-8B-Instruct	39.78%	57.96%	83.84%	0.177	0.054	0.215	0.641
Eye-Llama	32.56%	59.43%	66.11%	0.183	0.062	0.211	0.686
PMC-Llama-13B	48.28%	63.45%	72.48%	0.223	0.082	0.288	0.714
Ophtimus-Llama-1B	41.45%	45.74%	61.95%	0.219	0.076	0.217	0.711
Ophtimus-Llama-3B	52.70%	62.10%	69.36%	0.224	0.077	0.225	0.726
Ophtimus-Llama-8B	60.78%	68.25%	69.70%	0.226	0.083	0.230	0.733
Ophtimus-Instruct-8B	63.85%	71.51%	72.73%	0.222	0.079	0.224	0.735

SD-OCT-based Epiretinal Membrane Diagnostic Assistant System

Python · PyTorch · OpenCV · YOLO · Pillow

Overall pipeline architecture for ERM detection & quantification

Introduction

This project presents a low-cost and efficient method for detecting and quantifying Epiretinal Membranes (ERM) using Spectral-Domain OCT (SD-OCT). Using deep learning techniques—particularly YOLO object detection—we generate en face ERM Projection Images from B-scan data, enabling intuitive visualization and accurate measurement of ERM lesions.

The proposed approach also quantifies the association between ERM severity and retinal thickness, contributing toward enhanced clinical decision-making. This system aims to reduce the diagnostic gap between SD-OCT and Swept-Source OCT (SS-OCT) while maintaining accessibility and diagnostic performance.

GitHub repository: github.com/jinkimh/SD-OCT-ERM-Quantification

YOLO Model Evaluation

We evaluated YOLOv5, YOLOv8, and YOLOv11 models for ERM detection. Each model was trained with two dataset scales (Full: 2200 images, Half: 1100 images) and tested on 650 expert-labeled OCT B-scans.

Model	Size	Params (M)	Precision	Recall	mAP@50	mAP@50:95	Dataset
YOLOv5	S	7.02	0.752	0.703	0.722	0.423	Full
YOLOv5	S	7.02	0.694	0.642	0.664	0.376	Half
YOLOv5	M	20.87	0.783	0.734	0.752	0.444	Full
YOLOv5	M	20.87	0.723	0.685	0.701	0.396	Half
YOLOv5	L	46.14	0.813	0.762	0.784	0.463	Full
YOLOv5	L	46.14	0.745	0.704	0.726	0.414	Half
YOLOv5	X	86.22	0.836	0.784	0.802	0.485	Full
YOLOv5	X	86.22	0.763	0.725	0.743	0.437	Half
YOLOv8	S	11.14	0.781	0.736	0.764	0.447	Full
YOLOv8	S	11.14	0.723	0.676	0.701	0.393	Half
YOLOv8	M	25.86	0.813	0.762	0.791	0.466	Full
YOLOv8	M	25.86	0.748	0.705	0.724	0.412	Half
YOLOv8	L	43.63	0.844	0.792	0.823	0.482	Full
YOLOv8	L	43.63	0.774	0.731	0.754	0.436	Half
YOLOv8	X	68.15	0.867	0.814	0.842	0.504	Full
YOLOv8	X	68.15	0.793	0.752	0.772	0.454	Half
YOLOv11	S	9.43	0.804	0.752	0.783	0.468	Full
YOLOv11	S	9.43	0.746	0.692	0.714	0.417	Half
YOLOv11	M	20.05	0.846	0.794	0.821	0.493	Full
YOLOv11	M	20.05	0.774	0.736	0.757	0.443	Half
YOLOv11	L	25.31	0.873	0.823	0.854	0.524	Full
YOLOv11	L	25.31	0.807	0.773	0.793	0.476	Half
YOLOv11	X	56.87	0.902	0.857	0.882	0.556	Full
YOLOv11	X	56.87	0.836	0.803	0.826	0.507	Half