Ophtimus: Ophthalmology-specific LLM
🤗 Models and Datasets | 📕 AAAI 2025 workshop Paper
Introduction
Ophtimus is an open-source large language model (LLM) specialized in ophthalmology, built with 8 billion parameters based on the LLaMA architecture. It was trained on carefully curated ophthalmology-specific data, including medical papers, textbooks, and research reports. Through filtering, summarization, and preprocessing, only the most relevant and high-quality information was retained.
Designed to be both lightweight and high-performing, Ophtimus is suitable for real-world applications such as clinical decision support, medical education, and patient communication. The model and its training pipeline are fully open-sourced, providing a practical reference for developing similar domain-specific LLMs in other areas of medicine.
GitHub Repository: github.com/jinkimh/Ophtimus-Ophthalmology-LLM
Dataset Details
Note: All datasets were either newly constructed or adapted for this project. Pre-training datasets were curated from open-source ophthalmology materials, while instruction-tuning and evaluation datasets were built by extracting only ophthalmology-relevant samples from broader medical corpora. All data underwent preprocessing steps including deduplication, language filtering (English only), and removal of any personally identifiable information (PII).
| Dataset name | Source | Size | Purpose | Key Features |
|---|---|---|---|---|
| Ophthalmology-pubmed-corpus [Link] | Ophthalmology paper | 18.4M Tokens | Pre-Training |
• Map-reduce method summary • Broad ophthalmic keywords |
| Ophthalmology-textbook-corpus [Link] | Ophthalmology textbook | 4M Tokens | Pre-Training |
• Trusted medical sources • Rich in diagnostic cases |
| Ophthalmology MCQA Inst dataset [Link] | Ophthalmology Docs | 51.7k QAs | Inst-Tuning |
• Diverse multiple-choice formats • Reasoning included • Variety of ophthalmic topics |
| Ophthalmology EQA Inst dataset [Link] | Ophthalmology Docs | 49.3k QAs | Inst-Tuning | • Variety of ophthalmic topics |
| Ophtimus-Eval-Dataset [Link] | Medical platform data | 2,153 QAs | Evaluation |
• expert-verified data • MCQA dataset |
| PubMedQA-ophthal-Dataset [Link] | PubMedQA | 297 QAs | Evaluation |
• Ophthalmology domain filtered • True/False MCQA dataset |
| MedMCQA-Ophthal-Dataset [Link] | MedMCQA | 6,932 QAs | Evaluation |
• Ophthalmology domain filtered • MCQA dataset |
| EQAEval-Dataset [Link] | MedQuAD, Others | 1,389 QAs | Evaluation |
• Diverse open-source datasets • Ophthalmology domain filtered • Essay QA |
Model Details
Note: The "pre-training" and "fine-tuning" columns in the table refer to the training performed in this project. The base models had already undergone pre-training and/or fine-tuning prior to this project, and we applied transfer learning using those models.
| Model name | Base model | Parameters | Pre-training | Instruction-tuning |
|---|---|---|---|---|
| Ophtimus-Base [Link] | Llama-3.1-8B | 8B | ✅ | ❌ |
| Ophtimus-Llama-1B [Link] | Llama-3.2-1B-Instruct | 1B | ❌ | ✅ |
| Ophtimus-Llama-3B [Link] | Llama-3.2-3B-Instruct | 3B | ❌ | ✅ |
| Ophtimus-Llama-8B [Link] | Llama-3.1-8B-Instruct | 8B | ❌ | ✅ |
| Ophtimus-Instruct-8B [Link] | Ophtimus-Base | 8B | ✅ | ✅ |
Performance
Note: Multi-Choice QA: Ophtimus-Eval, MedMCQA, PubMedQA | Essay QA: MedQuAD, Medical Flashcards, Medical Wikidoc
Ophtimus-Eval is a proprietary dataset collected from a medical platform. The others are established medical benchmark datasets, from which only ophthalmology-related QA pairs were extracted for evaluation.
| Model | Multi-Choice Question | Essay Question | |||||
|---|---|---|---|---|---|---|---|
| Ophtimus Eval | MedMCQA (Ophth) | PubmedQA (Ophth) | RougeL | BLEU | METEOR | SemScore | |
| OpenAI GPT-4o | 71.95% | 81.95% | 89.90% | 0.193 | 0.082 | 0.341 | 0.761 |
| Llama-3-8B-Instrct | 48.60% | 74.02% | 63.97% | 0.193 | 0.064 | 0.244 | 0.684 |
| Llama-3.1-8B-Instrct | 39.78% | 57.96% | 83.84% | 0.177 | 0.054 | 0.215 | 0.641 |
| Eye-Llama | 32.56% | 59.43% | 66.11% | 0.183 | 0.062 | 0.211 | 0.686 |
| PMC-Llama-13B | 48.28% | 63.45% | 72.48% | 0.223 | 0.082 | 0.288 | 0.714 |
| Ophtimus-Llama-1B | 41.45% | 45.74% | 61.95% | 0.219 | 0.076 | 0.217 | 0.711 |
| Ophtimus-Llama-3B | 52.70% | 62.10% | 69.36% | 0.224 | 0.077 | 0.225 | 0.726 |
| Ophtimus-Llama-8B | 60.78% | 68.25% | 69.70% | 0.226 | 0.083 | 0.230 | 0.733 |
| Ophtimus-Instruct-8B | 63.85% | 71.51% | 72.73% | 0.222 | 0.079 | 0.224 | 0.735 |
Quickstart
Install Dependencies
cd Ophtimus-Ophthalmology-LLM
pip install -r requirements.txt
Ophtimus Inference
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# model_name example : BaekSeungJu/Ophtimus-Instruct-8B or Ophtimus-Llama-1B or Ophtimus-Llama-3B or Ophtimus-Llama-8B
model_name = "BaekSeungJu/Ophtimus-Instruct-8B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
system_instruction = (
"You are an expert ophthalmologist. Please provide accurate and "
"medically sound answers to the user's ophthalmology-related question."
)
# Enter your questions in the list
questions = [
"Please describe the symptoms and treatment of epiretinal membrane.",
"What's good for eyes?"
]
prompts = []
for question in questions:
row_json = [
{"role": "system", "content": system_instruction},
{"role": "user", "content": question}
]
prompt = tokenizer.apply_chat_template(row_json, add_generation_prompt=True, tokenize=False)
prompts.append(prompt)
input_ids = tokenizer(
prompts,
padding=True,
return_tensors="pt",
)["input_ids"].to("cuda")
model.eval()
with torch.no_grad():
outputs = model.generate(
input_ids,
max_new_tokens=1024,
do_sample=False,
)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=False)
for i, text in enumerate(decoded):
print(f"------------------------\nAnswer for question {i+1}:\n{text}")
For more details, visit the GitHub repository.