Ophtimus: Ophthalmology-specific LLM

GitHub Repository

Python PyTorch Transformers LangChain Streamlit FastAPI

🤗 Models and Datasets  |  📕 AAAI 2025 workshop Paper

Introduction

Ophtimus is an open-source large language model (LLM) specialized in ophthalmology, built with 8 billion parameters based on the LLaMA architecture. It was trained on carefully curated ophthalmology-specific data, including medical papers, textbooks, and research reports. Through filtering, summarization, and preprocessing, only the most relevant and high-quality information was retained.

Designed to be both lightweight and high-performing, Ophtimus is suitable for real-world applications such as clinical decision support, medical education, and patient communication. The model and its training pipeline are fully open-sourced, providing a practical reference for developing similar domain-specific LLMs in other areas of medicine.

GitHub Repository: github.com/jinkimh/Ophtimus-Ophthalmology-LLM

Ophtimus Overall Architecture

Dataset Details

Note: All datasets were either newly constructed or adapted for this project. Pre-training datasets were curated from open-source ophthalmology materials, while instruction-tuning and evaluation datasets were built by extracting only ophthalmology-relevant samples from broader medical corpora. All data underwent preprocessing steps including deduplication, language filtering (English only), and removal of any personally identifiable information (PII).

Dataset name Source Size Purpose Key Features
Ophthalmology-pubmed-corpus [Link] Ophthalmology paper 18.4M Tokens Pre-Training • Map-reduce method summary
• Broad ophthalmic keywords
Ophthalmology-textbook-corpus [Link] Ophthalmology textbook 4M Tokens Pre-Training • Trusted medical sources
• Rich in diagnostic cases
Ophthalmology MCQA Inst dataset [Link] Ophthalmology Docs 51.7k QAs Inst-Tuning • Diverse multiple-choice formats
• Reasoning included
• Variety of ophthalmic topics
Ophthalmology EQA Inst dataset [Link] Ophthalmology Docs 49.3k QAs Inst-Tuning • Variety of ophthalmic topics
Ophtimus-Eval-Dataset [Link] Medical platform data 2,153 QAs Evaluation • expert-verified data
• MCQA dataset
PubMedQA-ophthal-Dataset [Link] PubMedQA 297 QAs Evaluation • Ophthalmology domain filtered
• True/False MCQA dataset
MedMCQA-Ophthal-Dataset [Link] MedMCQA 6,932 QAs Evaluation • Ophthalmology domain filtered
• MCQA dataset
EQAEval-Dataset [Link] MedQuAD, Others 1,389 QAs Evaluation • Diverse open-source datasets
• Ophthalmology domain filtered
• Essay QA

Model Details

Note: The "pre-training" and "fine-tuning" columns in the table refer to the training performed in this project. The base models had already undergone pre-training and/or fine-tuning prior to this project, and we applied transfer learning using those models.

Model name Base model Parameters Pre-training Instruction-tuning
Ophtimus-Base [Link] Llama-3.1-8B 8B
Ophtimus-Llama-1B [Link] Llama-3.2-1B-Instruct 1B
Ophtimus-Llama-3B [Link] Llama-3.2-3B-Instruct 3B
Ophtimus-Llama-8B [Link] Llama-3.1-8B-Instruct 8B
Ophtimus-Instruct-8B [Link] Ophtimus-Base 8B

Performance

Note: Multi-Choice QA: Ophtimus-Eval, MedMCQA, PubMedQA | Essay QA: MedQuAD, Medical Flashcards, Medical Wikidoc
Ophtimus-Eval is a proprietary dataset collected from a medical platform. The others are established medical benchmark datasets, from which only ophthalmology-related QA pairs were extracted for evaluation.

Model Multi-Choice Question Essay Question
Ophtimus Eval MedMCQA (Ophth) PubmedQA (Ophth) RougeL BLEU METEOR SemScore
OpenAI GPT-4o 71.95% 81.95% 89.90% 0.193 0.082 0.341 0.761
Llama-3-8B-Instrct 48.60% 74.02% 63.97% 0.193 0.064 0.244 0.684
Llama-3.1-8B-Instrct 39.78% 57.96% 83.84% 0.177 0.054 0.215 0.641
Eye-Llama 32.56% 59.43% 66.11% 0.183 0.062 0.211 0.686
PMC-Llama-13B 48.28% 63.45% 72.48% 0.223 0.082 0.288 0.714
Ophtimus-Llama-1B 41.45% 45.74% 61.95% 0.219 0.076 0.217 0.711
Ophtimus-Llama-3B 52.70% 62.10% 69.36% 0.224 0.077 0.225 0.726
Ophtimus-Llama-8B 60.78% 68.25% 69.70% 0.226 0.083 0.230 0.733
Ophtimus-Instruct-8B 63.85% 71.51% 72.73% 0.222 0.079 0.224 0.735

Quickstart

Install Dependencies

cd Ophtimus-Ophthalmology-LLM
pip install -r requirements.txt

Ophtimus Inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# model_name example : BaekSeungJu/Ophtimus-Instruct-8B or Ophtimus-Llama-1B or Ophtimus-Llama-3B or Ophtimus-Llama-8B
model_name = "BaekSeungJu/Ophtimus-Instruct-8B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

system_instruction = (
    "You are an expert ophthalmologist. Please provide accurate and "
    "medically sound answers to the user's ophthalmology-related question."
)

# Enter your questions in the list
questions = [
    "Please describe the symptoms and treatment of epiretinal membrane.",
    "What's good for eyes?"
]

prompts = []
for question in questions:
    row_json = [
        {"role": "system", "content": system_instruction},
        {"role": "user", "content": question}
    ]
    prompt = tokenizer.apply_chat_template(row_json, add_generation_prompt=True, tokenize=False)
    prompts.append(prompt)

input_ids = tokenizer(
    prompts,
    padding=True,
    return_tensors="pt",
)["input_ids"].to("cuda")

model.eval()
with torch.no_grad():
    outputs = model.generate(
        input_ids,
        max_new_tokens=1024,
        do_sample=False,
    )

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=False)
for i, text in enumerate(decoded):
    print(f"------------------------\nAnswer for question {i+1}:\n{text}")

For more details, visit the GitHub repository.