Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

1University of California, Santa Cruz,
2Carnegie Mellon University

Introduction

Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions.

To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding.

Our evaluation reveals that top-performing models like GPT-4o, GPT-4V and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Besides, models like LLaVA-Med struggle even with more general questions, and results from CheXagent demonstrate the transferability of expertise across different modalities of the same organ, showing that specialized domain knowledge is still crucial for improving performance. This study underscores the urgent need for more robust evaluation to ensure the reliability of LMMs in critical fields like medical diagnosis, and current LMMs are still far from applicable to those fields.

ProbMed Benchmark

Overview

ProbMed draws from two comprehensive biomedical datasets MedICaT and ChestX-ray14 to compile a diverse set of 6,303 images. These images span three modalities (X-ray, MRI, and CT scan) and four organs (abdomen, brain, chest, and spine). After preprocessing, we generated a diverse set of high-quality questions for each image, covering various diagnostic dimensions. This process resulted in a total of 57,132 question-answer pairs, averaging 9 pairs per image.

  • Is the current evaluation of LMMs for Med-VQA reliable?

    One of the main motivations behind ProbMed is to assess the models' ability to accurately distinguish between relevant and irrelevant features. ProbMed pairs original questions with negation questions containing hallucinated attributes. This method challenges the model robustness by requiring them to identify true conditions while disregarding false, hallucinated ones. For instance, a question about a specific finding is paired with a negated question featuring a different, non-existent finding to test if the model can exclusively identify the true finding.

  • How reliable are LMMs on medical diagnosis, ranging from general questions to specialized diagnostic questions?

    To ensure a comprehensive evaluation, ProbMed includes questions that require reasoning across multiple diagnostic dimensions for each image. These dimensions include modality recognition, organ identification, clinical findings, abnormalities, and positional reasoning. This multifaceted approach assesses a model's diagnostic capabilities beyond simple question-answer pairs, requiring it to integrate various pieces of information to form a coherent diagnostic picture.

You can download the dataset on Hugging Face Dataset.

Experimental Analysis

Is Current Evaluation of LMMs for Med-VQA Reliable?

  • Probing Evaluation with Adversarial Pairs in VQA-RAD

    We construct adversarial pairs for 118 test instances where the answer is "yes" out of 272 closed-ended question-answers pairs within the test set of an existing benchmark. Each adversarial pair was manually created such that, based on the limited information from the original question-answer pair, the answer to the adversarial question had to be negated. This process resulted in 236 question-answer pairs in total. The adversarial questions in this subset are less challenging than those in ProbMed, as they often involve a simple semantic negation of the original question due to limited information.

    The results reveal the significant impact of adversarial pairs on model performance. Although the original accuracy appears very high for some underperforming models, the accuracy drops drastically after balancing the subset with adversarial pairs: 19.49% for GPT-4o, 6.78% for GPT-4V and 16.95% for Gemini Pro, with an average decrease of 35.84% across the tested models.

  • Probing Evaluation with Adversarial Pairs in ProbMed

    Similar significant impact of adversarial pairs are observed in ProbMed. The accuracy of more capable models is generally less affected by the introduction of challenging adversarial pairs. However, even the most robust models experience a minimum drop of 11.19% in accuracy when tested with ProbMed's challenging questions, with an average decrease of 37.09% across the tested models, highlighting the critical role of probing evaluation in evaluating Med-VQA performance comprehensively.

How Reliable Are LMMs in Medical Diagnosis?

  • Performance across Diagnostic Questions

    After correcting model accuracy by introducing adversarial pairs, we continued to address the second research question and conducted diagnostic probing ranging from general to specialized diagnostic questions using the ProbMed dataset.

    While GPT-4o, GPT-4V, and Gemini Pro outperform other models and excel in general tasks such as recognizing image modality and organs, their low performance in specialized tasks like determining the existence of abnormalities and answering fine-grained questions about condition/finding and position highlights a significant gap in their ability to aid in real-life diagnosis.

Categorical and overall accuracy (%) of different models aggregated among all image types in ProbMed. The best result in each question category is in-bold, and the second best is underlined.

# Model Source Overall Modality Organ Abnormality Condition/Finding Position
1 GPT-4o Link 55.60 97.42 69.46 61.97 29.30 24.06
2 GPT-4V Link 55.28 92.51 71.73 53.30 35.19 22.40
3 Gemini 1.5 Pro Link 55.08 96.47 75.69 62.59 27.93 17.54
4 Med-Flamingo Link 35.66 44.15 61.39 50.00 26.33 5.65
5 CheXagent Link 30.61 37.25 33.95 73.31 28.52 7.48
6 BiomedGPT Link 33.34 60.25 46.81 50.31 14.13 6.11
7 LLaVA-Med Link 17.90 5.49 32.98 38.76 20.39 5.37
8 MiniGPT-v2 Link 27.67 3.25 76.29 50.09 15.23 8.05
9 LLaVA-v1.6 (7B) Link 24.96 6.77 80.70 46.18 3.57 1.07
10 LLaVA-v1 (7B) Link 19.30 25.28 40.53 50.00 0.34 0.10
* Random Chance - 32.13 25.00 25.00 50.00 35.67 36.48

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

🚨 For more submission details, please refer to this link.

  • Error Analysis in Procedural Diagnosis

    An error analysis focusing on GPT-4V and Gemini Pro across specialized question types - Abnormality, Condition/Finding, and Position is further conducted. Each accuracy measurement is conditional on the model successfully answering the preceding diagnostic questions, reflecting a procedural diagnosis approach. This analysis reveals both models' vulnerabilities to hallucination errors, particularly as they progress through the diagnostic procedure, with Gemini Pro being more prone to accepting false conditions and positions.

error

Error Analysis of GPT-4V and Gemini Pro on ProbMed. The table shows the accuracy and types of errors for three specialized question types. Errors are categorized into wrong answers, rejection to answer, denying ground truth, and accepting hallucinations, providing a detailed breakdown of model performance and failure modes.

  • Transferability of Domain Expertise

    CheXagent, a model trained exclusively on chest X-rays images, performs best in detecting abnormalities and identifying conditions/findings among all seven models when tested on chest X-ray images. We conducted a finer-grained analysis to explore whether the model's expertise in identifying features of a particular organ can be transferred to other imaging modalities.

    CheXagent achieves significantly higher accuracy in identifying chestrelated features compared to other organs as well as demonstrating higher accuracy in identifying conditions and findings in CT scans and MRIs of the chest compared with other organs within the same unseen modality. This indicates that specialized knowledge gained on chest X-rays can be transferred to other imaging modalities of the same organ in a zero-shot manner, highlighting the potential for cross-modality expertise transfer in real-life medical imaging diagnostics.

error

Accuracy comparison of CheXagent in identifying organs and conditions/findings across different modalities. The model demonstrates significantly higher accuracy in identifying organs on chest images compared to images of other organs for both MRI and CT scans. Additionally, CheXagent shows improved accuracy in identifying conditions/findings on chest images, indicating the transferability of its specialized knowledge from chest X-ray training to other imaging modalities.

BibTeX

@misc{yan2024worse,
      title={Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA}, 
      author={Qianqi Yan and Xuehai He and Xiang Yue and Xin Eric Wang},
      year={2024},
      eprint={2405.20421},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}