Categorical and overall accuracy (%) of different models aggregated among all image types in ProbMed. The best result in each question category is in-bold, and the second best is underlined.
| # | Model | Source | Overall | Modality | Organ | Abnormality | Condition/Finding | Position |
| 1 | GPT-4o | Link | 55.60 | 97.42 | 69.46 | 61.97 | 29.30 | 24.06 |
| 2 | GPT-4V | Link | 55.28 | 92.51 | 71.73 | 53.30 | 35.19 | 22.40 |
| 3 | Gemini 1.5 Pro | Link | 55.08 | 96.47 | 75.69 | 62.59 | 27.93 | 17.54 |
| 4 | Med-Flamingo | Link | 35.66 | 44.15 | 61.39 | 50.00 | 26.33 | 5.65 |
| 5 | CheXagent | Link | 30.61 | 37.25 | 33.95 | 73.31 | 28.52 | 7.48 |
| 6 | BiomedGPT | Link | 33.34 | 60.25 | 46.81 | 50.31 | 14.13 | 6.11 |
| 7 | LLaVA-Med | Link | 17.90 | 5.49 | 32.98 | 38.76 | 20.39 | 5.37 |
| 8 | MiniGPT-v2 | Link | 27.67 | 3.25 | 76.29 | 50.09 | 15.23 | 8.05 |
| 9 | LLaVA-v1.6 (7B) | Link | 24.96 | 6.77 | 80.70 | 46.18 | 3.57 | 1.07 |
| 10 | LLaVA-v1 (7B) | Link | 19.30 | 25.28 | 40.53 | 50.00 | 0.34 | 0.10 |
| * | Random Chance | - | 32.13 | 25.00 | 25.00 | 50.00 | 35.67 | 36.48 |
🚨 To submit your results to the leaderboard, please send to this email with your result json files.
🚨 For more submission details, please refer to this link.