The accuracy of six MLLMs under the two evaluation settings. Proprietary models demonstrate higher performance as well as larger performance gain in the MCQ setting. While MCQ-style prompts boost GPT-4o's accuracy by ~15%, open-source models gain minimal benefits, highlighting fundamental reasoning gaps.
| | Open-ended | | | Multiple-choice | |||||||
# | Model | Source | Web | Office | Poster | Overall | Web | Office | Poster | Overall |
Proprietary Models | ||||||||||
1 | o1 (1217) | Link | 47.91 | 59.19 | 38.73 | 51.40 | 47.91 | 58.52 | 46.47 | 52.15 |
2 | GPT-4o (1120) | Link | 25.00 | 42.60 | 30.98 | 33.14 | 37.29 | 58.96 | 47.88 | 47.75 |
Open-sourced Models | ||||||||||
3 | Qwen2.5-VL-7B | Link | 8.54 | 29.14 | 11.97 | 17.60 | 14.37 | 33.18 | 16.90 | 22.56 |
4 | LLaVA-NeXT-7B | Link | 10.20 | 21.97 | 7.04 | 14.70 | 11.45 | 25.33 | 5.63 | 16.47 |
5 | InternVL2.5-8B | Link | 7.70 | 24.21 | 4.92 | 14.23 | 9.37 | 23.54 | 11.97 | 15.63 |
6 | Phi-3.5-Vision-4B | Link | 6.87 | 24.43 | 7.04 | 14.23 | 1.66 | 8.52 | 0.00 | 4.30 |
🚨 To submit your results to the leaderboard, please send to this email with your result JSON files.
🚨 For more submission details, please refer to this link.
Fine-grained error analysis
- Performance Gap: Proprietary models excel at detecting factual contradictions and identity mismatches, but even top models like GPT-4o show limitations in resolving temporal/spatial incoherence.
- Modality Matters: Models handle text-text inconsistencies best but falter with image-image comparisons, exposing weaknesses in visual reasoning.
- Layout Complexity: Performance drops sharply as artifacts become visually dense—models lose up to 40% accuracy on cluttered layouts compared to simple ones.

Fine-grained analysis of model performance across Inconsistency Categories and Modalities.

Model performance on layout complexity.
Prompting Strategies Analysis
To enhance multimodal inconsistency reasoning, we tested three prompting approaches and has the following observations:- Chain-of-Thought (CoT): Explicit textual reasoning steps provided minimal benefits, sometimes reducing accuracy, especially for open-source models.
- Set-of-Mark (SoM): Visual bounding boxes improved GPT-4o’s performance (+5%) but confused other models, often degrading results.
- Multimodal Interleaved CoT (MM-CoT): Our novel two-stage method combined textual reasoning with iterative visual refinement.

Probing results of different prompting methods. Performance of each prompting method is directly compared with the vanilla setting. Gains are in blue and drops are in red.
MM-CoT outperformed all other methods, boosting GPT-4o's accuracy by 4.4% and showing modest gains for open-source models. Proprietary models benefited most from iterative cross-modal integration, while isolated prompts (CoT/SoM) proved ineffective. Visual annotations only helped when guided by initial textual reasoning, highlighting the need for tightly coupled multimodal interaction.