Multimodal Inconsistency Reasoning (MMIR)

Qianqi Yan1, Yue Fan1, Hongquan Li, Shan Jiang2, Yang Zhao2, Xinze Guan2, Ching-Chen Kuo2, Xin Eric Wang1,
1University of California, Santa Cruz,
2eBay

Introduction

We introduce MMIR, the first benchmark for evaluating Multimodal Large Language Models (MLLMs) on detecting and reasoning about inconsistencies in layout-rich multimodal content. MMIR features 534 challenging samples across five reasoning-heavy inconsistency categories:

  • Factual Contradiction: Direct conflict between two elements (text-text, text-image, or image-image) within the modified content.
  • Identity Misattribution: Mislabeling of entities (objects, locations, brands etc.) that conflict with other elements.
  • Contextual Mismatch: Tonal, thematic, or situational incompatibility between elements.
  • Quantitative Discrepancy: Numerical or statistical inconsistencies between elements.
  • Temporal/Spatial Incoherence: Implied timelines, dates, or spatial relationships that are impossible or conflicting.

We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts while open-source models remain particularly vulnerable to inconsistency errors.

Detailed error analyses further show that models excel in detecting inconsistencies confined to a single modality, particularly in text, but struggle with cross-modal conflicts and complex layouts. Probing experiments reveal that single-modality prompting, including Chain-of-Thought (CoT) and Set-of-Mark (SoM) methods, yields marginal gains, revealing a key bottleneck in cross-modal reasoning. Our findings highlight the need for advanced multimodal reasoning and point to future research on multimodal inconsistency.

MMIR Benchmark

Overview

The MMIR benchmark was meticulously constructed through a four-stage curation pipeline to ensure high-quality, diverse, and challenging test cases. We began by collecting 521 real-world artifacts — including webpages, presentations, and posters — from trusted sources like VisualWebArena and Zenodo. These artifacts were parsed to extract structured metadata, including element types, content, and spatial layouts.

To simulate realistic errors, we used advanced multimodal models to propose 2,534 synthetic inconsistencies across five predefined categories. These proposals underwent automated validation to ensure technical feasibility and alignment with error definitions. Approved edits were then programmatically applied to artifacts using tools like Chrome DevTools (for webpages) and Python libraries (for presentations).

Finally, human experts rigorously reviewed the modified samples, filtering out unrealistic cases and retaining 534 validated entries that balance complexity and real-world relevance. The resulting dataset spans diverse artifact types and error categories, with carefully designed evaluation prompts for both open-ended and multiple-choice settings.

Key Features

  • 534 carefully validated samples
  • Real-world artifacts: Webpages, Slides, Posters
  • Synthetic inconsistency injection
  • Multi-stage verification pipeline

Evaluation Settings

  • Open-ended: Models receive the artifact with a fixed prompt Qopen_ended and generate a free-form response that identifies the semantic mismatch.
  • Multiple-choice: Models receive the artifact with a combined prompt Q_MCQ = (Qopen_ended, Ci). Each candidate in Ci is a textual description of an element. The model must select, from these options, the element(s) corresponding to the introduced inconsistency.

You can download the dataset on Hugging Face Dataset.

Experiments and Analysis

Our comprehensive evaluation of six state-of-the-art multimodal models (including proprietary systems like o1 and GPT-4o, and open-source models like Qwen2.5-VL, LLaVA-NeXT, InternVL2.5 and Phi-3.5-Vision) reveals critical insights into multimodal inconsistency reasoning. Proprietary models significantly outperform open-source models, with o1 achieving over 50% accuracy overall—surpassing open-source models by more than 30%. While all models struggle with complex inconsistencies, proprietary systems show stronger alignment between visual and textual reasoning, particularly when provided with contextual cues.

The accuracy of six MLLMs under the two evaluation settings. Proprietary models demonstrate higher performance as well as larger performance gain in the MCQ setting. While MCQ-style prompts boost GPT-4o's accuracy by ~15%, open-source models gain minimal benefits, highlighting fundamental reasoning gaps.

| Open-ended | Multiple-choice
# Model Source Web Office Poster Overall Web Office Poster Overall
Proprietary Models
1 o1 (1217) Link 47.91 59.19 38.73 51.40 47.91 58.52 46.47 52.15
2 GPT-4o (1120) Link 25.00 42.60 30.98 33.14 37.29 58.96 47.88 47.75
Open-sourced Models
3 Qwen2.5-VL-7B Link 8.54 29.14 11.97 17.60 14.37 33.18 16.90 22.56
4 LLaVA-NeXT-7B Link 10.20 21.97 7.04 14.70 11.45 25.33 5.63 16.47
5 InternVL2.5-8B Link 7.70 24.21 4.92 14.23 9.37 23.54 11.97 15.63
6 Phi-3.5-Vision-4B Link 6.87 24.43 7.04 14.23 1.66 8.52 0.00 4.30

🚨 To submit your results to the leaderboard, please send to this email with your result JSON files.

🚨 For more submission details, please refer to this link.

Fine-grained error analysis

  • Performance Gap: Proprietary models excel at detecting factual contradictions and identity mismatches, but even top models like GPT-4o show limitations in resolving temporal/spatial incoherence.
  • Modality Matters: Models handle text-text inconsistencies best but falter with image-image comparisons, exposing weaknesses in visual reasoning.
  • Layout Complexity: Performance drops sharply as artifacts become visually dense—models lose up to 40% accuracy on cluttered layouts compared to simple ones.

Prompting Strategies Analysis

To enhance multimodal inconsistency reasoning, we tested three prompting approaches and has the following observations:
  • Chain-of-Thought (CoT): Explicit textual reasoning steps provided minimal benefits, sometimes reducing accuracy, especially for open-source models.
  • Set-of-Mark (SoM): Visual bounding boxes improved GPT-4o’s performance (+5%) but confused other models, often degrading results.
  • Multimodal Interleaved CoT (MM-CoT): Our novel two-stage method combined textual reasoning with iterative visual refinement.

MM-CoT outperformed all other methods, boosting GPT-4o's accuracy by 4.4% and showing modest gains for open-source models. Proprietary models benefited most from iterative cross-modal integration, while isolated prompts (CoT/SoM) proved ineffective. Visual annotations only helped when guided by initial textual reasoning, highlighting the need for tightly coupled multimodal interaction.

BibTeX


      @misc{yan2025multimodalinconsistencyreasoningmmir,
        title={Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models}, 
        author={Qianqi Yan and Yue Fan and Hongquan Li and Shan Jiang and Yang Zhao and Xinze Guan and Ching-Chen Kuo and Xin Eric Wang},
        year={2025},
        eprint={2502.16033},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2502.16033}, 
  }