">
Even when the instruction appears valid, it may silently conflict with the visual context. Implicit reasoning requires models to detect whats missing, ambiguous, contradictory, or infeasible—without being told.
Multimodal Large Language Models (MLLMs) are increasingly deployed in open-ended, real-world environments. Real instructions often involve missing objects, ambiguous references, contradictory facts, or infeasible tasks—situations that require implicit reasoning beyond simple execution. Existing benchmarks mostly assume that the visual input and instruction are perfectly aligned, overlooking cases where flaws must be inferred from context. This paper provides a systematic analysis of how current MLLMs handle implicit reasoning scenarios where the problem is “hidden in plain sight.”
We organize our analysis around three key research questions (RQs):We curate a diagnostic suite covering four real-world failure modes and evaluate six leading MLLMs, including o3 and GPT-4o, on 643 diverse samples.
Main Findings:There are four categories under the implicit reasoning scenarios, posing diverse challenges.
iReason Statistics. Breakdown of the testbed by category. Please see paper Appendix for detailed data curation.
The accuracy (%) of six MLLMs under the four categories. Proprietary models demonstrate higher performance. The best result in each question category is in-bold, and the second best is underlined.
Model accuracy on explicit prompts (%). The best result in each question category is in-bold, and the second best is underlined.
Answer-Reason accuracy gaps (%). Negative values (red) indicate the model reasoned correctly but omitted it in the final answer.
We report model behavior under two settings: IC-Free, where the model decides whether to ask a clarification question or provide a direct answer, and IC-Force, where it is always required to ask a question. %Question indicates how often the model chooses to ask a question, and its corresponding accuracy reflects how often the question is relevant to the hidden issue. %Answer denotes the rate of directly answering without asking, with accuracy measuring the correctness of such answers. The rightmost columns show the gain in accuracy relative to each model's baseline performance on the implicit reasoning task.
@misc{yan2025hiddenplainsightprobing,
title={Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models},
author={Qianqi Yan and Hongquan Li and Shan Jiang and Yang Zhao and Xinze Guan and Ching-Chen Kuo and Xin Eric Wang},
year={2025},
eprint={2506.00258},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.00258},
}