- Visual QA: Mantis-eval
- Visual summarization: MMDialog, CliConSummation
- Audio QA: MMAU
- Audio summarization: MISP
- Video QA: Video-MME
OmniTrace traces each generated token to candidate source tokens across text, image, audio, and video, then aggregates them into span-level source explanations.
Modern multimodal large language models generate fluent responses from interleaved text, image, audio, and video inputs, but it remains difficult to determine which specific inputs support each generated statement. Existing attribution methods are largely built for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only multimodal generation.
OmniTrace addresses this by formalizing attribution as a generation-time tracing problem over the causal decoding process. Instead of producing explanations only after the final response is complete, OmniTrace follows generation token by token, converts arbitrary token-level signals such as attention or gradients into source assignments, and aggregates them into stable, semantically coherent span-level explanations.
The framework is designed to be lightweight, plug-and-play, and model-agnostic. It does not require retraining or supervision, and it supports heterogeneous source units across modalities, including text spans, image regions, audio intervals, and video intervals.
Formulation
OmniTrace reframes attribution for open-ended multimodal generation as an online tracing problem over decoder-only architectures, rather than a fixed-target explanation problem.
Framework
The method accepts arbitrary token-level attribution scores, including attention-based and gradient-based signals, and converts them into concise span-level source explanations.
Evaluation
OmniTrace is evaluated on visual, audio, and video tasks across reasoning and summarization, showing more stable and interpretable grounding than post-hoc baselines.
Hover over each generated sentence in the Demo
Generation-time source tracing. During decoding, OmniTrace obtains a token-level attribution signal for each generated token and projects that signal onto candidate source units across modalities.
Span-level aggregation. Generated tokens are grouped into semantically coherent chunks such as phrases or sentences, reducing noisy token-level fluctuations and improving interpretability.
Confidence-aware source curation. OmniTrace uses POS-aware weighting, confidence shaping, run-level coherence, and minimum-mass filtering to select concise supporting sources for each generated span.
Qwen2.5-Omni-7B
MiniCPM-o-4.5-9B
Attribution quality depends heavily on meaningful segmentation and multimodal context. High-quality ASR segmentation dramatically improves audio attribution, and combining visual and audio signals yields the best video attribution.
OmniTrace reveals interpretable attribution behavior: an early-position grounding bias and regime-dependent calibration effects in cross-modal attribution mass.
Removing any filtering component degrades performance.
Attribution quality and generation quality are related but not strictly monotonic, indicating that attribution reflects how models ground their outputs, not just whether the final answer is correct.
pip install omnitrace
git clone https://github.com/Jackie-2000/OmniTrace.git
cd OmniTrace
pip install -e .
from omnitrace import OmniTracer
tracer = OmniTracer(model_name="Qwen/Qwen2.5-Omni-7B")
# Example: audio input
sample = {
"prompt": "Answer the question based on the audio provided. Explain your reasoning step by step.",
"question": [
{"text": "What was the last sound in the sequence?\nA. footsteps\nB. dog_barking\nC. camera_shutter_clicking\nD. tapping_on_glass"},
{"audio": "examples/media/b7701ab1-c37e-49f2-8ad9-7177fe0465e9.wav"}
],
}
result = tracer.trace(sample)
TBD