OmniTrace

A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

Qianqi Yan ¹, Yichen Guo¹, Ching-Chen Kuo², Shan Jiang², Hang Yin², Yang Zhao², Xin Eric Wang ¹,

¹University of California, Santa Barbara
²eBay

Generation-time tracing Omni-modal Model-agnostic

Paper arXiv Code 📦 PyPI

OmniTrace traces each generated token to candidate source tokens across text, image, audio, and video, then aggregates them into span-level source explanations.

Overview

Modern multimodal large language models generate fluent responses from interleaved text, image, audio, and video inputs, but it remains difficult to determine which specific inputs support each generated statement. Existing attribution methods are largely built for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only multimodal generation.

OmniTrace addresses this by formalizing attribution as a generation-time tracing problem over the causal decoding process. Instead of producing explanations only after the final response is complete, OmniTrace follows generation token by token, converts arbitrary token-level signals such as attention or gradients into source assignments, and aggregates them into stable, semantically coherent span-level explanations.

The framework is designed to be lightweight, plug-and-play, and model-agnostic. It does not require retraining or supervision, and it supports heterogeneous source units across modalities, including text spans, image regions, audio intervals, and video intervals.

Key Contributions

Formulation

Attribution as generation-time tracing

OmniTrace reframes attribution for open-ended multimodal generation as an online tracing problem over decoder-only architectures, rather than a fixed-target explanation problem.

Framework

Signal-agnostic and span-level

The method accepts arbitrary token-level attribution scores, including attention-based and gradient-based signals, and converts them into concise span-level source explanations.

Evaluation

Works across text, image, audio, and video

OmniTrace is evaluated on visual, audio, and video tasks across reasoning and summarization, showing more stable and interpretable grounding than post-hoc baselines.

How OmniTrace Works

Hover over each generated sentence in the Demo

Input sources

Model generation

Generation-time source tracing. During decoding, OmniTrace obtains a token-level attribution signal for each generated token and projects that signal onto candidate source units across modalities.

Span-level aggregation. Generated tokens are grouped into semantically coherent chunks such as phrases or sentences, reducing noisy token-level fluctuations and improving interpretability.

Confidence-aware source curation. OmniTrace uses POS-aware weighting, confidence shaping, run-level coherence, and minimum-mass filtering to select concise supporting sources for each generated span.

Design goals: generation-aware, omni-modal, span-level, and model-agnostic.

Evaluation Setup

759

total examples

datasets

modalities

omni-modal base models

Tasks and datasets

Visual QA: Mantis-eval
Visual summarization: MMDialog, CliConSummation
Audio QA: MMAU
Audio summarization: MISP
Video QA: Video-MME

Models and metrics

Base models: Qwen2.5-Omni-7B and MiniCPM-o-4.5-9B
Visual-text tasks: span-level F1
Audio/video tasks: Time-F1 over 1-second bins
Human validation: 26.6% manually annotated; 88.17% agreement with automatic labels

Main Results

Qwen2.5-Omni-7B

OT_AttMean reaches 75.66 text F1 and 76.59 image F1 on visual summarization.
On audio tasks, OT_AttMean reaches 83.12 Time-F1 for summarization and 49.90 for audio QA.
On video QA, OT_AttMean reaches 40.16 Time-F1.
Across tasks, OmniTrace variants substantially outperform self-attribution and embedding-based baselines.

MiniCPM-o-4.5-9B

OT_RawAtt gives the strongest visual and audio results for MiniCPM in the paper.
It reaches 37.32 text F1 and 76.46 image F1 on visual summarization.
It reaches 49.21 Time-F1 on audio summarization and 41.06 on audio QA.
These gains show that the core benefit comes from generation-aware tracing and span-level aggregation, not from one specific scoring function.

Takeaway. Across both models, generation-time span-level attribution consistently produces more stable and semantically coherent explanations than naive self-attribution and embedding-based heuristics.

Attribution quality depends heavily on meaningful segmentation and multimodal context. High-quality ASR segmentation dramatically improves audio attribution, and combining visual and audio signals yields the best video attribution.

positional and modality attribution bias

OmniTrace reveals interpretable attribution behavior: an early-position grounding bias and regime-dependent calibration effects in cross-modal attribution mass.

Ablations and Analysis

Source curation matters

Removing any filtering component degrades performance.
Image attribution is especially sensitive: without the curation pipeline, image F1 drops from 76.59 to around 20.
POS-aware weighting, confidence shaping, run-level coherence, and minimum-mass filtering are all important for stable cross-modal grounding.

Attribution is not just output quality

Attribution quality is only partially correlated with generation quality.
In some QA settings, incorrect answers can still attend to relevant evidence.
This suggests attribution measures grounding behavior rather than simply mirroring final-answer correctness.

Removing any filtering component degrades performance.

generation quality vs attribution quality

Attribution quality and generation quality are related but not strictly monotonic, indicating that attribution reflects how models ground their outputs, not just whether the final answer is correct.

Quick Start

Install from PyPI

pip install omnitrace

Install from source

            
git clone https://github.com/Jackie-2000/OmniTrace.git
cd OmniTrace
pip install -e .

Core usage pattern

            
from omnitrace import OmniTracer

tracer = OmniTracer(model_name="Qwen/Qwen2.5-Omni-7B")

# Example: audio input
sample = {
  "prompt": "Answer the question based on the audio provided. Explain your reasoning step by step.",
  "question": [
    {"text": "What was the last sound in the sequence?\nA. footsteps\nB. dog_barking\nC. camera_shutter_clicking\nD. tapping_on_glass"},
    {"audio": "examples/media/b7701ab1-c37e-49f2-8ad9-7177fe0465e9.wav"}
  ],
}

result = tracer.trace(sample)

BibTeX

TBD