OmniTrace

A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

Qianqi Yan 1, Yichen Guo1, Ching-Chen Kuo2, Shan Jiang2, Hang Yin2, Yang Zhao2, Xin Eric Wang 1,
1University of California, Santa Barbara
2eBay
Generation-time tracing Omni-modal Model-agnostic
OmniTrace teaser figure

OmniTrace traces each generated token to candidate source tokens across text, image, audio, and video, then aggregates them into span-level source explanations.

Overview

Modern multimodal large language models generate fluent responses from interleaved text, image, audio, and video inputs, but it remains difficult to determine which specific inputs support each generated statement. Existing attribution methods are largely built for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only multimodal generation.

OmniTrace addresses this by formalizing attribution as a generation-time tracing problem over the causal decoding process. Instead of producing explanations only after the final response is complete, OmniTrace follows generation token by token, converts arbitrary token-level signals such as attention or gradients into source assignments, and aggregates them into stable, semantically coherent span-level explanations.

The framework is designed to be lightweight, plug-and-play, and model-agnostic. It does not require retraining or supervision, and it supports heterogeneous source units across modalities, including text spans, image regions, audio intervals, and video intervals.

Key Contributions

Formulation

Attribution as generation-time tracing

OmniTrace reframes attribution for open-ended multimodal generation as an online tracing problem over decoder-only architectures, rather than a fixed-target explanation problem.

Framework

Signal-agnostic and span-level

The method accepts arbitrary token-level attribution scores, including attention-based and gradient-based signals, and converts them into concise span-level source explanations.

Evaluation

Works across text, image, audio, and video

OmniTrace is evaluated on visual, audio, and video tasks across reasoning and summarization, showing more stable and interpretable grounding than post-hoc baselines.

How OmniTrace Works

Hover over each generated sentence in the Demo

Input sources

Model generation

1

Generation-time source tracing. During decoding, OmniTrace obtains a token-level attribution signal for each generated token and projects that signal onto candidate source units across modalities.

2

Span-level aggregation. Generated tokens are grouped into semantically coherent chunks such as phrases or sentences, reducing noisy token-level fluctuations and improving interpretability.

3

Confidence-aware source curation. OmniTrace uses POS-aware weighting, confidence shaping, run-level coherence, and minimum-mass filtering to select concise supporting sources for each generated span.

Design goals: generation-aware, omni-modal, span-level, and model-agnostic.

Evaluation Setup

759
total examples
6
datasets
4
modalities
2
omni-modal base models

Tasks and datasets

  • Visual QA: Mantis-eval
  • Visual summarization: MMDialog, CliConSummation
  • Audio QA: MMAU
  • Audio summarization: MISP
  • Video QA: Video-MME

Models and metrics

  • Base models: Qwen2.5-Omni-7B and MiniCPM-o-4.5-9B
  • Visual-text tasks: span-level F1
  • Audio/video tasks: Time-F1 over 1-second bins
  • Human validation: 26.6% manually annotated; 88.17% agreement with automatic labels

Main Results

Main results

Qwen2.5-Omni-7B

  • OTAttMean reaches 75.66 text F1 and 76.59 image F1 on visual summarization.
  • On audio tasks, OTAttMean reaches 83.12 Time-F1 for summarization and 49.90 for audio QA.
  • On video QA, OTAttMean reaches 40.16 Time-F1.
  • Across tasks, OmniTrace variants substantially outperform self-attribution and embedding-based baselines.

MiniCPM-o-4.5-9B

  • OTRawAtt gives the strongest visual and audio results for MiniCPM in the paper.
  • It reaches 37.32 text F1 and 76.46 image F1 on visual summarization.
  • It reaches 49.21 Time-F1 on audio summarization and 41.06 on audio QA.
  • These gains show that the core benefit comes from generation-aware tracing and span-level aggregation, not from one specific scoring function.
Takeaway. Across both models, generation-time span-level attribution consistently produces more stable and semantically coherent explanations than naive self-attribution and embedding-based heuristics.
ASR and modality ablations

Attribution quality depends heavily on meaningful segmentation and multimodal context. High-quality ASR segmentation dramatically improves audio attribution, and combining visual and audio signals yields the best video attribution.

positional and modality attribution bias

OmniTrace reveals interpretable attribution behavior: an early-position grounding bias and regime-dependent calibration effects in cross-modal attribution mass.

Ablations and Analysis

Source curation matters

  • Removing any filtering component degrades performance.
  • Image attribution is especially sensitive: without the curation pipeline, image F1 drops from 76.59 to around 20.
  • POS-aware weighting, confidence shaping, run-level coherence, and minimum-mass filtering are all important for stable cross-modal grounding.

Attribution is not just output quality

  • Attribution quality is only partially correlated with generation quality.
  • In some QA settings, incorrect answers can still attend to relevant evidence.
  • This suggests attribution measures grounding behavior rather than simply mirroring final-answer correctness.

Quick Start

Install from PyPI

pip install omnitrace

Install from source

            
git clone https://github.com/Jackie-2000/OmniTrace.git
cd OmniTrace
pip install -e .
            
          

Core usage pattern

            
from omnitrace import OmniTracer

tracer = OmniTracer(model_name="Qwen/Qwen2.5-Omni-7B")

# Example: audio input
sample = {
  "prompt": "Answer the question based on the audio provided. Explain your reasoning step by step.",
  "question": [
    {"text": "What was the last sound in the sequence?\nA. footsteps\nB. dog_barking\nC. camera_shutter_clicking\nD. tapping_on_glass"},
    {"audio": "examples/media/b7701ab1-c37e-49f2-8ad9-7177fe0465e9.wav"}
  ],
}

result = tracer.trace(sample)
            
          

BibTeX

TBD