Question-Aware Gaussian Experts for Audio-Visual Question Answering

1Sungkyunkwan University 2Purdue University
*Indicates Equal Contribution

Comparison with Other Temporal Sampling Methods in AVQA

This video presents the core idea of Question-Aware Temporal Integration Gaussian Experts for Reasoning (QA-TIGER), our novel framework for Audio-Visual Question Answering (AVQA). It highlights key differences between QA-TIGER and traditional temporal sampling methods in AVQA. Unlike existing approaches that rely on discrete frame selection methods—such as uniform sampling or Top-K selection—QA-TIGER explicitly integrates question information and models continuous temporal dynamics using a Gaussian-based weighting mechanism. By leveraging a Mixture of Experts (MoE), it adaptively captures question-relevant audio-visual cues and improves temporal alignment. The comparison demonstrates how QA-TIGER outperforms traditional uniform sampling and Top-K frame selection, leading to more accurate and context-aware predictions.

Abstract

Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance.

Overview

QA-TIGER main figure

Overview of Question-Aware Temporal Integration Gaussian Experts for Reasoning

QA-TIGER first extracts visual features from video segments using a pretrained CLIP image encoder and audio features via a VGGish model, while the input question is encoded with CLIP text encoder. It then embeds the question context into both modalities through a question-aware fusion module that applies self- and cross-attention, generating enriched, query-aligned representations. Next, a temporal integration module leverages a Mixture-of-Experts (MoE) framework where multiple Gaussian distributions are generated to capture both consecutive and non-consecutive temporal dependencies. Adaptive weights, computed via a routing mechanism, allow the model to emphasize segments most relevant to the query. Finally, the fused audio-visual features are combined using a question-guided reasoning module and passed through a linear classifier with softmax to produce the final answer prediction.

QA-TIGER Quantitative Results

To evaluate our model’s effectiveness, we compared QA-TIGER to existing AVQA methods on Audio (A-QA), Visual (V-QA), and Audio-Visual (AV-QA) tasks, considering both question types and overall averages. QA-TIGER was trained on the MUSIC-AVQA training set and validated using its validation set for testing both MUSIC-AVQA and MUSIC-AVQA-R. For MUSIC-AVQA-v2.0, QA-TIGER was trained on both the balanced and biased training sets and tested accordingly.


Method Audio QA Visual QA Audio-Visual QA Avg
Count Comp Avg Count Local Avg Exist Count Local Comp Temp Avg
FCNLSTM 70.45 66.22 68.88 63.89 46.74 55.21 82.01 59.34 46.28 62.15 47.33 60.06 60.34
BiLSTM 70.35 47.92 62.05 64.64 64.33 64.48 78.39 56.91 45.85 53.09 49.76 57.10 59.92
HCAttn 70.25 54.91 64.57 64.05 66.37 65.22 79.10 59.97 49.51 55.25 56.43 60.19 62.30
MCAN 77.50 55.24 69.25 71.56 70.93 71.24 80.40 64.91 54.48 57.22 47.57 61.58 65.49
PSAC 75.64 66.06 72.09 68.64 69.79 69.22 77.59 63.42 55.02 61.17 59.47 63.52 66.54
HME 74.76 63.56 70.61 67.97 69.46 68.76 80.30 63.19 53.18 62.69 59.83 64.05 66.45
HCRN 68.59 50.92 62.05 64.39 61.81 63.08 54.47 53.38 41.53 52.11 47.69 50.26 55.73
AVSD 72.41 61.90 68.52 67.39 74.19 70.83 81.61 63.89 58.79 61.52 61.41 65.49 67.44
Pano-AVQA 74.36 64.56 70.73 69.39 75.65 72.56 81.21 64.91 59.33 64.22 63.23 66.64 68.93
ST-AVQA 78.18 67.05 74.06 71.56 76.38 74.00 81.81 70.80 64.51 66.01 63.23 69.54 71.52
COCA 79.35 67.68 75.42 75.10 75.43 75.23 83.50 66.63 69.72 64.12 65.57 69.96 72.33
PSTP-Net 73.97 65.59 70.91 77.15 77.36 77.26 76.18 72.23 71.80 71.79 69.00 72.57 73.52
LAVISH 82.09 65.56 75.97 78.98 81.43 80.22 81.71 75.51 66.13 63.77 67.96 71.26 74.46
APL 82.40 70.71 78.09 76.52 82.74 79.69 82.99 73.29 66.68 64.76 65.95 0.96 74.53
TSPM 84.07 64.65 76.91 82.29 84.90 83.61 82.19 76.21 71.85 65.76 71.17 73.51 76.79
QA-TIGER 84.86 67.85 78.58 83.96 86.29 85.14 83.10 78.58 72.50 63.94 69.59 73.74 77.62

MUSIC-AVQA

QA-TIGER achieves the highest performance among all models, with an overall accuracy (77.62%), surpassing the previous best method, TSPM (76.79%). Notably, our model shows strong performance in complex reasoning tasks, achieving 78.58% and 72.50% in AV-Counting and AV-Local categories respectively, surpassing previous methods.



Method Audio QA Visual QA Audio-Visual QA Avg
Count Comp Count Local Exist Count Local Comp Temp
H T H T H T H T H T H T H T H T H T
FCNLSTM 66.23 36.48 64.78 51.24 61.75 5.31 54.86 51.06 64.76 78.52 62.69 7.23 46.66 57.30 43.13 71.67 37.02 30.78 54.12
BiLSTM 73.68 46.32 21.51 77.58 64.30 0.00 53.92 42.01 87.51 21.14 62.85 2.18 35.16 43.75 27.61 74.38 17.58 31.32 48.84
HCAttn 61.67 41.63 59.09 47.14 56.52 9.20 67.01 53.16 66.57 61.13 59.53 12.48 37.05 42.48 48.81 60.12 33.82 39.26 51.90
MCAN 75.02 60.16 58.89 50.09 64.58 26.69 66.48 62.25 51.29 67.29 64.76 25.28 46.11 61.61 50.57 52.40 34.64 58.05 57.27
PSAC 53.01 56.68 57.41 48.12 49.55 26.43 72.96 60.69 50.56 55.54 56.70 19.58 41.98 52.30 38.13 58.92 26.68 46.24 50.45
HME 62.60 53.95 54.97 58.29 50.95 16.46 73.25 58.60 65.74 66.49 63.18 17.18 33.79 46.03 53.20 69.57 33.95 41.57 53.66
HCRN 55.53 53.31 47.17 32.44 41.87 23.55 39.40 51.27 41.81 65.45 54.58 19.57 36.62 42.72 33.33 36.87 40.47 44.13 43.92
AVSD 54.00 47.84 60.61 47.79 60.34 10.07 74.78 61.43 66.28 61.98 46.21 8.06 33.00 40.35 51.98 66.00 40.14 41.52 52.33
Pano-AVQA 50.57 43.45 50.78 44.93 47.28 15.50 67.19 65.51 52.37 22.04 52.21 21.52 44.35 61.69 45.61 40.49 35.00 49.33 47.40
ST-AVQA 56.40 41.48 62.28 57.59 59.86 12.94 63.31 54.00 73.35 77.26 48.31 8.41 35.35 40.49 53.30 62.44 40.25 38.15 52.80
LAVISH 61.73 43.99 65.06 60.38 65.53 11.13 70.21 64.73 77.83 79.46 49.88 14.87 41.76 41.20 59.26 65.10 41.84 46.26 57.63
TSPM 81.65 71.80 67.66 49.56 78.29 47.56 80.58 73.18 69.15 82.79 77.09 38.64 42.24 57.37 52.07 68.86 39.23 49.36 66.30
QA-TIGER 82.67 75.82 71.75 43.11 81.30 54.59 84.76 75.59 72.84 78.56 76.70 33.55 48.22 64.65 37.55 80.47 36.85 62.96 67.99

MUSIC-AVQA-R

Our method achieves 67.99% overall accuracy on the MUSIC-AVQA-R dataset across diverse question types, without the need for explicit bias handling techniques. This balanced performance highlights QA-TIGER’s strong temporal modeling and question-aware feature extraction capabilities.



Test Training Method A-QA V-QA AV-QA Avg
(a) Bias Bias ST-AVQA 76.86 77.70 69.59 73.07
LAVISH 76.73 80.96 70.80 74.59
QA-TIGER 79.13 84.83 72.37 76.93
Balance ST-AVQA 76.18 77.20 67.96 71.92
LAVISH 75.56 80.83 69.27 73.51
LAST 77.10 82.99 70.86 75.24
LAST-Att 77.29 83.47 71.05 75.45
QA-TIGER 77.07 85.93 71.20 76.57
(b) Balance Bias ST-AVQA 73.34 76.82 64.51 69.40
LAVISH 73.14 79.70 65.01 70.39
QA-TIGER 77.57 84.84 67.43 73.91
Balance ST-AVQA 75.50 77.67 66.32 71.02
LAVISH 76.15 81.32 68.28 73.18
LAST 78.08 83.29 69.72 74.85
LAST-Att 78.56 84.07 70.30 75.44
QA-TIGER 79.90 86.95 70.22 76.43

MUSIC-AVQA-v2.0

QA-TIGER consistently outperforms existing models, like ST-AVQA, LAVISH, LAST(-Att), on the biased test set, regardless of the training set type (Table a). On the balanced test set (Table b), QA-TIGER trained on the balanced dataset achieves superior accuracy in both A-QA (79.90%) and V-QA (86.95%), surpassing LAST-Att in overall accuracy (75.44% vs. 76.43%). Notably, LAST-Att underperforms in A-QA despite incorporating an additional audio encoder, Audio Spectrogram Transformer. This highlights QA-TIGER’s robustness across diverse evaluation settings.



QA-TIGER Qualitative Results

To evaluate whether QA-TIGER accurately identifies question-relevant temporal segments, we qualitatively analyzed its performance using MUSIC-AVQA benchmark. Notably, the audio Gaussian aligns more with A-QA, while the visual Gaussian is more prominent for V-QA; for AV-QA, both modalities exhibit similar Gaussian distributions, demonstrating the model’s adaptive focus based on the question type, with a detailed comparison using uniform sampling and Top-K selection provided in the supplementary material.

qualitative-a

(a) Audio Question

QA-TIGER demonstrates its temporal reasoning capabilities by assigning high weights across the entire duration of the audio Gaussian when comparing the tuba and the clarinet—effectively capturing the complete temporal spans necessary for an accurate comparison—while the visual Gaussian focuses on frames where the instruments are actively played.

qualitative-b

(b) Visual Question

The method effectively integrates both visual and audio modalities to accurately count the saxophones; the visual Gaussian assigns higher weights to frames where all five saxophonists are clearly visible from the front, whereas frames with less distinct views receive lower weights, and the audio Gaussian emphasizes moments when the sounds of all five saxophones overlap, ensuring that the model concentrates on the most critical visual and auditory cues.

qualitative-c

(c) Audio-Visual Question

The visual Gaussian highlights frames in the early part of the video where individual instruments appear as well as those where all three instruments are shown together, and the audio Gaussian emphasizes the prominent sounds of the instruments. This complementary alignment enables QA-TIGER to effectively integrate visual and auditory cues, resulting in the accurate identification of all instruments.

Question-Aware Attention Visualization

To validate the effectiveness of our question-aware fusion module, we performed word-level visualizations. Additional supporting examples are provided in the supplementary material.

qa-attention

Audio-Visual Temporal Question

question-aware fusion module focuses on “instrument,” “right,” “left,” and “louder” to identify spatial locations in visual modality. Meanwhile, it emphasizes “louder” to analyze sound intensities in audio modality. This complementary approach enables the model to tackle the question effectively.

BibTeX

Will be updated