Question-Aware Gaussian Experts for Audio-Visual Question Answering

CVPR 2025

Highlight 🌟

Hongyeob Kim^1*, Inyoung Jung^1*, Dayoon Suh², Youjia Zhang¹, Sangmin Lee¹, Sungeun Hong^1†

¹Sungkyunkwan University ²Purdue University
^*Indicates Equal Contribution

Comparison with Other Temporal Sampling Methods in AVQA

This video presents the core idea of Question-Aware Temporal Integration Gaussian Experts for Reasoning (QA-TIGER), our novel framework for Audio-Visual Question Answering (AVQA). It highlights key differences between QA-TIGER and traditional temporal sampling methods in AVQA. Unlike existing approaches that rely on discrete frame selection methods—such as uniform sampling or Top-K selection—QA-TIGER explicitly integrates question information and models continuous temporal dynamics using a Gaussian-based weighting mechanism. By leveraging a Mixture of Experts (MoE), it adaptively captures question-relevant audio-visual cues and improves temporal alignment. The comparison demonstrates how QA-TIGER outperforms traditional uniform sampling and Top-K frame selection, leading to more accurate and context-aware predictions.

Abstract

Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance.

Overview

Overview of Question-Aware Temporal Integration Gaussian Experts for Reasoning

QA-TIGER first extracts visual features from video segments using a pretrained CLIP image encoder and audio features via a VGGish model, while the input question is encoded with CLIP text encoder. It then embeds the question context into both modalities through a question-aware fusion module that applies self- and cross-attention, generating enriched, query-aligned representations. Next, a temporal integration module leverages a Mixture-of-Experts (MoE) framework where multiple Gaussian distributions are generated to capture both consecutive and non-consecutive temporal dependencies. Adaptive weights, computed via a routing mechanism, allow the model to emphasize segments most relevant to the query. Finally, the fused audio-visual features are combined using a question-guided reasoning module and passed through a linear classifier with softmax to produce the final answer prediction.

QA-TIGER Quantitative Results

To evaluate our model’s effectiveness, we compared QA-TIGER to existing AVQA methods on Audio (A-QA), Visual (V-QA), and Audio-Visual (AV-QA) tasks, considering both question types and overall averages. QA-TIGER was trained on the MUSIC-AVQA training set and validated using its validation set for testing both MUSIC-AVQA and MUSIC-AVQA-R. For MUSIC-AVQA-v2.0, QA-TIGER was trained on both the balanced and biased training sets and tested accordingly.

Method	Audio QA			Visual QA			Audio-Visual QA						Avg
Method	Count	Comp	Avg	Count	Local	Avg	Exist	Count	Local	Comp	Temp	Avg	Avg
FCNLSTM	70.45	66.22	68.88	63.89	46.74	55.21	82.01	59.34	46.28	62.15	47.33	60.06	60.34
BiLSTM	70.35	47.92	62.05	64.64	64.33	64.48	78.39	56.91	45.85	53.09	49.76	57.10	59.92
HCAttn	70.25	54.91	64.57	64.05	66.37	65.22	79.10	59.97	49.51	55.25	56.43	60.19	62.30
MCAN	77.50	55.24	69.25	71.56	70.93	71.24	80.40	64.91	54.48	57.22	47.57	61.58	65.49
PSAC	75.64	66.06	72.09	68.64	69.79	69.22	77.59	63.42	55.02	61.17	59.47	63.52	66.54
HME	74.76	63.56	70.61	67.97	69.46	68.76	80.30	63.19	53.18	62.69	59.83	64.05	66.45
HCRN	68.59	50.92	62.05	64.39	61.81	63.08	54.47	53.38	41.53	52.11	47.69	50.26	55.73
AVSD	72.41	61.90	68.52	67.39	74.19	70.83	81.61	63.89	58.79	61.52	61.41	65.49	67.44
Pano-AVQA	74.36	64.56	70.73	69.39	75.65	72.56	81.21	64.91	59.33	64.22	63.23	66.64	68.93
ST-AVQA	78.18	67.05	74.06	71.56	76.38	74.00	81.81	70.80	64.51	66.01	63.23	69.54	71.52
COCA	79.35	67.68	75.42	75.10	75.43	75.23	83.50	66.63	69.72	64.12	65.57	69.96	72.33
PSTP-Net	73.97	65.59	70.91	77.15	77.36	77.26	76.18	72.23	71.80	71.79	69.00	72.57	73.52
LAVISH	82.09	65.56	75.97	78.98	81.43	80.22	81.71	75.51	66.13	63.77	67.96	71.26	74.46
APL	82.40	70.71	78.09	76.52	82.74	79.69	82.99	73.29	66.68	64.76	65.95	0.96	74.53
TSPM	84.07	64.65	76.91	82.29	84.90	83.61	82.19	76.21	71.85	65.76	71.17	73.51	76.79
QA-TIGER	84.86	67.85	78.58	83.96	86.29	85.14	83.10	78.58	72.50	63.94	69.59	73.74	77.62

MUSIC-AVQA

QA-TIGER achieves the highest performance among all models, with an overall accuracy (77.62%), surpassing the previous best method, TSPM (76.79%). Notably, our model shows strong performance in complex reasoning tasks, achieving 78.58% and 72.50% in AV-Counting and AV-Local categories respectively, surpassing previous methods.

Method	Audio QA				Visual QA				Audio-Visual QA										Avg
	Count		Comp		Count		Local		Exist		Count		Local		Comp		Temp
	H	T	H	T	H	T	H	T	H	T	H	T	H	T	H	T	H	T
FCNLSTM	66.23	36.48	64.78	51.24	61.75	5.31	54.86	51.06	64.76	78.52	62.69	7.23	46.66	57.30	43.13	71.67	37.02	30.78	54.12
BiLSTM	73.68	46.32	21.51	77.58	64.30	0.00	53.92	42.01	87.51	21.14	62.85	2.18	35.16	43.75	27.61	74.38	17.58	31.32	48.84
HCAttn	61.67	41.63	59.09	47.14	56.52	9.20	67.01	53.16	66.57	61.13	59.53	12.48	37.05	42.48	48.81	60.12	33.82	39.26	51.90
MCAN	75.02	60.16	58.89	50.09	64.58	26.69	66.48	62.25	51.29	67.29	64.76	25.28	46.11	61.61	50.57	52.40	34.64	58.05	57.27
PSAC	53.01	56.68	57.41	48.12	49.55	26.43	72.96	60.69	50.56	55.54	56.70	19.58	41.98	52.30	38.13	58.92	26.68	46.24	50.45
HME	62.60	53.95	54.97	58.29	50.95	16.46	73.25	58.60	65.74	66.49	63.18	17.18	33.79	46.03	53.20	69.57	33.95	41.57	53.66
HCRN	55.53	53.31	47.17	32.44	41.87	23.55	39.40	51.27	41.81	65.45	54.58	19.57	36.62	42.72	33.33	36.87	40.47	44.13	43.92
AVSD	54.00	47.84	60.61	47.79	60.34	10.07	74.78	61.43	66.28	61.98	46.21	8.06	33.00	40.35	51.98	66.00	40.14	41.52	52.33
Pano-AVQA	50.57	43.45	50.78	44.93	47.28	15.50	67.19	65.51	52.37	22.04	52.21	21.52	44.35	61.69	45.61	40.49	35.00	49.33	47.40
ST-AVQA	56.40	41.48	62.28	57.59	59.86	12.94	63.31	54.00	73.35	77.26	48.31	8.41	35.35	40.49	53.30	62.44	40.25	38.15	52.80
LAVISH	61.73	43.99	65.06	60.38	65.53	11.13	70.21	64.73	77.83	79.46	49.88	14.87	41.76	41.20	59.26	65.10	41.84	46.26	57.63
TSPM	81.65	71.80	67.66	49.56	78.29	47.56	80.58	73.18	69.15	82.79	77.09	38.64	42.24	57.37	52.07	68.86	39.23	49.36	66.30
QA-TIGER	82.67	75.82	71.75	43.11	81.30	54.59	84.76	75.59	72.84	78.56	76.70	33.55	48.22	64.65	37.55	80.47	36.85	62.96	67.99

MUSIC-AVQA-R

Our method achieves 67.99% overall accuracy on the MUSIC-AVQA-R dataset across diverse question types, without the need for explicit bias handling techniques. This balanced performance highlights QA-TIGER’s strong temporal modeling and question-aware feature extraction capabilities.

Test	Training	Method	A-QA	V-QA	AV-QA	Avg
(a) Bias	Bias	ST-AVQA	76.86	77.70	69.59	73.07
		LAVISH	76.73	80.96	70.80	74.59
		QA-TIGER	79.13	84.83	72.37	76.93
	Balance	ST-AVQA	76.18	77.20	67.96	71.92
		LAVISH	75.56	80.83	69.27	73.51
		LAST	77.10	82.99	70.86	75.24
		LAST-Att	77.29	83.47	71.05	75.45
		QA-TIGER	77.07	85.93	71.20	76.57
(b) Balance	Bias	ST-AVQA	73.34	76.82	64.51	69.40
		LAVISH	73.14	79.70	65.01	70.39
		QA-TIGER	77.57	84.84	67.43	73.91
	Balance	ST-AVQA	75.50	77.67	66.32	71.02
		LAVISH	76.15	81.32	68.28	73.18
		LAST	78.08	83.29	69.72	74.85
		LAST-Att	78.56	84.07	70.30	75.44
		QA-TIGER	79.90	86.95	70.22	76.43

MUSIC-AVQA-v2.0

QA-TIGER consistently outperforms existing models, like ST-AVQA, LAVISH, LAST(-Att), on the biased test set, regardless of the training set type (Table a). On the balanced test set (Table b), QA-TIGER trained on the balanced dataset achieves superior accuracy in both A-QA (79.90%) and V-QA (86.95%), surpassing LAST-Att in overall accuracy (75.44% vs. 76.43%). Notably, LAST-Att underperforms in A-QA despite incorporating an additional audio encoder, Audio Spectrogram Transformer. This highlights QA-TIGER’s robustness across diverse evaluation settings.

QA-TIGER Qualitative Results

To evaluate whether QA-TIGER accurately identifies question-relevant temporal segments, we qualitatively analyzed its performance using MUSIC-AVQA benchmark. Notably, the audio Gaussian aligns more with A-QA, while the visual Gaussian is more prominent for V-QA; for AV-QA, both modalities exhibit similar Gaussian distributions, demonstrating the model’s adaptive focus based on the question type, with a detailed comparison using uniform sampling and Top-K selection provided in the supplementary material.

(a) Audio Question

QA-TIGER demonstrates its temporal reasoning capabilities by assigning high weights across the entire duration of the audio Gaussian when comparing the tuba and the clarinet—effectively capturing the complete temporal spans necessary for an accurate comparison—while the visual Gaussian focuses on frames where the instruments are actively played.

(b) Visual Question

The method effectively integrates both visual and audio modalities to accurately count the saxophones; the visual Gaussian assigns higher weights to frames where all five saxophonists are clearly visible from the front, whereas frames with less distinct views receive lower weights, and the audio Gaussian emphasizes moments when the sounds of all five saxophones overlap, ensuring that the model concentrates on the most critical visual and auditory cues.

(c) Audio-Visual Question

The visual Gaussian highlights frames in the early part of the video where individual instruments appear as well as those where all three instruments are shown together, and the audio Gaussian emphasizes the prominent sounds of the instruments. This complementary alignment enables QA-TIGER to effectively integrate visual and auditory cues, resulting in the accurate identification of all instruments.

Question-Aware Attention Visualization

To validate the effectiveness of our question-aware fusion module, we performed word-level visualizations. Additional supporting examples are provided in the supplementary material.

Audio-Visual Temporal Question

Question-aware fusion module balances visual and auditory cues to identify the specific clarinet but produces the first sound. In the visual modality, attention focuses on "clarinet" and "first", using motion cues to detect active clarinets. To compensate for visually occluded clarinets, the audio modality emphasizes "which," "clarinet," and "first," helping to highlight the source of the initial sound in the figure.

BibTeX

@inproceedings{kim2025qatiger,
    title={Question-Aware Gaussian Experts for Audio-Visual Question Answering},
    author={Hongyeob Kim and Inyoung Jung and Dayoon Suh and Youjia Zhang and Sangmin Lee and Sungeun Hong},
    booktitle={CVPR},
    year={2025}
}