Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance.
QA-TIGER first extracts visual features from video segments using a pretrained CLIP image encoder and audio features via a VGGish model, while the input question is encoded with CLIP text encoder. It then embeds the question context into both modalities through a question-aware fusion module that applies self- and cross-attention, generating enriched, query-aligned representations. Next, a temporal integration module leverages a Mixture-of-Experts (MoE) framework where multiple Gaussian distributions are generated to capture both consecutive and non-consecutive temporal dependencies. Adaptive weights, computed via a routing mechanism, allow the model to emphasize segments most relevant to the query. Finally, the fused audio-visual features are combined using a question-guided reasoning module and passed through a linear classifier with softmax to produce the final answer prediction.
Method | Audio QA | Visual QA | Audio-Visual QA | Avg | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Count | Comp | Avg | Count | Local | Avg | Exist | Count | Local | Comp | Temp | Avg | ||
FCNLSTM | 70.45 | 66.22 | 68.88 | 63.89 | 46.74 | 55.21 | 82.01 | 59.34 | 46.28 | 62.15 | 47.33 | 60.06 | 60.34 |
BiLSTM | 70.35 | 47.92 | 62.05 | 64.64 | 64.33 | 64.48 | 78.39 | 56.91 | 45.85 | 53.09 | 49.76 | 57.10 | 59.92 |
HCAttn | 70.25 | 54.91 | 64.57 | 64.05 | 66.37 | 65.22 | 79.10 | 59.97 | 49.51 | 55.25 | 56.43 | 60.19 | 62.30 |
MCAN | 77.50 | 55.24 | 69.25 | 71.56 | 70.93 | 71.24 | 80.40 | 64.91 | 54.48 | 57.22 | 47.57 | 61.58 | 65.49 |
PSAC | 75.64 | 66.06 | 72.09 | 68.64 | 69.79 | 69.22 | 77.59 | 63.42 | 55.02 | 61.17 | 59.47 | 63.52 | 66.54 |
HME | 74.76 | 63.56 | 70.61 | 67.97 | 69.46 | 68.76 | 80.30 | 63.19 | 53.18 | 62.69 | 59.83 | 64.05 | 66.45 |
HCRN | 68.59 | 50.92 | 62.05 | 64.39 | 61.81 | 63.08 | 54.47 | 53.38 | 41.53 | 52.11 | 47.69 | 50.26 | 55.73 |
AVSD | 72.41 | 61.90 | 68.52 | 67.39 | 74.19 | 70.83 | 81.61 | 63.89 | 58.79 | 61.52 | 61.41 | 65.49 | 67.44 |
Pano-AVQA | 74.36 | 64.56 | 70.73 | 69.39 | 75.65 | 72.56 | 81.21 | 64.91 | 59.33 | 64.22 | 63.23 | 66.64 | 68.93 |
ST-AVQA | 78.18 | 67.05 | 74.06 | 71.56 | 76.38 | 74.00 | 81.81 | 70.80 | 64.51 | 66.01 | 63.23 | 69.54 | 71.52 |
COCA | 79.35 | 67.68 | 75.42 | 75.10 | 75.43 | 75.23 | 83.50 | 66.63 | 69.72 | 64.12 | 65.57 | 69.96 | 72.33 |
PSTP-Net | 73.97 | 65.59 | 70.91 | 77.15 | 77.36 | 77.26 | 76.18 | 72.23 | 71.80 | 71.79 | 69.00 | 72.57 | 73.52 |
LAVISH | 82.09 | 65.56 | 75.97 | 78.98 | 81.43 | 80.22 | 81.71 | 75.51 | 66.13 | 63.77 | 67.96 | 71.26 | 74.46 |
APL | 82.40 | 70.71 | 78.09 | 76.52 | 82.74 | 79.69 | 82.99 | 73.29 | 66.68 | 64.76 | 65.95 | 0.96 | 74.53 |
TSPM | 84.07 | 64.65 | 76.91 | 82.29 | 84.90 | 83.61 | 82.19 | 76.21 | 71.85 | 65.76 | 71.17 | 73.51 | 76.79 |
QA-TIGER | 84.86 | 67.85 | 78.58 | 83.96 | 86.29 | 85.14 | 83.10 | 78.58 | 72.50 | 63.94 | 69.59 | 73.74 | 77.62 |
QA-TIGER achieves the highest performance among all models, with an overall accuracy (77.62%), surpassing the previous best method, TSPM (76.79%). Notably, our model shows strong performance in complex reasoning tasks, achieving 78.58% and 72.50% in AV-Counting and AV-Local categories respectively, surpassing previous methods.
Method | Audio QA | Visual QA | Audio-Visual QA | Avg | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Count | Comp | Count | Local | Exist | Count | Local | Comp | Temp | |||||||||||
H | T | H | T | H | T | H | T | H | T | H | T | H | T | H | T | H | T | ||
FCNLSTM | 66.23 | 36.48 | 64.78 | 51.24 | 61.75 | 5.31 | 54.86 | 51.06 | 64.76 | 78.52 | 62.69 | 7.23 | 46.66 | 57.30 | 43.13 | 71.67 | 37.02 | 30.78 | 54.12 |
BiLSTM | 73.68 | 46.32 | 21.51 | 77.58 | 64.30 | 0.00 | 53.92 | 42.01 | 87.51 | 21.14 | 62.85 | 2.18 | 35.16 | 43.75 | 27.61 | 74.38 | 17.58 | 31.32 | 48.84 |
HCAttn | 61.67 | 41.63 | 59.09 | 47.14 | 56.52 | 9.20 | 67.01 | 53.16 | 66.57 | 61.13 | 59.53 | 12.48 | 37.05 | 42.48 | 48.81 | 60.12 | 33.82 | 39.26 | 51.90 |
MCAN | 75.02 | 60.16 | 58.89 | 50.09 | 64.58 | 26.69 | 66.48 | 62.25 | 51.29 | 67.29 | 64.76 | 25.28 | 46.11 | 61.61 | 50.57 | 52.40 | 34.64 | 58.05 | 57.27 |
PSAC | 53.01 | 56.68 | 57.41 | 48.12 | 49.55 | 26.43 | 72.96 | 60.69 | 50.56 | 55.54 | 56.70 | 19.58 | 41.98 | 52.30 | 38.13 | 58.92 | 26.68 | 46.24 | 50.45 |
HME | 62.60 | 53.95 | 54.97 | 58.29 | 50.95 | 16.46 | 73.25 | 58.60 | 65.74 | 66.49 | 63.18 | 17.18 | 33.79 | 46.03 | 53.20 | 69.57 | 33.95 | 41.57 | 53.66 |
HCRN | 55.53 | 53.31 | 47.17 | 32.44 | 41.87 | 23.55 | 39.40 | 51.27 | 41.81 | 65.45 | 54.58 | 19.57 | 36.62 | 42.72 | 33.33 | 36.87 | 40.47 | 44.13 | 43.92 |
AVSD | 54.00 | 47.84 | 60.61 | 47.79 | 60.34 | 10.07 | 74.78 | 61.43 | 66.28 | 61.98 | 46.21 | 8.06 | 33.00 | 40.35 | 51.98 | 66.00 | 40.14 | 41.52 | 52.33 |
Pano-AVQA | 50.57 | 43.45 | 50.78 | 44.93 | 47.28 | 15.50 | 67.19 | 65.51 | 52.37 | 22.04 | 52.21 | 21.52 | 44.35 | 61.69 | 45.61 | 40.49 | 35.00 | 49.33 | 47.40 |
ST-AVQA | 56.40 | 41.48 | 62.28 | 57.59 | 59.86 | 12.94 | 63.31 | 54.00 | 73.35 | 77.26 | 48.31 | 8.41 | 35.35 | 40.49 | 53.30 | 62.44 | 40.25 | 38.15 | 52.80 |
LAVISH | 61.73 | 43.99 | 65.06 | 60.38 | 65.53 | 11.13 | 70.21 | 64.73 | 77.83 | 79.46 | 49.88 | 14.87 | 41.76 | 41.20 | 59.26 | 65.10 | 41.84 | 46.26 | 57.63 |
TSPM | 81.65 | 71.80 | 67.66 | 49.56 | 78.29 | 47.56 | 80.58 | 73.18 | 69.15 | 82.79 | 77.09 | 38.64 | 42.24 | 57.37 | 52.07 | 68.86 | 39.23 | 49.36 | 66.30 |
QA-TIGER | 82.67 | 75.82 | 71.75 | 43.11 | 81.30 | 54.59 | 84.76 | 75.59 | 72.84 | 78.56 | 76.70 | 33.55 | 48.22 | 64.65 | 37.55 | 80.47 | 36.85 | 62.96 | 67.99 |
Our method achieves 67.99% overall accuracy on the MUSIC-AVQA-R dataset across diverse question types, without the need for explicit bias handling techniques. This balanced performance highlights QA-TIGER’s strong temporal modeling and question-aware feature extraction capabilities.
Test | Training | Method | A-QA | V-QA | AV-QA | Avg |
---|---|---|---|---|---|---|
(a) Bias | Bias | ST-AVQA | 76.86 | 77.70 | 69.59 | 73.07 |
LAVISH | 76.73 | 80.96 | 70.80 | 74.59 | ||
QA-TIGER | 79.13 | 84.83 | 72.37 | 76.93 | ||
Balance | ST-AVQA | 76.18 | 77.20 | 67.96 | 71.92 | |
LAVISH | 75.56 | 80.83 | 69.27 | 73.51 | ||
LAST | 77.10 | 82.99 | 70.86 | 75.24 | ||
LAST-Att | 77.29 | 83.47 | 71.05 | 75.45 | ||
QA-TIGER | 77.07 | 85.93 | 71.20 | 76.57 | ||
(b) Balance | Bias | ST-AVQA | 73.34 | 76.82 | 64.51 | 69.40 |
LAVISH | 73.14 | 79.70 | 65.01 | 70.39 | ||
QA-TIGER | 77.57 | 84.84 | 67.43 | 73.91 | ||
Balance | ST-AVQA | 75.50 | 77.67 | 66.32 | 71.02 | |
LAVISH | 76.15 | 81.32 | 68.28 | 73.18 | ||
LAST | 78.08 | 83.29 | 69.72 | 74.85 | ||
LAST-Att | 78.56 | 84.07 | 70.30 | 75.44 | ||
QA-TIGER | 79.90 | 86.95 | 70.22 | 76.43 |
QA-TIGER consistently outperforms existing models, like ST-AVQA, LAVISH, LAST(-Att), on the biased test set, regardless of the training set type (Table a). On the balanced test set (Table b), QA-TIGER trained on the balanced dataset achieves superior accuracy in both A-QA (79.90%) and V-QA (86.95%), surpassing LAST-Att in overall accuracy (75.44% vs. 76.43%). Notably, LAST-Att underperforms in A-QA despite incorporating an additional audio encoder, Audio Spectrogram Transformer. This highlights QA-TIGER’s robustness across diverse evaluation settings.
To evaluate whether QA-TIGER accurately identifies question-relevant temporal segments,
we qualitatively analyzed its performance using MUSIC-AVQA benchmark.
Notably, the audio Gaussian aligns more with A-QA,
while the visual Gaussian is more prominent for V-QA;
for AV-QA, both modalities exhibit similar Gaussian distributions,
demonstrating the model’s adaptive focus based on the question type,
with a detailed comparison using uniform sampling and Top-K selection provided in the supplementary material.
QA-TIGER demonstrates its temporal reasoning capabilities
by assigning high weights across the entire duration of the audio Gaussian when comparing the tuba
and the clarinet—effectively capturing the complete temporal spans necessary
for an accurate comparison—while the visual Gaussian focuses on frames where the instruments are actively played.
The method effectively integrates both visual and audio modalities to accurately count the saxophones;
the visual Gaussian assigns higher weights to frames where all five saxophonists are clearly visible from the front,
whereas frames with less distinct views receive lower weights, and the audio Gaussian emphasizes moments
when the sounds of all five saxophones overlap, ensuring that the model concentrates on the most critical visual and auditory cues.
The visual Gaussian highlights frames in the early part of the video where individual instruments appear as well as those where all three instruments are shown together, and the audio Gaussian emphasizes the prominent sounds of the instruments. This complementary alignment enables QA-TIGER to effectively integrate visual and auditory cues, resulting in the accurate identification of all instruments.
To validate the effectiveness of our question-aware fusion module, we performed word-level visualizations.
Additional supporting examples are provided in the supplementary material.
question-aware fusion module focuses on “instrument,” “right,” “left,” and “louder” to identify spatial
locations in visual modality. Meanwhile, it emphasizes
“louder” to analyze sound intensities in audio modality.
This complementary approach enables the model to tackle
the question effectively.
Will be updated