ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models


1Amazon,    2Sungkyunkwan University    3Inha University
* Equal contribution
CVPR 2026 Poster
Overview of ZOO-Prune. Token sensitivity is estimated at the projection layer and combined with diversity-aware selection for training-free visual token pruning.

Abstract

Large Vision-Language Models (VLMs) enable strong multimodal inference but suffer from heavy computation due to redundant visual tokens. Token pruning offers a promising solution, yet existing methods often rely on attention scores or diversity heuristics, which do not directly reflect how tokens affect the model output.

We propose ZOO-Prune, a training-free framework that estimates token importance through zeroth-order sensitivity at the lightweight projection layer. By measuring output changes under small perturbations and combining sensitivity with diversity, ZOO-Prune retains informative and non-redundant tokens without requiring backpropagation. Extensive experiments across multiple VLMs and benchmarks show that our method consistently outperforms prior approaches while pruning up to 94.4% of tokens without sacrificing accuracy, and achieves up to 2.30× faster end-to-end inference.

Introduction

Visual token pruning has become an effective way to improve the inference efficiency of large vision-language models by removing redundant visual tokens. Existing training-free methods mainly rely on attention scores or diversity heuristics. However, attention can be unstable across layers and heads, while diversity alone does not directly reflect how individual tokens affect the final model output.

ZOO-Prune is motivated by a simple question: which visual tokens most strongly influence the final prediction? To answer this, we estimate token sensitivity through zeroth-order perturbations at the lightweight projection layer, which provides an efficient output-aware importance signal without requiring backpropagation. Based on these sensitivity scores, ZOO-Prune further combines diversity-aware selection to preserve informative and non-redundant tokens under aggressive pruning.

Method

Method overview. ZOO-Prune estimates token sensitivity at the projection layer and combines it with diversity-aware selection.

ZOO-Prune performs training-free visual token pruning in two stages. It first estimates the sensitivity of each visual token through zeroth-order perturbations at the lightweight projection layer, and then selects a compact subset of tokens by combining sensitivity with diversity.

  • Zeroth-order sensitivity estimation. Token importance is measured by the output change caused by small random perturbations, without requiring backpropagation.
  • Projection-layer proxy. Sensitivity is computed at the projection layer, which provides an efficient and effective proxy for visual token importance.
  • Sensitivity-aware diversity selection. The final token subset preserves both informative and complementary visual content under aggressive pruning.

Experimental Results

ZOO-Prune consistently achieves a strong accuracy-efficiency trade-off across multiple vision-language models and benchmarks. It preserves high task performance even under aggressive pruning ratios, outperforming prior training-free pruning methods.

Qualitative Results. We further compare the selected visual tokens on the GQA benchmark to better understand how different pruning criteria behave under compression.

Text-visual attention often suffers from positional bias, while visual-visual attention tends to preserve redundant token clusters. Diversity-based pruning improves spatial spread but may miss semantically important regions. ZOO-based sensitivity captures output-related tokens more effectively, and ZOO-Prune further balances sensitivity and diversity for more robust token selection across compression ratios.

We also provide direct mask-level comparisons among representative training-free baselines. Compared with VisionZip and DivPrune, ZOO-Prune preserves more task-relevant regions while avoiding overly redundant or overly dispersed token selections.

BibTeX

If you find our work useful, please consider citing:

@article{kim2025training,
  title={Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models},
  author={Kim, Youngeun and Zhang, Youjia and Liu, Huiling and Jung, Aecheon and Lee, Sunwoo and Hong, Sungeun},
  journal={arXiv preprint arXiv:2509.24837},
  year={2025}
}