ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

Youngeun Kim*¹, Youjia Zhang*², Huiling Liu ²,
Aecheon Jung², Sunwoo Lee³, Sungeun Hong²

¹Amazon, ²Sungkyunkwan University ³Inha University

* Equal contribution

CVPR 2026 Poster

Overview of ZOO-Prune. Token sensitivity is estimated at the projection layer and combined with diversity-aware selection for training-free visual token pruning.

Abstract

Large Vision-Language Models (VLMs) enable strong multimodal inference but suffer from heavy computation due to redundant visual tokens. Token pruning offers a promising solution, yet existing methods often rely on attention scores or diversity heuristics, which do not directly reflect how tokens affect the model output.

We propose ZOO-Prune, a training-free framework that estimates token importance through zeroth-order sensitivity at the lightweight projection layer. By measuring output changes under small perturbations and combining sensitivity with diversity, ZOO-Prune retains informative and non-redundant tokens without requiring backpropagation. Extensive experiments across multiple VLMs and benchmarks show that our method consistently outperforms prior approaches while pruning up to 94.4% of tokens without sacrificing accuracy, and achieves up to 2.30× faster end-to-end inference.

Introduction

Visual token pruning has become an effective way to improve the inference efficiency of large vision-language models by removing redundant visual tokens. Existing training-free methods mainly rely on attention scores or diversity heuristics. However, attention can be unstable across layers and heads, while diversity alone does not directly reflect how individual tokens affect the final model output.

ZOO-Prune is motivated by a simple question: which visual tokens most strongly influence the final prediction? To answer this, we estimate token sensitivity through zeroth-order perturbations at the lightweight projection layer, which provides an efficient output-aware importance signal without requiring backpropagation. Based on these sensitivity scores, ZOO-Prune further combines diversity-aware selection to preserve informative and non-redundant tokens under aggressive pruning.

Method

Method overview. ZOO-Prune estimates token sensitivity at the projection layer and combines it with diversity-aware selection.

ZOO-Prune performs training-free visual token pruning in two stages. It first estimates the sensitivity of each visual token through zeroth-order perturbations at the lightweight projection layer, and then selects a compact subset of tokens by combining sensitivity with diversity.

Zeroth-order sensitivity estimation. Token importance is measured by the output change caused by small random perturbations, without requiring backpropagation.
Projection-layer proxy. Sensitivity is computed at the projection layer, which provides an efficient and effective proxy for visual token importance.
Sensitivity-aware diversity selection. The final token subset preserves both informative and complementary visual content under aggressive pruning.

Experimental Results

ZOO-Prune consistently achieves a strong accuracy-efficiency trade-off across multiple vision-language models and benchmarks. It preserves high task performance even under aggressive pruning ratios, outperforming prior training-free pruning methods.

LLaVA-1.5-7B

📌 On LLaVA-1.5-7B, ZOO-Prune achieves the best average performance across all pruning ratios, retaining 98.6%, 97.8%, and 95.5% performance with only 192, 128, and 64 tokens, respectively.

LLaVA-1.5-13B

✨ On LLaVA-1.5-13B, ZOO-Prune consistently remains the top-performing method, preserving 98.6%, 97.8%, and 96.5% performance under substantial token reduction, showing strong generalization to a larger VLM.

LLaVA-NeXT-7B

🚀 On LLaVA-NeXT-7B, ZOO-Prune maintains 98.3% performance with 640 tokens, 97.1% with 320 tokens, and still reaches 95.4% with only 160 tokens, demonstrating strong scalability under high-resolution token pruning.

Qualitative Results. We further compare the selected visual tokens on the GQA benchmark to better understand how different pruning criteria behave under compression.

Text-visual attention often suffers from positional bias, while visual-visual attention tends to preserve redundant token clusters. Diversity-based pruning improves spatial spread but may miss semantically important regions. ZOO-based sensitivity captures output-related tokens more effectively, and ZOO-Prune further balances sensitivity and diversity for more robust token selection across compression ratios.

Attention vs. Diversity vs. Sensitivity

🔍 This example shows how different token selection strategies behave on GQA, highlighting the contrast between attention-based, diversity-based, sensitivity-based, and our final ZOO-Prune selection.

Attention vs. Diversity vs. Sensitivity

📍 Text-visual attention can be biased toward locally related regions, while visual-visual attention often keeps overlapping token clusters instead of the most informative ones.

Attention vs. Diversity vs. Sensitivity

🧩 Diversity-based pruning spreads selected tokens more broadly, but it may overlook regions that are more critical to the final prediction.

Attention vs. Diversity vs. Sensitivity

⚖️ ZOO-Prune jointly optimizes sensitivity and diversity, leading to a more balanced selection of informative and complementary visual tokens across pruning ratios.

We also provide direct mask-level comparisons among representative training-free baselines. Compared with VisionZip and DivPrune, ZOO-Prune preserves more task-relevant regions while avoiding overly redundant or overly dispersed token selections.

Comparison Across Methods

🖼️ This comparison highlights the qualitative differences between VisionZip, DivPrune, and ZOO-Prune on GQA under the same pruning setting.

Comparison Across Methods

🌟 ZOO-Prune better preserves visually meaningful regions by selecting tokens that are both output-related and less redundant than those chosen by prior methods.

Comparison Across Methods

✅ Across examples, ZOO-Prune produces more balanced token masks, avoiding the redundancy of attention-driven pruning and the semantic looseness of diversity-only selection.

BibTeX

If you find our work useful, please consider citing:

@article{kim2025training,
  title={Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models},
  author={Kim, Youngeun and Zhang, Youjia and Liu, Huiling and Jung, Aecheon and Lee, Sunwoo and Hong, Sungeun},
  journal={arXiv preprint arXiv:2509.24837},
  year={2025}
}