ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models


1Amazon,    2Sungkyunkwan University    3Inha University
* Equal contribution
CVPR 2026 Poster
Overview of ZOO-Prune

Abstract

Large Vision-Language Models (VLMs) enable strong multimodal inference but suffer from heavy computation due to redundant visual tokens. Token pruning offers a promising solution, yet existing methods often rely on attention scores or diversity heuristics, which do not directly reflect how tokens affect the model output.

We propose ZOO-Prune, a training-free framework that estimates token importance through zeroth-order sensitivity at the lightweight projection layer. By measuring output changes under small perturbations and combining sensitivity with diversity, ZOO-Prune retains informative and non-redundant tokens without requiring backpropagation. Extensive experiments across multiple VLMs and benchmarks show that our method consistently outperforms prior approaches while pruning up to 94.4% of tokens without sacrificing accuracy, and achieves up to 2.30× faster end-to-end inference.

Introduction

Existing visual token pruning methods often rely on attention scores or diversity heuristics. However, attention can be unstable across layers and heads, while diversity alone does not directly reflect task relevance. As a result, prior methods may fail to capture how individual tokens truly affect the model output.

ZOO-Prune is motivated by a simple question: which visual tokens most strongly influence the final prediction? To answer this, we estimate token sensitivity through zeroth-order perturbations, yielding an output-aware and training-free importance signal without requiring backpropagation.

Motivation

The core idea of ZOO-Prune is to measure token importance through token sensitivity, namely how much the model output changes when a token is perturbed.

To make this practical for large vision-language models, we estimate sensitivity with zeroth-order perturbations at the lightweight projection layer. This avoids backpropagation and provides an efficient output-aware importance signal.

We then combine this sensitivity signal with token diversity, so that the selected subset remains both informative and non-redundant under aggressive pruning.

Method

ZOO-Prune consists of two main components.

ZOO-based Sensitivity Estimation. We estimate the importance of each visual token by measuring output changes under small random perturbations. This forward-only procedure provides a practical approximation of token influence without requiring gradients.

Sensitivity-Aware Diversity Selection. Based on the estimated sensitivity scores, we select a subset of tokens that balances importance and diversity. This helps retain task-relevant information while avoiding redundant selections.

Since the sensitivity estimation is performed at the lightweight projection layer, ZOO-Prune can be applied efficiently to existing VLMs as a plug-in pruning strategy.

BibTeX

If you find our work useful, please consider citing:

@article{kim2025training,
  title={Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models},
  author={Kim, Youngeun and Zhang, Youjia and Liu, Huiling and Jung, Aecheon and Lee, Sunwoo and Hong, Sungeun},
  journal={arXiv preprint arXiv:2509.24837},
  year={2025}
}