RA-Touch

Retrieval-Augmented Touch Understanding with Enriched Visual Data

Sungkyunkwan University
* Denotes equal contribution

TL;DR: Tactile Perception using Vison-Language data with ImageNet-T

Abstract

Visuo-tactile perception aims to understand an object's tactile properties, such as texture, softness, and rigidity.However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding.

ImageNet-T

ImageNet-T is a dataset constructed using ImageNet, based on the intuition that objects with similar materials may share similar tactile characteristics. It contains vision and tactile-centric caption pairs. We use GPT-4o mini for labeling; while it is an efficient and powerful labeler, its object-centric nature means it may not always produce accurate tactile descriptions when provided with images alone. To address this, we guide the model to infer tactile properties by incorporating both class information and visual descriptions. The final dataset consists of up to 150,000 image-caption pairs with open-vocabulary language labels, and includes subsets of 10k, 50k, and 100k samples.

Model Architecture and Training

RA-Touch is a tactile-centric retrieval-augmented framework designed to enhance visuo-tactile understanding by integrating external vision-language knowledge into the tactile representation learning process. Instead of relying solely on expensive tactile data collection, RA-Touch augments internal tactile features by retrieving and incorporating relevant texture-aware information from external sources. This is achieved through a modular pipeline composed of two key components: the Tactile-Guided Retriever and the Texture-Aware Integrator.

The Texture-Aware Retriever fuses RGB and tactile inputs to form a joint query that retrieves semantically aligned external examples, prioritizing tactile relevance over visual similarity. This allows the model to access external samples that share similar texture, material, or compliance properties, even if they differ visually. The Texture-Aware Integrator refines the retrieved features by applying cross-attention, where the tactile input serves as the query to extract texture-specific cues from visual and textual embeddings. These filtered features are integrated into the visuo-tactile prompt, enriching the tactile representation with context that supports fine-grained texture reasoning. Together, these modules enable RA-Touch to generalize tactile understanding without large-scale tactile supervision, making it a scalable and adaptable framework for downstream tasks in robotics and material perception.

RA-Touch Quantitative Results

We measure the performance of our model on a tactile-semantic understanding task.

Encoder Pre-training Modalities Score (1–10) p-value
Vision Tactile Language SSVTP HCT TVL
LLaVA-1.5 7B 3.64 3.55 3.56 1.21 × 10-9
LLaVA-1.5 13B 3.55 3.63 3.62 1.49 × 10-8
ViP-LLaVA 7B 2.72 3.44 3.36 8.77 × 10-14
ViP-LLaVA 13B 4.10 3.76 3.83 1.72 × 10-6
LLaMA-Adapter 2.56 3.08 3.02 2.68 × 10-17
BLIP-2 Opt-6.7B 2.02 2.72 2.64 1.92 × 10-31
InstructBLIP 7B 1.40 1.71 1.44 1.07 × 10-84
InstructBLIP 13B 1.44 1.21 1.24 4.64 × 10-88
GPT-4V 5.02 4.42 4.49
GPT-4-Turbo 4.91 5.00 4.99 1.25 × 10-5
GPT-4o 4.44 4.59 4.57 0.4532
GPT-4o mini 4.02 4.72 4.64 0.2101
TVL-LLaMA (ViT-Tiny) 6.09 4.79 4.94 4.24 × 10-5
+ RA-Touch (ImageNet-T 10k) 6.21 (+0.12) 5.09 (+0.30) 5.22 (+0.28) 1.13 × 10-13
+ RA-Touch (ImageNet-T 150k) 6.27 (+0.18) 5.11 (+0.32) 5.24 (+0.30) 1.08 × 10-13
TVL-LLaMA (ViT-Small) 5.81 4.77 4.89 6.02 × 10-4
+ RA-Touch (ImageNet-T 10k) 6.13 (+0.32) 5.07 (+0.30) 5.19 (+0.30) 7.52 × 10-12
+ RA-Touch (ImageNet-T 150k) 6.21 (+0.40) 5.13 (+0.36) 5.26 (+0.37) 2.89 × 10-13
TVL-LLaMA (ViT-Base) 6.16 4.89 5.03 3.46 × 10-6
+ RA-Touch (ImageNet-T 10k) 6.73 (+0.57) 5.13 (+0.24) 5.32 (+0.29) 2.31 × 10-14
+ RA-Touch (ImageNet-T 150k) 6.83 (+0.67) 5.17 (+0.28) 5.36 (+0.33) 7.15 × 10-16

RA-Touch consistently outperforms the baseline TVL-LLaMA across all backbone configurations (ViT-Tiny, Small, and Base). Notably, the ViT-Base variant achieves the highest overall performance, with the TVL score improving from 5.03 to 5.36 (+0.33) on the tactile-semantic classification task. Similar improvements are observed on the SSVTP (+0.67) and HCT (+0.28) subsets, indicating that RA-Touch effectively learns more generalized tactile representations across diverse settings. All observed performance gains are statistically significant at the α = 0.05 level, supporting the conclusion that RA-Touch’s retrieval-augmented strategy captures tactile semantics more effectively than traditional vision-tactile alignment alone. Furthermore, RA-Touch consistently surpasses existing open-source vision-language and multimodal models highlighting its effectiveness in tactile understanding without requiring additional tactile supervision.

RA-Touch Qualitative Results

To evaluate whether RA-Touch retrieves samples with similar material properties, we conducted a qualitative analysis using the TVL dataset.

Top-1 to Top-5 Retrieval Samples. RA-Touch generally retrieves samples that share similar textures with the query input, and we also observe high material consistency among the retrieved samples themselves. To qualitatively assess this behavior, we visualize Top-1 to Top-5 retrieval results using the TVL dataset. These examples demonstrate RA-Touch’s ability to retrieve texture-aligned samples in a visually diverse but materially coherent manner.

Comparison with Baseline Retrieval Methods. We further compare RA-Touch with various retrieval baselines using all possible combinations of query and retrieval modalities, including image-to-image, image-to-text, tactile-to-image, and tactile-to-text. All encoders are based on OpenCLIP; for tactile inputs, we use the TVL encoder aligned with the vision encoder. As illustrated in the figure, image-based retrieval tends to focus on background regions when they dominate the input. Tactile-only queries also struggle to consistently retrieve samples with similar textures. In contrast, the Tactile-Guided Retriever in RA-Touch modulates vision embeddings using tactile inputs, resulting in more semantically relevant and materially aligned retrievals. These findings highlight the importance of tactile-guided retrieval for effective texture understanding.

Citation

If you use this work or find it helpful, please consider citing our work.

To be updated ...
                

Credit: The design of this project page references the project pages of TVL, QA-TIGER, NeRF, CrossMAE, DeepMotionEditing, and LERF.