Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

1University of Wisconsin-Madison,
2Salesforce AI Research, 3Microsoft Research
NeurIPS 2024
SpatialEval Overview

SpatialEval is a benchmark to evaluate spatial intelligence for LLMs and VLMs across four key dimensions: spatial relationships, positional understanding, object counting, and navigation. The benchmark comprises four distinct tasks: Spatial-Map for comprehending spatial relationships between objects in map-based scenarios; Maze-Nav for testing navigation through complex environments; Spatial-Grid for evaluating spatial reasoning within structured environments; and Spatial-Real for real-world spatial understanding tasks. Each task incorporates three input modalities: Text-only (TQA), Vision-only (VQA), and Vision-Text (VTQA) inputs.

Introduction

Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning — a fundamental component of human cognition — remains under-explored. We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning such as relationship understanding, navigation, and counting. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature: (1) Spatial reasoning poses significant challenges where competitive models can fall behind random guessing; (2) Despite additional visual input, VLMs often under-perform compared to their LLM counterparts; (3) When both textual and visual information is available, multi-modal language models become less reliant on visual information if sufficient textual clues are provided. Additionally, we demonstrate that leveraging redundancy between vision and text can significantly enhance model performance. We hope our study will inform the development of multimodal models to improve spatial intelligence and further close the gap with human intelligence.

Main Results

Model Input Modality Term Description
LLM Text-only TQA (LLM) Text-only input that includes all necessary information to answer questions without visual context.
VLM Text-only TQA (VLM) Text-only input as in TQA (LLM) but applied to VLMs (e.g., the LLaVA family).
VLM Vision-only VQA Input only includes an image without corresponding textual description.
VLM Vision-text VTQA Input includes both an image and its textual description.

Summary of Terminology in SpatialEval.

💡 Insight 1: Only a few models outperform random guessing for spatial reasoning tasks

SpatialEval main results

💡 Insight 2: Vision information does not help with VQA? TQA (LLM) > VQA (VLM)

TQA vs VQA

💡 Insight 3: Similar trend hold for Proprietary Models as open-source models

proprietary

💡 Insight 4: Leveraging redundancy in multimodal inputs can improve VLM performance

VQA vs VTQA
Comparison Summary of Findings
TQA (LLM) vs. VQA VQA rarely enhances the performance compared to TQA (LLM).
VTQA vs. TQA (VLM) VLMs exhibit improved performance in spatial reasoning tasks when the image input is absent.
VQA vs. VTQA Given the same image input, additional textual description enhances VLM's performance.
TQA (VLM) vs. TQA (LLM) Multimodal fine-tuning enhances LLM's spatial reasoning ability.
TQA (LLM) vs. VTQA No definitive winner.

Summary of findings across different input modalities.

Ablation Studies

To better understand how VLMs process visual information, we conduct a series of controlled experiments in the VTQA (Vision-text input) setting. For each sample, we replace the original image input (that matches the textual description) with either: (1) No Image: only keep the textual input without the image input, (2) Noise Image: a Gaussian noise image irrelevant to the task, and (3) Random Image: a random image from the dataset that does not match the textual description.

image ablations

💡 Insight 5: VLMs exhibit improved performance when visual input is absent; LLMs benefit from multimodal training

original vs no image

💡 Insight 6: Noise image can help VQA

original vs noise image

💡 Insight 7: Mismatched multimodal information does not necessarily hurt

original vs noise image

Detailed Examples

Poster

BibTeX

@inproceedings{wang2024spatial,
        title={Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models},
        author={Wang, Jiayu and Ming, Yifei and Shi, Zhenmei and Vineet, Vibhav and Wang, Xin and Li, Yixuan and Joshi, Neel},
        booktitle={The Thirty-Eighth Annual Conference on Neural Information Processing Systems},
        year={2024}
      }