Academic Project Page

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

¹University of Wisconsin-Madison,
²Salesforce AI Research, ³Microsoft Research
NeurIPS 2024

Introduction

Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning — a fundamental component of human cognition — remains under-explored. We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning such as relationship understanding, navigation, and counting. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature: (1) Spatial reasoning poses significant challenges where competitive models can fall behind random guessing; (2) Despite additional visual input, VLMs often under-perform compared to their LLM counterparts; (3) When both textual and visual information is available, multi-modal language models become less reliant on visual information if sufficient textual clues are provided. Additionally, we demonstrate that leveraging redundancy between vision and text can significantly enhance model performance. We hope our study will inform the development of multimodal models to improve spatial intelligence and further close the gap with human intelligence.

Model	Input Modality	Term	Description
LLM	Text-only	TQA (LLM)	Text-only input that includes all necessary information to answer questions without visual context.
VLM	Text-only	TQA (VLM)	Text-only input as in TQA (LLM) but applied to VLMs (e.g., the LLaVA family).
VLM	Vision-only	VQA	Input only includes an image without corresponding textual description.
VLM	Vision-text	VTQA	Input includes both an image and its textual description.

Model

Input Modality

Term

Description

LLM

Text-only

TQA (LLM)

Text-only input that includes all necessary information to answer questions without visual context.

VLM

Text-only

TQA (VLM)

Text-only input as in TQA (LLM) but applied to VLMs (e.g., the LLaVA family).

VLM

Vision-only

VQA

Input only includes an image without corresponding textual description.

VLM

Vision-text

VTQA

Input includes both an image and its textual description.

Comparison	Summary of Findings
TQA (LLM) vs. VQA	VQA rarely enhances the performance compared to TQA (LLM).
VTQA vs. TQA (VLM)	VLMs exhibit improved performance in spatial reasoning tasks when the image input is absent.
VQA vs. VTQA	Given the same image input, additional textual description enhances VLM's performance.
TQA (VLM) vs. TQA (LLM)	Multimodal fine-tuning enhances LLM's spatial reasoning ability.
TQA (LLM) vs. VTQA	No definitive winner.

Comparison

Summary of Findings

TQA (LLM) vs. VQA

VQA rarely enhances the performance compared to TQA (LLM).

VTQA vs. TQA (VLM)

VLMs exhibit improved performance in spatial reasoning tasks when the image input is absent.

VQA vs. VTQA

Given the same image input, additional textual description enhances VLM's performance.

TQA (VLM) vs. TQA (LLM)

Multimodal fine-tuning enhances LLM's spatial reasoning ability.

TQA (LLM) vs. VTQA

No definitive winner.

BibTeX

@inproceedings{wang2024spatial, title={Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models}, author={Wang, Jiayu and Ming, Yifei and Shi, Zhenmei and Vineet, Vibhav and Wang, Xin and Li, Yixuan and Joshi, Neel}, booktitle={The Thirty-Eighth Annual Conference on Neural Information Processing Systems}, year={2024} }

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Introduction

Main Results

💡 Insight 1: Only a few models outperform random guessing for spatial reasoning tasks

💡 Insight 2: Vision information does not help with VQA? TQA (LLM) > VQA (VLM)

💡 Insight 3: Similar trend hold for Proprietary Models as open-source models

💡 Insight 4: Leveraging redundancy in multimodal inputs can improve VLM performance

Ablation Studies

💡 Insight 5: VLMs exhibit improved performance when visual input is absent; LLMs benefit from multimodal training

💡 Insight 6: Noise image can help VQA

💡 Insight 7: Mismatched multimodal information does not necessarily hurt

Detailed Examples

Detailed Example for Spatial-Map Task.

Detailed Example for Maze-Nav Task.

Detailed Example for Spatial-Grid Task.

Detailed Example for Spatial-Real Task.

Poster

BibTeX