SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

NeurIPS 2025

Pingyi Chen^1,2,3, Yujing Lou^3,4, Shen Cao³, Jinhui Guo³, Lubin Fan³,
Yue Wu³, Lin Yang², Lizhuang Ma⁴ Jieping Ye³,

¹Zhejiang University ²Westlake University ³Alibaba Cloud ⁴Shanghai Jiao Tong University

SD-VLM Our work (1) proposes Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduces a plug-and-play depth positional encoding method strengthening VLMs’ spatial awareness.

Abstract

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability. In this paper, we analyze the problem hindering VLMs' spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs' spatial awareness. MSMU dataset covers massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPT-Bench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. We will release MSMU dataset and SD-VLM to facilitate future research in quantitative spatial measuring and understanding.

Dataset

Figure: Pipeline.

Starting from 3D scene point clouds, we first collect the spatial information (e.g., locations, sizes, relative distances) of objects in the scene to construct a scene graph. Next, we rasterize 3D instances into 2D images and establish a 3D-to-2D mapping, which enables transferring spatial annotations to images. We also perform filtering on both images and objects to ensure the quality of the QA pairs. Finally, we design human-verified QA templates and employ LLM collaboration to generate a rich set of QA pairs.

(step 1) Building Scene Graph. Given a 3D point cloud of a scene, we first construct a scene graph (stored as a JSON file) to systematically organize all annotations and metadata. This graph includes the object categorization and corresponding 3D spatial localization data which provides bounding boxes for each object, defined by centroid coordinates and dimensional parameters.

(step 2) Rasterize 3D instances to 2D images. We rasterize 3D instances onto images as masks using official tools. This process bridge an object in the 3D scene and 2D image plane, making transferring spatial annotations from 3D scene graph to each image feasible.

(step 3) Image Filtering and Object Selection. We first sparsely sample the RGB images to reduce redundancy. After that, we carefully select objects in each image, which is guided by three principal criteria: (1) Prevalence and functionality. We focus on objects demonstrating clear functional purposes which are commonly encountered in indoor environments. Architectural components (e.g., walls, ceilings) are excluded due to their limited interactive potential. (2) Instance visibility. Objects that are partially occluded (e.g., a chair mostly hidden behind a table), truncated by image borders (e.g., only a corner of a table is visible), or too small to annotate reliably (e.g., distant objects occupying fewer than 50 pixels) are excluded from our dataset. (3) Semantic disambiguation. Addressing linguistic ambiguity is important before generating annotations. For example, tables which exist in one image may vary in color or texture but are all labeled as "table", which brings noisy correspondence and finally misleads VLMs. To mitigate this issue, we resort to Qwen2.5-VL to re-label these objects with more detailed descriptions, such as "the white table" or "the wooden table". Finally, we filter out non-informative images that have no valid objects.

(step 4) Templated-based Generation. We carefully design a set of templates based on the task definitions which include various placeholders. For instance, one template for measuring the size of a single target object is structured as follows: ``Q: What is the size of [object A]. A: The size of [object A] is [Length]x[Width]x[Height].''} For each image, we enumerate the selected objects and replace these placeholders with the corresponding object labels or spatial annotations. In tasks involving two or more target objects, we also meticulously craft instructions that incorporate all relevant object labels and spatial information.

(step 5) Eliciting Reasoning Path with LLM collaboration. We improve the quantitative spatial ability of VLMs by eliciting reasoning paths with reference objects, we augment the QA pairs with CoT reasoning rationale via LLM collaboration. Specifically, we randomly select one object as the reference object and combine its spatial annotations along with the image as inputs to the advanced VLM, Qwen2.5-VL. The VLM is then prompted to construct a reasoning path that leverages the reference object to infer the spatial properties of another object within the image. Subsequently, we utilize a large language model, DeepSeek-V3, to assess and filter the CoT pairs by evaluating the factual consistency and logical coherence.

We employ this data generation pipeline to construct VQA pairs from ScanNet and ScanNet++. The resulting MSMU dataset contains 2K scenes, 25K images, 75K objects, 700K QA pairs, and 2.5M numerical values, covering a wide range of quantitative spatial tasks. Besides, the CoT augmented group, named MSMU-CoT, consists of 10K quantitative spatial reasoning QA pairs.

Model

Figure: Framework.

We introduce the depth positional encoding (DPE), which can encode the depth maps into depth positional embeddings, allowing for a straightforward combination through addition. We utilize sine and cosine functions of varying frequencies to generate the depth positional embeddings

The model consists of a vision encoder to encode image features, a depth encoding module to incorporate depth information, and a large language model to process sequences of tokens. When depth maps are not accessible during inference, we employ an external depth estimation model to generate the depth map. This allows our model to adapt to various datasets and scenarios effectively.

Results

Model	Task								Average
Model	Existence	Object Counting	Scale Est.	Grounding	Relative Position	Absolute Distance	Scale Comparison	Ref. Object Est.	Average
Large Language Models (LLMs): Text only
GPT-4-Turbo	12.76	5.21	13.51	12.64	24.84	7.50	36.79	12.04	15.66
Qwen2.5	4.25	0.00	0.78	13.79	0.62	0.00	16.04	1.57	4.63
DeepSeek-V3	0.00	5.24	1.54	6.90	10.56	0.00	25.47	5.24	7.39
Vision-Language Models (VLMs): Image + Text
GPT-4o	44.68	41.67	3.86	27.59	67.08	20.00	54.72	2.09	32.28
Gemini-2	38.30	43.75	23.94	19.54	54.66	12.50	69.81	18.85	35.17
Qwen2.5-VL-72B	59.57	35.42	1.54	13.79	57.76	2.50	66.04	9.95	30.82
Qwen2.5-VL-32B	29.79	41.67	10.81	18.39	60.25	2.50	46.23	10.99	27.59
Qwen2.5-VL-7B	12.76	4.17	0.00	1.15	1.24	0.00	5.66	0.52	3.19
Intern-VL3-78B	47.62	42.71	6.47	26.32	56.94	13.33	64.10	16.46	33.63
Intern-VL3-8B	36.17	41.67	4.63	18.39	60.25	2.50	49.06	8.38	28.54
LLaVA-1.5-7B	1.54	36.46	5.02	20.69	42.86	5.00	38.68	0.52	19.45
Depth-encoded VLMs: Image + Depth + Text
SpatialBot	10.64	46.88	15.83	28.74	66.46	5.00	50.94	8.90	29.17
SpatialRGPT	10.64	36.46	20.08	17.24	60.25	15.00	62.26	9.95	28.98
Ours	87.23	47.92	51.35	42.53	75.16	40.00	55.66	46.07	56.31
Ours w/ MSMU-CoT	87.23	42.71	51.74	49.43	73.29	50.00	69.81	49.32	59.19

Visualization

Figure: Visualization of responses from different models on MSMU-Bench.

In this work, we identified a critical gap in the ability of Vision-Language Models (VLMs) to perform quantitative spatial reasoning. To address this, we developed MSMU, a large-scale dataset comprising 700K QA pairs and 2.5M numerical physical annotations derived from real 3D scenes, designed to provide precise metric supervision for enhancing VLMs' spatial reasoning capabilities. We introduced a simple but effective depth positional encoding module that integrates the third dimension information into the VLM frameworks, effectively upgrading the model's spatial awareness from 2D to 3D. This innovation was shown to significantly enhance spatial reasoning abilities, outperforming both RGB-only VLMs and depth-encoded VLMs. We anticipate that our contributions will pave the way for further advancements in VLMs' spatial reasoning capabilities, enabling more effective operation in real-world environments.