Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset

Abstract

The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, leveraging the strengths of pre-trained 2D LVLMs and TripAlign. Experiments on LEGO show state-of-the-art performance on established benchmarks. Its superior results on MV-ScanQA also expose the limitations of prior models in complex multi-view scenarios, and demonstrate MV-ScanQA and TripAlign's importance in fostering robust 3D vision-language understanding.

MV-ScanQA Dataset Analysis

MV-ScanQA introduces multi-view reasoning challenges by composing questions that require information integration across different viewpoints. Below we present key statistics about the dataset composition and characteristics.

Number of Views Required Distribution

Distribution of number of views required to solve questions

Distribution showing how many views are needed to answer each question. Higher values indicate more complex multi-view reasoning requirements.

Question Type Distribution

Distribution of question types in MV-ScanQA

Breakdown of question types showing the diversity of reasoning tasks, including counting, spatial relations, and attribute queries.

68%

Multi-view Questions

10K

Total Questions

2.3

Average Views Required

MV-ScanQA Interactive Examples

Explore how MV-ScanQA combines multiple single-view questions to create challenging multi-view reasoning tasks. Each example shows how information from different viewpoints must be integrated to answer the composite question.

Select Scene

Select Question Example

Question Composition

Question 1

Loading...

Answer: Loading...

Objects:

Question 2

Loading...

Answer: Loading...

Objects:

Multi-View Question

Loading...

Answer: Loading...

All Related Objects:

Minimum Views Required

Loading...

View Overlap Score

Loading...

Scene Visualization & Oracle Views

Bird's Eye View (with Object Annotations)

Question 1 Objects Question 2 Objects Overlapping Objects

Oracle Views Required to Answer

These are the an example set of views required to see all objects mentioned in the multi-view question

TripAlign Dataset Examples (Static)

TripAlign provides rich multi-object contextual alignments between 2D views, a set of visible 3D objects, and natural language descriptions. Each triplet captures semantically related objects with their spatial relationships.

Figure: TripAlign examples showing the alignment between 2D views, 3D object groups, and textual descriptions. Each triplet provides rich contextual information about multiple related objects.

Multi-Object

Groups of contextually related objects

View-Grounded

Explicit 2D-3D correspondence

Low-Cost

Automated generation pipeline

TripAlign Dataset Showcase

TripAlign contains 1M triplets of <2D view, 3D objects, text> that provide rich multimodal alignment signals. Each triplet captures contextually related objects from multiple viewpoints with detailed textual descriptions.

Scene:

Triplet:

Debug: Show Axes

2D View

Frame Info: Position: (2.1, 1.5, -3.2), Rotation: (15°, -45°, 0°)

3D Scene

Loading 3D Scene...

0%

The 3D scene (100~500MB) may take minutes to load from HuggingFace.

Drag to rotate • Scroll to zoom • Right-click to pan

Contextual Description

"In this cozy living room scene, there is a comfortable brown leather sofa positioned against the wall, with a wooden coffee table placed in front of it. The coffee table has some magazines and a small plant on top. To the right, there's a modern TV stand holding a flat-screen television. The spatial arrangement creates a typical entertainment area where the sofa faces the TV, with the coffee table serving as a functional surface between them."

Spatial Relations Object Properties Contextual Understanding Multi-Object Alignment

1M

Total Triplets

1513

Unique Scenes

3.2

Avg. Objects in Triplet
(Well Visible)

Note: Use the dropdown menus to explore different scenes and triplets. The 3D viewer supports mouse interaction for rotation and zoom.

BibTeX

@inproceedings{mo2025mvscanqa,
  title={Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset},
  author={Mo, Wentao and Chen, QingChao and Peng, Yuxin and Huang, Siyuan and Liu, Yang},
  year={2025},
}