Frame Info: Position: (2.1, 1.5, -3.2), Rotation: (15°, -45°, 0°)
The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, leveraging the strengths of pre-trained 2D LVLMs and TripAlign. Experiments on LEGO show state-of-the-art performance on established benchmarks. Its superior results on MV-ScanQA also expose the limitations of prior models in complex multi-view scenarios, and demonstrate MV-ScanQA and TripAlign's importance in fostering robust 3D vision-language understanding.
MV-ScanQA introduces multi-view reasoning challenges by composing questions that require information integration across different viewpoints. Below we present key statistics about the dataset composition and characteristics.
Distribution showing how many views are needed to answer each question. Higher values indicate more complex multi-view reasoning requirements.
Breakdown of question types showing the diversity of reasoning tasks, including counting, spatial relations, and attribute queries.
68%
Multi-view Questions
10K
Total Questions
2.3
Average Views Required
Explore how MV-ScanQA combines multiple single-view questions to create challenging multi-view reasoning tasks. Each example shows how information from different viewpoints must be integrated to answer the composite question.
Loading...
Answer: Loading...
Loading...
Answer: Loading...
Loading...
Answer: Loading...
Minimum Views Required
Loading...
View Overlap Score
Loading...
These are the an example set of views required to see all objects mentioned in the multi-view question
TripAlign provides rich multi-object contextual alignments between 2D views, a set of visible 3D objects, and natural language descriptions. Each triplet captures semantically related objects with their spatial relationships.
Figure: TripAlign examples showing the alignment between 2D views, 3D object groups, and textual descriptions. Each triplet provides rich contextual information about multiple related objects.
Multi-Object
Groups of contextually related objects
View-Grounded
Explicit 2D-3D correspondence
Low-Cost
Automated generation pipeline
TripAlign contains 1M triplets of <2D view, 3D objects, text> that provide rich multimodal alignment signals. Each triplet captures contextually related objects from multiple viewpoints with detailed textual descriptions.
Frame Info: Position: (2.1, 1.5, -3.2), Rotation: (15°, -45°, 0°)
Drag to rotate • Scroll to zoom • Right-click to pan
"In this cozy living room scene, there is a comfortable brown leather sofa positioned against the wall, with a wooden coffee table placed in front of it. The coffee table has some magazines and a small plant on top. To the right, there's a modern TV stand holding a flat-screen television. The spatial arrangement creates a typical entertainment area where the sofa faces the TV, with the coffee table serving as a functional surface between them."
1M
Total Triplets
1513
Unique Scenes
3.2
Avg. Objects in Triplet
(Well Visible)
Note: Use the dropdown menus to explore different scenes and triplets. The 3D viewer supports mouse interaction for rotation and zoom.
@inproceedings{mo2025mvscanqa,
title={Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset},
author={Mo, Wentao and Chen, QingChao and Peng, Yuxin and Huang, Siyuan and Liu, Yang},
year={2025},
}