Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

Wentao Mo1, Yang Liu1

1Wangxuan Institute of Computer Technology, Peking University

ICML 2026

Motivation

Interpretable programs meet flexible 3D MLLMs.

3D multi-modal LLM block showing complex instruction and open-set concept support, but black-box reasoning.
Neuro-symbolic methods block showing interpretable reasoning and program generalizability, but limited complex instruction and open-set concept support.
APEIRIA block combining neuro-symbolic interpretability with 3D multi-modal LLM flexibility through transparent 3D chain-of-thought.
A three-panel teaser comparing 3D MLLMs, neuro-symbolic methods, and APEIRIA.
Abstract

From rigorous neuro-symbolic reasoning to open-world 3D understanding.

Current 3D spatial reasoning methods face a trade-off: neuro-symbolic 3D learners provide interpretable compositional programs but remain tied to closed-set vocabularies, while end-to-end 3D MLLMs can handle free-form language yet often reason as opaque pattern matchers.

APEIRIA1 bridges these paradigms by distilling verified symbolic execution traces into natural language chain-of-thought. A three-stage curriculum first grounds 3D object features to the LLM to teach it to see in 3D, then injects systematic program-style reasoning via CoT-SFT, and finally extends the learned patterns to open-set concepts and complex nested instructions through CoT-RL.

1απειρια, Unlimited in Greek.

Method

Progressive curriculum-based reasoning distillation.

APEIRIA learns to see, think, and adapt through a simple-to-complex progression that preserves the clarity of symbolic programs.

Method overview showing three-stage curriculum and program-to-chain-of-thought translation.
Stage 1

Perception Alignment

Teach the 3D MLLM to recognize basic 3D scenes and objects by grounding categories, attributes, locations, and captions before program distillation.

Stage 2

Symbolic Reasoning Injection

Serialize neuro-symbolic programs into CoT traces that expose plans, object IDs, locations, and step outputs.

Stage 3

Open-Set Generalization

Use GRPO with format and soft grounding rewards to adapt reasoning to real-world language and deeper nesting.

Results

Strong spatial reasoning with transparent reasoning.

APEIRIA improves over prior neuro-symbolic methods and competitive 3D MLLMs on ScanRefer and Multi3DRefer, with modular enhancement pushing performance further.

ScanRefer Acc@0.25 60.5 APEIRIA with modular enhancement
ScanRefer Acc@0.5 53.2 Best reported in the paper table
Multi3DRefer F1@0.25 60.9 Full modular enhancement
Multi3DRefer F1@0.5 55.2 Full modular enhancement
Method Output ScanRefer
Acc@0.25
ScanRefer
Acc@0.5
Multi3DRefer
F1@0.25
Multi3DRefer
F1@0.5
Video-3D LLM Head 58.1 51.7 58.0 52.7
Inst3D-LMM Text 57.8 51.6 58.3 53.5
APEIRIA Text 58.4 51.2 59.2 53.8
APEIRIA + modular enhancement Text 60.5 53.2 60.9 55.2
Reasoning Behavior

The model exposes its spatial verification path.

CoT-RL encourages APEIRIA to preserve a planning-then-execution structure while extending beyond fixed program templates. The model can filter open-vocabulary descriptors, compose multi-condition relations, and summarize grounded object IDs.

Object IDs 3D positions Relation checks
Trace example
Instruction Find the vase left to the computer. Provide its ID, position, and dimensions.
  1. 01

    Plan

    Examine all objects; find vases; find computers; check which vase is left to a computer.

  2. 02

    Scene

    Object 0(vase), 1(vase), 2(bottle), 3(rug), ...

  3. 03

    Filter

    Vase 0 (1.15, 6.09, 1.33) Vase 1 (6.66, 5.41, 0.29) Computer 4 (2.34, 3.50, 0.75) Computer 11 (1.44, 4.04, 3.10)
  4. 04

    Relate

    • Vase 0 is left to Computer 4.
    • Vase 0 is above and near Computer 11.
    • Vase 1 is right to and near Computer 4.
    • Vase 1 is right to, in front of and far from Computer 11.
  5. 05

    Answer

    Object 0: position (1.15, 6.09, 1.33), size 0.86 x 0.99 x 1.79.

Modularity

External perception and plans can lift the reasoning performance.

As APEIRIA keeps planning and execution decoupled, the same reasoning scaffold can accept a stronger planner or a stronger scene parser at inference time. Claude Opus plans help slightly, while better perception from SegDINO3D drives the larger gains, suggesting the current bottleneck is mostly visual rather than planning.

Modularity Analysis Inference-time module replacement
Source ScanRefer Multi3DRefer
Self plan + perception58.459.2
Claude 4.5 Opus plan58.659.5
SegDINO3D perception60.460.6
Full modular enhancement60.560.9
Oracle upper bound61.361.3
Citation

BibTeX

Please consider citing APEIRIA if you find this work helpful to your research.

@inproceedings{mo2026,
  title={Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs},
  author={Mo, Wentao and Liu, Yang},
  % booktitle={International Conference on Machine Learning},
  year={2026}
}