OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Youjun Zhao1, Jiaying Lin1,2, Shuquan Ye1,3, Qianshi Pang4, Rynson W. H. Lau1

1City University of Hong Kong, 2The Hong Kong University of Science and Technology
3The Chinese University of Hong Kong, 4South China University of Technology

AAAI 2026

TL;DR:

We propose GOV-3D, a task where 3D scenes are queried by attributes (e.g., affordance and material) instead of only object classes.

OpenScan is a benchmark with attribute annotations covering eight linguistic aspects.
OpenScan teaser illustration.
GOV-3D extends classic OV-3D from class queries to attribute queries.
Task & Benchmark

From OV-3D to GOV-3D

GOV-3D Task

Problem

Classical open-vocabulary 3D scene understanding (OV-3D) localizes objects from object class names. GOV-3D extends it to attribute-level queries, such as affordance and material. The model need to both locate the correct object and understand the semantics behind the attribute.

Why a New Benchmark?

Motivation

Existing benchmarks like ScanNet, ScanNet200, and ScanNet++ mainly label object classes and cannot directly assess how well models understand attributes such as affordance and material. OpenScan fills this gap by providing attribute annotations on top of ScanNet200, allowing us to measure 3D understanding from multiple linguistic aspects.

Overview of OpenScan GOV-3D benchmark.
OpenScan benchmark provides attribute annotations for each object, expanding the single category of object classes into eight linguistic aspects of object-related attributes.

Eight Linguistic Aspects of Object Attributes

Attribute Categorized

OpenScan expands each object from a single class label into a rich set of attributes across eight linguistic aspects:

  • Affordance – is the object function or usage (e.g., "sit" for a chair).
  • Property – indicates the object characteristic (e.g., "soft" for a pillow).
  • Type – indicates the object category or group (e.g., "a communication device" for a telephone).
  • Manner – indicates the object behavior (e.g., "worn on a head" for a hat).
  • Synonyms – is a term with a similar meaning (e.g., "image" for a picture).
  • Requirement – indicates an essential condition that an object should possess to fulfill a specific need (e.g., "balance to ride" for a bicycle).
  • Element – indicates an individual component or part that constitutes the object (e.g., "two wheels" for a bicycle).
  • Material – indicates the type of material of the object (e.g., "plastic" for a bottle).
Dataset

OpenScan at a Glance

Benchmark Statistics

Overview

OpenScan is constructed on top of ScanNet200 and inherits its large-scale 3D indoor scenes while adding attribute annotations. Each object can be queried by multiple attribute , enabling attribute-centric evaluation in both semantic and instance settings.

Overall statistics of the OpenScan dataset.
OpenScan benchmark statistics of object-related attributes for the eight linguistic aspects.
Experiments

How Well Do Existing OV-3D Methods Generalize?

3D Instance Segmentation on OpenScan

Quantitative

We benchmark multiple state-of-the-art OV-3D models on attribute-centric GOV-3D evaluation. While these methods perform strongly on class-level queries, their performance drops significantly on attribute queries, revealing a large gap between class recognition and attribute understanding.

3D instance segmentation results on OpenScan.
3D instance segmentation performance of several OV-3D methods on the OpenScan benchmark.
Radar chart of quantitative GOV-3D performance across attribute aspects.
Radar chart of quantitative GOV-3D performance across eight linguistic aspects on our OpenScan benchmark.

Qualitative Findings

Qualitative

Visual results shows that models often succeed on class queries but fail for GOV-3D attribute queries.

Qualitative results on OpenScan.
Qualitative results of Open3DIS on our OpenScan benchmark. The GT objects and outputs are highlighted in color.
Failure cases on GOV-3D.
Typical failure patterns on GOV-3D: complex commonsense knowledge, noisy 3D geometry, and tiny target objects.
Data Engine

Attribute Annotation via Knowledge Graph & Human Feedback

Knowledge-Driven Attribute Mining

Automatic

We first associate ScanNet200 objects with attributes using the ConceptNet knowledge graph. Graph edges connect object nodes and attribute nodes, allowing us to retrieve candidate attribute terms for each object and each linguistic aspect.

OpenScan knowledge graph based data engine.
Illustration of the data generation process for our OpenScan benchmark.

Manual Annotation

Manual

For the visual attribute that cannot be inferred without human perception, we manually annotate each 3D object by creating a web interface for annotators to select each object's visual attribute.

Annotation interface used to label 3D attributes.
Web interface for manual annotation that allows users to view the 3D scene from multiple viewpoints and select the target object by clicking.
Visualization

Visualizations of OpenScan Benchmark Formats and Samples

Below are visualizations of OpenScan benchmark formats and samples of eight linguistic aspects.

Examples of objects and corresponding attributes.
Examples of objects and corresponding attributes in our OpenScan benchmark.
Samples of affordance, property, type, and manner aspects.
Samples of affordance, property, type, and manner aspects.
Samples of synonym, requirement, element, and material aspects.
Samples of synonym, requirement, element, and material aspects.
Citation

BibTeX

@article{zhao2024openscan,
  title   = {OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding},
  author  = {Zhao, Youjun and Lin, Jiaying and Ye, Shuquan and Pang, Qianshi and Lau, Rynson W. H.},
  journal = {arXiv preprint arXiv:2408.11030},
  year    = {2024}
}