We construct instruction tuning dataset for 2D/3D modality instruction tuning via GPT API.

2D Instruction Tuning Datasets

The generated instruction-following dialogues are organized into the following meta files. We provide a table to illustrate the correspondence between each meta file and data collection:

Meta file name	Size	Data file name	Size
daily_dialogue_49k.json	112M	coco_images.zip	7.8G
detailed_description_49k.json	83.2M	bamboo_images.zip	5.4G
vision_task_dialogue_46k.json	64.8M	coco_images.zip, bamboo_images.zip, locount_images.zip, textvqa_images.zip	9.2G
LAMM_instruct_186k.json	325M	/	/

Note that we provide a LAMM_instruct_186k.json meta file to merge all the dataset across different tasks. You can just use this file for training.

Additional Detection Instruction

The lack of sufficient detection instructions results in poor performance on downstream PASCAL VOC evaluation. To overcome this problem, we leverage entire COCO detection annotations to generate instructions, and add them into the aforementioned datasets as supplementation.

Meta file name	Size	Data file name	Size
coco_detection_117k.json	116M	coco_images.zip	7.8G
octavius_2d_train_293k.json	339M	/	/

Note that we provide a LAMM_instruct_186k.json meta file to merge all the dataset across different tasks. You can just use this file for training.

3D Instruction Tuning Datasets

We provide two 3D instruction tuning datasets, “Scan2Inst” and “LAMM3D-Dataset”, for 3D instruction tuning.

Scan2Inst

Scan2Inst is build on ScanNet. Specifically, we first use FCAF3D from mmdetection3d to extract 3d object given a scene level point cloud. Then a ULIP-like encoder is used to extract linguistic-aligned object level 3d feature. In the end, to speed up the data loading process, we store the dataset to a pickle file. For convincely, we provide a processed pickle file (scan2inst_train.pickle) here, you can train our model by just loading this file.

Besides, if you want to utilize your own dataset, we also provide our ULIP model pretraining code, you can train your own ULIP model by following the instructions from src/tools/Octavius/ULIP/scripts/pretrain_pointbert.sh. You can also use our pretrained model Here to extract your own dataset.

Meta file name	Size	Data file name	Size
scan2inst_train.json	62.3M	scan2inst_train.pickle	209M

LAMM3D-Dataset

LAMM3D-Dataset is build on 3RScan and ShapeNet. You can download them from here.

Directory Structure

Meta file name	Size	Data file name	Size
LAMM_3dinstruct_10k.json	19.6M	3rscan_pcls.zip, shapenet_pcls.zip	929M

data ├── LAMM │ ├── 2D_Instruct
│ │ ├── coco_images.zip
│ │ ├── bamboo_images.zip
│ │ ├── textvqa_images.zip
│ │ ├── locount_images.zip
│ │ └── meta_file
│ │ ├── daily_dialogue_49k.json
│ │ ├── detailed_description_49k.json
│ │ ├── factual_knowledge_dialogue_42k.json
│ │ └── vision_task_dialogue_46k.json │ ├── 3D_Instruct │ │ ├── 3rscan_pcls.zip
│ │ ├── shapenet_pcls.zip
│ │ └── meta_file
│ │ └── LAMM_3dinstruct_10k.json │ └── … │ ├── Octavius │ ├── 2D_Instruct
│ │ ├── coco_images.zip
│ │ ├── bamboo_images.zip
│ │ ├── textvqa_images.zip
│ │ ├── locount_images.zip
│ │ └── meta_file
│ │ └── octavius_2d_train_293k.json │ ├── 3D_Instruct
│ │ ├── scan2inst_train.pickle │ │ └── meta_file
│ │ └── scan2inst_train.json
│ └── … │ └── …