Skip to main content

Instruction Tuning

We construct instruction tuning dataset for 2D/3D modality instruction tuning via GPT API.

2D Instruction Tuning Datasets

2D instruction tuning datasets are build on MS-COCO, Bamboo, Locount and TextVQA dataset. You can download them from here.

The generated instruction-following dialogues are organized into the following meta files. We provide a table to illustrate the correspondence between each meta file and data collection:

Meta file nameSizeData file nameSize
daily_dialogue_49k.json112Mcoco_images.zip7.8G
detailed_description_49k.json65.5Mcoco_images.zip7.8G
factual_knowledge_dialogue_42k.json83.2Mbamboo_images.zip5.4G
vision_task_dialogue_46k.json64.8Mcoco_images.zip, bamboo_images.zip, locount_images.zip, textvqa_images.zip9.2G
LAMM_instruct_186k.json325M--

Note that we provide a LAMM_instruct_186k.json meta file to merge all the dataset across different tasks. You can just use this file for training.

Additional Detection Instruction

The lack of sufficient detection instructions results in poor performance on downstream PASCAL VOC evaluation. To overcome this problem, we leverage entire COCO detection annotations to generate instructions, and add them into the aforementioned datasets as supplementation.

Meta file nameSizeData file nameSize
coco_detection_117k.json116Mcoco_images.zip7.8G
octavius_2d_train_293k.json399M--

Note that we provide a octavius_2d_train_293k.json meta file to merge all the dataset across different tasks. You can just use this file for training.

3D Instruction Tuning Datasets

We provide two 3D instruction tuning datasets, "Scan2Inst" and "LAMM3D-Dataset", for 3D instruction tuning.

Scan2Inst

Note: Compared with LAMM3D-Dataset, Scan2Inst provide more tasks and instruction-following dialogues, therefore we highly recommend you use Scan2Inst rather than LAMM3D-Dataset for 3D instruction tuning.

Scan2Inst is build on ScanNet. Specifically, we first use FCAF3D from mmdetection3d to extract 3d object given a scene level point cloud. Then a ULIP-like encoder is used to extract linguistic-aligned object level 3d feature. In the end, to speed up the data loading process, we store the dataset to a pickle file. For convincely, we provide a processed pickle file (scan2inst_train.pickle) here, you can train our model by just loading this file.

Besides, if you want to utilize your own dataset, we also provide our ULIP model pretraining code, you can train your own ULIP model by following the instructions from src/tools/Octavius/ULIP/scripts/pretrain_pointbert.sh. You can also use our pretrained model Here to extract your own dataset.

Corresponding meta file is here:

Meta file nameSizeData file nameSize
scan2inst_train.json62.3Mscan2inst_train.pickle209M

LAMM3D-Dataset

LAMM3D-Dataset is build on 3RScan and ShapeNet. You can download them from here.

Corresponding meta file is here:

Meta file nameSizeData file nameSize
LAMM_3dinstruct_10k.json19.6M3rscan_pcls.zip, shapenet_pcls.zip929M

Directory Structure

data
├── LAMM
│ ├── 2D_Instruct
│ │ ├── coco_images.zip
│ │ ├── bamboo_images.zip
│ │ ├── textvqa_images.zip
│ │ ├── locount_images.zip
│ │ └── meta_file
│ │ ├── daily_dialogue_49k.json
│ │ ├── detailed_description_49k.json
│ │ ├── factual_knowledge_dialogue_42k.json
│ │ └── vision_task_dialogue_46k.json
│ ├── 3D_Instruct
│ │ ├── 3rscan_pcls.zip
│ │ ├── shapenet_pcls.zip
│ │ └── meta_file
│ │ └── LAMM_3dinstruct_10k.json
│ └── ...

├── Octavius
│ ├── 2D_Instruct
│ │ ├── coco_images.zip
│ │ ├── bamboo_images.zip
│ │ ├── textvqa_images.zip
│ │ ├── locount_images.zip
│ │ └── meta_file
│ │ └── octavius_2d_train_293k.json
│ ├── 3D_Instruct
│ │ ├── scan2inst_train.pickle
│ │ └── meta_file
│ │ └── scan2inst_train.json
│ └── ...

└── ...

Meta File Format

2D instruction tuning data

[
{
"id": "000000019028", # image id
"image": "coco_images/000000019028.jpg", # image path
"conversations": [
{
"from": "human", # instruction
"value": "How is the kitchen in the image furnished?"
},
{
"from": "gpt", # response
"value": "The kitchen in the image is furnished with white cabinets and white appliances. There is a dishwasher, a stove, and a sink. On the stove, a blue towel hangs on the handle. A cutting board is placed on the dishwasher. There are also additional elements like a bowl of apples on the counter and a beige rug on the floor."
}
],
"task_type": "conversation", # task type
"src_image": "coco2017" # original dataset
},
{
...
}
]

3D instruction instruction data

[
{
"pcl": "shapenet_pcls/04256520_cb71cb7b36dbcb6f826fc8d57346a2e4_4096.npy",
"conversations": [
{
"from": "human",
"value": "What scenario does this point cloud belong to according to the model\u2019s prediction?"
},
{
"from": "gpt",
"value": "Through meticulous analysis, it becomes evident that the point cloud aligns with the characteristics of sofa,couch,lounge s cenario."
}
],
"task_type": "classification3d",
"src_dataset": "ShapeNet",
"src_id": "04256520_cb71cb7b36dbcb6f826fc8d57346a2e4"
},
{
...
}
]

Sign up to get email updates on the LAMM or email us at openlamm@gmail.com.
© 2024. LAMM