Instruction Tuning
We construct instruction tuning dataset for 2D/3D modality instruction tuning via GPT API.
2D Instruction Tuning Datasets
2D instruction tuning datasets are build on MS-COCO, Bamboo, Locount and TextVQA dataset. You can download them from here.
The generated instruction-following dialogues are organized into the following meta files. We provide a table to illustrate the correspondence between each meta file and data collection:
Meta file name | Size | Data file name | Size |
---|---|---|---|
daily_dialogue_49k.json | 112M | coco_images.zip | 7.8G |
detailed_description_49k.json | 65.5M | coco_images.zip | 7.8G |
factual_knowledge_dialogue_42k.json | 83.2M | bamboo_images.zip | 5.4G |
vision_task_dialogue_46k.json | 64.8M | coco_images.zip, bamboo_images.zip, locount_images.zip, textvqa_images.zip | 9.2G |
LAMM_instruct_186k.json | 325M | - | - |
Note that we provide a LAMM_instruct_186k.json
meta file to merge all the dataset across different tasks. You can just use this file for training.
Additional Detection Instruction
The lack of sufficient detection instructions results in poor performance on downstream PASCAL VOC evaluation. To overcome this problem, we leverage entire COCO detection annotations to generate instructions, and add them into the aforementioned datasets as supplementation.
Meta file name | Size | Data file name | Size |
---|---|---|---|
coco_detection_117k.json | 116M | coco_images.zip | 7.8G |
octavius_2d_train_293k.json | 399M | - | - |
Note that we provide a octavius_2d_train_293k.json
meta file to merge all the dataset across different tasks. You can just use this file for training.
3D Instruction Tuning Datasets
We provide two 3D instruction tuning datasets, "Scan2Inst" and "LAMM3D-Dataset", for 3D instruction tuning.
Scan2Inst
Note: Compared with LAMM3D-Dataset, Scan2Inst provide more tasks and instruction-following dialogues, therefore we highly recommend you use Scan2Inst rather than LAMM3D-Dataset for 3D instruction tuning.
Scan2Inst is build on ScanNet. Specifically, we first use FCAF3D from mmdetection3d to extract 3d object given a scene level point cloud. Then a ULIP-like encoder is used to extract linguistic-aligned object level 3d feature. In the end, to speed up the data loading process, we store the dataset to a pickle file. For convincely, we provide a processed pickle file (scan2inst_train.pickle) here, you can train our model by just loading this file.
Besides, if you want to utilize your own dataset, we also provide our ULIP model pretraining code, you can train your own ULIP model by following the instructions from src/tools/Octavius/ULIP/scripts/pretrain_pointbert.sh
. You can also use our pretrained model Here to extract your own dataset.
Corresponding meta file is here:
Meta file name | Size | Data file name | Size |
---|---|---|---|
scan2inst_train.json | 62.3M | scan2inst_train.pickle | 209M |
LAMM3D-Dataset
LAMM3D-Dataset is build on 3RScan and ShapeNet. You can download them from here.
Corresponding meta file is here:
Meta file name | Size | Data file name | Size |
---|---|---|---|
LAMM_3dinstruct_10k.json | 19.6M | 3rscan_pcls.zip, shapenet_pcls.zip | 929M |
Directory Structure
data
├── LAMM
│ ├── 2D_Instruct
│ │ ├── coco_images.zip
│ │ ├── bamboo_images.zip
│ │ ├── textvqa_images.zip
│ │ ├── locount_images.zip
│ │ └── meta_file
│ │ ├── daily_dialogue_49k.json
│ │ ├── detailed_description_49k.json
│ │ ├── factual_knowledge_dialogue_42k.json
│ │ └── vision_task_dialogue_46k.json
│ ├── 3D_Instruct
│ │ ├── 3rscan_pcls.zip
│ │ ├── shapenet_pcls.zip
│ │ └── meta_file
│ │ └── LAMM_3dinstruct_10k.json
│ └── ...
│
├── Octavius
│ ├── 2D_Instruct
│ │ ├── coco_images.zip
│ │ ├── bamboo_images.zip
│ │ ├── textvqa_images.zip
│ │ ├── locount_images.zip
│ │ └── meta_file
│ │ └── octavius_2d_train_293k.json
│ ├── 3D_Instruct
│ │ ├── scan2inst_train.pickle
│ │ └── meta_file
│ │ └── scan2inst_train.json
│ └── ...
│
└── ...
Meta File Format
2D instruction tuning data
[
{
"id": "000000019028", # image id
"image": "coco_images/000000019028.jpg", # image path
"conversations": [
{
"from": "human", # instruction
"value": "How is the kitchen in the image furnished?"
},
{
"from": "gpt", # response
"value": "The kitchen in the image is furnished with white cabinets and white appliances. There is a dishwasher, a stove, and a sink. On the stove, a blue towel hangs on the handle. A cutting board is placed on the dishwasher. There are also additional elements like a bowl of apples on the counter and a beige rug on the floor."
}
],
"task_type": "conversation", # task type
"src_image": "coco2017" # original dataset
},
{
...
}
]
3D instruction instruction data
[
{
"pcl": "shapenet_pcls/04256520_cb71cb7b36dbcb6f826fc8d57346a2e4_4096.npy",
"conversations": [
{
"from": "human",
"value": "What scenario does this point cloud belong to according to the model\u2019s prediction?"
},
{
"from": "gpt",
"value": "Through meticulous analysis, it becomes evident that the point cloud aligns with the characteristics of sofa,couch,lounge s cenario."
}
],
"task_type": "classification3d",
"src_dataset": "ShapeNet",
"src_id": "04256520_cb71cb7b36dbcb6f826fc8d57346a2e4"
},
{
...
}
]