Benchmarking
We provide 2D/3D/ChEF benchmarking datasets for downstream evaluation.
2D Benchmarking Datasets
2D Benchmarking datasets are build on Flickr30k, CIFAR-10, FSC147, CelebA, UCMerced, LSP, PASCAL VOC, SVT, AI2D and ScienceQA datasets. You can download them from here.
Corresponding meta file is here:
Meta file name | size | Data file name | size |
---|---|---|---|
Caption_flickr30k.json | 598K | flickr30k_images.zip | 559M |
Classification_CIFAR10.json | 2.6M | cifar10_images.zip | 8.9M |
Counting_FSC147.json | 7.3M | fsc147_images.zip | 44M |
Detection_VOC2012.json | 6.4M | voc2012_images.zip | 196M |
Facial_Classification_CelebA(Hair).json | 2.4M | celeba_images.zip | 566M |
Facial_Classification_CelebA(Smile).json | 3.7M | celeba_images.zip | 566M |
Fine-grained_Classification_UCMerced.json | 676K | ucmerced_images.zip | 317M |
Keypoints_Dectection_LSP.json | 3.9M | lsp_images.zip | 44M |
Locating_FSC147.json | 7.5M | fsc147_images.zip | 44M |
Locating_LSP.json | 3.9M | lsp_images.zip | 9.9M |
Locating_VOC2012.json | 6.0M | voc2012_images.zip | 196M |
OCR_SVT.json | 68K | svt_images.zip | 82M |
VQA_AI2D.json | 2.1M | ai2d_images.zip | 559M |
VQA_SQAimage.json | 3.6M | sqaimage_images.zip | 127M |
3D Benchmarking Datasets
We provide two 3D benchmarking datasets, "Scan2Inst-benchmark" and "LAMM3D-Dataset-benchmark".
Scan2Inst-benchmark
If your MLLMs are trained with "Scan2Inst", you should use "Scan2Inst-benchmark" for evaluation.
We provide NR3D and ShapeNet for zero-shot evaluation, and ScanNet for finetuning evaluation. You can download processed pickle file from here.
Corresponding meta file is here:
Meta file name | size | Data file name | size |
---|---|---|---|
Caption_nr3d.json | 2.28M | Caption_nr3d.pickle | 25.41M |
Caption_scannet.json | 239.43K | Caption_scannet.pickle | 7.29M |
Classification_scannet.json | 249.80 K | Classification_scannet.pickle | 7.38M |
Classification_shapenet.json | 1.09M | Classification_shapenet.pickle | 21.45M |
VQA_scannet.json | 231.64K | VQA_scannet.pickle | 4.82M |
LAMM3D-Dataset-benchmark
If your MLLMs are trained with "LAMM3D-Dataset", you should use "LAMM3D-Dataset-benchmark" for evalution.
LAMM3D-Dataset-benchmark is build on ScanNet. You can download them from here.
Corresponding meta file is here:
Meta file name | size | Data file name | size |
---|---|---|---|
Detection_ScanNet.json | 1.7M | scannet_pcls.zip | 246M |
VG_ScanRefer.json | 3.7M | scannet_pcls.zip | 246M |
VQA_ScanQA_multiplechoice.json | 859K | scannet_pcls.zip | 246M |
ChEF Benchmarking Dataset
Omnibenchmark
Download Omnibenchmark for fine-grained classification dataset and Bamboo Label System for hierarchical catergory labels.
We sampled and labeled Omnibenchmark meticulously by using a hierarchical chain of categories, facilitated by the Bamboo label system.
python ChEF/data_process/Omnibenchmark.py
You can also directly download the labeled Omnibenchmark dataset from OpenXLab.
MMBench, MME and SEEDBench
Refer to MMBench, MME and SEEDBench for dataset and more details.
POPE
POPE is a special labeled COCO dataset for hallucination evaluation based on the validation set of COCO 2014. Download COCO and POPE.
MMBench_C and ScienceQA_C
MMBench_C and ScienceQA_C are datasets with image and text corruptions fot robustness evaluation. You can also directly download the MMBench_C and ScienceQA_C dataset from OpenXLab.
Directory Structure
data
├── ChEF
| │── Omnibenchmark_Bamboo
| │ ├── meta_file
| │ └── omnibenchmark_images
| ├── MMBench_C
| | ├── images
| | ├── Image_Corruptions_info.json
| | ├── Text_Corruptions_info.json
| | └── MMBench_C.json
| └── ScienceQA_C
| ├── sqaimage_images
| ├── Image_Corruptions_info.json
| ├── Text_Corruptions_info.json
| └── VQA_ScienceQA_C.json
├── Bamboo
| └── sensexo_visual_add_academic_add_state_V4.visual.json
|── MMBench
| ├── mmbench_dev_20230712.tsv
| └── mmbench_test_20230712.tsv
|── MME_Benchmark_release_version
|── SEED-Bench
|── coco_pope
| ├── val2014
| ├── coco_pope_adversarial.json
| ├── coco_pope_popular.json
| └── coco_pope_random.json
└── ...