Tutorial
Tutorial
Our code is here.
Once you have successfully configured the environment, the entire directory structure should be as follows:
LAMM βββ ckpt β βββ lamm_2d # saved checkpoints in training β βββ β¦ βββ data # dataset folder, see
Dataset Preparation
section for detail βΒ Β βββ LAMM # LAMM dataset βΒ Β βββ Octavius # Octavius dataset β βββ ChEF # ChEF dataset
β βββ β¦ # your custom dataset βββ docs # document βββ images # readme assets βββ model_zoo # seeModel Preparation for Training
for detail β βββ vicuna_ckpt # Vicuna-7B/13B β βββ epcl_vit-L_256tokens # EPCL pretraining checkpoints (Optional) β βββ β¦ βββ requirements # python environment requirements βββ src βββ β¦
Training
Prepare Required Checkpoints
We provide pretrained weights of visual encoder and LLM. You can download them from the table below and place them in the model_zoo
directory.
Model Name | Link |
---|---|
Vicuna | Link |
clip-vit-large-patch14-336 | Link |
epcl_vit-L_256tokens | Link |
Organize the pretrained weights as below:
model_zoo βββ vicuna_ckpt β βββ 13b_v0
β βββ 7b_v0 βββ epcl_vit-L_256tokens
LAMM
- 2D Models Training
cd src sh tools/LAMM/train_lamm2d.sh lamm_2d # or sh tools/LAMM/train_lamm2d_slurm.sh <YOUR_PARTITION> lamm_2d
- 3D Models Training
cd src sh tools/LAMM/train_lamm3d.sh lamm_3d # or sh tools/LAMM/train_lamm3d_slurm.sh <YOUR_PARTITION> lamm_3d
For your reference, GPU memory consumption for different models is shown as follows:
Model Size | Sample Num/GPU | GPU Memory |
---|---|---|
Vicuna_v0_7B | 1 | ~30GB |
Vicuna_v0_7B | 2 | ~46GB |
Vicuna_v0_13B | 1 | ~53GB |
Vicuna_v0_13B | 2 | ~70GB |
Octavius
- Image modality only
cd src sh tools/Octavius/train_octavius_slurm.sh <YOUR_PARTITION> <NUM_GPU> \ config/Octavius/octavius_2d_e4_bs64.yaml octavius_2d_e4_bs64
- Point cloud modality only
cd src sh tools/Octavius/train_octavius_slurm.sh <YOUR_PARTITION> <NUM_GPU> \ config/Octavius/octavius_3d_e3_bs64.yaml octavius_3d_e3_bs64
- Image & point cloud modality joint
cd src sh tools/Octavius/train_octavius_slurm.sh <YOUR_PARTITION> <NUM_GPU> \ config/Octavius/octavius_2d+3d_e6_bs64.yaml octavius_2d+3d_e6_bs64
Model Zoo
We provide several pretrained LAMM/Octavius checkpoints here:
LAMM Model Zoo
# Training Samples | Vision Encoder | LLM | Training Data | Lora Rank | Link |
---|---|---|---|---|---|
98K | CLIP-ViT-L | Vicuna_v0_7B | LAMM-2D daily dialogue & description | 32 | Checkpoints |
186K | CLIP-ViT-L | Vicuna_v0_7B | LAMM-2D Instruction Data | 32 | Checkpoints |
186K | CLIP-ViT-L | LLaMA2_chat_7B | LAMM-2D Instruction Data | 32 | Checkpoints |
98K | CLIP-ViT-L | Vicuna_v0_13B | LAMM-2D daily dialogue & description | 32 | Checkpoints |
186K | CLIP-ViT-L | Vicuna_v0_13B | LAMM-2D Instruction Data | 32 | Checkpoints |
10K | EPCL-ViT-L | Vicuna_v0_13B | LAMM-3D Instruction Data | 32 | Checkpoints |
Octavius Model Zoo
#Samples | Vision Encoder | LLM | Training Data | Link |
---|---|---|---|---|
286K | CLIP-ViT-L | Vicuna_v0_13B | LAMM-2D Instruction Data & COCO-Detection | ckpt |
90K | Obj-As-Scene | Vicuna_v0_13B | Scan2Inst | ckpt |
376K | CLIP-ViT-L & Obj-As-Scene | LLaMA2_chat_13B | LAMM-2D Instruction Data & COCO-Detection & Scan2Inst | ckpt |
You can download them and put them into the ckpt
directory for fast evaluation.
Instruction Tuning
We construct instruction tuning dataset for 2D/3D modality instruction tuning via GPT API.
2D Instruction Tuning Datasets
2D instruction tuning datasets are build on MS-COCO, Bamboo, Locount and TextVQA dataset. You can download them from here.
The generated instruction-following dialogues are organized into the following meta files. We provide a table to illustrate the correspondence between each meta file and data collection:
Meta file name | Size | Data file name | Size |
---|---|---|---|
daily_dialogue_49k.json | 112M | coco_images.zip | 7.8G |
detailed_description_49k.json | 83.2M | bamboo_images.zip | 5.4G |
vision_task_dialogue_46k.json | 64.8M | coco_images.zip, bamboo_images.zip, locount_images.zip, textvqa_images.zip | 9.2G |
LAMM_instruct_186k.json | 325M | / | / |
Note that we provide a LAMM_instruct_186k.json
meta file to merge all the dataset across different tasks. You can just use this file for training.
Additional Detection Instruction
The lack of sufficient detection instructions results in poor performance on downstream PASCAL VOC evaluation. To overcome this problem, we leverage entire COCO detection annotations to generate instructions, and add them into the aforementioned datasets as supplementation.
Meta file name | Size | Data file name | Size |
---|---|---|---|
coco_detection_117k.json | 116M | coco_images.zip | 7.8G |
octavius_2d_train_293k.json | 339M | / | / |
Note that we provide a LAMM_instruct_186k.json
meta file to merge all the dataset across different tasks. You can just use this file for training.
3D Instruction Tuning Datasets
We provide two 3D instruction tuning datasets, βScan2Instβ and βLAMM3D-Datasetβ, for 3D instruction tuning.
Scan2Inst
Note: Compared with LAMM3D-Dataset, Scan2Inst provide more tasks and instruction-following dialogues, therefore we highly recommend you use Scan2Inst rather than LAMM3D-Dataset for 3D instruction tuning.
Scan2Inst is build on ScanNet. Specifically, we first use FCAF3D from mmdetection3d to extract 3d object given a scene level point cloud. Then a ULIP-like encoder is used to extract linguistic-aligned object level 3d feature. In the end, to speed up the data loading process, we store the dataset to a pickle file. For convincely, we provide a processed pickle file (scan2inst_train.pickle) here, you can train our model by just loading this file.
Besides, if you want to utilize your own dataset, we also provide our ULIP model pretraining code, you can train your own ULIP model by following the instructions from src/tools/Octavius/ULIP/scripts/pretrain_pointbert.sh
. You can also use our pretrained model Here to extract your own dataset.
Corresponding meta file is here:
Meta file name | Size | Data file name | Size |
---|---|---|---|
scan2inst_train.json | 62.3M | scan2inst_train.pickle | 209M |
LAMM3D-Dataset
LAMM3D-Dataset is build on 3RScan and ShapeNet. You can download them from here.
Corresponding meta file is here:
Meta file name | Size | Data file name | Size |
---|---|---|---|
LAMM_3dinstruct_10k.json | 19.6M | 3rscan_pcls.zip, shapenet_pcls.zip | 929M |
Directory Structure
data βββ LAMM β βββ 2D_Instruct
β β βββ coco_images.zip
β β βββ bamboo_images.zip
β β βββ textvqa_images.zip
β β βββ locount_images.zip
β β βββ meta_file
β β βββ daily_dialogue_49k.json
β β βββ detailed_description_49k.json
β β βββ factual_knowledge_dialogue_42k.json
β β βββ vision_task_dialogue_46k.json β βββ 3D_Instruct β β βββ 3rscan_pcls.zip
β β βββ shapenet_pcls.zip
β β βββ meta_file
β β βββ LAMM_3dinstruct_10k.json β βββ β¦ β βββ Octavius β βββ 2D_Instruct
β β βββ coco_images.zip
β β βββ bamboo_images.zip
β β βββ textvqa_images.zip
β β βββ locount_images.zip
β β βββ meta_file
β β βββ octavius_2d_train_293k.json β βββ 3D_Instruct
β β βββ scan2inst_train.pickle β β βββ meta_file
β β βββ scan2inst_train.json
β βββ β¦ β βββ β¦
Meta File Format
2D instruction tuning data
[
{
"id": "000000019028", # image id
"image": "coco_images/000000019028.jpg", # image path
"conversations": [
{
"from": "human", # instruction
"value": "How is the kitchen in the image furnished?"
},
{
"from": "gpt", # response
"value": "The kitchen in the image is furnished with white cabinets and white appliances. There is a dishwasher, a stove,
and a sink. On the stove, a blue towel hangs on the handle. A cutting board is placed on the dishwasher. There are also
additional elements like a bowl of apples on the counter and a beige rug on the floor."
}
],
"task_type": "conversation", # task type
"src_image": "coco2017" # original dataset
},
{
...
}
]
3D instruction tuning data
[
{
"pcl": "shapenet_pcls/04256520_cb71cb7b36dbcb6f826fc8d57346a2e4_4096.npy",
"conversations": [
{
"from": "human",
"value": "What scenario does this point cloud belong to according to the model\u2019s prediction?"
},
{
"from": "gpt",
"value": "Through meticulous analysis, it becomes evident that the point cloud aligns with the characteristics of sofa,
couch, lounge scenario."
}
],
"task_type": "classification3d",
"src_dataset": "ShapeNet",
"src_id": "04256520_cb71cb7b36dbcb6f826fc8d57346a2e4"
},
{
...
}
]
Benchmark
We provide 2D/3D/ChEF benchmarking datasets for downstream evaluation.
2D Benchmarking Datasets
2D Benchmarking datasets are build on Flickr30k, CIFAR-10, FSC147, CelebA, UCMerced, LSP, PASCAL VOC, SVT, AI2D and ScienceQA datasets. You can download them from here.
Corresponding meta file is here:
Meta file name | Size | Data file name | Size |
---|---|---|---|
Caption_flickr30k.json | 598K | flickr30k_images.zip | 559M |
Classification_CIFAR10.json | 2.6M | cifar10_images.zip | 8.9M |
Counting_FSC147.json | 7.3M | fsc147_images.zip | 44M |
Detection_VOC2012.json | 6.4M | voc2012_images.zip | 196M |
Facial_Classification_CelebA(Hair).json | 2.4M | celeba_images.zip | 566M |
Facial_Classification_CelebA(Smile).json | 3.7M | celeba_images.zip | 566M |
Fine-grained_Classification_UCMerced.json | 676K | ucmerced_images.zip | 317M |
Keypoints_Dectection_LSP.json | 3.9M | lsp_images.zip | 44M |
Locating_FSC147.json | 7.5M | fsc147_images.zip | 44M |
Locating_LSP.json | 3.9M | lsp_images.zip | 9.9M |
Locating_VOC2012.json | 6.0M | voc2012_images.zip | 196M |
OCR_SVT.json | 68K | svt_images.zip | 82M |
VQA_AI2D.json | 2.1M | ai2d_images.zip | 559M |
VQA_SQAimage.json | 3.6M | sqaimage_images.zip | 127M |
3D Benchmarking Datasets
We provide two 3D benchmarking datasets, βScan2Inst-benchmarkβ and βLAMM3D-Dataset-benchmarkβ.
Scan2Inst-benchmark
If your MLLMs are trained with βScan2Instβ, you should use βScan2Inst-benchmarkβ for evaluation.
We provide NR3D and ShapeNet for zero-shot evaluation, and ScanNet for finetuning evaluation. You can download processed pickle file from here.
Corresponding meta file is here:
Meta file name | Size | Data file name | Size |
---|---|---|---|
Caption_nr3d.json | 2.28M | Caption_nr3d.pickle | 25.41M |
Caption_scannet.json | 239.43K | Caption_scannet.pickle | 7.29M |
Classification_scannet.json | 249.80K | Classification_scannet.pickle | 7.38M |
Classification_shapenet.json | 1.09M | Classification_shapenet.pickle | 21.45M |
VQA_scannet.json | 231.64K | VQA_scannet.pickle | 4.82M |
LAMM3D-Dataset-benchmark
If your MLLMs are trained with βLAMM3D-Datasetβ, you should use βLAMM3D-Dataset-benchmarkβ for evalution.
LAMM3D-Dataset-benchmark is build on ScanNet. You can download them from here.
Corresponding meta file is here:
Meta file name | Size | Data file name | Size |
---|---|---|---|
Detection_ScanNet.json | 1.7M | scannet_pcls.zip | 246M |
VG_ScanRefer.json | 3.7M | scannet_pcls.zip | 246M |
VQA_ScanQA_multiplechoice.json | 859K | scannet_pcls.zip | 246M |
ChEF Benchmarking Dataset
Omnibenchmark
Download Omnibenchmark for fine-grained classification dataset and Bamboo Label System for hierarchical catergory labels.
We sampled and labeled Omnibenchmark meticulously by using a hierarchical chain of categories, facilitated by the Bamboo label system.
python ChEF/data_process/Omnibenchmark.py
You can also directly download the labeled Omnibenchmark dataset from OpenXLab.
MMBench, MME and SEEDBench
Refer to MMBench, MME and SEEDBench for dataset and more details.
POPE
POPE is a special labeled COCO dataset for hallucination evaluation based on the validation set of COCO 2014. Download COCO and POPE.
MMBench_C and ScienceQA_C
MMBench_C and ScienceQA_C are datasets with image and text corruptions fot robustness evaluation. You can also directly download the MMBench_C and ScienceQA_C dataset from OpenXLab.
Directory Structure
data βββ ChEF | βββ Omnibenchmark_Bamboo
| β βββ meta_file | β βββ omnibenchmark_images | βββ MMBench_C | | βββ images | | βββ Image_Corruptions_info.json | | βββ Text_Corruptions_info.json | | βββ MMBench_C.json | βββ ScienceQA_C | βββ sqaimage_images | βββ Image_Corruptions_info.json | βββ Text_Corruptions_info.json | βββ VQA_ScienceQA_C.json βββ Bamboo | βββ sensexo_visual_add_academic_add_state_V4.visual.json |ββ MMBench | βββ mmbench_dev_20230712.tsv | βββ mmbench_test_20230712.tsv |ββ MME_Benchmark_release_version |ββ SEED-Bench |ββ coco_pope | βββ val2014 | βββ coco_pope_adversarial.json | βββ coco_pope_popular.json | βββ coco_pope_random.json βββ β¦
Installation
Training
Pre-requist Packages:
gcc <= 7.5.0; nvcc >= 11.1
Python & Pytorch Environment
conda create -n lamm python=3.10 -y
conda activate lamm
# Choose different version of torch according to your
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
Install Required Dependencies
conda install timm==0.6.7 deepspeed==0.9.3 transformers==4.31.0 -c conda-forge
pip install peft==0.3.0 --no-dependencies
pip install -r requirements/default.txt
Install Faiss
# if cuda is available
conda install -c conda-forge faiss-gpu
# otherwise
conda install -c conda-forge faiss-cpu
Download required NLTK data
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
Optional Dependencies
LAMM-3D Environments
cd src/model/EPCL/third_party/pointnet2/
python setup.py install
cd ../../utils/
pip install cython
python cython_compile.py build_ext --inplace
Reducing Memory in Training
Β· flash attention (v2)
Install flash attention (v2) if you are tight in GPU memory. Please refer to flash attentionβs installation
FlashAttention-2 currently supports Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
Β· xformers
Install xformers if you are tight in GPU memory and cannot use flash attention (e.g., using Nvidia v100). Please refer to xformersβs installation
Benchmarking
We use ChEF for benchmarking.
conda create -n ChEF python=3.10
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements/default.txt
pip install -r requirements/ChEF.txt
Optional Dependencies
Efficient Inference
Β· lightllm
Install lightllm to speed up inference and decrease the GPU memery usage to enable large batchsize.
git clone -b multimodal https://github.com/ModelTC/lightllm.git
cd lightllm
python setup.py install
Default Benchmark
LAMM-Benchmark
Notes: LAMM-Benchmark has now been fully implemented using ChEF, and we highly recommend using the latest ChEF evaluation method for benchmarking in your work. ChEF supports the common 2D and 3D tasks evaluation and locating tasks evaluation in LAMM. Please note that the GPT rank metric in LAMM is no longer applicable.
To evaluate LAMM/Octavius on LAMM-Benchmark in 2D common tasks, use the pre-defined model config (src/config/ChEF/models/lamm.yaml
or src/config/ChEF/models/octavius_2d+3d.yaml
) and the pre-defined recipes config (src/config/ChEF/scenario_recipes/LAMM/
).
python eval.py --model_cfg config/ChEF/models/lamm.yaml --recipe_cfg config/ChEF/scenario_recipes/LAMM/ScienceQA.yaml
If you want to automately running all the evaluations sequentially, you can run
sh tools/LAMM/eval_lamm2d.sh
sh tools/LAMM/eval_lamm3d.sh
To evaluate Octavius on ScanNet Detection, run:
sh tools/Octavius/octavius_ChEF.sh
ChEF
Download Evaluated MLLMs
LLM | Vision Encoder | Language Model | Link |
---|---|---|---|
InstructBLIP | EVA-G | Vicuna 7B | instruct_blip_vicuna7b_trimmed |
Kosmos2 | CLIP ViT-L/14 | Decoder 1.3B | kosmos-2.pt |
LAMM | CLIP ViT-L/14 | Vicuna 13B | lamm_13b_lora32_186k |
LLaMA-Adapter-v2 | CLIP ViT-L/14 | LLaMA 7B | LORA-BIAS-7B |
LLaVA | CLIP ViT-L/14 | MPT 7B | LLaVA-Lightning-MPT-7B |
MiniGPT-4 | EVA-G | Vicuna 7B | MiniGPT-4 |
mPLUG-Owl | CLIP ViT-L/14 | LLaMA 7B | mplug-owl-llama-7b |
Otter | CLIP ViT-L/14 | LLaMA 7B | OTTER-9B-LA-InContext |
Shikra | CLIP ViT-L/14 | LLaMA 7B | shikra-7b |
Organize them as below:
ckpt βββ epcl_vit-L_256tokens βββ β βββ lamm_2d # saved checkpoints in training β βββ β¦ βββ β¦
Custom Benchmark
Evaluator
In ChEF, all evaluation pipelines are managed by the Evaluator
(src/ChEF/evaluator.py
) class. This class serves as the control center for evaluation tasks and incorporates various components, including a scenario, an instruction, an inferencer, and a metric. These components are defined through recipe configurations.
Key Components:
- Β· Scenario: The scenario represents the evaluation dataset and task-specific details.
- Β· Instruction: Responsible for processing samples and generating queries.
- Β· Inferencer: Performs model inference on the dataset.
- Β· Metric: Evaluates model performance using defined metrics.
Evaluation Workflow
The evaluation process in ChEF follows a structured workflow:
- Β· Model and Data Loading: First, the model and evaluation dataset (scenario) are loaded.
- Β· Evaluator.evaluate Method: The evaluation is initiated by calling the
evaluate
method of theEvaluator
class. - Β· Inference with inferencer.inference: The
inferencer
is used to perform model inference. During dataset traversal, theInstructionHandler
processes each sample, generating queries that serve as inputs to the model. - Β· Results Saving: The output of the inference is saved in the specified
results_path
. - Β· Metric Evaluation: Finally, the
metric
evaluates the results file, calculating various performance metrics specific to the evaluation task. - Β· Output Evaluation Results: The final evaluation results are provided as output, allowing you to assess the modelβs performance.
Employ Your Model
In ChEF, you can employ your own custom models by following these steps:
Step 1: Prepare Your Model Files
- Β· Navigate to the
src/ChEF/models/
folder in ChEF. - Β· Paste all the necessary files for your custom model into this folder.
Step 2: Write the Test Model
- Β· Create a new Python file in the models folder and name it something like
test_your_model.py
. - Β· In this file, you will need to inherit from the
TestBase
class defined insrc/ChEF/models/test_base.py
. TheTestBase
class provides a set of interfaces that you should implement for testing your model.
Step 3: Test Your Model
- Β· Add your model in
src/ChEF/models/init.py
. - Β· Prepare your model configuration in
src/config/ChEF/models/
. For example, the config forKOSMOS-2
(src/config/ChEF/models/kosmos2.yaml
):
model_name: Kosmos2
model_path: ../model_zoo/kosmos/kosmos-2.pt
if_grounding: False # set True for detection and grounding evaluation
The config for KOSMOS-2 on detection tasks evaluation:
model_name: Kosmos2
model_path: ../model_zoo/kosmos/kosmos-2.pt
if_grounding: True
Use the provided recipes for evaluation:
python tools/eval.py --model_cfg configs/ChEF/models/your_model.yaml --recipe_cfg recipe_cfg
Instruction
In ChEF, the InstructionHandler
(src/ChEF/instruction/init.py
) class plays a central role in managing instructions for generating queries when iterating through the dataset in the inferencer
. These queries are then used as inputs to the model for various tasks.
ChEF supports three main query types: standard query
, query pool
, and multiturn query
. For each query type, various query statements are defined based on the datasetβs task type.
- Β· Standard Query: Uses the first query defined in the query pool.
- Β· Query Pool: Specifies queries in the pool by assigned ids defined in the configuration.
- Β· Multiturn Query: Can get different queries depending on the turn id, which are also defined in the query pool.
For more details, refer to the src/ChEF/instruction/query.py
.
InstructionHandler
also supports generating in-context examples for queries using ice_retriever
(src/ChEF/instruction/ice_retriever/
). ChEF supports four types of ice_retrievers
: random
, fixed
, topk_text
, and topk_img
. The generate_ices
function in the InstructionHandler
class outputs several in-context examples for the input query.
Employ Your Instruction: You can add special queries in the Query Pool
, and define the assigned ids in the recipe configuration to use the new queries. You can also define a new type of query by defining the query in src/ChEF/instruction/query.py
and adding a new function in InstructionHandler
.
Inferencer
In ChEF, the Inferencer
component is a crucial part of the system, responsible for model inference. ChEF offers a variety of pre-defined inferencers to cater to different needs. You can easily choose the appropriate inferencer by specifying the inferencer category and necessary settings in the recipe configuration. Additionally, users have the flexibility to define their custom inferencers.
Pre-Defined Inferencers:
ChEF provides eight different inferencers that cover a range of use cases. You can effortlessly use the desired inferencer by specifying its category and required settings in the recipe configuration.
Custom Inferencers:
For advanced users and specific requirements, ChEF offers the option to create custom inferencers. The basic structure of an inferencer is defined in the src/ChEF/inferencer/Direct.py
file (Direct_inferencer
). You can extend this structure to implement your custom inferencer logic.
class Your_inferencer(Direct_inferencer):
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
def inference(self, model, dataset):
predictions = []
# Step 1: build dataloader
dataloader = DataLoader(dataset, batch_size=self.batch_size, collate_fn=lambda batch: {key: [dict[key] for dict in batch] for key in batch[0]})
for batch in tqdm(dataloader, desc="Running inference"):
# Step 2: get input query
prompts = self.instruction_handler.generate(batch)
# Step 3: model outputs
outputs = model.generate(prompts)
# Step 4: save results
predictions = predictions + outputs
# Step 5: output file
self._after_inference_step(predictions)
Metric
In ChEF, the Metric
component plays a crucial role in evaluating and measuring the performance of models across various scenarios and protocols. ChEF offers a wide range of pre-defined metrics, each tailored to different evaluation needs. Detailed information about these metrics can be found in the src/ChEF/metric/init.py
file.
Custom Metrics:
ChEF also allows users to define their custom metrics. The basic structure of a metric is defined in the src/ChEF/metric/utils.py
file (Base_Metric
). You can extend this structure to implement your custom metric logic.
class Your_metric(Base_metric):
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
def metric_func(self, answers):
'''
answers: List[sample], each sample is a dict
sample: {
'answer' : str,
'gt_answers' : str,
}
'''
# Evaluation
LAMM
@article{yin2023lamm,
title={LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark},
author={Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and
Li, Mukai and Sheng, Lu and Bai, Lei and Huang, Xiaoshui and Wang, Zhiyong and others},
journal={arXiv preprint arXiv:2306.06687},
year={2023}
}
ChEF
@misc{shi2023chef,
title={ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal
Large Language Models},
author={Zhelun Shi and Zhipin Wang and Hongxing Fan and Zhenfei Yin and Lu Sheng and Yu Qiao
and Jing Shao},
year={2023},
eprint={2311.02692},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Octavius
@misc{chen2023octavius,
title={Octavius: Mitigating Task Interference in MLLMs via MoE},
author={Zeren Chen and Ziqin Wang and Zhen Wang and Huayang Liu and Zhenfei Yin and Si Liu
and Lu Sheng and Wanli Ouyang and Yu Qiao and Jing Shao},
year={2023},
eprint={2311.02684},
archivePrefix={arXiv},
primaryClass={cs.CV}
}