LAMM

Zhenfei Yin*,1,3  Jiong Wang*,1,4  Jianjian Cao*,1,4  Zhelun Shi*,1,2  Dingning Liu1,5  Mukai Li1
Xiaoshui Huang1  Zhiyong Wang3  Lu Sheng2  Lei Bai†,1  Jing Shao†,1  Wanli Ouyang1
1Shanghai Artificial Intelligence Laboratory  2Beihang University  3The University of Sydney 
4Fudan University  5Dalian University of Technology
* Equal Contribution  Corresponding Authors

📄 Paper𝕏 Demo▶️ YouTube 📺 Bilibili 📦 LAMM Models

Overview

Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities. In this work, we extend the research of MLLMs to point clouds and present the LAMM-Dataset and LAMM-Benchmark for 2D image and 3D point cloud understanding. We also establish an extensible framework to facilitate the extension of MLLMs to additional modalities. Our main contribution is three-fold: 1) We present the LAMM-Dataset and LAMM-Benchmark, which cover almost all high-level vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We demonstrate the detailed methods of constructing instruction-tuning datasets and benchmarks for MLLMs, which will enable future research on MLLMs to scale up and extend to other domains, tasks, and modalities faster. 3) We provide a primary but potential MLLM training framework optimized for modalities’ extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research.

Demos

Online Demo

For cases of 2D images, we provide an online demo deployed on huggingface spaces.

Due to limitation of hardware capacity, online version only supports LLM of 7B parameters and load pretrained model takes few minutes.

CLI Demo

We also provide a CLI demo for local test. Point cloud data are required to be in format of npy, we suggest to use data from LAMM-Benchmark-3D.

    cd ./src
    python cli_demo.py \
        --model lamm_peft \
        --vision_type pcl or image \
        --encoder_pretrain epcl or clip \
        --encoder_ckpt_path $EPCL_CKPT_PATH or '' \
        --llm_ckpt_path $LLM_CKPT_PATH \
        --delta_ckpt_path $LAMM_CKPT_PATH

LAMM-Dataset

LAMM-Dataset is a comprehensive multi-modal instruction tuning dataset, which contains 186K language-image instruction-response pairs, and 10K lanuage-3D instruction-response pairs.In LAMM-Dataset, the instruction-response pairs are gathered from 8 image datasets and 4 point cloud datasets. Here we design four type of multi-modal instruction-response pairs,

  • C1: n-round daily dialogue focuses on multi-modal daily conversations.
  • C2: n-round factual knowledge dialogue aims at factual knowledge reasoning.
  • C3: 1-round detailed description aims to elaborate images and 3D scenes in texts.
  • C4: 1-round visual task dialogue transfers various vision tasks into instruction-response pairs, aiming at enhancing generalizability towards domain tasks in other modalities.

You can download instruction / benchmark dataset and put them into data/LAMM directory.

LAMM-Framework

  1. You can install the environment following here.

  2. Prepare the required pretrained weights of LLMs and visual encoder here.

  3. Train your LAMM model following here. We also provide pretrained model here.

LAMM-Benchmark

Note: We highly recommend you use ChEF to evalute LAMM model, see here for details.

Default LAMM-Benchmark evaluates 9 common image tasks, using a total of 11 datasets with over 62,439 samples, and 3 common point cloud tasks, by utilizing 3 datasets with over 12,788 data samples, while existing works only provide quantitative results on fine-tuning and evaluating specific datasets such as ScienceQA, and most works only conduct demonstration or user studies.

  • We are the very first attempt to establish a benchmark for MLLMs. We conducted a comprehensive benchmark to quantify the zero-shot and fine-tuning performance of existing multi-modal language models on various computer vision tasks and compare them against state-of-the-art methods of these tasks, including classification, object detection, pose estimation, visual question answering, facial classification, optical character recognition, object counting.

  • We also attempted two novel evaluation strategies designed explicitly for MLLMs. Specifically, as for text generation, we established a scoring logic based on the GPT API. As for tasks involving interactions between points and images, such as object detection and pose estimation, we proposed an object-locating evaluation method.

Citation

@article{yin2023lamm,
    title={LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark},
    author={Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and Li, Mukai and Sheng, Lu and Bai, Lei and Huang, Xiaoshui and Wang, Zhiyong and others},
    journal={arXiv preprint arXiv:2306.06687},
    year={2023}
}

License

The project is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The checkpoints are also CC BY NC 4.0 (allowing only non-commercial use).

Acknowledgement

We thank Hongxing Fan, Zeren Chen, Zhen Wang for support of LAMM project.

We also thanks the great works including CLIP, EPCL, LLaMA, Vicuna, FlashAttention, xformers, lightllm