Tutorial

Our code is here.

Once you have successfully configured the environment, the entire directory structure should be as follows:

LAMM β”œβ”€β”€ ckpt β”‚ β”œβ”€β”€ lamm_2d # saved checkpoints in training β”‚ └── … β”œβ”€β”€ data # dataset folder, see Dataset Preparation section for detail β”‚Β Β  β”œβ”€β”€ LAMM # LAMM dataset β”‚Β Β  β”œβ”€β”€ Octavius # Octavius dataset β”‚ β”œβ”€β”€ ChEF # ChEF dataset
β”‚ └── … # your custom dataset β”œβ”€β”€ docs # document β”œβ”€β”€ images # readme assets β”œβ”€β”€ model_zoo # see Model Preparation for Training for detail β”‚ β”œβ”€β”€ vicuna_ckpt # Vicuna-7B/13B β”‚ β”œβ”€β”€ epcl_vit-L_256tokens # EPCL pretraining checkpoints (Optional) β”‚ └── … β”œβ”€β”€ requirements # python environment requirements β”œβ”€β”€ src └── …

LAMM is one of the first open-source frameworks for Multi-modal Large Language Models (MLLMs) training and evaluation. Unlike frameworks such as LLaVA, LAMM specifically focuses on training and evaluating MLLMs for embodied agents.

Specifically, we have integrated data and code from a series of research projects, including:

  • Β· Training and evaluation datasets for MLLMs supporting both 2D and 3D temporal data.
  • Β· Training and evaluation code for MLLMs.
  • Β· Implementation frameworks for embodied AI downstream tasks in both simulated environments (such as Minecraft) and real-world robotics manipulation scenarios.

For detailed information, please refer to our tutorials or specific research projects and documentation in the research section.

The framework provides a comprehensive solution for researchers and developers working at the intersection of multi-modal learning and embodied intelligence. Our goal is to facilitate the development and assessment of MLLMs that can effectively interact with and understand physical environments.