LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

  • School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
†Corresponding author
TL;DR: In this work, we enhance MLLMs by integrating fine-grained spatial-aware visual knowledge and high-level semantic visual evidence, boosting image-level and region-level task capabilities and alleviating hallucinations.

Abstract
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).
Video

Method
Model Architecture

As shown in the figure below, LION model 1) extracts holistic visual features from Q-Former, and combines them with fine-grained spatial-aware visual features from the vision aggregator. 2) The frozen recognize anything model (RAM) produces image tags, which are cooperated with soft prompt to provide complementary high-level semantic visual evidences. 3) The Mixture-of-Adapters with a router in the frozen LLM dynamically fuses visual knowledge learned from different visual branches and LLM adapters based on the task types (image-level and region-level).



Stage-wise Training Strategy

In order to address the tasks conflicts between region-level and image-level, we adopt a stage-wise instruction-tuning strategy.

Results
Image-level and Region-level Tasks Results

For image-level tasks, we focus on image captioning and Visual Question Answering (VQA). For region-level tasks, we evaluate LION on three REC datasets including RefCOCO, RefCOCO+ and RefCOCOg. The results, detailed in Table 1~2, highlight LION's superior performance compared to baseline models.



Benchmark Results

We further evaluate LION on a object hallucination benchmark(POPE) and the most popular MLLM benchmark (MMBench). The results in Table 3~4 show that LION has strong performances across various skills and also demonstrates a strong resistance to hallucinations, particularly in popular and adversarial settings in POPE.



Qualitative Comparison


More Examples
Bibtex
@inproceedings{chen2024lion,
    title={LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge}, 
    author={Chen, Gongwei and Shen, Leyang and Shao, Rui and Deng, Xiang and Nie, Liqiang},
    booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024}
}
Acknowledgement

We referred to the project page of AvatarCLIP when creating this project page.