X-SAM: Unified Image Segmentation Multimodal Large Model Achieves State-of-the-Art Performance on 20+ Image Segmentation Datasets

Deep News

Aug 19

This research was jointly conducted by Sun Yat-sen University, Peng Cheng Laboratory, and Meituan. The first author, Wang Hao, is a doctoral student at Sun Yat-sen University, with research interests in image and video segmentation, open-scene visual perception, and multimodal large models. The corresponding authors are Professor Liang Xiaodan and Associate Researcher Lan Xiangyuan.

**Background and Motivation**

While the Segment Anything Model (SAM) excels as a foundational segmentation model for dense segmentation mask generation, its reliance on single-input visual prompts limits its applicability across broader image segmentation tasks. Although Multimodal Large Language Models (MLLMs) perform exceptionally well in tasks like image description and visual question answering, their output is constrained to text generation and cannot directly handle pixel-level visual tasks, which fundamentally limits the development of generalized models.

Sun Yat-sen University, Peng Cheng Laboratory, and Meituan jointly propose X-SAM—a unified image segmentation multimodal large model that extends the segmentation paradigm from "segment anything" to "any segmentation." X-SAM introduces a unified framework that equips MLLMs with advanced pixel-level perception and understanding capabilities.

The research team proposes a new task called Visual Grounded Segmentation (VGS), which segments all instance objects through interactive visual prompts, endowing MLLMs with pixel-level understanding capabilities for visual grounding. To support effective training across diverse data sources, X-SAM employs a unified training strategy that supports joint training across datasets.

Experimental results demonstrate that X-SAM achieves state-of-the-art performance across extensive image segmentation benchmarks, fully showcasing its superiority in multimodal pixel-level visual understanding.

**Method Design**

X-SAM designs universal input formats and unified output representations: 1) Text Query Input 2) Vision Query Input 3) Unified Output Representation

X-SAM adopts an end-to-end unified segmentation MLLM architecture comprising the following core components:

1) **Dual Encoders Design** 2) **Dual Projectors Architecture**

To enhance LLM's image understanding capabilities, X-SAM employs a feature fusion strategy.

3) **Segmentation Connector** Addressing the need for fine-grained multi-scale features in image segmentation tasks, a segmentation connector is designed to provide rich multi-scale information for the segmentation decoder.

4) **Unified Segmentation Decoder** Replacing SAM's original decoder, it adopts the Mask2Former decoder architecture.

X-SAM employs a three-stage progressive training strategy to optimize performance across diverse image segmentation tasks: 1) **Stage 1: Segmentor Fine-tuning** 2) **Stage 2: Alignment Pre-training** 3) **Stage 3: Mixed Fine-tuning**

For training dataset scale differences (ranging from 0.2K to 665K samples), X-SAM adopts a dataset-balanced resampling strategy:

Where t is a hyperparameter controlling oversampling ratio, and f_d is the frequency of dataset d. During mixed training, dataset d is resampled according to r_d to improve performance on few-shot datasets.

**Experimental Results**

**Comprehensive Performance Metrics** X-SAM underwent comprehensive evaluation on over 20 segmentation datasets covering 7 different image segmentation tasks, achieving optimal performance across all tasks.

**Key Task Performance Indicators:** - Referring segmentation tasks - Conversational generation segmentation tasks - Visual grounded segmentation tasks - Vision-language understanding tasks

**Visualization Results** [Results demonstrations are included in the original research]

**Summary and Outlook**

X-SAM, as the first truly unified segmentation multimodal large language model, successfully achieves the important leap from "segment anything" to "any segmentation." Through innovative VGD segmentation tasks, unified architectural design, and progressive training strategies, X-SAM maintains competitive performance across various tasks while achieving broader task coverage, opening new directions for image segmentation research and laying important foundations for building general visual understanding systems.

Future research directions can focus on extensions to the video domain. First, integrating with SAM2 to achieve unified segmentation for images and videos, further expanding application scope. Second, extending VGD segmentation to video by incorporating temporal information, constructing innovative video segmentation tasks to provide new possibilities for video understanding technology development.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

X-SAM: Unified Image Segmentation Multimodal Large Model Achieves State-of-the-Art Performance on 20+ Image Segmentation Datasets

Most Discussed