On November 3, 360 Group's quietly open-sourced visual-language alignment model, FG-CLIP2, sparked widespread discussion in the global tech community. The model has surpassed tech giants Google's SigLIP 2 and Meta's MetaCLIP2 across 29 authoritative public benchmarks, including long/short text-image retrieval and object detection, marking another breakthrough for China in the field of AI foundational models.
FG-CLIP2 successfully addresses the long-standing challenge of "fine-grained recognition" in CLIP models, achieving a remarkable 96% confidence level in detail recognition even in complex multi-object scenes. The model introduces three fundamental innovations: First, a hierarchical alignment architecture enables it to grasp both macro scenes and micro details, bridging the gap from "seeing" to "seeing clearly." Second, a dynamic attention mechanism allows the model to intelligently focus on key image regions, delivering precise detail capture with minimal computational cost. Third, a bilingual co-optimization strategy resolves the imbalance in Chinese-English understanding at the foundational level, achieving true native bilingual support.