At the AGI-Next Frontier Summit co-hosted by a key Beijing laboratory of Tsinghua University and Zhipu AI, Professor Tang Jie of Tsinghua University and founder of Zhipu pointed out that since 2025, AI large models have begun to show rapid improvement in their performance on the Human Level Examination (HLE), a highly challenging benchmark for evaluating intelligence.
Tang Jie noted that back in 2020, AI large models could only solve basic problems like MMU and QA. By 2021-2022, through post-training, they began to acquire mathematical reasoning capabilities (addition, subtraction, multiplication, division), addressing a fundamental reasoning gap. From 2023 to 2024, large models evolved from knowledge memorization to complex reasoning, starting to tackle graduate-level problems and real-world programming tasks like the SWE bench, mirroring the human progression from elementary school to the professional workplace. In 2025, model capabilities in the Human Level Examination are rapidly advancing; this test includes extremely obscure questions that cannot be retrieved via Google, requiring models to possess strong generalization abilities.
"There has always been a desire for machines (AI) to have generalization capabilities, where teaching them a little enables them to infer much more," Tang Jie stated. Although the generalization ability of AI today still needs significant improvement, Zhipu, and indeed the entire industry, is actively working to enhance it through a series of methods.
Around 2020, the industry, based on the Transformer architecture, strengthened models' long-term knowledge retention by scaling up data volume and computing power, enabling direct access to basic knowledge (such as answering "What is the capital of China?"). By around 2022, the industry began focusing on alignment and reasoning optimization to enhance complex reasoning abilities and intent understanding, with core methods involving the continuous expansion of supervised fine-tuning (SFT) and reinforcement learning, relying on vast amounts of human feedback data to improve model accuracy. By 2025, the field is starting to experiment with building verifiable environments, allowing machines to autonomously explore, acquire feedback data for self-improvement, and strengthen generalization capabilities, thereby addressing issues like high noise and limited scenarios inherent in traditional human feedback data.