In the College of Chemistry and Molecular Engineering at Peking University, organic chemistry exams are a challenging yet rewarding experience for many students. However, an unexpected announcement before the midterm exam created an unusual atmosphere: "Please note that this exam will cover topics beyond organic chemistry."
What was even more surprising than the expanded scope was the arrival of a group of "special examinees" in the exam hall. These participants required no seats, paper, or pens. They were GPT, Gemini, DeepSeek—some of the world's most advanced AI models—competing remotely against 174 sophomore students from the college.
This was a carefully designed "Turing test," serving as a touchstone for large language models by the Peking University research team. Recently, the College of Chemistry and Molecular Engineering, in collaboration with teams from the university's Computing Center, School of Computer Science, and Yuanpei College, unveiled their latest achievement: SUPERChem. Using a set of "Peking University exam questions" as a benchmark, they meticulously measured the true boundaries of AI's scientific reasoning capabilities.
Opening the SUPERChem question bank immediately conveys a sense of intensity. The 500 questions—covering fine-grained analysis of crystal structures, in-depth deduction of reaction mechanisms, and quantitative calculations of physicochemical properties—are not sourced from readily available public databases. Instead, they are deeply adapted from high-difficulty exam questions and cutting-edge professional literature.
Why go through the trouble of creating new questions? "Because large models are too good at 'memorization'," explained a team member. Most test questions accessible online have already been thoroughly studied by knowledge-hungry AIs during their training phases. Chemistry, however, is a discipline that cannot be mastered by rote learning alone—it requires rigorous logical deduction and spatial imagination of the microscopic world. "We were very curious whether the one-dimensional 'next token prediction' of large language models could solve complex reasoning problems in two-dimensional or even three-dimensional spaces."
Designing a set of questions that AI has "never seen before" and must rely on genuine reasoning skills to solve is extremely challenging. Yet this is where Peking University's College of Chemistry holds a unique advantage. Nearly 100 faculty and students—including many gold medalists from chemistry Olympiads—joined forces to create a high-threshold, reasoning-heavy, and cheat-proof exam.
Their goal was to test whether AI truly "understands" chemistry.
While designing test questions is often tedious, these young Peking University students turned the process into a "game." To build this high-quality assessment set, the team established a dedicated collaborative platform. There, tasks like question creation, review, and revision were transformed into a step-by-step "level-clearing" process. Members collaborated on the platform, reviewing each other's work and challenging one another, blending rigorous scientific discussion with dynamic intellectual exchange.
The team also introduced a point-based incentive system, making the question-writing process feel like leveling up in a video game. Each question had to go through initial drafting, solution writing, and then strict preliminary and final reviews, with different students overseeing each step and earning corresponding points. Some questions that passed final review had undergone up to 15 iterations.
When the exam results were revealed, humans demonstrated complex scientific intuition. As a baseline, the participating undergraduate chemistry students achieved an average accuracy rate of 40.3%—a figure that itself speaks to the difficulty of the questions.
How did the AI perform? Even the top models tested only achieved scores comparable to the average level of lower-level undergraduates.
The team was surprised by the confusion introduced by visual information. The language of chemistry is graphical; molecular structures and reaction mechanism diagrams contain critical information. Yet for some models, the introduction of image information actually caused their accuracy to drop instead of rise. This indicates that current AI still faces significant perceptual bottlenecks when converting visual information into chemical semantics.
Moreover, even when AI selected the correct answer, its problem-solving steps often could not withstand scrutiny. Therefore, the team annotated each question with detailed scoring rules. Under the "microscope" of SUPERChem, it became clear whether AI genuinely understood the material or was merely pretending.
The team found that AI's reasoning chains often broke down in high-level tasks such as product structure prediction, reaction mechanism identification, and structure-activity relationship analysis. Despite possessing vast knowledge reserves, today's leading models still struggle when confronted with hardcore chemical problems requiring rigorous logic and deep comprehension.
The creation of SUPERChem fills a gap in multimodal, deep-reasoning evaluation within the field of chemistry. The team released these results not to highlight AI's shortcomings, but to push it further. SUPERChem serves as a signpost, reminding us that there is still a long journey from general-purpose chatbots to professional scientific assistants capable of understanding structure-activity relationships and deducing reaction mechanisms—a leap from "memorizing knowledge" to "understanding the physical world."
Currently, the SUPERChem project has been fully open-sourced. The team hopes that this "exam" originating from Peking University will become a public asset for the global scientific and artificial intelligence communities, catalyzing the next technological breakthrough. Perhaps in the near future, when we open this test again, AI will be able to submit a perfect score—a delightful surprise for both chemistry and artificial intelligence.
We have selected one "simple" question that did not make it into the SUPERChem question bank for you to experience this exam yourself.
To commemorate the 150th anniversary of Mendeleev's discovery of the periodic law, the International Union of Pure and Applied Chemistry designated 2019 as the "International Year of the Periodic Table of Chemical Elements." Mendeleev predicted several then-unknown elements, one of which is M.
M is a silvery-white metal, soft in texture, soluble in concentrated sulfuric acid, nitric acid, hydrochloric acid, and dilute alkaline solutions. When M reacts with oxygen heated to 250°C, it forms a pale yellow solid A. Treating A with SOCl₂ yields a bright yellow solid B, which can also be obtained by directly heating M with a yellow-green gas C. If B is heated to 200°C with elemental gas D, it transforms into a red solid E. Dissolving M directly in dilute hydrochloric acid also yields a solution of E. However, if a magnesium plate plated with M is dissolved in dilute hydrochloric acid, a small amount of binary compound F can be prepared. F is a liquid at room temperature, unstable, and its aqueous solution is acidic. F can react with potassium metal to form a light gray solid G, releasing elemental gas D.
Based on the above information, select the correct statement from the following options:
A: The parity (odd/even) of the atomic number and group number of substance M is different. B: In the reaction of the magnesium plate plated with M with dilute hydrochloric acid, the oxidation state of Mg in the product is the same as that of M in compound A. C: G has an anti-fluorite structure. D: Due to oxidation by air, a solution of E will transform into a solution containing B after prolonged storage.