The Dawn of Inference Era: Four Major Computing Trends Unveiled Behind NVIDIA's "Mystery Chip"

NVIDIA's integration of LPU (Language Processing Unit) technology and OpenAI's diversified investments in inference chips are shifting the primary battleground of AI computing competition from training to inference. According to Shenwan Hongyuan Research, the core theme of the computing industry in 2026 will be inference, with both total token consumption and technological paradigms undergoing deep restructuring around this focus.

On February 28, The Wall Street Journal reported that NVIDIA plans to unveil a new inference chip at next month's GTC developer conference, integrating Groq's LPU technology. NVIDIA CEO Jensen Huang described it as a "completely new system the world has never seen." OpenAI has agreed to become one of the largest customers for this processor and will purchase substantial "dedicated inference capacity" from NVIDIA.

Simultaneously, OpenAI last month entered into a multi-billion-dollar computing partnership with startup Cerebras, which claims its inference chips have surpassed NVIDIA GPUs in speed. These developments indicate that AI giants are transitioning from an arms race in training compute to a multi-pronged strategy for inference compute.

Shenwan Hongyuan's report highlights four key trends in inference computing during the token economy era: First, increasing deployment scenarios using pure CPUs, with low-cost inference demands accelerating computing decentralization. Second, the rise of specialized architectures like LPUs, challenging GPU dominance in inference processes. Third, accelerated breakthroughs in domestic computing chips, with clear trends toward supply chain diversification. Fourth, the shift in inference demand structure from "single training sessions" to "massive token consumption," making cost-effectiveness a core competitive factor.

The report states that manufacturers capable of providing sufficient, cost-effective inference chips will benefit most. Breakthroughs in CPUs, LPUs, and domestic chips collectively form the core narrative reshaping the current computing landscape.

Inference demand is experiencing comprehensive growth, with token consumption reaching record highs. Shenwan Hongyuan Research attributes this expansion to two structural drivers: accelerated monetization of large models, with models like Claude introducing multiple industry plugins for application integration, and faster implementation of AI agents, as products like openclaw and Qianwen Agent indicate agents entering real work and production environments. Each model call and agent task execution requires substantial inference computing support.

Data cited by Shenwan Hongyuan Research shows significant growth in inference volume for leading domestic large models during the Spring Festival: Doubao processed 63.3 billion tokens on New Year's Eve, Yuanbao reached 114 million monthly active users, and over 120 million people participated in Qianwen's "Spring Festival Free Trial" event.

Global AI model API aggregation platform OpenRouter further illustrates the scale of this trend. During the week of February 9-15, Chinese models achieved 4.12 trillion token calls, surpassing US models' 2.94 trillion calls for the first time. The following week (February 16-22), Chinese model calls surged to 5.16 trillion tokens, a 127% increase over three weeks, with Chinese models occupying four of the top five spots in global call volume.

LPUs are emerging as new contenders, signaling divergence between training and inference chips. NVIDIA invested $20 billion to license Groq's core technology and acquired key executives including founder Jonathan Ross through "core hiring" transactions. Shenwan Hongyuan Research views this deal as formal recognition by top players of pure inference chips' importance.

Architectural differences between LPUs and traditional GPUs explain their efficiency advantages in inference scenarios. AI inference consists of prefill and decoding stages, with large models' decoding processes being particularly slow. LPUs are specifically optimized for latency and memory bandwidth, the two primary inference bottlenecks. Previous reports suggest NVIDIA's upcoming product may involve next-generation Feynman architecture, potentially adopting broader SRAM integration solutions or even deep LPU integration through 3D stacking technology.

Shenwan Hongyuan Research concludes that future AI chips will develop clear technical divisions: training will continue using GPU-HBM combinations, while inference will evolve toward ASIC+LPU-SRAM+SSD solutions. As computing demand shifts from training to inference, manufacturers specializing in inference chips will encounter significant development opportunities.

System-wide innovation in inference, with simultaneous increases in CPU and network demands, represents another critical dimension of current inference computing upgrades. Shenwan Hongyuan Research notes that as applications evolve from chatbots to agents, computing systems require simultaneous improvements in latency, throughput, and reasoning depth, driving architecture toward a three-layer network.

The first layer is the fast-response layer, providing ultra-low latency feedback through pure inference chips equipped with SRAM. The second layer is the slow-thinking layer, utilizing high-throughput computing clusters for complex logical reasoning, where demand for multi-core, multi-threaded CPUs will significantly increase. The third layer is the memory layer, corresponding to NVIDIA's ContextMemory System, managing agents' long-term memory and KV Cache through Bluefield4 DPU-managed SSD storage.

NVIDIA is adjusting its hardware strategy accordingly. The previous standard practice of bundling Vera CPU with Rubin GPU deployment proved too costly for specific AI agent workloads. This month, NVIDIA announced expanded collaboration with Meta Platforms, completing its first large-scale pure CPU deployment to support Meta's advertising targeting AI agent, indicating the company's move beyond pure GPU sales models.

Domestic computing power is achieving accelerated breakthroughs. Shenwan Hongyuan Research believes technological upgrades in domestic inference chips deserve close attention, with existing market expectation gaps.

Technologically, new-generation domestic inference chips have achieved multiple fundamental improvements: added support for low-precision data formats like FP8/MXFP8/MXFP4, delivering 1P and 2P computing power respectively; significantly enhanced vector computing capability through new isomorphic designs supporting SIMD/SIMT dual programming models; and 2.5 times increased interconnect bandwidth reaching 2TB/s compared to previous generations.

Particularly noteworthy is the implementation of PD separation at the chip level: through self-developed HBM of different specifications, forming PR versions for prefill and recommendation scenarios, and DT versions for decode and training scenarios. The PR version uses low-cost HBM, substantially reducing investment costs in inference prefill stages, with expected launch in Q1 2026.

Supply chain progress provides supporting evidence. According to a leading packaging and testing company's first-round inquiry response, its 2.5D packaging business revenue primarily comes from high-performance computing chip packaging services, growing rapidly from 50 million yuan in 2022 to 1.82 billion yuan in 2024, indirectly confirming continuous improvement in domestic computing chip supply capacity and accelerated supply chain localization.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

The Dawn of Inference Era: Four Major Computing Trends Unveiled Behind NVIDIA's "Mystery Chip"

Most Discussed