For years, tech giant CEOs have been promoting the vision of AI agents - intelligent systems that can autonomously use software applications to complete tasks for humans. However, when consumer-level AI agents (whether OpenAI's ChatGPT Agent or Perplexity's Comet) are put into actual use today, the limitations of this technology remain quite apparent. Making AI agents more robust may require a series of new technologies that the industry has not yet fully explored.
One such technology involves carefully simulating "workspaces" where agents receive training on multi-step tasks - these "workspaces" are called reinforcement learning (RL) environments. Just as labeled datasets drove the previous wave of AI development, reinforcement learning environments are gradually becoming key elements in the agent development process.
AI researchers, entrepreneurs, and investors reveal that top AI labs currently have dramatically increased demand for reinforcement learning environments, while numerous startups are eager to provide such technology.
"All major AI labs are building reinforcement learning environments internally," says Jennifer Li, general partner at Andreessen Horowitz. "But as you can imagine, creating these types of datasets is extremely complex, so AI labs are also looking for third-party suppliers who can build high-quality environments and evaluation systems. The entire industry is focusing on this area."
The demand for reinforcement learning environments has spawned a batch of well-funded emerging startups, such as Mechanize Work and Prime Intellect, all committed to taking leading positions in this field. Meanwhile, major data annotation companies like Mercor and Surge indicate they are also increasing investments in reinforcement learning environments to keep up with trends as the industry transitions from static datasets to interactive simulations. Large labs are also considering massive investments: according to The Information, Anthropic's management has discussed plans to invest over $1 billion in reinforcement learning environments over the next year.
Investors and entrepreneurs hope that among these startups, there will emerge a "Scale AI of the reinforcement learning environment field" - referring to Scale AI, the $29 billion valuation data annotation giant that provided crucial support for the chatbot era's development.
The core question currently is whether reinforcement learning environments can truly push AI technology beyond existing boundaries.
What are Reinforcement Learning (RL) Environments?
Essentially, reinforcement learning environments are "training grounds" that simulate scenarios where AI agents operate in real software applications. One entrepreneur described the construction process in a recent interview as "like making a very boring video game."
For example, an environment might simulate a Chrome browser and give an AI agent the task of "buying a pair of socks on Amazon." The system scores the agent's performance and sends a "reward signal" if the task is successful (i.e., the right socks are purchased).
Although such tasks sound relatively simple, AI agents can still make mistakes at multiple stages during execution: they might get "lost" in website dropdown menus or mistakenly purchase multiple pairs of socks. Since developers cannot precisely predict what mistakes agents might make, the environment itself must be robust enough to capture all unexpected behaviors while providing effective feedback - making environment construction far more complex than creating static datasets.
Some reinforcement learning environments are designed with great complexity, supporting AI agents in using tools, accessing the internet, or calling various software applications to complete specified tasks; others are more narrowly positioned, focusing on helping agents learn specific tasks in enterprise software applications.
Although reinforcement learning environments are now hot technology in Silicon Valley, precedents for using such technology already exist. In 2016, one of OpenAI's first projects was building "RL Gyms" (reinforcement learning gyms), with concepts highly similar to modern reinforcement learning environments; that same year, Google DeepMind's AlphaGo AI system defeated the world Go champion, also using reinforcement learning technology in simulated environments.
What makes today's reinforcement learning environments unique is that researchers are attempting to combine large Transformer models to create AI agents that can "use computers." Unlike AlphaGo (a specialized AI system only suitable for closed environments), today's AI agents aim to have more general capabilities. Current AI researchers have a more solid technical foundation, but their goals are also more complex, with potentially more problems.
A Highly Competitive Field
AI data annotation companies like Scale AI, Surge, and Mercor are actively adapting to trends, focusing on building reinforcement learning environments. These companies not only have more abundant resources than most startups in this field but have also established deep partnerships with AI labs.
Surge CEO Edwin Chen says he has recently observed "significant growth" in AI labs' demand for reinforcement learning environments. He reveals that Surge achieved reportedly $1.2 billion in revenue last year through partnerships with AI labs including OpenAI, Google, Anthropic, and Meta; the company has recently established a dedicated internal team responsible for reinforcement learning environment construction.
Following Surge is the $10 billion valuation startup Mercor, which also has partnerships with OpenAI, Meta, and Anthropic. Marketing materials obtained by TechCrunch show Mercor is pitching its core business to investors - building reinforcement learning environments for specific domain tasks in programming, healthcare, law, and other areas.
Mercor CEO Brendan Foody said in an interview: "Very few people truly realize how big the opportunities in the reinforcement learning environment field really are."
Scale AI once dominated the data annotation field, but after Meta invested $14 billion and poached its CEO, the company's market share has gradually declined. Subsequently, Google and OpenAI no longer list Scale AI as a data supplier, and even within Meta, Scale AI faces competitive pressure in data annotation business. Nevertheless, Scale AI is still working to adapt to trends by investing in reinforcement learning environment construction.
"This is the nature of the industry (Scale AI) operates in," says Chetan Rane, Scale AI's product lead for agents and reinforcement learning environments. "Scale has proven its ability to adapt quickly: we did this in the early stages of our first business segment - autonomous driving; after ChatGPT emerged, Scale AI also successfully adapted to new trends; now, we're again making adjustments in new frontier areas like agents and environments."
Some emerging companies have focused on the reinforcement learning environment field from their founding. The approximately 6-month-old startup Mechanize Work is one such company, proposing the bold goal of "automating all work." However, co-founder Matthew Barnett told TechCrunch that his company is currently starting by building reinforcement learning environments for AI programming agents.
Barnett says Mechanize Work plans to provide AI labs with a small number of highly robust reinforcement learning environments, rather than building large quantities of simple reinforcement learning environments like major data companies. To this end, the startup offers software engineers annual salaries of $500,000 (for building reinforcement learning environments), compensation far higher than hourly work at Scale AI or Surge.
Two informed sources reveal that Mechanize Work has begun collaborating with Anthropic to develop reinforcement learning environments. Both Mechanize Work and Anthropic declined to comment on cooperation details.
Other startups are betting that reinforcement learning environments will also have impact in fields beyond AI labs. Prime Intellect, a startup supported by AI researcher Andrej Karpathy, Founders Fund, and Menlo Ventures, is positioning its reinforcement learning environments to serve small and medium-sized developers.
Last month, Prime Intellect launched a reinforcement learning environment hub, aiming to create the "Hugging Face of reinforcement learning environments" (Hugging Face is a well-known open-source community in the AI field). The platform aims to give open-source developers the same resource support as major AI labs while selling them access to computational resources in the process.
Prime Intellect researcher Will Brown says that training agents with general capabilities in reinforcement learning environments may require higher computational costs than previous AI training techniques. Therefore, besides startups building reinforcement learning environments, GPU suppliers providing computational support for this process will also have opportunities.
"No single company can dominate the reinforcement learning environment field alone - it's too large," Brown said in an interview. "Part of what we're currently doing is trying to build good open-source infrastructure around this field. Our core service is providing computational resources, which is indeed a convenient entry point for using GPUs, but we're more focused on long-term development."
Can It Achieve Scalable Development?
Regarding reinforcement learning environments, an unresolved question currently is: can this technology achieve scalable development like previous AI training methods?
Over the past year, reinforcement learning has driven multiple major breakthroughs in the AI field, including OpenAI's o1 model and Anthropic's Claude Opus 4 model. These breakthroughs are significant because methods previously used to improve AI models are now showing "diminishing returns" trends.
Reinforcement learning environments are part of AI labs' "bigger bet" on reinforcement learning technology - many believe that as more data and computational resources are invested in this technology, reinforcement learning will continue driving AI progress. Some researchers responsible for OpenAI's o1 model previously revealed that the company initially invested in AI reasoning models (developed through investments in reinforcement learning and test-time computation) precisely because they believed such models had good scaling potential.
Currently, the best path for reinforcement learning to achieve scale is not yet clear, but reinforcement learning environments appear to be a promising direction. Unlike simply rewarding chatbots through text responses, reinforcement learning environments allow agents to operate tools and use computers to complete tasks in simulated scenarios - this approach consumes far more resources but potentially offers greater rewards.
Some people are skeptical about the development prospects of reinforcement learning environments. Ross Taylor, former Meta AI research head who now co-founded General Reasoning, says reinforcement learning environments are prone to "reward hacking" - where AI models "cheat" to obtain rewards without actually completing tasks.
"I think people underestimate the difficulty of scaling environments," Taylor says. "Even the best currently publicly available (reinforcement learning environments) usually cannot function properly without major modifications."
Sherwin Wu, OpenAI's engineering lead for API business, expressed in a recent podcast that he is "bearish on" startups in the reinforcement learning environment field. Wu pointed out that competition in this field is extremely fierce, and AI research develops extremely rapidly, making it very difficult to provide quality services to AI labs.
Karpathy (as an investor in Prime Intellect, who has called reinforcement learning environments potentially breakthrough technology) also expressed cautious attitudes toward the entire reinforcement learning field. He questioned in a post on social platform X: Through reinforcement learning technology, how much more progress can AI achieve?
"I'm optimistic about environment and agent interactions, but pessimistic about reinforcement learning itself," Karpathy stated.