This hesitancy has evaporated. While the ultimate realization of these advanced machines remains on the horizon, capital is now flowing at an unprecedented rate. In 2025 alone, companies and investors poured $6.1 billion into humanoid robot development, a staggering fourfold increase from the previous year. The catalyst for this surge is a fundamental revolution in how machines learn to perceive and interact with their environment.
Consider the task of teaching a pair of robotic arms to fold clothes. The traditional approach involved painstakingly crafting intricate rule sets: analyze fabric tolerance for tearing, pinpoint shirt collars, precisely guide grippers, and dictate exact folding distances, all while accounting for rotations and sleeve twists. This method, the bedrock of early robotics, demanded anticipating every conceivable scenario and encoding it into the system. The sheer complexity of such a rulebook quickly became unmanageable, yet it offered a path to predictable results.

Around 2015, a paradigm shift began. Instead of explicit rule-making, researchers started employing digital simulations. They would construct virtual environments mirroring the robotic arms and the objects they were meant to manipulate. The program would then be rewarded for successful folds and penalized for failures. Through millions of iterative trials and errors, the AI learned to optimize its actions, mirroring the learning process that propelled AI to mastery in complex games.
The watershed moment, however, arrived with the public release of ChatGPT in 2022. This large language model, trained on an immense corpus of text, learned not through trial and error but by predicting the most probable next word in a sequence. This groundbreaking approach was soon adapted for robotics. Similar models, capable of processing not only text but also images and sensor data, began to predict the optimal next action for a robot, issuing dozens of motor commands per second. This conceptual leap – embracing AI models that ingest vast quantities of data – has proven effective across a spectrum of robotic applications, from conversational interfaces to intricate physical tasks. This shift was further bolstered by strategies that embraced deploying imperfect robots in real-world environments, allowing them to learn from direct experience. Today, Silicon Valley’s roboticists are once again dreaming on a grand scale, fueled by this evolution in machine learning.
Jibo: The Early Social Robot
Introduced in 2014 by MIT robotics researcher Cynthia Breazeal, Jibo was a social robot designed for families. Lacking arms, legs, and a traditional face, it resembled a lamp and captured the public’s imagination, raising $3.7 million through a crowdfunding campaign with early preorders priced at $749. Jibo could introduce itself and entertain children with dances, but its capabilities were limited. The ultimate vision was for it to evolve into an embodied assistant managing schedules, emails, and storytelling. While it garnered a devoted user base, the company behind Jibo ceased operations in 2019.

In retrospect, Jibo’s primary limitation was its rudimentary language processing. It competed with established virtual assistants like Apple’s Siri and Amazon’s Alexa, which relied on heavily scripted interactions. These systems translated speech to text, analyzed user intent, and retrieved pre-approved responses. While these snippets could be charming, they were often repetitive and lacked genuine conversational fluidity, a significant drawback for a robot intended to be a social companion.
The subsequent revolution in AI-driven language generation has dramatically improved conversational capabilities. Modern voice interfaces from leading AI providers are now remarkably engaging. However, this advancement introduces a new risk: while scripted interactions are inherently safe, AI-generated conversations can veer into unpredictable and even inappropriate territory, as evidenced by instances of AI toys discussing dangerous topics with children.
Dactyl: Simulating for Real-World Dexterity
By 2018, leading robotics labs were abandoning scripted rules in favor of trial-and-error learning. OpenAI’s Dactyl project exemplified this trend, focusing on training a robotic hand virtually. Using digital models of the hand and palm-sized cubes, Dactyl was tasked with manipulating the cubes, for example, to orient a specific letter or color upwards.

The challenge lay in bridging the gap between simulation and reality. A hand excelling in a virtual environment often faltered when transferred to the physical world due to subtle differences in color, friction, or the deformability of materials. To overcome this, the concept of "domain randomization" emerged. This involved creating millions of slightly varied simulated worlds, each with randomized parameters like friction, lighting, and color saturation. Exposure to this wide array of variations enabled the robot to better adapt to real-world conditions. This approach proved successful for Dactyl, and a year later, it was applied to the more complex task of solving Rubik’s Cubes, achieving a 60% success rate, though this dropped to 20% for particularly challenging scrambles.
Despite its success, the inherent limitations of simulation mean this technique plays a less dominant role today. OpenAI initially shuttered its robotics efforts in 2021 but has since re-established the division, reportedly with a focus on humanoid robots.
RT-2: Bridging Vision, Language, and Action
Around 2022, Google’s robotics team embarked on an ambitious data collection effort, spending 17 months filming people interacting with robot controllers to perform tasks ranging from picking up chips to opening jars, cataloging 700 distinct actions. The objective was to develop one of the first large-scale foundation models for robotics. Similar to large language models, the approach involved tokenizing input data (text, in this case) into a format usable by an algorithm to generate an output. Google’s RT-1 model received input on the robot’s visual perception and joint positions, along with an instruction, and translated this into motor commands. It successfully executed 97% of previously seen tasks and 76% of novel instructions.

The subsequent iteration, RT-2, further advanced this concept by training on general internet images, mirroring the development of vision-language models. This allowed the robot to interpret object locations within a scene. Kanishka Rao, a roboticist at Google DeepMind who led both iterations, noted that this broadened training unlocked new capabilities, enabling commands like "Put the Coke can near the picture of Taylor Swift." In 2025, Google DeepMind further integrated large language models with robotics by releasing a Gemini Robotics model, enhancing its ability to comprehend natural language commands.
RFM-1: The Robotic Coworker
In 2017, a team of OpenAI engineers spun out a project that would become Covariant, focusing on pragmatic robotic arms for warehouse operations rather than sci-fi humanoids. Building a foundation model system akin to Google’s, Covariant deployed its platform in warehouses, treating it as a data collection pipeline. By 2024, Covariant released RFM-1, a robotics model that could interact with users like a coworker. For instance, after being shown multiple sleeves of tennis balls, it could be instructed to move each sleeve to a designated area. The robot could then respond, perhaps indicating difficulty gripping an item and seeking advice on the appropriate suction cups to use.
While such interactions had been demonstrated experimentally, Covariant achieved this at a significant scale. The company integrated cameras and data collection systems across customer sites, continuously feeding more data into the model for training. Despite impressive progress, limitations persisted. In a 2024 demonstration involving kitchen items, the robot struggled when asked to "return the banana" to its original location, cycling through several incorrect items before finally succeeding. Cofounder Peter Chen acknowledged that the robot "doesn’t understand the new concept" of retracing steps, highlighting its dependence on robust training data. Chen and fellow founder Pieter Abbeel were subsequently hired by Amazon, which is now licensing Covariant’s robotics model, a move particularly significant given Amazon’s extensive network of warehouses.

Digit: Humanoids in Real-World Deployment
The surge in robotics investment is largely directed towards humanoid robots, designed to seamlessly integrate into existing human workplaces without requiring extensive retooling of infrastructure. However, the practical implementation of humanoids in real-world settings, such as warehouses, remains challenging, often confined to test zones and pilot programs.
Agility Robotics’ humanoid, Digit, stands out as an early example of a humanoid robot providing tangible cost savings rather than novelty. Its functional design, characterized by exposed joints and a non-humanoid head, has been adopted by companies like Amazon, Toyota, and GXO for tasks such as moving and stacking shipping totes. While Digit represents a significant step forward, it is still far from the idealized humanlike helpers envisioned by Silicon Valley. Its lifting capacity is limited to 35 pounds, and increased strength often leads to heavier batteries and more frequent recharging. Furthermore, regulatory bodies emphasize the need for stricter safety standards for mobile humanoids operating in proximity to people.
Digit’s development underscores that the current revolution in robot training is not monolithic. Agility Robotics utilizes simulation techniques similar to those employed by OpenAI and collaborates with Google’s Gemini models to enhance its robots’ environmental adaptability. This convergence of diverse learning methodologies, honed over more than a decade of experimentation, is now enabling the industry to build at an unprecedented scale.

