ShengShu bets on one model for robot minds

Advances in embodied AI, systems that can interact with the real world, could reshape manufacturing, logistics, home robotics and industrial automation.

May 18, 2026

Generated image to illustrate the global race in physical AI or ‘embodied AI’ to power robots, especially humanoid robots to mimic humans in real world settings.

ShengShu Technology has unveiled Motubrain, a “World Action Model” it says can replace the patchwork of task-specific systems that still define much of robotics today. The Singapore-based company claims the model ranks among the best on two embodied AI benchmarks, WorldArena and RoboTwin 2.0.

The pitch is straightforward: instead of building robots from separate modules for perception, planning and control, ShengShu wants a single architecture that can learn from video, language and action together. That is a familiar ambition in artificial intelligence, but in robotics it remains technically hard, because machines must cope with changing environments, imperfect data and the physical limits of hardware.

The global race in embodied AI is now moving beyond chatbots and image generators toward systems that can interact with the real world, a shift that could reshape manufacturing, logistics, home robotics and industrial automation. For companies and investors, the prize is not merely better models, but software that can generalise across robot types and task settings without being retrained from scratch.

“A true world model must be able to build a unified representation of the real world and predict how it evolves,” said Jun Zhu, ShengShu’s founder. “Video is a critical foundation of that intelligence because it naturally captures time, space, motion, causality, and physical dynamics at scale”.

ShengShu says Motubrain is built on a unified multimodal model and a three-stream mixture-of-transformers design that ties video, action and language into one loop. The company says the model can handle multi-step tasks involving up to 10 atomic actions, well above the two or three actions typical of many robotic systems. It also says the system learned from unlabelled video, task recordings and data from different robot embodiments, reducing dependence on manual annotation.

The company adds that Motubrain is already being used by robotics firms in training programmes spanning industrial, commercial and domestic settings. It has also partnered with Astribot, SimpleAI and Anyverse Dynamics to push the model toward broader deployment. ShengShu, backed by a $293 million Series B led by Alibaba Cloud, is trying to turn its video-model expertise into a foothold in physical AI.

Mobile Robotics Insider

Ready for more?