The ability of ML algorithms to capture, predict and synthesize the human behavior is extremely useful in many domains including virtual reality, self-driving cars, game character animation, or human-robot interaction.
The main drawback of these learning algorithms is their requirement to be trained on large datasets that can only be generated by expensive devices such as MoCap systems, structure cameras, or 3D scanners.
SUMMON: a more efficient ML algorithm
To address this problem, researchers from Stanford University and Toyota Research Institute propose a new framework called SUMMON, which is a flexible algorithm that can synthesize diverse scenes by only referencing human motion trajectories. Moreover, SUMMON’s adaptability also makes it possible to create multiple scenes from a single motion sequence.
How it works
The human motion can reveal much information about the environment they are in and how they interact with it. Using the vertices and the movement of the human body, the algorithm can create various objects and place them in a 3D scene.
For example, the visual of a sitting person may suggest the presence of a chair and the position of the legs may suggest the chair’s location. Consequently, the system generates a chair or a sofa when someone sits down, as in the picture below.

Conclusion, future research
The researchers claim that their framework has the potential to “generate extensive human-scene interaction data for the community”.
As the scenes created by SUMMON are stationary, future developments may incorporate dynamic scenes with movement and rearrangement of furniture during human-scene interaction.
Learn more:
- Research paper: “Scene Synthesis from Human Motion” (on arXiv)
- Code (on GitHub)
- Project homepage