Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Qi Wu^*, Yubo Zhao^*, Yifan Wang, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

paper arXiv Code Poster Video

While previous approaches to 3D human motion generation have achieved notable success, they often rely on extensive training and are limited to specific tasks. To address these challenges, we introduce Motion-Agent, an efficient conversational framework designed for general human motion generation, editing, and understanding. Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text. This is accomplished by encoding and quantizing motions into discrete tokens that align with the language model's vocabulary. With only 1--3% of the model's parameters fine-tuned using adapters, MotionLLM delivers performance on par with diffusion models and other transformer-based methods trained from scratch. By integrating MotionLLM with GPT-4 without additional training, Motion-Agent is able to generate highly complex motion sequences through multi-turn conversations, a capability that previous models have struggled to achieve. Motion-Agent supports a wide range of motion-language tasks, offering versatile capabilities for generating and customizing human motion through interactive conversational exchanges.

Overview of Motion-Agent

Qualitative Results of Motion-Agent

Generating complex and long motions

"Generate a motion of a person performing a floor exercise in artistic gymnastics, and make it long."

"Generate another motion that a person is kicked down and then stands up to fight back by slapping and kicking."

Advanced reasoning ability

Comparison with other methods

Generate a motion where a golfer hits the ball, runs to the hole to check, and then celebrates by jumping and waving hands.

Motion-Agent

MotionGPT

MoMask

Smoothening Transition

Direct Concatenation

Transitioned by Motion-Agent

More Conversation Examples

Qualitative results of MotionLLM

Text-to-motion

"A person performs a backflip."

"A person walks forward, turns around, and walks back the way he came."

"A person is doing rope skipping exercise in the park."

"A man is walking as if to be a zombie."

Comparison with SOTA

"MotionLLM: A man stands motionless and then take one steps backwards to the left."

"MoMask: A man stands motionless and then take one steps backwards to the left."

"MotionLLM: A person jumps and spins in the air 360 degrees counterclockwise."

"MoMask: A person jumps and spins in the air 360 degrees counterclockwise."

Motion-to-text

"MotionLLM: A person walks forward while holding arms out as if to be a zombie"

"MotionLLM: The person is walking on a balance beam"

"MotionLLM: A person walks backwards in zig-zag motion"

"MotionLLm: A person uses their left hand to open a bottle, drinks from it, then places the bottle back down"

* The captions above are generated by MotionLLM.

^* These authors contributed to this work equally.

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Website source code based on the Nerfies project page. If you want to reuse their source code, please credit them appropriately.