MotionLLM: Multimodal Motion-Language Learning
with Large Language Models

overview
GitHub arXiv

Abstract

Recent advancements in Multimodal Large Language Models (MM-LLMs) have demonstrated promising potential in terms of generalization and robustness when applied to different modalities. While previous works have already achieved 3D human motion generation using various approaches including language modeling, they mostly use specialized architecture and are restricted to single-human motion generation. Inspired by the success of MM-LLMs, we propose MotionLLM, a simple and general framework that can achieve single-human, multi-human motion generation, and motion captioning by fine-tuning pre-trained LLMs. Specifically, we encode and quantize motions into discrete LLM-understandable tokens, which results in a unified vocabulary consisting of both motion and text tokens. With only 1-3% parameters of the LLMs trained by using adapters, our single-human motion generation achieves comparable results to those diffusion models and other trained-from-scratch transformer-based models. Additionally, we show that our approach is scalable and flexible, allowing easy extension to multi-human motion generation through autoregressive generation of single-human motions.

Single-Human Motion Generation

a person performs a backflip

a man is walking as if to be a zombie

a person is doing rope skipping exercise in the park

a person walks forward, turns around, and walks back the way he came

Single-Human Motion Generation Comparison

Ours: A man kneels down and proposes marriage

MoMask: A man kneels down and proposes marriage

MotionGPT: A man kneels down and proposes marriage

T2M-GPT: A man kneels down and proposes marriage

More comparison between MotionLLM and MoMask

Ours: A man stands motionless and then take one steps backwards to the left

MoMask: A man stands motionless and then take one steps backwards to the left

Ours: A person jumps and spins in the air 360 degrees counterclockwise

MoMask: A person jumps and spins in the air 360 degrees counterclockwise

Multi-human Motion Generation

two people execute kicks to each other while standing

one person prepares to strike while the other prepares to block the attack

one lifts the right arm to greet the other with a wave

both people engage in a fencing bout, exchanging swift sword blows.

Motion Captioning

The following captions are generated by our MotionLLM.

a person walks forward while holding arms out as if to be a zombie

the person is walking on a balance beam

a person walks backwards in zig-zag motion

a person uses their left hand to open a bottle, drinks from it, then places the bottle back down

Multilingual text to motion

The following prompts are translated using DeepL.

English: A person first turns left, and then goes forward

German: Eine Person wendet sich zuerst nach links und geht dann vorwärts

Chinese: 一个人先向左转,然后向前走

French: Une personne tourne d'abord à gauche, puis avance