We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (e.g. text, objects, partial sequences).
At its core, HOIGPT utilizes a large language model to predict the bidirectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences.
Hand-object decomposed VQ-VAE for discretizing HOI sequences
Trained to process and generate both text and HOI tokens
Discretizes both hand and object motion sequences into tokens for compact, expressive HOI representation with improved disentanglement.
Uses separate codebooks for hand and object components, improving disentanglement and representation quality for better modeling.
Introduces physically grounded geometric losses to encourage realistic hand-object spatial relationships and plausible interactions.
Trains the language model in stages, starting from simpler to more complex/longer HOI sequences for robust modeling.
Employs a large language model trained to process and generate both text and HOI tokens for unified, bidirectional generation.
Given text, HOIGPT synthesizes 3D HOI sequences; given (partial) HOI sequences, it generates and completes text descriptions.
HOIGPT sets new state-of-the-art performance on both text generation and HOI generation across multiple tasks and benchmarks, demonstrating the effectiveness of our unified approach.
Watch our demo video to see HOIGPT in action, showcasing bidirectional hand-object interaction generation and captioning capabilities.
@inproceedings{huang_etal_cvpr25,
author = {Mingzhen Huang and Fu-Jen Chu and Bugra Tekin and Kevin J Liang and
Haoyu Ma and Weiyao Wang and Xingyu Chen and Pierre Gleize and
Hongfei Xue and Siwei Lyu and Kris Kitani and Matt Feiszli and Hao Tang},
title = {HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
address = {Nashville, USA},
year = {2025}
}