HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

1FAIR, Meta 2SUNY at Buffalo

Abstract

We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (e.g. text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (−2.56 FID) across multiple tasks and benchmarks.

Method Overview

HOIGPT frames hand-object interaction modeling as a unified token-based generative task. The pipeline includes several key innovations:

  • Hand-Object Decomposed VQ-VAE Tokenizer: Discretizes both hand and object motion sequences into tokens for compact, expressive HOI representation.
  • Dual Codebook Design: Uses separate codebooks for hand and object, improving disentanglement and representation quality.
  • Geometric Loss: Introduces physically grounded geometric losses to encourage realistic hand-object spatial relationships and plausible interactions.
  • Incremental Learning Strategy: Trains the language model in stages, starting from simpler to more complex/longer HOI sequences for robust modeling.
  • Motion-Aware Language Model: Employs a large language model trained to process and generate both text and HOI tokens for unified, bidirectional generation.
  • Flexible Conditional Generation: Given text, HOIGPT synthesizes 3D HOI sequences; given (partial) HOI sequences, it generates and completes text descriptions.
HOIGPT Method Overview
Figure: Overview of HOIGPT. Given conditional signals (e.g., text, object, or partial sequences), HOIGPT uses a hand-object decomposed VQ-VAE tokenizer, dual codebooks, geometric loss, and a motion-aware language model for unified hand-object interaction generation and captioning.

Video

BibTeX

@inproceedings{huang_etal_cvpr25,
  author = {Mingzhen Huang and Fu-Jen Chu and Bugra Tekin and Kevin J Liang and Haoyu Ma and Weiyao Wang and Xingyu Chen and Pierre Gleize and Hongfei Xue and Siwei Lyu and Kris Kitani and Matt Feiszli and Hao Tang},
  title = {HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models},
  booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  address = {Nashville, USA},
  year = {2025}
}