HOIGPT: Learning Long Sequence
Hand-Object Interaction
with Language Models

Fu-Jen Chu1
Bugra Tekin1
Kevin J Liang1
Haoyu Ma1
Weiyao Wang1
Xingyu Chen1
Pierre Gleize1
Hongfei Xue2
Siwei Lyu2
Kris Kitani1
Matt Feiszli1
Hao Tang1
1FAIR, Meta     2SUNY at Buffalo
CVPR 2025

Abstract

We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (e.g. text, objects, partial sequences).

At its core, HOIGPT utilizes a large language model to predict the bidirectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences.

🚀 Key Innovations

Novel HOI Tokenizer

Hand-object decomposed VQ-VAE for discretizing HOI sequences

Motion-Aware LLM

Trained to process and generate both text and HOI tokens

Method Overview

HOIGPT Method Overview
Figure: Overview of HOIGPT. Given conditional signals (e.g., text, object, or partial sequences), HOIGPT uses a hand-object decomposed VQ-VAE tokenizer, dual codebooks, geometric loss, and a motion-aware language model for unified hand-object interaction generation and captioning.

Hand-Object Decomposed VQ-VAE

Discretizes both hand and object motion sequences into tokens for compact, expressive HOI representation with improved disentanglement.

Dual Codebook Design

Uses separate codebooks for hand and object components, improving disentanglement and representation quality for better modeling.

Geometric Loss

Introduces physically grounded geometric losses to encourage realistic hand-object spatial relationships and plausible interactions.

Incremental Learning

Trains the language model in stages, starting from simpler to more complex/longer HOI sequences for robust modeling.

Motion-Aware Language Model

Employs a large language model trained to process and generate both text and HOI tokens for unified, bidirectional generation.

Flexible Conditional Generation

Given text, HOIGPT synthesizes 3D HOI sequences; given (partial) HOI sequences, it generates and completes text descriptions.

Results & Performance

+2.01%
R Precision Improvement
Text Generation
-2.56
FID Reduction
HOI Generation
SOTA
State-of-the-Art
Multiple Benchmarks
Unified
Bidirectional
Generation & Captioning

HOIGPT sets new state-of-the-art performance on both text generation and HOI generation across multiple tasks and benchmarks, demonstrating the effectiveness of our unified approach.

Demo Video

Watch our demo video to see HOIGPT in action, showcasing bidirectional hand-object interaction generation and captioning capabilities.

Citation

@inproceedings{huang_etal_cvpr25,
  author = {Mingzhen Huang and Fu-Jen Chu and Bugra Tekin and Kevin J Liang and 
            Haoyu Ma and Weiyao Wang and Xingyu Chen and Pierre Gleize and 
            Hongfei Xue and Siwei Lyu and Kris Kitani and Matt Feiszli and Hao Tang},
  title = {HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models},
  booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  address = {Nashville, USA},
  year = {2025}
}