Introducing GAIA: The Next Generation in Zero-Shot Talking Avatar Creation

Zero-shot talking avatar generation involves synthesizing realistic talking videos using just a single portrait image and corresponding speech. Traditional techniques often relied on specific domain-related heuristics such as warping-based motion representation and 3D Morphable Models. However, these methods placed restrictions on the avatar’s naturalness and variety.

We present GAIA (Generative AI for Avatar), an innovative system designed to eliminate the need for domain priors in the avatar generation process.

GAIA operates on an observation: while speech drives the avatar’s motion, the avatar’s appearance and background remain constant for the video duration. We’ve distilled this process into two key stages:

  1. Disentangling each frame into separate motion and appearance representations.
  2. Generating motion sequences based on the speech and reference portrait image.

To optimize GAIA, we trained the model on a large-scale, high-quality talking avatar dataset in varying scales (up to 2B parameters). The results? GAIA outperformed previous models regarding:

  • Naturalness,
  • Diversity,
  • Lip-sync quality, and
  • Visual quality.

Additionally, GAIA is scalable (larger models yield better results) and versatile, supporting applications from controllable talking avatar generation to text-instructed avatar generation.

Wondering how GAIA works? Check out the diagram here:

GAIA Framework

Official Website

Speech-driven Talking Avatar Generation-1
Speech-driven Talking Avatar Generation-2
Speech-driven Talking Avatar Generation-3
Video-driven Talking Avatar Generation-1
Video-driven Talking Avatar Generation-2
Pose-controllable Talking Avatar Generation
Fully Controllable Talking Avatar Generation
Textual Instruction: Sad
Textual Instruction: Open your mouth
Textual Instruction: Surprise

Official Website