We introduce VividTalk, a two-stage generic framework engineered for crafting high-quality, expressive talking head videos. VividTalk’s remarkable capabilities include lip-sync, natural head pose generation, and high video quality, raising the bar in the audio-driven talking head generation sector.

In the initial stage, the framework maps audio to mesh by studying two motions: non-rigid expression motion and rigid head motion. The non-rigid expression motion utilises both blendshape and vertex as an intermediate representation, expanding the model’s representation ability. To create natural head motion, we use a novel, learnable head pose codebook supported by a two-phase training mechanism.

The second stage involves a dual branch motion-vae and a generator that morphs the meshes into dense motion, generating high-quality video frame-by-frame.

VividTalk prides itself on producing visually superior talking head videos that excel in lip-sync and realism. We conducted extensive experiments showing that VividTalk eclipses previous top-performing works in objective and subjective comparisons and can be proven to be a major stride in the field.

Upon publication, the code will be made accessible to the public. For a sneak peek, you can view a sample video here.

Official Website

VividTalk supports animating facial images across various styles, such as human, realism, and cartoon.
Using VividTalk you create talking head videos according to various audio singal.
The comparison between VividTalk and state-of-the-art methods in terms of lip-sync, head pose naturalness, identity preservation, and video quality.

Official Website