Models

At sync, we’re building foundational models to understand and manipulate humans in video. Our suite of lipsyncing models allows you to edit the lip movements of any speaker in any video to match a target audio. Explore and compare the capabilities of the different models below.

Featurelipsync-1.7.1lipsync-1.8.0lipsync-1.9.0-betalipsync-2
DescriptionFast, legacy model, best suited for simple low-res avatar videos.Slow, legacy model, suited for budget-constrained tasks. Use lipsync-1.9 & later for best results.Our fastest lipsyncing model. Standard, general-purpose, accurate lipsync.Our most natural lipsyncing model yet. The first model that can preserve the unique speaking style of every speaker. Best across all kinds of video content.
Price / min @ 25 fps$1 — $0.8$1 — $0.8$1.2 — $1.5$2.4 — $3
Accuracy
Speed
StyleStandard generic lip movementsStandard generic lip movementsStandard generic lip movementsLip movements in the unique style of the speaker
Identity Preservation
Teeth
Face Detection
Face Blending
Pose Robustness
Beard
Face Resolution256×256512×512512×512512×512
Best forLegacy model, might work for low-quality videosLegacy model, use 1.9.0 and aboveSimpler avatar-style use casesBest across all kinds of videos, outperforms every other lipsync model across all key attributes.

All models are available in both Studio and API.

The character in the input video needs to look like they are talking. Our models learn to mimic the speaking style in the input video. If the character is completely static, the model might not generate lips that move either.

Absolutely, our latest model is your best bet for this. For best results, be sure to isolate and upload the vocals track, as the instrumental sounds can sometimes interfere with the lipsync quality.

You can lipsync human-like faces, but our models don’t currently support animals or non-humanoid characters.

Please check if the problematic segments have:

  • Multiple speakers in the frame
  • Faces that are too small or in profile view
  • Segments where the speaker in the input video is not speaking

For multiple people, try masking or cropping out some faces using external tools. We’re working to resolve these issues in future model releases.

Extreme profile view faces can lead to sub-par results. Please try with our latest model (lipsync-2), which has improved pose robustness and is your best option for challenging angles.

Our latest model generates faces at 512×512 resolution, which is usually sufficient for most 1080p videos. If the face in your input video is quite large, you may notice some resolution differences. For the highest possible quality, lipsync-2 offers the best resolution handling among our models.