Models | Sync

At sync, we’re building foundational models to understand and manipulate humans in video. Our suite of lipsyncing models allows you to edit the lip movements of any speaker in any video to match a target audio. Explore and compare the capabilities of the different models below.

Feature	lipsync-1.9.0-beta	lipsync-2	lipsync-2-pro
Description	Fast legacy lipsync for simple videos	Our most natural lipsyncing model yet. The first model that can preserve the unique speaking style of every speaker.	Our highest quality lipsyncing model with diffusion-based super resolution. Enhanced detail preservation for beards, teeth, and facial features.
Pricing @ 25fps	$0.02 — $0.025/sec	$0.04 — $0.05/sec	$0.067 — $0.083/sec
Accuracy
Speed
Style	Standard generic lip movements	Lip movements in the unique style of the speaker	Lip movements in the unique style of the speaker with enhanced detail and fidelity
Identity Preservation
Teeth
Face Detection
Face Blending
Pose Robustness
Beard
Face Resolution	512×512	512×512	512×512 with enhanced detail preservation
Best for	best lipsync for majority of the videos	best lipsync for majority of the videos	better than the best. lipsync-2 with premium quality, highly recommended for professional needs. Seamlessly generates facial details with beards, wrinkles, and teeth.

All models are available in both Studio and API.

Plan Requirements: The lipsync-2-pro model requires a Scale plan or higher for API access. Studio users can access all models regardless of plan, with billing applied per usage.

Advanced Options

Obstruction Detection: For challenging video content where faces may be partially hidden by objects, hands, or other elements, you can enable obstruction detection. This feature improves face detection accuracy in complex scenes but comes with slower generation speeds. Enable this option when working with videos where faces are frequently obscured or when standard generation produces suboptimal results due to obstructions.

Caveats

Still Frame Limitation: Our lipsync models require natural speaking motion in the input video to function properly. If your video contains segments with still frames (where the speaker is not actively moving or speaking), lipsync will not work during those portions, even if audio is present.

This occurs because our models use 2-second independent chunks for inference and need to detect natural speaking style to generate appropriate lip movements. Static or still video segments don’t provide the necessary visual cues for the model to create realistic lip synchronization.

Recommendation: For best results, ensure your input video shows the speaker actively talking throughout the duration you want to lipsync.

FAQs

Why isn't my AI-generated character's lip movement working properly?

The character in the input video needs to look like they are talking. Our models learn to mimic the speaking style in the input video. If the character is completely static, the model might not generate lips that move either.

Solution: When creating your AI-generated video, add the text prompt “person is speaking naturally” to your generation. This will create characters with lips that are already moving, which will work much better with our platform.

Can I use sync to lipsync a song to a character?

Absolutely, our latest model is your best bet for this. For best results, be sure to isolate and upload the vocals track, as the instrumental sounds can sometimes interfere with the lipsync quality.

Does sync work with animal faces or non-human characters?

You can lipsync human-like faces, but our models don’t currently support animals or non-humanoid characters.

Why is the lipsync quality poor or non-existent in certain parts of my video?

Please check if the problematic segments have:

Multiple speakers in the frame
Faces that are too small or in profile view
Segments where the speaker in the input video is not speaking
Faces that are partially obstructed by objects, hands, or other elements

For multiple people, try masking or cropping out some faces using external tools. For obstruction issues, consider enabling the occlusion_detection_enabled option in your generation request, which provides better face detection in complex scenes (though it will slow down processing). We’re working to resolve these issues in future model releases.

Why does my video have poor quality lipsync when faces are in profile view?

Extreme profile view faces can lead to sub-par results. Please try with our latest model (lipsync-2), which has improved pose robustness and is your best option for challenging angles.

Why does the generated face appear to be lower resolution than my input video?

Our models generate faces at 512×512 resolution, which is usually sufficient for most 1080p videos. If the face in your input video is quite large, you may notice some resolution differences. For the highest possible quality, lipsync-2-pro offers the best resolution handling with enhanced detail preservation, especially for beards, teeth, and fine facial features.

What makes lipsync-2-pro different from other models?

lipsync-2-pro uses advanced diffusion-based super resolution technology instead of traditional GAN-based approaches. This results in:

Enhanced beard resolution: Better handling of facial hair without blurring
Improved teeth generation: More consistent and natural-looking teeth across frames
Superior detail preservation: Enhanced quality around the mouth region and facial features
Better face size handling: Can process larger face regions (up to 350×350 pixels) without quality degradation

The trade-off is slower processing time (1.5-2x slower than lipsync-2) and higher cost, making it ideal for premium quality applications where the highest fidelity is required.