Speaker selection

Speaker selection helps you target the right face when a clip contains multiple people. You can either let Sync auto-detect the active speaker or provide a user-selected point from your UI and forward it via the active_speaker_detection DTO on /v2/generate.

When to use what

  • Auto-detect: fastest setup; best for single/obvious speaker clips. Set auto_detect: true and skip manual fields.
  • Manual selection: best for multiple people or when you want deterministic control. Provide a reference frame and a point on the speaker’s face, or supply bounding boxes if you already have detections for that frame.

Workflow: selecting a speaker in your UI

1

Capture a reference frame

Seek the video to a frame where the target speaker’s face is visible. Keep track of the frame index you show in the UI.

2

Collect a point on the face

Record the [x, y] coordinates (in the same coordinate system/pixels as your extracted frame) for the clicked point on the speaker’s face. Keep the frame index and coordinates paired.

3

Optional: provide bounding boxes instead

If you already ran face detection over the video, send bounding_boxes as a per-frame array. Each entry is [x1, y1, x2, y2] (top-left to bottom-right) or null when no face is present. This replaces the need for frame_number + coordinates.

4

Send generation request

Set options.active_speaker_detection with either frame_number + coordinates, or bounding_boxes for all frames when you already have detections (no frame_number/coordinates needed in that case). Leave auto_detect false when you want to honor the manual selection.

ActiveSpeaker DTO fields

See the full API reference for active_speaker_detection.

  • auto_detect (boolean, default false): let Sync pick the active speaker automatically.
  • frame_number (number): frame index that corresponds to the provided coordinates.
  • coordinates ([x, y]): reference point on the speaker’s face in frame_number.
  • bounding_boxes ((number[] | null)[], optional): per-frame array of bounding boxes across the video. Each entry corresponds to that frame: set to [x1, y1, x2, y2] (x1,y1 = top-left; x2,y2 = bottom-right) for the detected face, or null if no box for that frame. Use this instead of frame_number + coordinates when you already run detection over the clip.

Request examples

1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5const response = await sync.generations.create({
6 input: [
7 { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8 { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9 ],
10 model: "lipsync-2",
11 options: {
12 activeSpeakerDetection: {
13 autoDetect: false,
14 frameNumber: 240,
15 coordinates: [640, 360]
16 }
17 }
18});
$curl -X POST https://api.sync.so/v2/generate \
> -H "x-api-key: $SYNC_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{
> "model": "lipsync-2",
> "input": [
> { "type": "video", "url": "https://assets.sync.so/docs/example-video.mp4" },
> { "type": "audio", "url": "https://assets.sync.so/docs/example-audio.wav" }
> ],
> "options": {
> "active_speaker_detection": {
> "auto_detect": false,
> "frame_number": 240,
> "coordinates": [640, 360]
> }
> }
> }'
1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5await sync.generations.create({
6 input: [
7 { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8 { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9 ],
10 model: "lipsync-2",
11 options: {
12 activeSpeakerDetection: {
13 autoDetect: false,
14 // boundingBoxes aligned to video frames; null where no box is present.
15 boundingBoxes: [
16 null, // frame 0
17 [520, 280, 760, 520], // frame 1 (speaker A) -> [x1,y1,x2,y2]
18 [120, 260, 320, 500], // frame 2 (speaker B)
19 null // frame 3
20 // ...one entry per frame in the clip
21 ]
22 }
23 }
24});

If you prefer auto-detection, omit manual fields and set auto_detect to true. For manual control, either provide frame + coordinates from your UI selection, or supply bounding_boxes for each frame if you already ran detection (no frame_number/coordinates needed in that case).