Speaker Selection — API

Speaker selection helps you target the right face when a clip contains multiple people. You can either let Sync auto-detect the active speaker or provide a user-selected point from your UI and forward it via the active_speaker_detection DTO on /v2/generate. For using speaker selection in the web app, see the guide.

When to use what

Auto-detect: fastest setup; best for single/obvious speaker clips. Set auto_detect: true and skip manual fields.
Manual selection: best for multiple people or when you want deterministic control. Provide a reference frame and a point on the speaker’s face, or supply bounding boxes if you already have detections for that frame.

Workflow: selecting a speaker in your app

Capture a reference frame

Seek the video to a frame where the target speaker’s face is visible. Keep track of the frame index you show in the UI.

Collect a point on the face

Record the [x, y] coordinates (in the same coordinate system/pixels as your extracted frame) for the clicked point on the speaker’s face. Keep the frame index and coordinates paired.

Optional: provide bounding boxes instead

If you already ran face detection over the video, send bounding_boxes as a per-frame array. Each entry is [x1, y1, x2, y2] (top-left to bottom-right) or null when no face is present. This replaces the need for frame_number + coordinates.

Send generation request

Set options.active_speaker_detection with either frame_number + coordinates, or bounding_boxes for all frames when you already have detections (no frame_number/coordinates needed in that case). Leave auto_detect false when you want to honor the manual selection.

ActiveSpeaker DTO fields

See the full API reference for active_speaker_detection.

auto_detect (boolean, default false): let Sync pick the active speaker automatically.
frame_number (number): frame index that corresponds to the provided coordinates.
coordinates ([x, y]): reference point on the speaker’s face in frame_number.
bounding_boxes ((number[] | null)[], optional): per-frame array of bounding boxes across the video. Each entry corresponds to that frame: set to [x1, y1, x2, y2] (x1,y1 = top-left; x2,y2 = bottom-right) for the detected face, or null if no box for that frame. Use this instead of frame_number + coordinates when you already run detection over the clip.

Request examples

TypeScript SDK

1 import { SyncClient } from "@sync.so/sdk";
2 
3 const sync = new SyncClient();
4 
5 const response = await sync.generations.create({
6   input: [
7     { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8     { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9   ],
10   model: "lipsync-2",
11   options: {
12     activeSpeakerDetection: {
13       autoDetect: false,
14       frameNumber: 240,
15       coordinates: [640, 360]
16     }
17   }
18 });

cURL (HTTP)

$ curl -X POST https://api.sync.so/v2/generate \
>   -H "x-api-key: $SYNC_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "lipsync-2",
>     "input": [
>       { "type": "video", "url": "https://assets.sync.so/docs/example-video.mp4" },
>       { "type": "audio", "url": "https://assets.sync.so/docs/example-audio.wav" }
>     ],
>     "options": {
>       "active_speaker_detection": {
>         "auto_detect": false,
>         "frame_number": 240,
>         "coordinates": [640, 360]
>       }
>     }
>   }'

TypeScript SDK (bounding boxes instead of coordinates)

1 import { SyncClient } from "@sync.so/sdk";
2 
3 const sync = new SyncClient();
4 
5 await sync.generations.create({
6   input: [
7     { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
8     { type: "audio", url: "https://assets.sync.so/docs/example-audio.wav" }
9   ],
10   model: "lipsync-2",
11   options: {
12     activeSpeakerDetection: {
13       autoDetect: false,
14       // boundingBoxes aligned to video frames; null where no box is present.
15       boundingBoxes: [
16         null,                               // frame 0
17         [520, 280, 760, 520],               // frame 1 (speaker A) -> [x1,y1,x2,y2]
18         [120, 260, 320, 500],               // frame 2 (speaker B)
19         null                                // frame 3
20         // ...one entry per frame in the clip
21       ]
22     }
23   }
24 });