More

kwindla · 2025-08-25T19:31:24 1756150284

As someone who spends a lot of time looking at timestamped log lines to debug Pipecat pipelines, I'm a big fan of this work from Aleix.

In general, I have three pain points with debugging realtime, multi-model, multi-modal AI stuff. 1. where's the latency creeping in? 2. What context actually got passed to the models. 3. Did the model/processor get data in the format it expected.

For 1 and 3, Whisker is a big step forward. For 2, something like LangFuse (Open Telementry) is very helpful.

kwindla · 2025-04-25T16:00:09 1745596809

https://www.slate.auto/en

The configurator is fun:

https://www.slate.auto/en/personalization

rawgabbit · 2025-04-25T16:55:12 1745600112

Thanks for the link. I see they sell portable bluetooth speakers we can mount under the dash. I like the idea of DIY wrapping both the interior and exterior; I can imagine anime fan boys like my son coming up with very wild art for these wraps. I had also forgotten cars used to have hand cranks to roll up the windows.

kwindla · 2025-03-07T17:50:31 1741369831

In general, for realtime voice AI you don't want this model to support multiple speakers because you have a separate voice input stream for each participant in a session.

We're not doing "speaker diarization" from a single audio track, here. We're streaming the input from each participant.

If there are multiple participants in a session, we still process each stream separately either as it comes in from that user's microphone (locally) or as it arrives over the network (server-side).

kwindla · 2025-03-07T17:48:16 1741369696

I've talked about this a lot with friends.

Endpoint detection (and phrase endpointing, and end of utterance) are terms from the academic literature about this, and related, problems.

Very few people who are doing "AI Engineering" or even "Machine Learning" today know these terms. In the past, I argued that we should use the existing academic language rather than invent new terms.

But then OpenAI released the Realtime API and called this "turn detection" in their docs. And that was that. It no longer made sense to use any other verbiage.

mncharity · 2025-03-07T23:43:37 1741391017

Re SEO, I note "utterance" only occurs once, in a perhaps-ephemeral "Things to do" description.

To help with "what is?" and SEO, perhaps something like "Turn detection (aka [...], end of utterance)"... ?

lelag · 2025-03-07T18:33:04 1741372384

Thank for the explanation. I guess it makes some sense, considering many people with no nlp background are using those models now…

kwindla · 2025-03-07T05:12:48 1741324368

A couple of interesting updates today:

- 100ms inference using CoreML: https://x.com/maxxrubin_/status/1897864136698347857

- An LSTM model (1/7th the size) trained on a subset of the data: https://github.com/pipecat-ai/smart-turn/issues/1

kwindla · 2025-03-07T04:46:25 1741322785

It takes about 45 minutes to do the current training run on an L4 GPU with these settings:

    # Training parameters
    "learning_rate": 5e-5,
    "num_epochs": 10,
    "train_batch_size": 12,
    "eval_batch_size": 32,
    "warmup_ratio": 0.2,
    "weight_decay": 0.05,

    # Evaluation parameters
    "eval_steps": 50,
    "save_steps": 50,
    "logging_steps": 5,

    # Model architecture parameters
    "num_frozen_layers": 20

I haven't seen a run do all 10 epochs, recently. There's usually an early stop after about 4 epochs.

The current data set size is ~8,000 samples.

kwindla · 2025-03-07T03:25:39 1741317939

Turn detection is deciding when a person has finished talking and expects the other party in a conversation to respond. In this case, the other party in the conversation is an LLM!

remram · 2025-03-07T03:31:17 1741318277

Oh I see. Not like segmenting a conversation where people speak in turn. Thanks.

password4321 · 2025-03-07T21:28:08 1741382888

Speaker diarization is also still a tough problem for free models.

whiddershins · 2025-03-07T17:04:42 1741367082

huh. how is analyzing conversations in the manner you described NOT the way to train such a model?

remram · 2025-03-07T18:13:18 1741371198

Did you reply to the wrong comment? No one is taking about training here.

kwindla · 2025-03-07T01:44:52 1741311892

Can you say more? There's not much open source work in this domain, that I've been able to find.

I'm particularly interested in architecture variations, approaches to the classification head design and loss function, etc.

kwindla · 2025-03-07T01:44:27 1741311867

580M parameters. More info about the model architecture: https://github.com/pipecat-ai/smart-turn?tab=readme-ov-file#...

cyberbiosecure · 2025-03-07T05:15:06 1741324506

580m, awesome, incredible

meltyness · 2025-03-07T15:45:18 1741362318

... but will the model learn when to interrupt you out of frustration with your ongoing statements, and start shouting?

it seems like for the obvious use-cases there might need to be some sort of limit on how much this component knows

kwindla · on Dec 11, 2024

The Multimodal Live API is free while the model/API is in preview. My guess is that they will be pretty aggressive with pricing when it's in GA, given the 1.5 Flash multimodal pricing.

If you're interested in this stuff, here's a full chat app for the new Gemini 2 API's with text, audio, image, camera video and screen video. This shows how to use both the WebSocket API and to route through WebRTC infrastructure.

https://github.com/pipecat-ai/gemini-multimodal-live-demo

dandiep · on Dec 11, 2024

Thanks, this is great!