I’m confused by the clustering step: > To find the most informative examples, we...

fumeux_fume · 2025-08-08T13:35:24 1754660124

The information you're seeking appears to be left out of the post. My best guess is that a separate embedding model, specifically tuned for document similarly, is used to generate the vectors and then a clustering algorithm is chosen to create the clusters. They may also use PCA to reduce the embedded vector dimensions before clustering.

overfeed · 2025-08-08T17:27:40 1754674060

> How can you get overlapping clusters if the two sets of labelled examples are disjoint?

What's disjoint are the training labels and the classifier's output - not the values in high-dimension space. For classification tasks, there can be neighboring items in the same cluster but separated by the hyperplane - and therefore placed in different classes despite the proximity.

patresh · 2025-08-08T11:11:06 1754651466

If the diagram is representative of what is happening, it would seem that each cluster is represented as a hypersphere, possibly using the cluster centroid and max distance from the centroid to any cluster member as radius. Those hyperspheres can then overlap. Not sure if that is what is actually happening though.

cm228 · 2025-08-08T10:51:25 1754650285

they cluster the examples with their model and then check the predictions against the labels.