It's probably because of the span between consecutive I-frames (or intra-block sweeps if not using I-frames) in the video stream, but audio should switch "instantaneously" (within 1 frame).
Hmm, if that were the case, wouldn't the time spent waiting be variable? That would mean that you'd sometimes get instant switches, when the I-frame was the next frame you received after the switch.