For big data sets or complex SIMD algorithms the I/O bandwidth overhead is tiny ...

For big data sets or complex SIMD algorithms the I/O bandwidth overhead is tiny compared to the speedup achieved by moving the calculation to the GPU.

For the calculations that don't work well on the GPU due to small data sets, simple calculations, or bandwidth constraints we could just run the code in parallel across multiple cores/multiple goroutines.

I think eventually (and this seems to be the direction companies like AMD are headed in) we'll have a couple (maybe up to 4) big cores right next to a bunch of smaller whimpy GPU-like cores which handle SIMD, making SIMD on big cores all but redundant. We're not there yet but AMD and Intel are both working on trying to get their on-chip GPUs to share memory with the processor directly. At the moment the focus for this is mainly gaming performance, so textures, etc. don't have to be copied from main memory to the GPU; the same functionality will greatly benefit GPGPU though. Once we have this heterogeneous architecture and newer faster memory technologies, the problems with using the GPU for SIMD will disappear.

But for the moment, with the real-world technology constraints we have, you're absolutely right on the limitations of GPGPU.