If you don't have a fundamentally serial workload (and usually either you don't ...

If you don't have a fundamentally serial workload (and usually either you don't or you have a lot of them you can parallelize across tasks) and you are willing to write bespoke CUDA code for that workload, Nvidia is telling the truth.

CUDA's sweet spot lies between embarrassingly parallel (for which ASICs and FPGAs rule the world because these are generally pure compute with low memory bandwidth overhead) and serial (for which CPUs are still best), a place I call "annoyingly parallel." There are a lot of workloads in this space in my experience.

But if you don't satisfy both of the aforementioned requirements and/or you insist on doing this all from someone else's code interfaced through a weak-typed garbage collected global interpreter locked language, your mileage will vary greatly cough deep learning frameworks cough.

Finally, it doesn't matter who's doing it, marchitecturing(tm) drives me nuts too.