When the cost of different requests varies widely it’s difficult to get it right. When we rolled out docker I saw a regression in p95 time. I countered this by doubling our instance size and halving the count, which made the number of processes per machine slightly more instead of way less than the number of machines. I reasoned that the local load balancing would be a bit fairer and that proved out in the results.
I'm not 100% sure if it's just load balancing. It would depend on the details of the setup but that situation also allows you to throw more resources at each request.
I mean obviously there is a point where splitting up the instances doesn't help because you're just leaving more instances completely idle, or with too little resources to be helpful.