AoT, JIT, bytecode, none of this matters if you can't control your data layout and access patterns. All this talk of performance and still not a mention of cache misses anywhere.
There seems to be a desire for this magic inliner and compiler that fixes all your performance problems when it just doesn't work like that.
Until you understand your data and have complete control over it none of this other stuff matters.
You have it backwards; cache effects only start to matter once you have the other things right. For example, being able to transform boxed to unboxed arithmetic, or to optimize out function calls from tight loops, will, at least in most cases, give you a bigger performance boost than rearranging data for cache-friendliness will.
That doesn't matter much if each function call accesses thousands of memory locations, but if you have many small functions such as calls to (un)box objects or property accessors, or small closures you pass to a function, getting rid of the function calls by inlining code makes a huge difference.
The reason why unboxed arithmetic and inlined function calls matter is that they allow for data locality, i.e., better control over cache and memory access patterns.
If "template JIT" means what I think it does, this is basically what you get from Cython[1] or, later, Nuitka[2], only they're "template AOT"s.
Sadly, it buys you sorely little. CPython's bytecodes are "big", so very little time is spent between them relative to that spent inside them. The motivation to do anything smarter here is lacking.
There seems to be a desire for this magic inliner and compiler that fixes all your performance problems when it just doesn't work like that.
Until you understand your data and have complete control over it none of this other stuff matters.