Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Was the industry ready for this concept of a computer having a number of meaningfully different kinds of cores? Has this happened before? Or did application developers just get cores as an integer count and that was it?


> Was the industry ready for this concept of a computer having a number of meaningfully different kinds of cores?

The industry didn’t have a choice. The market was demanding higher performance within the same thermal envelope and the same energy consumption. These are mobile devices, you can’t put a bigger heat sink on it and then crank up the power.

You can find tons of academic literature discussing the necessity of this development (along with many other things that have come to pass), how it would work, etc. in the decade leading up to the introduction. ARM didn’t just release it to the world and say “Surprise!”

We knew it was coming, we just didn’t do the best job of preparing for it.


Why not just run some of the high performance' cores at a much lower clock rate?

Much smaller silicon and software changes would have been needed to allow for just two different clock rates.

What was the argument for designing two different types of cores instead?


High performance cores use a lot of transistors and take up a lot of space. When you are aiming for high performance, this is a good tradeoff.

The efficiency cores are physically smaller, cheaper, etc. One you reduce the performance expectations to the point where they satisfy the requirements, they are a much better choice. The efficiency core has more perf-per-watt than a clocked-down performance core.

It seems counterintuitive. But when your budget for transistors is X and your budget for power draw/heat dissipation is Y, and there is no leeway whatsoever, the big.LITTLE concept gets you more aggregate performance.


Can you link an analysis for that where small cores have more perf-per-watt then a clocked down big core?

From what I understand the transistors aren't mostly for enabling the higher clock speed at all, they allow for wider cores that do more per cycle. It doesn't seem clear at all that they would be less efficient than a narrower design that does less per cycle, if clocked much lower to yield the same final performance per cycle.


AMD’s approach is similar to this. The efficiency core is the same core with less cache and lower clocks. It’s not a different micro architecture like in ARM and Intel designs.


It's been around since 2011 on Android. Nothing new. https://en.wikipedia.org/wiki/ARM_big.LITTLE


General term is asymmetric multiprocessing, which goes back a few more decades:

https://en.wikipedia.org/wiki/Asymmetric_multiprocessing

IIRC big.LITTLE implementations tended to have cores that didn't support the same instruction sets, meaning you couldn't migrate tasks between them if you needed to. Kind of like how laptops could switch between integrated and discrete GPUs, but some users would need to switch to the discrete GPU to use an external monitor even if they didn't want the power hit.

Also, "big.LITTLE" is a pretty strange brand name.


I think the term you're going for here is heterogeneous multi-processing/computing, not asymmetric multiprocessing:

https://en.wikipedia.org/wiki/ARM_big.LITTLE#Heterogeneous_m...

https://en.wikipedia.org/wiki/Heterogeneous_computing

It even lists big.LITTLE there as a typical example in the second article. big.LITTLE itself never had different ISAs as far as I could tell, just scheduling caveats that lead to efficiency tradeoffs, like the first article mentions.


My understanding of "heterogenous computing" is that it's more about splitting a task across a CPU and coprocessors, or writing the same code to target both. Asymmetric means there's multiple CPU cores but they're not equally performant.


big.LITTLE is typically cores with the same architectural features, just different performance due to different microarchitecture. A7/A15/A17, or A53/A57/A72, or A55/A76.

This let them run the same code, even the same system level code. The scheduler only had to optimize performance, without tasks being pinned to one core or another for correctness.


This is the case in Intel 12th gen: P cores support AVX512, E cores do not. Instead of allowing OS-level support, you could gain back AVX512 by disabling E cores. Intel ended up disabling that feature in hardware later though.


This was honestly a pretty surprising mistake for a big player like Intel. Quite a big oversight in my opinion. The fact that the OS will migrate processes from big cores to little cores should not be a surprise, given that it is basically necessary for the power savings to be realized. That effectively necessitates having the same ISA on both the big and little cores. It's not like desktop class operating systems haven't been running on these types of cores for over a decade.


Not to be confused with Little Big - https://en.wikipedia.org/wiki/Little_Big_(band)


Not to be confused with LittleBigPlanet.


Yet nobody in the Windows/Mac world was ready (if you think they should have to care which maybe they shouldn't).


Nobody in the Windows/Mac world writing software should have to care about the change, or even know about it. That's the beauty of an OS scheduler, it is optimized to execute your instructions where it sees fit, and higher level programs don't need to know how it works, where it executes your instructions, or that it even exists. Sometimes abstraction and separation of concerns can be beautiful thing. I'd argue that most programs shouldn't make any decisions about where to run, I know I don't want random engineers for some mac app controlling my power consumption or setting priority above anything else I run. The kernel developers with the data stream of health and performance info from the intimately connected hardware (on apple) is who I want to make the call on where to run. In the very few situations where E or P core assignment matters (games and VMs are all that I can think of and even that is arguable imo), there has been the ability to inform the OS scheduler where to run for a very long time.

Tangent: With today's insanely powerful hardware you should not ever be constrained in your programs to ever have to consider setting core affinity and if you go down that road you might want to reevaluate what you're doing, because you're probably doing something wrong and blaming it on the OS scheduler. Even constrained to the E cores (which are still plenty fast), your programs should perform well. I think more developers need to start writing software on slow machines on purpose because too many apps are written on machines that cost $4000+ with the newest chip and gpu innovations and then are never tested on slower more commonly used hardware, and things end up being dog slow on those machines and get no attention. If more apps were built on slow hardware and were still fast, they'd be even more so on the $4k machines. The macbook air was great for this because it was fanless and every core was essentially an "E" core, forcing you to optimize the code you wrote for the selfish reason of it not being annoying to run the code you were writing. Even if selfish, the net result was production code that was blazing fast once deployed on server grade hardware.


Abstractions leak. Many devs don't need to care, but anyone particularly concerned with performance or power consumption will want to know the details. (Just like many devs don't need to know about CPU cache, how GPUs work, etc.)


If instructions don't run the same between heterogenous cores, i.e. one core supports instructions the other doesn't, unless this information is available to the scheduler, there's no way it can make that decision. AVX512 was such an example, and sadly it ended up getting locked off from the P cores.


Something made the decision if it got "locked off from the P cores", what was it if not the OS scheduler?


What got locked off was the P-core's AVX512 instruction support, not some process's core affinity. Intel didn't have an E-core design ready to support AVX512 in Alder Lake. They initially allowed AVX512 to be enabled on the P-cores if all the E-cores were disabled at boot, but they later switched to fusing off AVX512 support permanently.


Interesting, but what does this have to do with what my original comment was about?


> today's insanely powerful hardware you should not ever be constrained in your programs to ever have to consider setting core affinity

That is wrong in HPC environment.

Most programs tailored for HPC will set core affinity manually. And they have very good reason to do so (cache affinity, memory bindings).

The correct abstraction depend of your domain. There were indeed very little incentive up to know to play with core affinity and the scheduler in a classical desktop environment.

The story is different on systems with strong constraints on efficiency.


I don't want you to determine where my programs on my computer run though. I want my operating system, which has much more context about the hardware available, a very smart scheduler in place, and honors my user preferences for how to operate on AC or battery power, and takes into account my scheduling overrides I want via AppTamer or similar. You as the developer of the application I'm running shouldn't mess with what cores it runs on, you should just make sure your program is able to be used across all of them, and that's where your reach should end. That's where my reach should end if I am writing a program you will run. The execution of the process is up to the user running the process and the OS, not the person who wrote the program. Maybe that's not an opinion may people share with me though


Yep but that has nothing to do with HPC. Those are super computing applications and you're not running those on your laptop or desktop. In HPC scenarios the developer knows more than the operating system.


Ah, I missed that we were talking about super computers. I thought we were talking about consumer gear


Also applicable for smaller scale HPC applications running on a single workstation, like an M2 Ultra.


On Mac they have used the Grand Central Dispatch library since 2009. So if developers already used that, they were ready for M1 https://en.wikipedia.org/wiki/Grand_Central_Dispatch


GCD was actually not designed for this scenario and it's easy to mess up with it.

It was designed for a large number of equal cores (aka SMP), meaning people did dispatch_async to the default concurrent queues all over the place, which is a bad pattern when you have to shrink down to phone size. Also, dispatch_semaphore has priority inversions and a lot of other features (like dispatch_group) are semaphores in disguise.

It does work if you use it carefully, but Swift concurrency is a different design for a reason.


According to Apple developer documentation GCD is currently designed for this scenario

https://developer.apple.com/documentation/apple-silicon/tuni...

"On a Mac with Apple silicon, GCD takes into account the differences in core types, distributing tasks to the appropriate core type to get the needed performance or efficiency."


It is now, but that's mostly from evolution in the iOS era. But like I said, it's not perfect.


that's probably assuming you set the qos correctly.


And dispatch_async loses QoS while dispatch_sync keeps it, but people often used to async unnecessarily.


>dispatch_async loses QoS

Wait really? Where can I read more about this, this goes against what I would assume.

Edit: Or do you mean that when dispatch_async the block is run with the QoS of target queue instead of source? That is what I would normally expect, if you want to "inherit priority" then dispatch_sync would do that, at the expense of blocking.


Android likes to do things to check off boxes and in mediocre ways


Heterogeneous computing is as old as computer science itself. There are well-known algorithms for dealing with this.


OTOH I'd question whether we as an industry are up to using it effectively, given that we couldn't even handle symmetric multiprocessing very well:)


iPhones and iOS have had this for quite awhile. My iOS dev knowledge is rather dated now but IIRC Grand Central Dispatch let you indicate the type of workload your task needed and thus which core type it was typically scheduled on.


GCD first appear on OSX 10.5 or 10.6 as i remember.


GCD first appeared on 10.6 Snow Leopard, which was marketed as a bug-fix only release and is due to this romanticised until today, but in reality included major changes under the hood and wasn’t very stable in his first versions.


Game consoles had different kinds of cores/processors.


I mean ever since very early on the GPU was a "separate CPU with different kinds of cores" compared to the main CPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: