Hacker Newsnew | past | comments | ask | show | jobs | submit | joha4270's commentslogin

I'm sorry, I'm clearly missing something but why would page size impact L1 cache size?

When you do a cache lookup, there is a "tag" which you use as an index during lookup. But once you do the lookup, you may need to walk a few entries in the corresponding "bucket" (identified by that tag) to find the matching cache line. The number of entries you walk is the associativity of the cache e.g. 8-way or 12-way associativity means there are 8 or 12 entries in that bucket. The larger the associativity, the larger the cache, but also it worsens latency, as you have to walk through the bucket. These are the two points you can trade off: do you want more total buckets, or do you want each bucket to have more entries?

To do this lookup in the first place, you pull a number of bits from the virtual/physical address you're looking up, which tells you what bucket to start at. The minimum page size determines how many bits you can use from these addresses to refer to unique buckets. If you don't have a lot of bits, then you can't count very high (6 bits = 2^6 = 64 buckets) -- so to increase the size of the cache, you need to instead increase the associativity, which makes latency worse. For L1 cache, you basically never want to make latency worse, so you are practically capped here.

Platforms like Apple Silicon instead set the minimum page size to 16k, so you get more bits to count buckets (8 bits = 256 buckets). Thus you can increase the size of the cache while keeping associativity low; L1 cache on Apple Silicon is something crazy like 192kb, and L2 (for the same reasons) is +16MB. x86 machines and software, for legacy reasons, are very much tied to 4k page size, which puts something of a practical limit on the size of their downstream caches.

Look up "Virtually Indexed, Physically Tagged" (VIPT) caches for more info if you want it.


It’s not a hard limit, especially if you aren’t pushing the frequency wall like Intel. AMD used to use a 2-way 64kb L1, Intel has an 8-way 64kb L1i on Gracemont, and more to the point, high-end ARM Cortex has had 4-way 64kb L1 caches since before they even supported 16kb pages.

Yeah, I was more just trying to paint a broad picture. Nvidia in particular I think had fast and large-ish L1 on Tegra (X2?) despite being tied to 4k pages.

This the the most cursed part of modern cpu design, but the TLDR is that programs use virtual addresses while CPUs use physical addresses which means that CPU caches need to include the translation from virtual to physical adress. The problem is that for L1 cache, the latency requirement of 3-4 cycles is too strict to first do a TLB lookup and then an L1 cache lookup, so the L1 can only be keyed on the bits of ram which are identical between physical and virtual addresses. With a 4k page size, you only have 6 bits between the size of your cache line (64 bytes) and the size of your page, which means that at an 8 way associative L1D, you only get 64 buckets*64 bytes/bucket=32 kbits of L1 cache. If you want to increase that while keeping the 4k page size, you need to up the associativity, but that has massive power draw and area costs, which is why on x86, L1D on x86 hasn't increased since core 2 duo in 2006.

Can you not take some of those virtual bits and get more buckets that way? I am sure it will make things more complicated if nothing else by them possibly being mapped to the same physical page, but it doesn't sound like an impossible barrier. Maybe something terrible where a cache line keeps bouncing between different buckets in the rare case that does happen, but as long as you can keep the common case as fast...

Otoh L1 sizes hasn't increased since my first processor, those CPU designers probably know more than I do.


that will break if any page is mapped at two VAs, you'll end up with conflicting cache lines for the same page...

The L2 already keeps track of what lines are somewhere in L1's for managing coherency.

Divide the cache into "meta-caches" indexed by the virtual bits and treat them as separate from the L2's point of view. Duplicate the data and if somebody writes back invalidate all the other copies. The hardware already exists for doing this on any multicore system. Sure, you will end up duplicating data sometimes and it will actually be slower if you're actually writing to aliased locations. But is this happening often enough to be a problem compared to generally having a bigger cache?

It sounds to me like an engineering tradeoff that might or might not make sense, not a hard limit which at least was what I think was being asserted. But as I also said, L1 sizes hasn't increased in a while and smart people are working on it, so there is probably something I don't know.


this "divide" thing will add latency which you really do not want to add to L1 hits

Nice HN explanation! One hopes we will not be living with 4kb pages forever, and perhaps L1 performance will be one more reason.

I'd really hope we do live with 4kb pages forever. Variable page size would make many remapping optimizations (i. e. continuous ring buffers) much harder to do, so we would need more abstraction layers, and more abstraction layers will eat away all the performance gains while also making everything more fragile and harder to understand. Hardware people really love those "performance hacks" that make live a more painful for the upper layers in exchange for a few 0.1%s of speed. You could also probably gain some speed by dropping byte access and saying the minimal addressable unit is now 32 bits. Please don't. If you need larger L1 cache - just increase associativity.

The extra L1 cache from a 64k page is on it's own a ~5-10% perf improvement (and it decreases power use by reducing the number of times you go out to L2.

Funny, most of what you described sums up the Alpha architecture. 8KB pages + huge pages and, initially, only word-addressable memory, no byte access.

(Of course, it only took a few years for this to be rectified with the byte-word extension, which became required by ~all "real software" that supported Alpha)

It's also one of the only architectures Windows NT supported that didn't have 4KB pages, along with Itanium. I've wondered how (or if?) it handled programs that expect 4KB pages, especially in the x86 translation subsystem.


Alas, no matter how the bits that makes up my bank balance looks, in practice its still a single point of failure where I might simply lose access to my money if the right service is down. Cash has much better uptime stats, even if it can be inconvenient to carry around.

The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?


Speculative decoding takes advantage of the fact that it's faster to validate that a big model would have produced a particular sequence of tokens than to generate that sequence of tokens from scratch, because validation can take more advantage of parallel processing. So the process is generate with small model -> validate with big model -> then generate with big model only if validation fails

More info:

* https://research.google/blog/looking-back-at-speculative-dec...

* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...


See also speculative cascades which is a nice read and furthered my understanding of how it all works

https://research.google/blog/speculative-cascades-a-hybrid-a...


Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation

I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...

When you predict with the small model, the big model can verify as more of a batch and be more similar in speed to processing input tokens, if the predictions are good and it doesn't have to be redone.

They are referring to a thing called "speculative decoding" I think.

More or less, yes. Of course, defects are not evenly distributed, so you get a lot of chips with different grades of brokenness. Normally the more broken chips gets sold off as lower tier products. A six core CPU is probably an eight core with two broken cores.

Though in this case, it seems [1] that Cerebras just has so many small cores they can expect a fairly consistent level of broken cores and route around them

[1]: https://www.cerebras.ai/blog/100x-defect-tolerance-how-cereb...


Whatever our human laws and morality says about right and wrong and fault, the laws of physics usually judges the car a winner when it hits somebody.

Placing yourself somewhere where pedestrians are not expected (non-residental road) mostly hidden from oncoming traffic for an extended period is putting yourself in undue risk.


You don't always have a choice about where you are momentarily and anybody turning a blind corner has an obligation to immediately reduce their speed (prior to turning the corner!) to where they can safely come to a stop without endangering others. That's drivers education 101. Right after 'don't text while driving', 'don't drink while driving' and 'slow down when there are pedestrians, bicycles and other fragile road users around'.


You'd be legally right, dead right.

I walk that road many times. I hug the side on the outside of the turn (there's no room on the inside) and so I can see (and be seen) from further away. I listen for cars coming. I watch for them. I am prepared to jump over the railing.

It's just common sense.

BTW, do you know that if you rear-end someone, it's your fault? I once was in heavy traffic, and the traffic in front of me stopped abruptly. I hit the brakes hard. I also glanced in the rearview mirror and realized the truck behind me was not going to stop in time. So I quickly pulled onto the shoulder. The truck hit the car in front of me.

But fortunately, that gave the truck a precious few more feet of stopping distance, and the collision with the car in front was minor.


Last year I was driving on an arterial, with a 35mph speed limit, that was a miles long downhill grade. There was a bike lane on the right. In it was a girl maybe 12, and on the back of it was another girl maybe 6. With the downhill, she was able to go about 20mph. Suddenly, she veers into the center of the car lane. Never looked over her shoulder. (Traffic was lining up behind me.) She then rides a bit on the stripe separating the bike land from the car lane. Then back to center of the car lane. Then in the bike lane, then back to the car lane. Back and forth. She never looked over her shoulder. I never dared to pass her, even when she was in the bike lane.

OMG


All of Mr. Devereaux's work is wonderful including the series you linked, but I think that one its overly focused on the household. I think his two part series on "Lonely Cities"[1][2] is a lot better at giving you a feeling for a city. It is both less in depth and in that one he spends half his time complaining about how Hollywood gets it wrong, so of course YMMV.

[1]:https://acoup.blog/2019/07/12/collections-the-lonely-city-pa... [2]:https://acoup.blog/2019/07/19/the-lonely-city-part-ii-real-c...


While this PDF might be new, Akin's Laws of Spacecraft Design dates back to 2003.

https://web.archive.org/web/20031101212246/https://spacecraf...


Are there technical reasons to prefer GDScript over C#?

GDScript is undoubtedly better integrated in the engine, but I would have expected C# compare more favorably in larger projects than the game jam sized projects I have made.



I don't see how this article could possibly support the argument that C# is slower than GDScript

It compares several C# implementations of raycasts, never directly compares with GDScript, blames the C# performance on GDScript compatibility and has an strike-out'ed section advocating dropping of GDScript to improve C# performance!

Meanwhile, Godot's official documentation[1] actually does explicitly compare C# and GDScript, unlike the the article which just blames GDScript for C#'s numbers, claiming that C# wins in raw compute while having higher overhead calling into the engine

[1]: https://docs.godotengine.org/en/stable/about/faq.html#doc-fa...


My post could have been a bit longer. It seems to have been misunderstood.

I use GDScript because it’s currently the best supported language in Godot. Most of the ecosystem is GDScript. C# feels a bit bolted-on. (See: binding overhead) If the situation were reversed, I’d be using C#. That’s one technical reason to prefer GDScript. But you’re free to choose C# for any number of reasons, I’m just trying to answer the question.


At least in my case, I got curious about the strength of /u/dustbunny's denouncement of Godot+C#.

I would have have put it as a matter of preference/right tool with GDScripts tighter engine integration contrasted with C#'s stronger tooling and available ecosystem.

But with how it was phrased, it didn't sound like expressing a preference for GDScript+C++ over C# or C#++, it sounded like C# had some fatal flaw. And that of course makes me curious. Was it a slightly awkward phrasing, or does C# Godot have some serious footgun I'm unaware of?


Makes sense! I think dustbunny said it best: C# is “not worth the squeeze” specifically in Godot, and specifically if you’re going for performance. But maybe that’ll change soon, who knows. The engine is still improving at a good clip.


I'm unsure if you're making a joke that flies over my head, but no Greenland is Danish, not Dutch.


A lot of people have made careers out of telling you that it's a failure, but while not everything about the F-35 is an unquestionable success, it has produced a "cheap" fighter jet that is more capable than all but a handful of other planes.

Definitely not a failure.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: