Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, I meant that 192GB of RAM even with the worst quantization would result in such a large model going deep into disk swap when it entirely runs out of RAM. At least 100GB worth, if MacOS will even allow that without freezing or crashing or OOM killing the process.
 help



You still have a core misunderstanding. Only one layer of weights is required in memory at a time. A forward pass can be over-simplified as a matrix multiplication of each layer, one at a time.

There is no swapping of working RAM. We're just talking about loading the weights read-only data into RAM on-demand for each layer. It is only as slow as your storage interface.


I don't think I have a core misunderstanding - I've seen the abysmal tps rate that results from being unable to load an entire model in actual system RAM (not swap space) at the same time. No matter how fast your NVME storage sequential read speed is.

Yes doing that would prevent destruction of an SSD through using disk space as swap RAM, but it will not be a good experience or usable at all. Note that the original post I was replying to referenced "swapping" which is generally meant to mean using system swap space as RAM.

The standard term for loading only portions of a model from disk as needed is memory mapping, not "swapping". https://www.google.com/search?client=firefox-b-d&q=llama-ser... , or same thing if you google "safetensors file memory mapping"

With a model of this large of a size, not being able to hold it in RAM? Even at worst quantization you'd be looking at 1tps or worse.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: