Yes, I meant that 192GB of RAM even with the worst quantization would result in such a large model going deep into disk swap when it entirely runs out of RAM. At least 100GB worth, if MacOS will even allow that without freezing or crashing or OOM killing the process.
You still have a core misunderstanding. Only one layer of weights is required in memory at a time. A forward pass can be over-simplified as a matrix multiplication of each layer, one at a time.
There is no swapping of working RAM. We're just talking about loading the weights read-only data into RAM on-demand for each layer. It is only as slow as your storage interface.
I don't think I have a core misunderstanding - I've seen the abysmal tps rate that results from being unable to load an entire model in actual system RAM (not swap space) at the same time. No matter how fast your NVME storage sequential read speed is.
Yes doing that would prevent destruction of an SSD through using disk space as swap RAM, but it will not be a good experience or usable at all. Note that the original post I was replying to referenced "swapping" which is generally meant to mean using system swap space as RAM.