We (disclosure: founder) do something similar at Trelent[1] but with an emphasis on security. Paid accounts can use OpenAI & Anthropic models, free ones just OpenAI. We have 3.5 sonnet live already. If you want to try it out lmk! Also totally respect building your own open-source :)
So typically these providers only offer ZDR to "managed" customers, after a lengthy application process. For example, on Azure, "managed" means companies with >$1m, possibly more now, in annual spend. They don't want to waste their time going through this long application process with smaller companies, so we take some of that weight off their shoulders. They get the same revenue at the end of the day, so in many ways it groups smaller companies' LLM spend and sends it straight to their bottom line, and they still get to claim their rolling out AI "responsibly".
Once one provider is cracked, the others fall as well, as these AI companies are all competing viciously for customers. Et voila, ZDR across multiple providers for the small(er) companies out there :)
That way, at inference-time you get the speed of 36B params because you are only "using" 36B params at a time, but the next token might (and frequently does) need a different set of experts than the one before it. If that new set of experts is already loaded (ie you preloaded them into GPU VRAM with the full 132B params), there's no overhead, and you just keep running at 36B speed irrespective of the loaded experts.
You could theoretically load in 36B at a time, but you would be severely bottlenecked by having to reload those 36B params, potentially for every new token! Even on top of the line consumer GPUs that would slow you down to ~seconds per token instead of tokens per second :)
I am the founder of Trelent. We’re building a useful agent for RPA, with a distinctly structured & hierarchical approach to working with LLMs. We have pilot customers in several industry verticals, have thousands of users, and are venture backed by multiple funds. Our goal is to ultimately enable the next billion software developers.
We’re very early (team of two) and are looking for a founding engineer who can wear several hats when necessary, but has a knack for UX. The focus right now is largely product-side, mainly on the front-end. Our stack is NextJS, Tailwind, and Node. Bonus points if you've worked with LLMs before - we strongly believe that they will change UX (but not through a chatbox alone).
Preference to North America, but fully remote and can hire anywhere for the right candidate. More details & application at https://jobs.trelent.com
Our website has a little more @ www.trelent.net. Working on a better way to showcase more examples. Will be inherently bias as we're selecting them (even if we try to take a fair sample of good/bad results), so best would be for you to try it yourself haha
Totally get that. Working on locally-hosted version for enterprise use-case with this in mind. We are either paid in dollars or data, and I'd rather be transparent about that then hide it deep in a Privacy Policy.
Please note: this does make use of a remote server, so use with caution. Working on locally-hosted version for anyone with ~48GB of VRAM (and change) to spare.