More

thntk · 2025-08-06T07:49:29 1754466569

The model architecture only uses and cites pre-2023 techniques from the GPT-2 and GPT-3 era. Probably they intentionally tried to use the most bare transformers architecture possible. Kudo to them to have found a clever way to play the open-weights model game, while hiding any architectural advancements used in their closed models, and also claim they have moats in data quality and training techniques.

They hide many things, but some speculated observations:

- Their 'mini' models must be smaller than 20B.

- Does the bitter lesson once again strike recent ideas in open models?

- Some architectural ideas cannot be stripped away even if they wanted to, e.g., MoEs, mixed sparse attention, RoPE, etc.

thntk · on Jan 7, 2025

Anyone know if it can run training/fine-tuning and not just 4-bit inference? Does it support mixed precision training with either BF16 or FP16?

thntk · on Sept 2, 2024

Have you tried hiring or rotating internal people just for solving specific (management) tasks? It is like the "just-in-time" style in Japanese corps.

tcgv · on Sept 2, 2024

We haven't tried that yet, but it's an interesting idea.

I'm the technical co-founder. Under me, we have three squads. Two of them are led by managers who were promoted internally from individual contributor roles. These managers still do some hands-on coding occasionally when needed, but they primarily focus on delegating tasks to their teams as much as possible. The third squad has a manager without coding responsibilities, whom we hired to test a new squad structure. It's been going really well—his squad is performing excellently, and I've been able to delegate some cross-functional tasks that used to fall on me, like collaborating with the sales team when clients have technical requirements for renewing their accounts or documenting releases and bug fixes to share within the company.

So in terms of the technology area of our company, our lower-level managers are performing considerably well. We’re implementing frameworks and routines that are really helping to advance our company’s managerial maturity. We’re also working to apply these practices to other, less structured areas, so your proposal could be very timely.

thntk · on Sept 3, 2024

It sounds like your startup is doing great, good luck.

thntk · on Sept 2, 2024

It's not founder mode or manager mode. I think it is just about effective management.

When starting up, the founder needs to (1) know what should be done, (2) be able to do it themself, and (3) do it and confirmed it's done. When scaling up, the founder still needs to (1) know what should be done, (2') know who are able to do it, and (3') arrange for those to do it and confirm it's done.

What is called manager mode is just a failure in either (1), (2) or (3). And what is called founder mode is just trying to remedy such failures by exerting themself instead of fixing the structure, thus, also not effective.

narnarpapadaddy · on Sept 2, 2024

I agree with this. In my experience, if your "good people" are running your business into the ground, you just didn't hire good people. The challenge is that hiring "good people" is incredibly difficult. It's also actually fairly hard to get released once onboard. If the company is in the black, even if growth is far below where a founder or investor wants to be, stakeholders are reluctant to make changes.

Baseball recruiters have the advantage that they can go watch a pitcher toss a few balls. For most roles in a company you can't get that direct knowledge of someone's skills prior to hiring. After hiring someone who is actually an expert needs to sit with the new hire and assess them critically.

As a founder, sit in the first several meetings with the new sales guy. Did they come prepared knowing who to talk to? What their budget likely was? What their pain points likely were? Did they hear what the client asked and respond accordingly? Or did they did misunderstand the domain, need, request, etc.? Did they leave with notes and follow-up items? If you didn't come away impressed release them and move on. After 3 or 5 meetings you'll have confidence that they are the right fit.

After that, let them do their job and move on to addressing the next challenge. Once someone has shown themselves to be a good hire, protect them.

halfcat · on Sept 2, 2024

> After that, let them do their job and move on

Yes, for the most part. But also: trust but verify

Figure out how much time you can spend on re-verifying, and do it in a random sampling basis. This will look different depending on the roles, but it’s whatever you need to do to verify, first-hand, the job is still being done correctly.

This will likely be a very small percent of your time, but the key is that it needs to be non-zero.

abhinai · on Sept 2, 2024

Have you ever run a company or are you just sharing opinions based on what you “think” should be done?

narnarpapadaddy · on Sept 2, 2024

I have founded and “exited” 3 startups, 2 of which ended in bankruptcy. :)

baxtr · on Sept 2, 2024

I’m afraid the whole piece is missing a framework like yours.

It looks very selectively at one successful category of enterprises and then generalizes this onto all.

But:

What about all those founder mode companies that went terribly wrong?

What about all the companies that function well in “non-founder” mode? Apple today is a perfect example.

I don’t think it’s about founder/non-founder either but about some underlying fundamental principles that PG hasn’t figured out yet.

thntk · on Aug 21, 2024

Isn't it the technological cycle in IT? We have seen PC softwares in the 90s, then web apps, then mobile apps, and now AI services. Except that currently AI is immature, so companies do not know exactly what to do with it yet. They are just conscious about previous cycles, so they overreact with massive layoff to rebudget for whatever new threats and opportunities.

Despite their efforts, I suspect new companies will appear anyway and grow massively like previous cycles, and hiring will increase again. So if you were laid off or a new grad, it seems now is great time for startup, or at least to learn new skills instead of getting rehired immediately by the existing companies.

thntk · on July 24, 2024

Anyone know what caused the very big performance jump from Large1 to Large2 in just a few months?

Besides, parameter redundancy seems evidenced. Front-tier models used to be 1.8T, then 405B, and now 123B. Would front-tier models in the future be <10B or even <1B, that would be a game changer.

duchenne · on July 24, 2024

Counter-intuitively, larger models are cheaper to train. However, smaller models are cheaper to serve. At first, everyone was focusing on training, so the models were much larger. Now, so many people are using AI everyday, so companies spend more on training smaller models to save on serving.

nuz · on July 24, 2024

Lots and lots of synthetic data from the bigger models training the smaller ones would be my guess.

thntk · on July 24, 2024

When Zuck said spy can easily steal models, I wonder how much of it comes from experiences. I remember they struggled to train OPT not long ago.

On a more serious note, I don't really buy his arguments about safety. First, widespread AI does not reduce unintentional harm but increases it, because the rate of accident is compound. Second, the chance of success for threat actors will increase, because of the asymmetric advantage of gaining access to all open information and hiding their own information. But there is no reverse at this point, I enjoy it while it lasts, AGI will come sooner or later anyway.

thntk · on July 23, 2024

Correct me if I'm wrong, my impression is that 3.1 is a better fine-tuned variant of base 3.0 with extensive use of synthetic data.

thntk · on July 18, 2024

I've seen such articles more and more recently. In the past, when people had a vague idea, they had to do research before writing. During this process, they often realized some flaws and thoroughly revised the idea or gave up writing. Nowadays, research can be bypassed with the help of eloquent LLMs, allowing any vague idea to turn into a write-up.

thntk · on July 18, 2024

We knew high quality data can help as evidenced by the \Phi models. However, this alone can never eliminate hallucination because data can never be both consistent and complete. Moreover, hallucination is an inherent flaw of intelligence in general if we think of intelligence as (lossy) compression.