The model architecture only uses and cites pre-2023 techniques from the GPT-2 and GPT-3 era. Probably they intentionally tried to use the most bare transformers architecture possible. Kudo to them to have found a clever way to play the open-weights model game, while hiding any architectural advancements used in their closed models, and also claim they have moats in data quality and training techniques.
They hide many things, but some speculated observations:
- Their 'mini' models must be smaller than 20B.
- Does the bitter lesson once again strike recent ideas in open models?
- Some architectural ideas cannot be stripped away even if they wanted to, e.g., MoEs, mixed sparse attention, RoPE, etc.
We haven't tried that yet, but it's an interesting idea.
I'm the technical co-founder. Under me, we have three squads. Two of them are led by managers who were promoted internally from individual contributor roles. These managers still do some hands-on coding occasionally when needed, but they primarily focus on delegating tasks to their teams as much as possible. The third squad has a manager without coding responsibilities, whom we hired to test a new squad structure. It's been going really well—his squad is performing excellently, and I've been able to delegate some cross-functional tasks that used to fall on me, like collaborating with the sales team when clients have technical requirements for renewing their accounts or documenting releases and bug fixes to share within the company.
So in terms of the technology area of our company, our lower-level managers are performing considerably well. We’re implementing frameworks and routines that are really helping to advance our company’s managerial maturity. We’re also working to apply these practices to other, less structured areas, so your proposal could be very timely.
It's not founder mode or manager mode. I think it is just about effective management.
When starting up, the founder needs to (1) know what should be done, (2) be able to do it themself, and (3) do it and confirmed it's done. When scaling up, the founder still needs to (1) know what should be done, (2') know who are able to do it, and (3') arrange for those to do it and confirm it's done.
What is called manager mode is just a failure in either (1), (2) or (3). And what is called founder mode is just trying to remedy such failures by exerting themself instead of fixing the structure, thus, also not effective.
I agree with this. In my experience, if your "good people" are running your business into the ground, you just didn't hire good people. The challenge is that hiring "good people" is incredibly difficult. It's also actually fairly hard to get released once onboard. If the company is in the black, even if growth is far below where a founder or investor wants to be, stakeholders are reluctant to make changes.
Baseball recruiters have the advantage that they can go watch a pitcher toss a few balls. For most roles in a company you can't get that direct knowledge of someone's skills prior to hiring. After hiring someone who is actually an expert needs to sit with the new hire and assess them critically.
As a founder, sit in the first several meetings with the new sales guy. Did they come prepared knowing who to talk to? What their budget likely was? What their pain points likely were? Did they hear what the client asked and respond accordingly? Or did they did misunderstand the domain, need, request, etc.? Did they leave with notes and follow-up items? If you didn't come away impressed release them and move on. After 3 or 5 meetings you'll have confidence that they are the right fit.
After that, let them do their job and move on to addressing the next challenge. Once someone has shown themselves to be a good hire, protect them.
Yes, for the most part. But also: trust but verify
Figure out how much time you can spend on re-verifying, and do it in a random sampling basis. This will look different depending on the roles, but it’s whatever you need to do to verify, first-hand, the job is still being done correctly.
This will likely be a very small percent of your time, but the key is that it needs to be non-zero.
Isn't it the technological cycle in IT? We have seen PC softwares in the 90s, then web apps, then mobile apps, and now AI services. Except that currently AI is immature, so companies do not know exactly what to do with it yet. They are just conscious about previous cycles, so they overreact with massive layoff to rebudget for whatever new threats and opportunities.
Despite their efforts, I suspect new companies will appear anyway and grow massively like previous cycles, and hiring will increase again. So if you were laid off or a new grad, it seems now is great time for startup, or at least to learn new skills instead of getting rehired immediately by the existing companies.
Anyone know what caused the very big performance jump from Large1 to Large2 in just a few months?
Besides, parameter redundancy seems evidenced. Front-tier models used to be 1.8T, then 405B, and now 123B. Would front-tier models in the future be <10B or even <1B, that would be a game changer.
Counter-intuitively, larger models are cheaper to train. However, smaller models are cheaper to serve. At first, everyone was focusing on training, so the models were much larger. Now, so many people are using AI everyday, so companies spend more on training smaller models to save on serving.
When Zuck said spy can easily steal models, I wonder how much of it comes from experiences. I remember they struggled to train OPT not long ago.
On a more serious note, I don't really buy his arguments about safety. First, widespread AI does not reduce unintentional harm but increases it, because the rate of accident is compound. Second, the chance of success for threat actors will increase, because of the asymmetric advantage of gaining access to all open information and hiding their own information. But there is no reverse at this point, I enjoy it while it lasts, AGI will come sooner or later anyway.
I've seen such articles more and more recently. In the past, when people had a vague idea, they had to do research before writing. During this process, they often realized some flaws and thoroughly revised the idea or gave up writing. Nowadays, research can be bypassed with the help of eloquent LLMs, allowing any vague idea to turn into a write-up.
We knew high quality data can help as evidenced by the \Phi models. However, this alone can never eliminate hallucination because data can never be both consistent and complete. Moreover, hallucination is an inherent flaw of intelligence in general if we think of intelligence as (lossy) compression.
They hide many things, but some speculated observations:
- Their 'mini' models must be smaller than 20B.
- Does the bitter lesson once again strike recent ideas in open models?
- Some architectural ideas cannot be stripped away even if they wanted to, e.g., MoEs, mixed sparse attention, RoPE, etc.