More

ahzhou · 2026-02-14T15:51:26 1771084286

Fast as in time-to-market

ahzhou · 2025-04-08T16:52:12 1744131132

VC vs bootstrap is usually based on company TAM. There are certainly high growth bootstrapped businesses.

ahzhou · 2025-04-08T16:50:57 1744131057

Not common in Silicon Valley, but much more common in the rest of the country. There’s an archetype for bootstrapped tech businesses: - highly vertical specific - couple hundred million TAM - founder started the business in their 30s and is now in their 40s

ahzhou · on Jan 29, 2025

It’s a tensor stored in GPU memory to improve inference throughput. Check out the PagedAttention (which introduces vLLM) paper for how most systems implement it nowadays.

ahzhou · on Jan 28, 2025

You can easily do a fermi estimate based on the information given. They are comparing GPU hours.

See: https://planetbanatt.net/articles/v3fermi.html

ahzhou · on Jan 28, 2025

I might be missing something, but DeepSeek’s recipe is right there in plain sight. Most of the cost efficiency of DeepSeek v3 seem to be attributable to MoE and FP8 training. DeepSeek R1s improvements are from GRPO-based RL.

Interesting to note - we have no idea how much R1 cost to train. To speculate - maybe DeepSeek’s release made an upcoming Llama release moot in comparison.

pptr · on Jan 28, 2025

What is different about Deepseek's use of MoE vs all the other MoE models that makes training more efficient?

FP8 training and GRPO make sense to me, but that only gets you a 4x improvement total, right?

ahzhou · on Jan 28, 2025

They slightly restructure their MoE [1], but I think the main difference is that other big models (e.g Llama 504B) are dense and have higher FLOP requirements. MoE should represent a ~5x improvement. FP8 should be about a ~2x improvement.

We don’t know how much of a speed improvement GRPO represents. They didn’t say how many GPU hours went into to RLing DeepSeek-r1 and we don’t have a o1 numbers to compare.

There’s definitely lots of misinformation spreading though. The $5.5m number refers to Deepseek-v3, not Deepseek-r1. I don't want to take away from HighFlyer's accomplishment, though. I think a lot of these innovations were forced to work around H800 networking limitations, and it's impressive what they've done.

[1] https://arxiv.org/abs/2401.06066

karmakaze · on Jan 28, 2025

It's interesting that only having access to less powerful hardware motivated/necessitated more efficient training--like how tariffs can backfire if left in place too long.

ahzhou · on Jan 24, 2025

LLMs are inherently bad at this due to tokenization, scaling, and lack of training on the task. Anthropic’s computer use feature has a specialized model for pixel-counting: > Training Claude to count pixels accurately was critical. Without this skill, the model finds it difficult to give mouse commands. [1] For a VLM trained on identifying bounding boxes, check out PaliGemma [2]

You may also be able to get the computer use API to draw bounding boxes if the costs make sense.

That said, I think the correct solution is likely to use a non-VLM to draw bounding boxes. Depends on the dataset and problem.

1. https://www.anthropic.com/news/developing-computer-use 2. https://huggingface.co/blog/paligemma

nostrebored · on Jan 24, 2025

PaliGemma on computer use data is absolutely not good. The difference between a FT YOLO model and a FT PaliGemma model is huge if generic bboxes are what you need. Microsoft's OmniParser also winds up using a YOLO backbone [1]. All of the browser use tools (like our friends at browser-use [2]) wind up trying to get a generic set of bboxes using the DOM and then applying generative models.

PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.

[1] https://huggingface.co/microsoft/OmniParser?language=python [2] https://github.com/browser-use/browser-use

ahzhou · on Oct 3, 2024

Author: @fandzomga Username: fsndz

Why try to funnel us to your paywalled article?

ahzhou · on July 2, 2024

Conditionally yes. There are many libraries that cannot be tree shaken for various reasons. Libraries typically need to stick to a subset of full JS to ensure that the code can be statically analyzed.

throwAGIway · on July 2, 2024

Basically the only forbidden thing is dynamically calculating import paths, or dynamically generating the module.exports object.

ahzhou · on June 21, 2024

GraphQL is very powerful when combined with Relay. It’s useless extra bloat if you just use it like REST.

The difference between the two technologies is that LangChain was developed and funded before anyone know what to do with LLMs and GraphQL was internal tooling using to solve a real problem at Meta.

In a lot of ways, LangChain is a poor abstraction because the layer it’s abstracting was (and still is) in it’s infancy.