Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That seems really oddly specific. Why is an ostensibly universal system prompt going into the details of Python libraries and fonts?


It's going into the instructions on how to use standard built-in tools, which it is intended to choose to do as much as is appropriate to address any response. Without information on what the tools are and how it is expected to use them, it can't do that reliably (as with anything else where precision matters, grounding in the context is much more powerful for this purpose than training alone in preventing errors, and if it makes errors in trying to call the tools or simply forgets that it can, that's a big problem in doing its job.)


Edge cases they couldn't tune out without generally damaging the model.


I'm naive on this topic but I would think they would do something like detect what the questions are about the load a relevant prompt instead of putting everything in like that?


> I'm naive on this topic but I would think they would do something like detect what the questions are about the load a relevant prompt instead of putting everything in like that?

So you think there should be a completely different AI model (or maybe the same model) with its own system prompt, that gets the requests, analyzes it, and chooses a system prompt to use to respond to it, and then runs the main model (which may be the same model) with the chosen prompt to respond to it, adding at least one round trip to every request?

You'd have to have a very effective prompt selection or generation prompt to make that worthwhile.


Not sure why you emphasizing a round trip request like these models aren't already taking a few seconds to respond? Not even sure that matters since these all run in the same datacenter, or you can atleast send requests to somewhere close.

I'd probably reach for like embeddings though to find a relevant prompt info to include


> I'd probably reach for like embeddings though to find a relevant prompt info to include

So, tool selection, instead of being dependent on the ability of the model given the information in context, is dependent on both the accuracy of a RAG-like context stuffing first and then the model doing the right thing given the context.

I can't imagine that the number of input prompt tokens you save doing that is going to ever warrant the output quality cost of reaching for a RAG-like workaround (and the size of the context window is such that you shouldn't have the probems RAG-like workarounds mitigate very often anyway, and because the system prompt, long as it is, is very small compared to the context window, you have a very narrow band where shaving anything off the system prompt is going to meaningfully mitigate context pressure even if you have it.)

I can see something like that being a useful approach with a model with a smaller useful context window in a toolchain doing a more narrowly scoped set of tasks, where the set of situations it needs to handle is more constrained and so identify which function bucket a request fits in and what prompt best suits it is easy, and where a smaller focussed prompt is a bigger win compared to a big-window model like GPT-5.


I don't think making the prompt smaller is the only goal. Instead if having 1000 tokens of general prompt instructions you could have 1000 tokens of specific prompt instructions.

There was also a paper I saw that went by that showed model performance went down when extra unrelated info was added, that must be happening to some degree here too with a prompt like that


Router models exist, and do something like what you describe. They run one model to make a routing decision, and then feed the request to a matching model, and return its result. They're not popular, because they add latency, cost, and variance/nondeterminism. This is all hearsay, mind you.


They are trying to create a useful tool, but they are also trying to beat the benchmarks. I'm sure they fine tune the system prompt to score higher at the most well known ones.


you're being facetious, but it's stochastic and they've provided prompts that lead to a better response some higher % of the time.


I'm not being facetious. This is a legitimate, baffling disconnect.


Probably they ran a frequency analysis to get the most used languages, and then, they focused on scoring high on those languages in any way they could including Prompt Engineering or Context Engineering (whatever they're calling that right now).

Or they just choose Python because that's what most AI bros and ChatGPT users use nowadays. (No judging, I'm a heavy Python user).


No, it's because that's what ChatGPT users internally to calculate things, manipulate data, display graphs etc. That's what its "python" tool is all about. The use cases usually have nothing to do with programming - the user is only interested in the end result, and don't know or care that it was generated using Python (although it is noted in the interface).

The LLM has to know how to use the tool in order to use it effectively. Hence the documentation in the prompt.


Oops, I forgot about that. Still, having it in the system prompt seems fragile, but whatever, my bad.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: