I’m convinced this is group hallucination. It must be so interesting to work at OpenAI, knowing you didn’t change a thing, and seeing that because of random chance, some small fraction of 100M users have all tricked each other that suddenly, something is different.
I think it's more likely that people are confused, and OpenAI is not making things any clearer either.
AFAIK, OpenAI has repeatedly stated that GPT4 hasn't changed. People repeatedly states that when they use ChatGPT, they get a difference experience today than before. Both can be true at the same time, as ChatGPT is a "packaged" experience of GPT4, so if you use the API versions, nothing has likely changed. But ChatGPT has guaranteed changed, for better or worse, as that's "just" integration work rather than fundamental changes to the model.
In the discussions on HN, people tend to speak each other regarding this as well, saying things like "GPT4 has for sure changed" when their only experience of GPT4 is via ChatGPT, which has changed since launch, obviously.
But ChatGPT != GPT4, which could always be made clearer.
It's a bit of both. The GPT-4 models have definitely been changing - there's multiple versions right now and you can try them out in the Playground. One of the biggest differences is that the latest model patches all of the GPT-4 jailbreak prompts; quite a big change if you were doing anything remotely spicy. But OA also says that it hasn't been changing the underlying model beyond that (that's probably the tweet you're thinking of), while people are still reporting big degradations in the ChatGPT interface, and those may be mistakes or changes in the rest of the infrastructure.
I was just getting started with ChatGPT plus in mid may. the exact date was not clear but I was within the first week of using GPT4 via chatgpt plus to write some work ansible code. on may 16 (not that exact date, but day N) it was amazing and when I wasn't writing work stuff, I was brainstorming for my novel.
The next day, suddenly prompts that used to work now gave much more generic results, the code was much more skinflinty and the kept trying to 'no wait I'm going to leave that long code as an exercise for you human'.
I didn't have time to buy in to a hallucination, I wasn't involved in openai chats to get 'infected by hysteria' or whatever, I was just using the tool a ton. and there was a noticeable change on day N+1 that has persisted until now.
The fact that gpt4 API calls appear to be similar tells me they changed their hidden meta prompt on the chatgpt plus website backend and are not admitting that they adjusted the meta prompt or other settings on the interface middleware between the JS webpage we users see and the actual gpt4 models running.
I’d note they explicitly document they rev GPT-4 every two weeks and provide fixed snapshots of the prior periods model for reference. One could reasonably benchmark the evolution of the model performance and publish the results. But certainly you’re right - ChatGPT != GPT4, and I would expect that ChatGPT performs worse than GPT4 as it’s likely extremely constrained in its guidance, tunings, and whatever else they do to form ChatGPT’s behavior. It might also very well be that to scale and revenue follow costs they’ve dumbed down the ChatGPT plus. I’ve found it increasingly less useful over time but I sincerely feel like it’s mostly because of the layers of sandbox protection they’re adding constraining the model into non optimal spaces. I do find that the classical iterative prompt engineering still helps a great deal - give it a new identity aligned to the subject matter. Insist on depth. Insist on checking the work and repeating itself. Asking it if it’s sure about a response. Periodically reinforcing the context you want to boost the signal. Etc.
Heh, this kind of reminds me of the process of enterprise support.
Working with the customer in dev: "Ok, run this SQL query and restart the service. Done, ok does the test case pass?" Done in 15 minutes.
Working with customer in production: "Ok, here is a 35 point checklist of what's needed to run the SQL query and restart the service. Have your compliance officer check it and get VP approval, then we'll run implementation testing and verification" --same query and restart now takes 6 hours.
> so if you use the API versions, nothing has likely changed
I doubt that. I don't recall them actually clearly and precisely saying they aren't changing the 'gpt-4' model - i.e. the model you're getting when specifying 'gpt-4' in an API call. That one direct tweet I recall, which I think you're referring to, could be read more narrowly as saying the pinned versions didn't change.
That is, if you issue calls against 'gpt-4-0314', then indeed nothing changed since its release. But with calls against 'gpt-4', anything goes.
This would be consistent with their documentation and overall deployment model: the whole reason behind the split between versioned (e.g. 'gpt-4-0314', 'gpt-4-0613') and unversioned models (e.g. 'gpt-4') was so that you could have both stable base and a changing tip. If that tweet is to be read as saying 'gpt-4' didn't change since release, then the whole thing with versioning is kind of redundant.
The -0613 version is really different! It added function calling to the API as a hint to the LLM, and in my experience if you don't use function calling it's significantly worse at code-like tasks, but if you do use it, it's roughly equivalent or better when it calls your function.
Seconded. In particular, how does function calling help restore performance in general prompts like: "Here's roughly what I'm trying to achieve: <bunch of requirements> Could you please write me such function/script/whatever?".
Maybe I lack the imagination, but what function should I give to the LLM? "insert(text: string)"?
For generating arbitrary code, I imagine you could do the same thing but swap `query_db` with the name `exec_javascript` or something similar based on your preferred language.
>But ChatGPT != GPT4, which could always be made clearer.
Isn't the thread about ChatGPT? I mean it is helpful to know that they are not the same (I personally was not clear on this myself, so I, at least, benefitted from your comment), but I think the thread is just about Chat GPT.
It’s definitely not. Our prompts that were generating JSON output went from around 95% valid JSON to about 10% overnight. The model just started inserting random commentary. We’ve reverted to the 0314 model and it’s working fine again.
I use ChatGPT (GPT4) to build scaffolding for python one-off scripts, and over the past 3-4 days I'm getting nonsense. Not python-looking nonsense, but markdown, weird quotes, random text, etc. Same prompts.
I had an API integration written to convert an English language security rule into an XML object designed to instruct a remote machine how to comply with the rule programmatically. April 2023 we had about an 86% accept rate, that number has declined to 31% with no changes to the prompt.
This is the kind of info I've been looking for - I ran some informal experiments which asked ChatGPT to mark essays along various criteria analyzed how consistent the marking was. This was several months ago, GPT-4 performed quite well, but the data wasn't kept, (it was just an ad-hoc application test written in jupyter notebooks).
I'm certain it's now doing significantly worse on the same tests, but alas I have lost the historical data to prove it.
Have you tried using the recently releases function calling API? That’s reliable at returning json in my experience, although I’ve just tinkered with it, not used it for anything “real.”
My guess is that the degradation of JSON capability happened recently? The gpt-4 API switched over to gpt-4-0613 (the function calling version) on June 27. And given the performance increase for ChatGPT Plus at the end of May, my guess is they started testing the new model (which is much faster) on web users around then. In my testing [1], the new version is:
a. Worse at general code-like tasks without using functions
b. Equivalent or better at code-like tasks if you use the function API
c. Much faster than the older model either way.
I'd guess it's cheaper to run, too, and that they use the presence of a function in the API signature to weight their mixture of experts differently (and cull some experts?). The degradation in general purpose coding tasks is pretty obvious and repeatable (try the same prompts in the Playground with the -0314 model vs the -0613!), but it does seem like you can regain that lost capability with the new function call API, and it's faster. The tradeoff is that you only regain the capability when it calls functions; you can't really have a mix of prose-and-code in the same response as easily, or at least not with the same quality.
I remember the first time I played Minecraft and I was in awe at how expansive the play world felt. Without thinking too much about it, I had the feeling that if I set off in any direction I would discover infinitely new things. After enough playtime I saw the repeating patterns and eventually it felt so small again.
People will always see what they want to see. I've had so many interactions with customers over the years who thought that a service or feature was removed or crippled when in fact nothing had changed on our side. The only thing that changed is their perception. Especially when they can't get something to work and they believe they have succeeded at something similar before, they'll always suspect that the software is at fault instead of their own memory.
This is an excellent theory IMO. It isn't that the AI has actually gotten much worse, it's that the novelty has worn off and they are finally starting to notice all of the repetitious patterns it has and mistakes it makes — the stuff people like me who never bought into the AI hype to begin with noticed from the start — but instead of realizing that maybe their initial impressions of the capabilities of large language models were wrong or based on partial information they are taking the Mandela effect route and just insisting that something outside them has fundamentally changed.
Pretty sure this is going on to some degree. It seems like there should be some kind of regression testing possible on these systems to definitively prove these claims, rather than these anecdotal stories that seem to rarely ever come with concrete examples.
Since 30 June, the API responses are making common English misspelling errors, of the type where two words sound the same with different meanings such as break and brake.
I saw this happen zero times in the prior GPT-4 model, and multiple times this July, on multiple conversation topics and multiple word pairs.
Curiously, they're behaving as misspellings rather than mismeanings, since the sentence continuation is as if the correct meaning had been used.
I acknowledge this could be a blend of pareidolia and the Baader-Meinhof phenomenon.
I don't agree. As someone who has written many jailbreak prompts, the very fact that earlier jailbreak prompts no longer work indicates to me that the integration has changed. The model might be the same, but filtering the input extensively might cause undefined behavior.
We will never know for sure, it is equally likely they did some cost savings which caused a reduction in quality. I certainly noticed that too, but we have no way to prove that, any proof can always be dismissed easily. For example I see it generate code with hallucinated variables quite often now, that never happened before to that degree, but I'm just as well part of the group hallucinating so easy to dismiss. Anecdotal evidence is useless.
We also can never independently evaluate it, OpenAI could cache messages, fine tune on public test sets, etc. etc.
This is a common tactic observed with toxic personality disorders. People will repeatedly ask for examples knowing they can dispute any example given because the topic is subjective. You can spot the loop in these threads. An army of people comment/flag asking for examples regardless of how many are given in the thread. When you provide them examples they nitpick and call your prompting bad. Not saying it's bots but there's a pattern with these "OpenAI nerfed" threads across social media right now.
> We will never know for sure, it is equally likely they did some cost savings which caused a reduction in quality.
That is entirely not equally likely, and would be completely unprecedented, at the frontier of an emerging technology that people are pumping the money and the future of the world into to win.
No matter how much money they pump into it there is a finite number of GPUs. Money can’t make GPUs appear out of thin air. If they’re faced with the choice of lowering quality slightly or turning away new customers, it shouldn’t be surprising if they choose lowering quality.
that sounds very naive to me, you think the "future of the world" matters to corporations making the business decision to save money and increase profit short-term? That idea is so alien to me we might as well live on a different planet.
The part that is unprecedented would be giving up an edge in a battle that will win you the world if you win the battle, by saving a few bucks. At the highest level (think OpenAI/Microsoft, Google) money is not going to be the lynchpin for a long, long time. This thing is too close to "forever good enough" at way too many things to lose your edge by being too clever by half.
It's gotten worse. It's at the very least been quantized resulting in much lower overall precision. That's how they have been able to speed up inference so much.
I have used the same prompts to design a infinite scroll up/down image container.
2 months ago chat when it first came out was able to generate working code that used intersection observer API.
Now when I use the same prompts it only generates high level suggestions and when I ask for code it suggests using dom scroll events and doesn't even come up with intersection observer API unless I specifically ask. And if I do, it then generates incorrect code.
It even previously was correctly memoizing certain functions and included performance optimizations.
I’m convinced as well. There are plenty of folks using it for production use cases who regularly run evaluations, including myself. No evidence that it has been nerfed.
It’s just not as robust and general as it felt when using it for the first time.
Folks using it in production are using the API, where you can explicitly select a model. In the post, the poster is using ChatGPT, through the web app, where you can't select exact model and they sometimes do updates.
The UI also has a system prompt you don’t control which could be changed without model changes, and may have other differences from using the API directly (and those differences may also change over time.)
My suspicion is that we're collectively becoming accustomed to ChatGPT failures. These failures cause problems, and become more annoying with time. The same thing happened with voice assistants.
That being said, the safety filters have definitively changed in OpenAI. ChatGPT is definitely more prone to reminding me that it is an LLM, and it refuses to participate in pretend play which it perceives as violating its safety filters. As a trivial example, ChatGPT is less willing to generate test cases for security vulnerabilities now - or engage in speculative mathematical discussions. Instead it will simply state that it is an advanced LLM blah blah blah.
I started using it relatively late, but earlier in May, you could have given it a DOI link, and it would have summarized it for you. Now, it argues that it's not a database and that it can only summarize it if you provide the full text. However, if you ask for it with the title of the paper, it will provide you with a summary.
You could have also asked it to search patents on some topic, and it would have given you a list of links. Now, it provides instructions on how to find it yourself.
yes! I was using GPT-4 as a citation engine for a bit by pasting in text and requesting related citations. The accuracy rate of 3/4 was good enough that it was still saving me hours reading irrelevant material, particularly as validating the non-existence of 25% of citations was a trivial activity.
Slowly but surely, the comment gaslighting all of the people reporting the issue, makes its way to the top, while other comments with genuine discussion are flagged and slip lower. Seen this before...
I hate to say it on HN but I see it too and it gets my conspiracy gears cranking a bit.
My theory is that the initial ChatGPT offering (3.5/4/whatever) was "too hot" for the likes of certain incumbents. In my experience, the capabilities at launch were incredible and clearly a threat for a wide range of F500 software firms. I had phone calls with people I haven't talked to in over a decade about what I was seeing. I am not seeing those things today. This was mere months ago. This is not nostalgia.
Indeed. I have a sinking feeling they realized (or were otherwise convinced) those models are too disruptive to existing businesses and whole market segments, in particular (but not limited to) when it comes to writing code. Or at least that's where it's most obvious to me just how many different classes of companies could grow and capture value[0] that GPT-4 has been providing, pay-as-you-go, for a dozen cents per use. But the same must be true in many other industries.
Come to think of it, it must be the case, because the alternative would be pretty much every player on the market taking the hit and carrying on, or pretending they don't see the untapped value source that just freely flows out of OpenAI for anyone to enjoy, for a modest fee.
As a prime example, I'd point out Microsoft and their various copilots - the code one, the Office 365 one, the Windows system-wide one, in varying stages of development. API access to GPT-4 as good as it originally was[1], directly devalues all of those.
It stands to reason that slowly making the model dumber, while also making it faster and cheaper to use, is the best way for OpenAI to safeguard big players' markets - the "faster" and "cheaper" give perfect cover, while the overall effect is salting the entire space of possibilities - making the model good enough to entertain the crowd, but just not good enough to build solutions on top, not unless you're working for one of the players with special deals.
TL;DR: too many entities with money were unhappy about all the value OpenAI was giving to the world for peanuts, so the model is being gradually nerfed in a way that allows that value to be captured, controlled, and doled out for a hefty price.
(And if that turns out to be true, I'm going to be really pissed. I guess it's in the style of humanity to slow down pace of development not because of ideology, not because of potential risks, but because it's growing too fast to fully monetize.)
--
[0] - I mean that in the most nasty, parasitic sense possible.
[1] - I'm talking about the public release. That GPT-4 version seems to have already been weakened compared to pre-"safety tuning" GPT-4 (see the TikZ Unicorn benchmark story), but we can't really talk about what we never got to play with.
I've smelt the sweet scent of anticompetitive back-room dealing around OpenAI ever since they and Microsoft started forcing people to apply for access to the APIs and including telling them what the use case they were going to use it was.
It just seemed obvious that if anyone suggested a use case that was actually really high value MS would just take the idea, run with it for a month or two to see if it has legs, and then steal it if it actually worked.
All while you're waiting in the queue to have your idea validated as "safe".
Meanwhile Sam Altman was on a worldwide press tour repeatedly saying that their mission is to “democratise” AI. They’re actually doing the exact opposite: gatekeeping, building moats, and seeking legislation to entrench a monopoly position.
Can you provide some evidence to back that up? Especially because OpenAI _has_ been tinkering with ChatGPT - by trying to limit jailbreaks.
People have a strong prior that these kinds of changes will reduce model performance (because you're limiting your model), so the burden is on you to show that performance hasn't degraded.
Seriously... In that 134 replies thread, 0 transcripts showing actual performance degradation. Just endless "Yes, it seems bla bla." No evidence but just shapes in the clouds.
I don't have a transcript, but I tried (when GPT-4 was initially released) passing it a riddle encoded with a caeser cipher that was base 64 encoded. I gave it the prompt "This is a riddle that is encoded in some way, solve it" and it managed to do so.
Now it can't even do just the caeser cipher without hallucinating nor can do it do even purely base64 decoding without hallucinating.
Are you talking about GPT-4 or ChatGPT GPT-4? The GPT-4 model hasn’t changed, and that was confirmed by developers at OpenAI a while back IIRC. But, ChatGPT is always undergoing changes. I assume they have a layer or two on top of the model that is being trained with reinforcement learning.
I am convinced they have limited inference time to save GPU compute a month or two after Bing did the same. Perhaps I am part of that group hallucination.
You are more right than you think. Initial hallucinations of ai and usefulness are waning off, people realising that this chatbot is not much more than entertainment.
They've definitely changed something about the models, and it is in their interests to do so, both to create a low-latency experience, but most importantly, to save money.
While GPT-4 is still workable, GPT-3.5 flatly refuses requests these days, claiming that as an "AI language model" it couldn't help me write code.
Usually trying to regenerate a response works fine in these cases. However, claiming that the RLHF and subsequent fine tuning isn't having any effect is a bit dishonest on the part of OpenAI.
Hilariously when I asked GPT-4 who I was, it said I’d previously worked at OpenAI. I thought of trying to apply there and saying “well, your own model thought I did…”
I think it's because when you first use it, you're surprised to the upside about how capable it is and you don't care about small faults because you expected to correct for those anyway.
Then you get used to this new level of capability and subconsciously weight the errors more.
For all the talk, I see very few people sharing direct chat links that are the same query at different points in time with different quality of answer.
In fact, when I do similar things, I don't notice a change in quality.
This is just fascinating, isn't it? I have competing thoughts in my head:
1. It's software that offers non-deterministic output and as such is fiendishly difficult to write realistic end-to-end tests for. Of course it's experiencing regressions. Heisenbugs are the hardest bugs to catch and fix, but having millions of users will reliably uncover them. And for an LLM, almost every bug is a Heisenbug! What if OpenAI improved GPT4 on one metric and this "nerfed" it on some other more important metric? That's just a classic regression. And what would robust, realistic end-to-end tests even look like for GPT4?
2. It's software that presents itself as a human on the Internet—even worse, a human representing an institution. Of course nobody trusts it. Everyone is extremely mistrustful of the intents and motivations of other humans on the Internet, especially if those humans represent an organization. I co-ran the tiny activist nonprofit Fight for the Future for years, and it was really amazing how common it was for comments in online spaces to assume the worst intentions; I learned to expect it and react extremely patiently. Imagine what it's like for OpenAI, building a product that has become central to peoples' workflows. Of course people are paranoid and think they're the devil, and are able to hallucinate all manner of offense and model it with every paranoid theory imaginable. The funny thing is, the more successful GPT4 is at seeming human, the less some people will trust it, because they don't trust humans! And the smarter and more successful it gets, the less some people will trust it! (How much do most people trust smart, successful public figures?)
3. Maybe an overall improvement for most users (one that the data would strongly suggest is a valid change and that would pass all tests) is a regression for some smaller set of users that aren't expressed in the tests. There might be some pairs of objectives that still present genuinely zero-sum tradeoffs given the size of the model and how it's built. What then? The usefulness of GPT4 is specifically that it is general purpose, i.e. that the massive cost of training it can be amortized across tons of different use cases. But intuitively there must be limits to this, where optimization for some cases comes at a cost to others, beyond the oft-cited of Bowdlerization. Maybe an LLM is just yet another case in the real world where sharing an important resource with lots of people is a hard problem.
If I were at OpenAI, I would want some third party running a community-submitted end-to-end test suite on each new release, with accounts that were secret to OpenAI and from unknown IP addresses—via Tor Snowflake bridges or something.
It's so tempting when running into user-reported Heisenbugs to trick oneself into ignoring users and not accepting that you've shipped a real regression. In addition to wanting the world to know, I would want to know.
But there's a real question of what these community-curated tests would even be, since they'd have to be automated but objective enough to matter. Maybe GPT4 answers could be rated by an open source LLM run by a trusted entity, set to temperature: 0? Or maybe some tests could have unambiguous single-string answers, without optimizing for something unrealistic? And the tests would have to be secret or OpenAI could just finetune to the tests. It's tricky, right?
What if something is different but nothing has changed in the model? Transformers are non deterministic. The response to same prompt may vary slightly, and can be controlled somewhat by the temperature setting. Something could have gone wrong there.
Aren't they using RLHF? The feedback from humans might not always be the ~right~ feedback. Couldn't that possibly degrade the quality of its responses?
But they do change things in the ChatGPT web app. You can't choose the exact model there, just 3/4 and from time to time they update the models they use.
That is the fundamental nature of intelligence. Like beauty it only exists on our minds. When you stare at something long enough you can convince yourself it isn't actually beautiful because it is just a bunch of brush strokes.
I have yet to see any real data on this phenomenon outside of anecdotal stories, so I'm also in the same boat re: group hallucination. Would be interested in seeing some more substantial evidence.
At least for some people, it seems to be a (unconscious) way to save face after being so ridiculous with the hype and predictions and "these will replace doctors and lawyers" when this was all first trending.
The quality seriously degraded overnight a few months ago... it was quite abrupt and obvious to those who use it regularly.
It's not surprising, they were probably running the model at a huge and ultimately unacceptable loss. But they should really offer a higher paid tier to access the previous capabilities... not drop them entirely. Many would pay far more than $20/month to access a marginally but meaningfully better model.
EDIT:
Many being dismissive of LLMs don't even seem to use them. Providers are vastly overvalued from an investment perspective, but the utility is very real. To say the loss in capability is just an "illusion" is clearly wrong to anybody who actually uses it.