Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This feels like an extremely weak defense to me, because if it works it is essentially a copyright backdoor.

1. There is little doubt that OpenAI has NYT articles in it's dataset.

2. They claim it's OK because all of the IP is mixed together and effectively laundered.

3. But in fact if you put in a secret code you can get the IP back nearly verbatim.

4. So if OpenAI wins this case we could just trade prompts that regurgitate the articles back without ever visiting NYT.



> So if OpenAI wins this case we could just trade prompts that regurgitate the articles back without ever visiting NYT.

This seems like the inverse of the old "book cipher" scheme to "avoid" copyright infringement.

If you want to distribute something you're not allowed to, first you find some public data (e.g. a public domain book), then you xor it against the thing you want to distribute. The result is gibberish. Then you distribute the gibberish and the name of the book to use as a key and anyone can use them to recover the original. The "theory" is that neither the gibberish nor the public domain book can be used to recover the original work alone, so neither is infringing by itself, and any given party is only distributing one of them. Obviously this doesn't work and the person distributing the gibberish rather than the public domain book is going to end up in court.

So then which side of the fence is ChatGPT and which side is the text you have to feed it to get it to emit the article? Well, it's the latter that you need access to both the existing ChatGPT and the original article in order to produce.

Notice also that this fails in the same way. The people distributing the text that can be combined with the LLM to reproduce the article are the ones with the clear intention to infringe the copyright. Moreover, you can't produce the prompt that would get ChatGPT to do that unless you already have access to the article, so people without a subscription can't use ChatGPT that way. And, rather importantly, the scheme is completely vacuous. If you already have access to the article needed to generate the relevant prompt and you want to distribute it to someone else, you don't have to give them some prompt they can feed to ChatGPT, you can just give them the text of the article.


I agree. If you gzip a NYT article and print it out, very few people would be able to read the article. But it can still be decoded ("prompt engineering" as OpenAI calls it).


Copyright maximalism in the 21st century can be summed up as: When an individual makes a single copy of a song and gives it to a friend, that's piracy. When a corporation makes subtlety different copies of thousands of works and sells them to customers, that's just fair use


NYT is a corporation.

Corporation vs individual is a distraction. It’s some people (wrongly, in my view) prioritising production over consumption. If this were Altman personally producing an AI, the same people would rally to him.

The corporate/individual framing needlessly inflames the debate when it’s really one about power and money.


I don't think it's "production over consumption". At least I don't like that framing. For me it's about supporting production. The humans that write news articles every day can't produce that valuable work if they don't get fairly compensated for it. It's not that the AI produces more, it's that the AI destabilizes production. It makes it impossible to produce.


> It's not that the AI produces more

We're not debating whether they do. "Humans that write news articles" are producing. That contrasts with "an individual mak[ing] a single copy of a song and giv[ing] it to a friend." We don't put journalists in jail for plagiarism.


> We don't put journalists in jail for plagiarism.

I'm guessing you're imagining a scenario here were a journalist has copied an entire article verbatim and republished it in their newspaper. That would actually be both copyright infringement AND plagiarism. Newspapers just rarely enforce that right.

These two things aren't on a scale. They are independent infractions.


No, they wouldn't, because Altman would still be stealing other people's actual work.


> they wouldn't, because Altman would still be stealing other people's actual work

OpenAI is "stealing other people's [sic] actual work." The people rallying to it clearly don't care that much about it now. They wouldn't care whether it's a corporation or Sam Altman per se doing it.


I'd say there's some merit to that defense. Imagine for example if a website generated itself based on a sequence in Pi - technically all of the NYT is in that 'dataset' and if you tell it to start at the right digit it will spit back any NYT article. In a more realistic sense though you can make it spit back anything you want and the NYT article is just a consequence of that behavior - finding the right 'secret code' to get a verbatim article is not something you can easily just do.

ChatGPT is somewhere in-between - You can't just ask it for a specific NYT article and have it spit it back at you verbatim (NYT acknowledges as such, it took them ~10k prompts to do it), but with enough hints and guesses you can coax it into producing one (along with pretty much anything else you want). The question then becomes whether that's closer to the Pi example (ChatGPT is basically just spitting the prompt back at you), or if it's easy enough to do that it similar to ChatGPT just hosting the article.

Edit: I suppose I'd add, this is also a separate question from the training, training on copyrighted material may or may not be legal regardless of whether the model can verbatim spit the training material back out.


You're getting lost in the technology here. Copyright is not about producing the exact sequence of bytes, nor is it about "hosting an article". Copyright is an intellectual property right to the creative work, not the exact reproduction that is seen on some website, but the creative work itself.

The law doesn't not care about your weird edge cases. What matters is what should be and how we can make it so.


You're ignoring the point of what I'm saying though, which is that the required prompt is relevant to determining if ChatGPT itself is the thing violating the copyright. I can probably get ChatGPT to produce any sequence of tokens I want given enough time, that doesn't mean ChatGPT is violating every copyright in existence, somewhere you have to draw the line.


I'm not ignoring it. I'm saying that the axis you're contemplating this problem on isn't correct. It's not about if you can get "any sequence of tokens" or the edit distance between those tokens and the actual tokens of the copyrighted work. The law is not (and should not be) an algorithm with a definite mathematical answer at some fixed point in a continuum.

PI is not copyrighted, because that would be silly, but if you were to find the exact bytes in there to reproduce the next Marvel movie and you started sharing that offset, that would probably be copyright infringement. The fact that neither of those numbers were part of the original work, or copyrightable in isolation, or that "technically everything is present in pi", is immaterial. It's obvious to any non-pedantic human being that you're infringing on the creative work.


You're still missing the key point. Say that website exists, if someone were to find the exact point in Pi that is the next marvel movie and started sharing the location, is the copyright violation committed by the creator of the Pi website or by the person who found and is sharing the location?

If I give you a prompt that's just the contents of a NYT article and me telling ChatGPT to say it back to me, is ChatGPT committing the copyright violation by producing the article or am I by creating and sharing the prompt?


I will say it again. I am not missing the point, I am refusing the point. The point you are bringing across is not a useful point in matters of law.

There is no reasonable way for us to deliberate on your made up scenarios, because in matters of law the details matter. The website hosting pi could very well be taking part in the copyright infringement, it could also very well not. Our way of weighing those details is the process of the law.

You place the question of PI in a vacuum, asking me if it should be illegal "in principle", but that's not law. The intent, appearance, skill of council, even the judge and jury, will matter if a case had to come up. You cannot separate the idealized question from the messy details of the fleshy humans.


Yes, it's almost like it's a complicated legal question and the content of the required prompt to produce a copyright-infringing response would be something that would interest the judge and jury.

You're saying "it's complicated and lots of factors would come into play", which is the same thing I'm saying. The fact that it spits out copyright-violating text does not necessarily mean ChatGPT is the one at fault, it's messy.


>Yes, it's almost like it's a complicated legal question and the content of the required prompt to produce a copyright-infringing response would be something that would interest the judge and jury.

In what way? You don't seem to know what is decided by a jury or what is decided by a judge. Specifically, what do you think the prompt evidences that it is relevant?

> The fact that it spits out copyright-violating text does not necessarily mean ChatGPT is the one at fault, it's messy.

Actually, that's exactly what it means. There is no defense to copyright infringement of the nature you are discussing. OpenAi is responsible for what it ingests, and the fact that use of its tool can result in these outcomes is solely the responsibility of OpenAI and your misunderstandings otherwise are dense and apparently impenetrable.


How is that person missing the point? You are making a legal argument and apparently without any consideration for the actual law...


They're missing my point because I'm not saying it is or isn't, I'm saying that it's messy and things like the required prompt may sway the judge and/or jury one way or the other. If you provide ChatGPT an entire copyrighted text in the prompt and then go "ah-ha, the response violated my copyright", a judge and/or jury probably won't be very impressed with you. If instead you just ask ChatGPT "please produce chapter 1 of my latest book" and it does, then ChatGPT is not looking so great.


Judge or jury one way or the other on what? You literally have no idea what you are talking about, have any idea how a lawsuit works, and apparently what is decided by a judge vs what is decided by a jury, and you are constantly espousing on legal issues as if it is contributing to anything but furthering other people's ignorance when they don't know better to dismiss your posts.

Your hypothetical is asinine and completely removed from what is at issue in this lawsuit.

And of course now, reading other posters responding to you in this thread, I'm not the only one pointing out how you are only contributing your own misunderstandings.


The difference between "I can probably generate everything" and "I can definitely produce this copyrighted work" is substantial and in fact the core argument in the case.


Can you really say it can "definitely produce this copyrighted work" if NYT had to try thousands of prompts, some of which included parts of the articles they wanted it to produce? That's my point. I really don't know the answer, but it's not as simple as "they asked it to to produce the article and it did", they tested thousands of combinations.


Did it? Then yes. You can say it "definitely produce this copyrighted work"

I'm not sure how that could even be controversial. Either it does or doesn't. In this case, it does.


So if I go on ChatGPT, copy in a chapter from a book and then ask it to repeat the chapter back to me, is ChatGPT violating the copyright of the book I just fed it?


That's not an issue in this lawsuit


If it is outputting verbatim copies of works it has ingested, it is doing copyright infringement. It's really not that difficult.


I think the difference here is that a human intentionally built a dataset containing that information, whereas Pi is an irrational number which is a consequence of our mathematics and number system and wasn't intentionally crafted to give you NYT articles.


Well that depends on what you're trying to prove. If you think it's a copyright violation to include the articles in the dataset _at all_ then it doesn't even matter if ChatGPT can produce NYT articles, it's a violation either way. If including the articles in the dataset is not in-and-of-itself a copyright violation then things get complicated when talking about what prompt is required to produce a copyright-violating result.


1. Anyone can get all of NYT's articles for free along with CNN and every other major news site, this isn't in dispute, it's available here in a single 93 terabyte compressed file:

https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-05/inde...

2. I did not see any defense of this nature.

3. Yes and this is the big deal. If the secret code needed to reproduce copyrighted material involves large portions of that copyrighted material already then that's quite a bit different than just verbatim reproductions out of thin air.

4. Yes, if OpenAI wins this case then you could feed into ChatGPT large portions of NYT articles and OpenAI could possibly respond by regurgitating similar such portions of NYT articles in response.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: