Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd say there's some merit to that defense. Imagine for example if a website generated itself based on a sequence in Pi - technically all of the NYT is in that 'dataset' and if you tell it to start at the right digit it will spit back any NYT article. In a more realistic sense though you can make it spit back anything you want and the NYT article is just a consequence of that behavior - finding the right 'secret code' to get a verbatim article is not something you can easily just do.

ChatGPT is somewhere in-between - You can't just ask it for a specific NYT article and have it spit it back at you verbatim (NYT acknowledges as such, it took them ~10k prompts to do it), but with enough hints and guesses you can coax it into producing one (along with pretty much anything else you want). The question then becomes whether that's closer to the Pi example (ChatGPT is basically just spitting the prompt back at you), or if it's easy enough to do that it similar to ChatGPT just hosting the article.

Edit: I suppose I'd add, this is also a separate question from the training, training on copyrighted material may or may not be legal regardless of whether the model can verbatim spit the training material back out.



You're getting lost in the technology here. Copyright is not about producing the exact sequence of bytes, nor is it about "hosting an article". Copyright is an intellectual property right to the creative work, not the exact reproduction that is seen on some website, but the creative work itself.

The law doesn't not care about your weird edge cases. What matters is what should be and how we can make it so.


You're ignoring the point of what I'm saying though, which is that the required prompt is relevant to determining if ChatGPT itself is the thing violating the copyright. I can probably get ChatGPT to produce any sequence of tokens I want given enough time, that doesn't mean ChatGPT is violating every copyright in existence, somewhere you have to draw the line.


I'm not ignoring it. I'm saying that the axis you're contemplating this problem on isn't correct. It's not about if you can get "any sequence of tokens" or the edit distance between those tokens and the actual tokens of the copyrighted work. The law is not (and should not be) an algorithm with a definite mathematical answer at some fixed point in a continuum.

PI is not copyrighted, because that would be silly, but if you were to find the exact bytes in there to reproduce the next Marvel movie and you started sharing that offset, that would probably be copyright infringement. The fact that neither of those numbers were part of the original work, or copyrightable in isolation, or that "technically everything is present in pi", is immaterial. It's obvious to any non-pedantic human being that you're infringing on the creative work.


You're still missing the key point. Say that website exists, if someone were to find the exact point in Pi that is the next marvel movie and started sharing the location, is the copyright violation committed by the creator of the Pi website or by the person who found and is sharing the location?

If I give you a prompt that's just the contents of a NYT article and me telling ChatGPT to say it back to me, is ChatGPT committing the copyright violation by producing the article or am I by creating and sharing the prompt?


I will say it again. I am not missing the point, I am refusing the point. The point you are bringing across is not a useful point in matters of law.

There is no reasonable way for us to deliberate on your made up scenarios, because in matters of law the details matter. The website hosting pi could very well be taking part in the copyright infringement, it could also very well not. Our way of weighing those details is the process of the law.

You place the question of PI in a vacuum, asking me if it should be illegal "in principle", but that's not law. The intent, appearance, skill of council, even the judge and jury, will matter if a case had to come up. You cannot separate the idealized question from the messy details of the fleshy humans.


Yes, it's almost like it's a complicated legal question and the content of the required prompt to produce a copyright-infringing response would be something that would interest the judge and jury.

You're saying "it's complicated and lots of factors would come into play", which is the same thing I'm saying. The fact that it spits out copyright-violating text does not necessarily mean ChatGPT is the one at fault, it's messy.


>Yes, it's almost like it's a complicated legal question and the content of the required prompt to produce a copyright-infringing response would be something that would interest the judge and jury.

In what way? You don't seem to know what is decided by a jury or what is decided by a judge. Specifically, what do you think the prompt evidences that it is relevant?

> The fact that it spits out copyright-violating text does not necessarily mean ChatGPT is the one at fault, it's messy.

Actually, that's exactly what it means. There is no defense to copyright infringement of the nature you are discussing. OpenAi is responsible for what it ingests, and the fact that use of its tool can result in these outcomes is solely the responsibility of OpenAI and your misunderstandings otherwise are dense and apparently impenetrable.


How is that person missing the point? You are making a legal argument and apparently without any consideration for the actual law...


They're missing my point because I'm not saying it is or isn't, I'm saying that it's messy and things like the required prompt may sway the judge and/or jury one way or the other. If you provide ChatGPT an entire copyrighted text in the prompt and then go "ah-ha, the response violated my copyright", a judge and/or jury probably won't be very impressed with you. If instead you just ask ChatGPT "please produce chapter 1 of my latest book" and it does, then ChatGPT is not looking so great.


Judge or jury one way or the other on what? You literally have no idea what you are talking about, have any idea how a lawsuit works, and apparently what is decided by a judge vs what is decided by a jury, and you are constantly espousing on legal issues as if it is contributing to anything but furthering other people's ignorance when they don't know better to dismiss your posts.

Your hypothetical is asinine and completely removed from what is at issue in this lawsuit.

And of course now, reading other posters responding to you in this thread, I'm not the only one pointing out how you are only contributing your own misunderstandings.


The difference between "I can probably generate everything" and "I can definitely produce this copyrighted work" is substantial and in fact the core argument in the case.


Can you really say it can "definitely produce this copyrighted work" if NYT had to try thousands of prompts, some of which included parts of the articles they wanted it to produce? That's my point. I really don't know the answer, but it's not as simple as "they asked it to to produce the article and it did", they tested thousands of combinations.


Did it? Then yes. You can say it "definitely produce this copyrighted work"

I'm not sure how that could even be controversial. Either it does or doesn't. In this case, it does.


So if I go on ChatGPT, copy in a chapter from a book and then ask it to repeat the chapter back to me, is ChatGPT violating the copyright of the book I just fed it?


That's not an issue in this lawsuit


If it is outputting verbatim copies of works it has ingested, it is doing copyright infringement. It's really not that difficult.


I think the difference here is that a human intentionally built a dataset containing that information, whereas Pi is an irrational number which is a consequence of our mathematics and number system and wasn't intentionally crafted to give you NYT articles.


Well that depends on what you're trying to prove. If you think it's a copyright violation to include the articles in the dataset _at all_ then it doesn't even matter if ChatGPT can produce NYT articles, it's a violation either way. If including the articles in the dataset is not in-and-of-itself a copyright violation then things get complicated when talking about what prompt is required to produce a copyright-violating result.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: