Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does using copyrighted works to train a machine learning model make that model infringing?


Generally a ML model transforms the copyrighted material to the point where it isn't recognizable, so it should be treated as its own unrelated work that isn't infringing or derivative. But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.

Also I don't think there have been any major court cases about this, so there's no clear precedent in either direction.


> But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.

Easy fix - keep a bloom filter of hashed ngrams ensuring you don't repeat more than N words from the training set.


There are some that say that the Google Books court case is precedent for ML model stuff, if you search back through my comment history you will find links.


Thanks!


GP is not talking about the model but about the training data set.


I am aware, I'm asking if the model, however, is infringing. Surely you can't distribute them in a dataset but is training on copyrighted data legal, and can you distribute that model?


All text written by a human in the US is automatically copyright the author. So if an engine trained on works under copyright is a derivative work, GPT3 and friends have serious problems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: