"If Google could find a way to take that corpus, sliced and diced by genre, topic, time period, all the ways you can divide it, and make that available to machine-learning researchers and hobbyists at universities and out in the wild, I’ll bet there’s some really interesting work that could come out of that. Nobody knows what,” Sloan says. He assumes Google is already doing this internally. Jaskiewicz and others at Google would not say."
For books that are scanned, but with no extra licensing, would Google be allowed to do anything with the data? Create a very delocalized n-gram set? Use it as the "test" set (not even cross-validation, where it might influence hyperparams) for a ML algorithm?
Edit: would love to know where google's authorization derives from, with the ngram set. Somewhere in the Judge's orders? A negotiated fee with the Authors Guild?
Ok, here is one of the important opinions in the Google Books settlement, by Judge Chin in 2013 [0]. He basically says (paraphrasing), "I'm going to assume Google has violated copyright by creating digital copies and serving them. But it's fair use, because the new products are transformative".
For example, re:ngrams
"""
Similarly, Google Books is also transformative in the sense that it has transformed book text into data for purposes of substantive research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google Books has created something new in the use of book text-the frequency of words and trends in their usage provide substantive information. [...]
On the other hand, fair use has been found even where a defendant benefitted commercially from the unlicensed use of copyrighted works
Data-mining, indexing, quotations, meta-data, have all been extracted before. It seems more like the degree to which Google are/want to do it, rather than the idea to do it?
If I get the same treatment as Google before the law then doesn't this mean I can copy any whole corpus of work, use it, recopy it, share it, make derivative works, etc., all as long as at the end I write something new - a music track inspired by their work, say? That appears to be what the judge is saying when applied to other works??
For books that are scanned, but with no extra licensing, would Google be allowed to do anything with the data? Create a very delocalized n-gram set? Use it as the "test" set (not even cross-validation, where it might influence hyperparams) for a ML algorithm?
Edit: would love to know where google's authorization derives from, with the ngram set. Somewhere in the Judge's orders? A negotiated fee with the Authors Guild?