Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Google Book Search Got Lost (backchannel.com)
155 points by chrismealy on April 11, 2017 | hide | past | favorite | 80 comments


This is the value-destruction side of copyright protection. With more reasonable length and more easily determinable copyright status, efforts like google books would be able to do so many more things. But without a permissive legal framework that innovation is shut down.


Blame it 100% on Micky Mouse and Disney that is where all the blame belongs.


Here is what I don't understand: how come copyright gets expanded left and right but patent does not. Does Disney the entertainment industry have more power than the entirety of traditional industries which depend on patents to make a profit (pharmeceutical, manufacturing, electronics, etc.) combined?


I think it's a lot less clear what the societal benefits of forcing copyrights to expire are. For patents, it's much more clear -- a patent prevents others from copying the invention, so its expiration allows other to use that invention for economically interesting purposes, which have a net societal benefit. So on the one hand, the temporary monopoly allows the inventor to profit from their invention, and costs others their ability to exploit it, and after a certain amount of time, those needs are reversed.

For copyrights, in the purest form, what do you really get from letting it expire? The only really tangible benefit is to allow people to obtain the work without having to pay anyone, and (theoretically) make it easier to preserve and distribute orphaned works without the permission of the creator. It seems like as a lawmaker, it's really not that hard to rationalize saying "I can pass this legislation, extending copyright, or I can directly cause Disney to forfeit X million dollars in revenue".

At least with a patent expiration, you can counterweight that by saying "I can extend the patent, and Boeing continues to get X dollars, or I can allow the patent to expire, and other companies get to use that invention to make Y dollars." The tradeoff is a little clearer.

I would love to see better rules for copyrights moving to the public domain, not so much for the sake of Mickey Mouse, but rather for obscure and orphaned works, as well as simply decreasing the cost for the consumer to improve accessibility. But it's very easy in my mind to see why this is unlikely to happen.


That's an interesting point actually. Maybe there should be multiple stages to copyright expiry, where the opportunities to create derivative works open up over time, without the work itself entering the public domain just yet.

Derivative works are where most off the societal value lies. Something like videogame publishers owning any footage of their game being played is absurd.


> For copyrights, in the purest form, what do you really get from letting it expire? The only really tangible benefit is to allow people to obtain the work without having to pay anyone, and (theoretically) make it easier to preserve and distribute orphaned works without the permission of the creator.

You seem to be forgetting about derivative works.


Yeah my Columbo novel should not be denied an audience.


Just as society gained from Disney's being allowed to use the stories and characters of Sleeping Beauty or The Little Mermaid, we would similarly gain from new artists being able to give us new takes on Disney stories and characters.

Or mashups - perhaps your Columbo novel could feature Mickey Mouse...


Your Columbo novel is probably awful. (No offense.) But there's someone out there who would have written a great one.


None taken.

Let me tell you about it. There are three interleaved narratives.

One is about an elderly actor who played the TV role for so long that he's no longer sure if he's an actor or the actual detective. The second is about a real life police detective called Frank Columbo who is plagued by his identity with the fictional detective and who he resembles in almost every particular. The third is about the actual Columbo.

All three are simultaneously solving different murders, all of which have their exposition up front - in the classic Columbo style.


Just add an minimum annual tax for keeping a copyright. Currently holding copyright on older works has little or no cost once it starts costing money that would change.


It could be a dollar per year past it's copyright. That way there is little to no burden to the holder but lets lesser works free. Therefore abandoned works get released.

We did it Red.. HackerNews :)


For patents you have companies fighting on the other side to limit them, so they can use the technology eventually.

For copyrights there is no resistance with deep pockets.


A lot of businesses benefit from expired patents. Any business that uses technology would be impossible if patents never expired.

No business really benefits that much from expired copyright. When Disney lobbies to extend copyright, who is going to spend millions to fight them?


> who is going to spend millions to fight them?

I am hopping one day everyone will be feed up and there will be millions in funding.

Personally I don't think this gets fixed till we can give special copyright to Disney and other corporations and free up the other 99.5% of copyright.


That seems like a great compromise. Pay a ten million dollar fee to extend your copyright by ten years. Most material becomes public domain, some exceptionally valuable properties remain privately held and generate value for the holder. Plus, we get some additional tax revenue.


In theory, one of the large tech companies affected by these laws could do it. I mean, Google's been affected by this in regards to everything from search to Youtube. If they wanted to, I'm sure they could spend far more than Disney on lobbying.

This feels like a fight that Silicon Valley companies should take on.


> Any business that uses technology would be impossible if patents never expired.

Not completely true. Businesses could exist even with indefinite patents by licensing them (see MP3 patent).


But the tech stack is much deeper than derivative works ever could be. Some derivate three generations down from Donald Duck is probably no longer a derivative of it, judged by current standards. But the computer I'm using right now could reasonably be covered by tens of thousands of patents all the way back to the origins of metallurgy, if those didn't expire.


Patents regulated relations between more or less equal entities (in very broad terms).

Copyright regulates mostly matters between large media companies and individual consumers.


It would probably be much harder to justify. IP law exists because it's considered a net good to society (everyone doesn't agree on this, but this is why we have it at least). Of course, it doesn't help that all the money goes into lobbying on behalf of IP holders who want to extend the reach of IP.

In theory, patents have good and bad side effects. The good effect of a patent is that it encourages people to spend lots of money on R&D because it gives them a period of time to recoup that investment once they bring something to market. The bad effect, obviously, is that this artificial monopoly can prevent other people from bringing similar things to market. But it's (again, in theory) a net good because the end result is that it creates an additional financial incentive to create new things, and once the patent expires then the negative aspect goes away. You are left with new stuff that may not have otherwise been brought into existence.

The goal of a patent system is to benefit the population as a whole, not patent holders, even though patent holders benefit from it as well. If it only benefitted patent holders, and didn't have this philosophical underpinning of benefitting everyone, we wouldn't have patent laws in the first place.

Copyright is similar in that it is designed with known pros and cons, but is legally rationalized as being a net good for society. The goal of copyright is to create an environment in which more stuff is created overall. The ultimate legal justification for copyright is the overall positive effect on society, it is not to benefit the copyright holder (that's just a happy side effect for them).

So it's a weird dynamic where both the patent system and copyright law exist to benefit society, but all of the money poured into changing it is going to come from IP holders who want to manipulate the law so that they can make more money than they are now. There are no lobbyists representing "all of society", unfortunately, so it's a very one-sided battle that only ends when judges and legislators push back on the IP holders.

To answer the question, I'm guessing that it's much easier to push for extending copyright because the societal cost to that is much less clear. It's easy to make an argument that (for example) pharmaceutical patents should not be extended because people will suffer. Cheaper drugs don't come until patents expire and competition is allowed to take its course.

On the other hand, it's harder to make a case for the harm done by extending the period of time in which (for example) only Disney can sell Mickey Mouse t-shirts, so it's a lot easier for the entertainment industry to lobby for longer copyright periods than it is for other industries to lobby for longer patent periods.


Life plus 75 is insane. A period of 25 years from the date of publication will be more than sufficient for copyright.


Why 25 years? Why not 20 years? Or 50 years?

We have been told that copyright makes it possible that authors can live from their hard work. This interest needs to be balanced against the needs of the common good. It is known that free access to information encourages innovation.

Some works need more time to generate revenue for the authors, others hit the market furiously and could be released after a short time.

Therefore I propose a minimal copyright time of 5 years from the date of publication. After that the owner sets a yearly fee to be paid by the owner. Someone can buy the work out of copyright by paying a multiple of the fee.

Say the copyright should last around 75 years. The owner declares to pay a fee of $10,000 the first year after the minimum of 5 years. The multiplicator could be then 70. If someone pays the owner $700,000 the book will immediately and irreversibly fall into public domain.


I think it is because there are rich corporations that benefit when another corporation's patents expire, and so they lobby congress to not extend the term of patents. With most copyrights, you don't have that, all the money is in favor of extending.


"If Google could find a way to take that corpus, sliced and diced by genre, topic, time period, all the ways you can divide it, and make that available to machine-learning researchers and hobbyists at universities and out in the wild, I’ll bet there’s some really interesting work that could come out of that. Nobody knows what,” Sloan says. He assumes Google is already doing this internally. Jaskiewicz and others at Google would not say."

For books that are scanned, but with no extra licensing, would Google be allowed to do anything with the data? Create a very delocalized n-gram set? Use it as the "test" set (not even cross-validation, where it might influence hyperparams) for a ML algorithm?

Edit: would love to know where google's authorization derives from, with the ngram set. Somewhere in the Judge's orders? A negotiated fee with the Authors Guild?


Ok, here is one of the important opinions in the Google Books settlement, by Judge Chin in 2013 [0]. He basically says (paraphrasing), "I'm going to assume Google has violated copyright by creating digital copies and serving them. But it's fair use, because the new products are transformative".

For example, re:ngrams

""" Similarly, Google Books is also transformative in the sense that it has transformed book text into data for purposes of substantive research, including data mining and text mining in new areas, thereby opening up new fields of research. Words in books are being used in a way they have not been used before. Google Books has created something new in the use of book text-the frequency of words and trends in their usage provide substantive information. [...]

On the other hand, fair use has been found even where a defendant benefitted commercially from the unlicensed use of copyrighted works

"""

Oh man, this is mind-blowing.

[0] https://copyright-casebook.com/about/recent-cases-edited/aut...


Data-mining, indexing, quotations, meta-data, have all been extracted before. It seems more like the degree to which Google are/want to do it, rather than the idea to do it?

If I get the same treatment as Google before the law then doesn't this mean I can copy any whole corpus of work, use it, recopy it, share it, make derivative works, etc., all as long as at the end I write something new - a music track inspired by their work, say? That appears to be what the judge is saying when applied to other works??




> "They should have just licensed the books instead.”

The problem with orphaned works is that can't be done, as nobody knows who owns them.


One presumes that, as the speaker is the Author's Guild, they would probably be happy to fix that problem by accepting the licensing fees for orphaned works themselves.


Indeed, that was attempted, but it was vetoed by the judge:

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,...


Furthermore, there's a fair bit of pushback by many content creators around orphan works legislation. This has been perhaps most pronounced with photography. There seems to be a concern that big corporations won't try very hard to reach people who have neglected to renew copyrights and will snap up their works.

(Not saying I agree with this POV but it's out there.)


>people who have neglected to renew copyrights //

Renewal, indeed registration, is largely a USA thing since the Berne Convention (adopted nearly everywhere before USA signed in 1988) did away with the need for registration in the 19th Century.

Worth bearing in mind, the USA as with "English" measurement is a good step out of line with the practice in the rest of the world. Any argument based on someone not registering is going to need completely rethinking outside USA.


Fair enough. There have been proposals to make copyright terms shorter with renewal options but those aren't inherent in orphan works legislation.

In general, I favor orphan works legislation but there is both potential for abuse and ambiguity in how much effort will and should go into tracking down rights holders.


If you'd like to see an example of what innovative things you can do with book contents:

https://blog.archive.org/2016/02/09/how-will-we-explore-book...

https://books.archivelab.org/dateviz/


I'm not that impressed, am i missing something? Looks like pre-indexed search matched with full-text search??


It's an index of concepts that appear in a sentence with dates. The blog post shows an example of how this type of index surfaces important dates for the concept 'Gregorian Calendar'.


It's not terribly hard to think of good uses for Google Books[1]. It's just, the legality was a bit murky, and what are the incentives ?

One idea(among many, surely): many people prefer visual explanations. In many subject areas, books offer better visual explanations. If when searching for something, Google would have also linked to some visual explanations from books(in the main search), and used machine learning to find the best - that may be really great, may really improve the experience and value of the web.

But is it legal ? it's unclear.

Can Google monetize that ? probably no.

Or am i wrong, is there some way Google can monetize that?


"Google's mission is to organize the world's information and make it universally accessible and useful."

Literally why Google exists: https://www.google.com/intl/en/about/


Does anyone actually believe this? I can't think of a single example of altruistic behaviour from this company...


https://www.google.org/

That's the arm of Google whose mission is to spend $100 million per year on charitable projects.


Google made $74bn in 2015 [1], 100m on charity is just PR. Is that a days revenue? An hours? I'm glad it's happening, but I'm cynical enough to consider that little more than an investment in company image.

Every company which makes such ridiculous amounts of money throws some of it at a cause or other, they get far more returns from it in PR than they would spending it another way.

Altruism? How about drop half on making the world better and not making a marketing circus from it.

1: https://abc.xyz/investor/news/earnings/2015/Q4_google_earnin...


Could you give me an example of a theoretical activity that would qualify to you as altruism but would not be dismissible as "they get far more returns from it in PR"?


There will always be a tint of 'image bias' I guess.

I think upping the numbers to where it's clearly a loss for whoever is donating would tip the scale more towards the 'good citizen' mark than these token amounts.

We'll never know if that works since no one has done it, and probably never will.

I'm not trying to call out google in particular, in fact that they even give this amount makes them better than many. These corporate statements about doing things "for the good of humanity" fall kind of flat when they do so little towards it though and I'm not sure why we buy into their marketing.

If your mission statement is to advance the species, or even give universal access to information for everyone; why would you sit on $84bn of yearly profit instead of using it to achieve that goal?

I know, life isn't as easy as that and I understand the reality of capitalism.

It's pie in the sky stuff, and before I'm flamed to death I'm not saying it's feasable and I'm not calling on it to happen, but just for the sake of discussion: am I the only one who thinks the change that apple, google and facebook could make to our species if they just gave 1/2 of their bank accounts would be significant and beneficial to everyone? Would it really make much difference to them if they have 400bn instead of 900bn in the bank?

shrug -- I don't have the answers.


I feel like you're assuming a false dichotomy between doing good and benefitting oneself. If Google organizes the world's information and makes it universally accessible and useful, and they put ads beside it, the latter part doesn't make the former untrue.


> Can Google monetize that ? probably no.

I'm sure Google could work out some affiliate fee/scheme for directing the user to a marketplace to buy the book whose image is shown. But if Google directs the user to it's own marketplace, it would surely get a cut, no?


That's a good idea and Google is already putting Amazon links in Google Books.

The US publishing industry is $28B. technical/educational books are 50% of that, And Google could only get 10% for that market(many other channels), and let's say they'll get 10% affiliate fee. that equals 28B * 0.5 * 0.1 * 0.1 = $140 million/year , for what they do currently.

Books are quite expensive, so adding visual links to the main search engine, would say double that - $280 million/yr. Not much for Google.

But on the negative side, adding visual links will distract people from surfing to sites filled with ads,clicking search ads, etc - so maybe it will cost Google money. Maybe a lot of money. That could be the logic behind why Google Forums, clearly a valuable service, was dumped[1].

So Google probably needs a more serious way to monetize it to justify the effort.


I'm sad that outfits like Project Gutenberg are stonewalled at 1923 or whatever year Steamboat Willie came out. There are so many good books whose authors' grandchildren are dead, that can't legally be reproduced.


Library Genesis is your friend. (http://libgen.io or http://gen.lib.rus.ec, among other mirrors.)


This. Relatedly, losing an easy Google News Archive was killer for some of the research I'd like to do. Several papers/articles I wrote in c. 2010 would not be possible to do today.


Common Crawl has a new news archive started a few months ago (http://commoncrawl.org/2016/10/news-dataset-available/) and the Internet Archive has had one going for quite a while.


Thanks for this! I'm talking about old scanned newspapers. :-) The Internet Archive has a good start, but it's pretty heavy on Kentucky, and few have in-text search available, which is killer if you're researching an event with few/no specific dates. (That's not to knock them—IA is pretty amazing, and OCRing newspapers is notoriously difficult.)


Do you mean the old deja thing? We (google) got a copy to the archive years ago.


This reply got a bit convoluted. My apologies.

First, I'm referring to this: https://www.theatlantic.com/technology/archive/2011/05/googl...

It's still /technically/ possible to search what's there via https://news.google.com/newspapers. Still, it's not exactly user-intuitive, and not being able to sort/search by date can make historical research very difficult (especially when the OCR isn't perfect—that's common, but trying several different phrases to make sure you've found everything is way easier when searching range of years).

Some related thoughts can be found in an old Hacker News post: https://news.ycombinator.com/item?id=7408034

Online newspaper archives are a ridiculously awesome boon for the humanities. Chronicling America from the Library of Congress, for instance, is great. It's the de facto successor to Google News Archive in the US. I just wish that Google News Archive could get a couple of the old search features back to aid researchers. :-)

Second, on a quick tangent I just discovered: when you select "archives" at news.google.com, it says "looking for scanned newspapers?" with a link to: https://support.google.com/news/answer/3334. But there's nothing there anymore about scanned news. :-)


The articles alludes to it at the end: the "corpus" of scanned books is incredibly valuable to a big data company like google and gives them a real edge against other companies.


If it is incredibly valuable, other companies can do it too, and sell access to the corpus to other companies.


Wouldn't they also run into the licensing problem, even more so if they try to sell access?


Seems a natural investment for Amazon.


Here's Amazon's "search inside" program, which started in 2001: http://newsbreaks.infotoday.com/NewsBreaks/Search-Inside-the...


Perhaps they can dump it as a torrent, so any hacker could build a valuable service with it.

See e.g. SciHub.


I think Book Search is entirely a red herring, one of Google's many 'say one thing, do something entirely different.' Kind of herring. A great example I know of personally was the Google 411 project.

Google 411 was a project which offered 'directory services' by voice over your phone. It was pretty cool, you could call its 800 number, ask for a listing, and it would use Google search to find it and then read it back to you. Then you got to tell it how well it did. People started using it, adoption spread, and then they cut it off.

So why did Google run this service in the first place? Was it to see if they could make a business out of directory services? No, it was to collect a data set where a spoken phrase could be matched to its exact translation (the thing they were looking for) which could then be processed and re-processed to train algorithms for independent speaker recognition. On the one hand they could try to pay a million people to come in and say something and then confirm or deny they understood what they said, or they could use a bit of spare hardware and collect that information with out paying anyone a dime. They own that data set, it is extremely valuable for testing improvements in voice recognition, and it is yet another barrier to a new company trying to get into that space.

Now let's look at Google Books. The 'story' was digitize all the books and make a great library available, and maybe even offer up PDF copies of out of print books for people. It set off a legal firestorm (as mentioned in the article) it got all of the archivists on board and libraries contributed millions of volumes, and Google got a ruling out of the Supreme Court that said that digitizing a book was not an act of copyright infringement.

But the subtle part is that while folks say 'everything is on the web' (and that may be true) literally 99.99% of everything on the web is complete and utter crap, written by people who are seeking to game advertising selection engines not digitize information. Most of the stuff in books is not crap, because it cost someone significant time and labor to take that information and publish it. (romance novels excluded :-)).

Google had digitized the single largest collection of human knowledge ever, and put it in a form where 100's of thousands of machines can process and re-process it to derive ontological accurate facts into the largest knowledge base in history. If you want to test whether your algorithm can identify credible information, there is no better way to do it than to prime it with a ground truth which has much higher credibility than most of the accessible data out there.

That data set exists, they own it, and they don't have any obligation to share it with anyone. And it will allow for the creation of trained models for differentiating fact from fiction, jest for insult, and command from comment.

It is my opinion that any company that wants to be 'serious' about AI better have access to an equivalent data set or they will lose.


An aside, which doesn't necessarily affect your reasoning about Books...

You didn't mention that GOOG-411 (I still have the t-shirt and other schwag) also had rampant abuse, had seen legitimate traffic shift to smartphones and, last but not least, had awful audio quality, so it wasn't just cut off for no good reason. The data set is not as valuable as might appear at first. The Google Cloud Speech API documentation recommends 16KHz 16-bit samples, not the 8KHz 8-bit PCM (at best) you get from DS0/POTS.

The speech corpus being collected was not a secret at all, either.


If it is fair use to create transformative data from scanned books, could the Internet Archive work with universities and commerical entities to create an open license corpus of data that is derived from their collection of scanned books?

As precedent, Google did contribute Freebase to Wikidata, even though it was the starting point for their proprietary Knowledge Graph (Facebook has a competing graph), https://en.wikipedia.org/wiki/Freebase


Yes, and I strongly support their effort to do so.



And also: the dawning realization that Scanning All The Books, however useful, might not change the world in any fundamental way.

I think this is actually the main takeaway.


Remember when ReCaptcha used Google Books snippets? You may not, it's been swapped over to Street View images for a while. But Google Books has been a big deal for automated OCR.


I believe PageRank is irrelevant to Google Books


[deleted]


Occam's Razor: Copyright holders prevented it from being useful


> "the ranger account of a bear my dad fought in Yellowstone"

Link?


(Is there some protest somewhere I can join to fight this idiotic image-fade in effect? You know, the one made universally hated by medium.com.)


You want to protest lazy loading images? Cool.


Yup.


That idiotic effect saves bandwidth and allows the page to load more quickly.


I have plenty of bandwidth. (Also: exactly what bandwidth does it save, even for those that have little bandwidth? Aren't browsers better equipped to decide what images to load or not?)

The page doesn't load any faster; the only effect is that I see a stupid fade-in instead of seeing the page with the image more or less immediately.

The Cloudflare host serving the image is a 4 ms roundtrip away. The bandwidth is 250/100 Mbit/s.

  #ab -n 100 -c 10 "https://cdn-images-1.medium.com/max/1500/1*UW2Mz4yMD65VYgFZILiL_w.png"
  ..
  Requests per second:    113.23 [#/sec] (mean)
  Time per request:       88.314 [ms] (mean)
  Time per request:       8.831 [ms] (mean, across all concurrent requests)
  Transfer rate:          25649.12 [Kbytes/sec] received
So to summarize my points:

- The javascript-driven delayed loading + fade-in is completely unnecessary and primarily distracting for people with high bandwidth connnections.

- The browser is in a better spot to figure out how to handle the loading of these images. Just make sure the dimensions are set so that the layout isn't modified after the images have loaded.

- The fade-in effect was novel the first week, a couple of years ago, but is now just annoying.

If any of the client-side developers who no doubt downvoted this sub-thread would step up to actually talk about it, it would be quite welcome. :)


The browser is definitely in a better position to handle downloading these images on a slow connection without blocking the rest of the page from loading. Unfortunately, it doesn't, at least without extensions that only a very small percentage of users are aware exist. I have long wished for browsers to use more discretion when on slow connections, or failing that, to at least inform websites that they are on a slow connection so that I can optimize my sites for slow connections without sacrificing the experience for fast ones. Alas, this is not the world we live in.


Please explain to me what benefit the fade-in has to

a) low-bandwidth users

b) high-bandwidth users

?

Edit: answering myself.

There are least four aspects to the delayed display of images that I can imagine.

1) delay from loading time to when the download is triggered

2) delay and annoyance caused by the time fade-in itself takes

3) delay caused by not progressively painting the image while it is being downloaded

4) perhaps most important to modern designers: avoiding slowing down the mobile browser by avoiding decoding images that the user has already scrolled past.

I think #4 is why desktop users with 10x faster CPUs and connections suffer.

I do think this is the wrong way to go about solving problems like this.


The fade-in, as I understand it, waits until the page has loaded to actually start downloading the image. This means that the image download will not be competing for connections and bandwidth with HTML/CSS/JS files that are required for first paint.

Whether this makes a significant difference is probably highly dependent on the page and the user's connection, and I have no idea without profiling. But it's at least plausible that it would help some users on slow connections.

Of course, it could just be that devs are A/B testing time until the "load" event fires, in which case they've just found a way to cheat the metric.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: