GA is the standard analytics for a huge % of websites - so even if a website doesn't use GA for tracking traffic, Google still has the referral data. And things like Google Adsense (i'm pretty sure that sends back the referral data too, for tracking click fraud).
There is no way really to avoid Google knowing a lot about you/your website anymore.
Google Analytics is on a substantial proportion of the Internet. 65% of the top 10k sites, 63.9% of the top 100k, and 50.5% of the top million[1]. My own results from a research project I did using the Common Crawl[2] corpus estimates approximately 39.7% of the 535 million pages processed so far have GA on them.
The real key to tracking is the referrer data.
For the vast majority of clicks, you land on a site that has Google Analytics or you've just left one that did.
As Google Analytics tracks your referrer, that means they still have your full browsing history if you jump from GA => !GA => GA => !GA => ...
According to my research[3], Google gets activity information on 51.43% of the 42 billion links analyzed in the 535 million page corpus as either the start or end of the link uses Google Analytics. This activity means they can accurately track browsing history on most sites, even those that don't use GA, simply as timing information, referrers, and knowledge of the web graph end up leaking user activity.
Used in an anonymized fashion, this is beneficial as it helps Google understand real world web traffic and hence rank search results accordingly (far better than simulated activity based upon PageRank or similar).
In the theoretical situation you drop anonymization is where this gets troublesome.
If you're interested, there are more details at "Measuring the impact of Google Analytics"[3], though much of the discussion is on Hadoop + Common Crawl. For a privacy focused write-up (primarily worried about the NSA using Google Analytics), refer to "Google, make Google Analytics HTTPS by default"[4].
P.S. Everyone who notes "Google Analytics is easy to evade" are correct but missing the broader point -- the majority of web users will never do that.
In the theoretical situation you drop anonymization is where this gets troublesome.
Here's something even more troubling. Take an 'anonymous' crime reporting site[1] and put Google Analytics on it. Put it on every single page, even on the page with the forms to submit anonymously. Not bad enough? How about a similar site, only this one aimed at reporting corruption[2] and try the same thing. What could possibly go wrong?
Both sites are well aware of the issue and have written me back when I pointed this out. This level of trust in an American ad company is curious. All I can do now is hope that whistle-blowers wanting to report corruption are savvy enough to avoid the web forms.
Imagine you are a government worker somewhere and you see evidence of corruption and report it. From the same machine you signed in to GMail with. Now consider that your local government can order Google to secretly hand over tracking data and forbid them from notifying the crime reporting site(s).
> P.S. Everyone who notes "Google Analytics is easy to evade" are correct but missing the broader point -- the majority of web users will never do that.
This is an important point here because it's in stark contrast with Google e-mail. Google's e-mail servers are exceptionally difficult to evade.
I am (apparently) able to evade Google's attempts to track my web browsing habits with Google Analytics. I can take responsiblity for that myself.
This is a technically difficult opt-out process is IMHO against the spirit of "don't be evil", but at least there is an option.
On the other hand, I am totally unable to prevent Google from building an accurate profile of my e-mail habits. The only way I can opt-out here is by significantly curtailing my e-mail habits (or by insisting on PGP, which will have the same effect). This is significantly more frightening to me - I have no choice.
Isn't it about time we dropped the HTTP referer header? If we lived in a World where that header didn't exist, and somebody came along today and proposed that we add it to Firefox, Chrome or IE, there would be absolute outrage. If the proponents then argued: "Yeah, but it will make tracking users easier, and we might be able to target advertising better to make more money", people would not accept that as a valid argument.
I'm a PhD student at MIT interested in doing research using the Commoncrawl dataset. Are you interested in working on this together or at least getting together and chatting about it? drop me a line: nagaraj@mit.edu
As Google Analytics is "self reporting" (i.e. your browser tells the GA servers what it's doing) you can avoid reporting or erroneously report whatever you'd like.
It'd likely be easier for you just to block Google Analytics though if that is of concern to you.
In the unlikely event that fake activity became a problem, Google's well equipped to deal with it. They have a great deal of tech and brains in place to detect fake ad click activity, which is vaguely related.
That doesn't really help if $incriminating_website is among the real visits. I think obfuscation is a waste of time given the machine learning techniques at their disposal.
I wouldn't read too much into that. Google Engineering has huge budgets and all kinds of random projects get paid for with no better justification than "this is good for the web, therefore it's good for us".
I worked there for years. Seeing really deep, well thought out business plans there was a rarity especially for small projects like hosting web fonts or running DNS resolvers. Heck, even for very large projects sometimes the accounting was unbelievably carefree.
Pfft.. What was I thinking? It's the original Dont-Be-Evil company, right? Of course they are giving away tons of freebies just because they are awesome. They just run a money printing press for an extra minute and those huge budgets will materialize out of thin air. Yay.
Yes, that's pretty much how it is. Don't believe it if you like, but you won't convince me: I was there in some of these meetings, I read the design docs, I watched these sorts of projects get approved. They just give this stuff away because they're swimming in money and it's a place run by geeks.
Google Analytics (and other tracking cookies/scripts) are actually very easy to evade, by using browser extensions such as Ghostery, Ad-block Plus and NoScript.
We should raise concerns about Facebook like buttons too. Its impossible to see a site without like button and it sends all our browsing info back to facebook.
I always use Piwik [0] - it's excellent, open-source and most of all: it respects your users' privacy by not letting any third party like Google track them.
My advice: Use it too and let your users know that you do so because you respect them.
It's still possible to avoid leaking information to GA, by rewriting outgoing links to go through a redirector. Many sites do this (including Google Encrypted Search, but its not perfect).
The email problem sucks though. We need end-to-end encryption, on all mail, today.
You are right about the outgoing links on your site to avoid leakage (this is what well implemented search engines like DuckDuckGo does as well), but I think GP talked about the incoming links of your site. Your pages will appear as exit pages in Google Analytics which you cannot do a thing about.
Except it's not because the problem's still there for everyone else. Do you really that being shielded yourself but allowing your non-techy friends and family to be tracked is a 'solved' problem?
GA is the standard analytics for a huge % of websites - so even if a website doesn't use GA for tracking traffic, Google still has the referral data. And things like Google Adsense (i'm pretty sure that sends back the referral data too, for tracking click fraud).
There is no way really to avoid Google knowing a lot about you/your website anymore.