Hacker Newsnew | past | comments | ask | show | jobs | submit | divineg's commentslogin

It's incredible. I can't believe it but it actually works quite nicely.

If 10K $5 subscriptions can cover its cost, maybe a community run search engine funded through donations isn't that insane?


It's been clear to anyone familiar with encoder only LLMs that Google is effectively dead. The only reason why it still lives is that it takes a while to crawl the whole web and keep the index up to date.

If someone like common crawl, or even a paid service, solves the crawling of the web in real time then the moat Google had for the last 25 years is dead and search is commoditized.


The team that runs the Common Crawl Foundation is well aware of how to crawl and index the web in real time. It's expensive, and it's not our mission. There are multiple companies that are using our crawl data and our web graph metadata to build up-to-date indexes of the web.


Yes, I've used your data myself on a number of occasions.

But you are pretty much the only people who can save the web from AI bots right now.

The sites I administer are drowning in bots, and the applications I build which need web data are constantly blocked. We're in the worst of all possible worlds and the simplest way to solve it is to have a middleman that scrapes gently and has the bandwidth to provide an AI first API.


I'm all for that.


Your terms and conditions include a lot of restrictions with some ambiguous in how they can be interpreted.

Would Common Crawl do a "for all purposes and no restrictions" license if it is for AI training, comouter analyses, etc? Especially given the bad actors are ignoring copyrights and terms while such restrictions only affect moral, law-abiding people?

Also, even simpler, would Common Crawl release under a permissive license a list of URL's that others could scrape themselves? Maybe with metadata per URL from your crawls, such as which use Cloudflare or other limiters. Being able to rescrape the CC index independently would be very helpful under some legal theories about AI training. Independent, search operators benefit, too.


Common Crawl doesn't own the content in its crawl, so no, our terms of use do not grant anyone permission to ignore the actual content owner's license.

We carefully preserve robots.txt permissions in robots.txt, in http headers, and in html meta tags.

We do publish 2 different url indexes, if you wanted to recrawl for some reason.


I was talking about CC's Terms of Use which it says applies to "Crawled Content." All our uses must comply with both copyright owners' rules and CC's Terms. The CC terms are here for those curious:

https://commoncrawl.org/terms-of-use

In it, (a), (d), and (g) have had overly-political interpretations in many places. (h) is on Reddit where just offering the Gospel of Jesus Christ got me hit with "harassment" once. The problem is whether what our model can be or is uses for incurs liability under such a license. Also, it hardly seems "open" if we give up our autonomy and take on liability just to use it.

Publishing a crawl, or the URL's, under CC-0, CC-by, BSD, or Apache would make them usable without restrictions or any further legal analyses. Does CC have permissively-licensed crawls somewhere?

Btw, I brought up URL's because transfering crawled content may be a copyright violation in U.S., but sharing URL's isn't. Are the URL's released under a permissive license that overrides the Terms of Use?

Alternatively, would Common Crawl simply change their Terms so that it doesn't apply to the Crawled Content and URL databases? And simply release them under a permissive license?


> Publishing a crawl, or the URL's, under CC-0, CC-by, BSD, or Apache would make them usable without restrictions or any further legal analyses.

This isn't true, and I can't imagine that any lawyer would agree with this statement. CCF does not have rights ownership of any of the bytes of our crawl, so we cannot grant you any rights for the bytes in our crawl. Nothing that we could say could have any relationship to this legal issue.


It's confusing to me that you say this. Your own organization claims in the Terms of Service that it has rights over the crawls, even restricting how they are used. Now, you are telling me you believe you have none or no lawyer would consider this. If so, why is "Crawled Content" and restrictions on its use in your terms of service?

Very simply, if what you say is true, then you need to change your Terms to reflect that. You have two options:

1. Take crawled content out of the Terms of Service. Put a permissive license on the crawls.

2. Modify your Terms to say "crawled content" can be used for any purpose and distributed free with no restrictions. You currently impose extra restrictions, though.

That's contract law maybe with copyright elements in it. Yet, you also appear to believe your crawls aren't copyrightable. That's a huge unknown because collections are copyrightable when sufficient creativity is put into them:

https://en.m.wikipedia.org/wiki/Copyright_in_compilation

Many collections claim a copyright or have a permissive license for this reason. Again, simply saying your crawls and URL databases are permissively licensed would solve that problem. It takes just one edit on a few, web pages.

If crawls and DB's are truly without restrictions, please put a permissive license on their respective pages. Also, please change your terms to put no restrictions on Crawled Content. Instead, it should say something like it's free to use and distribute with no warranty or liability on you. The usual stuff.

I'll emphasize again that a permissively-licensed list of all URL's you've crawled is one of the most valuable changes you could make.


It's not dead but will take a huge hit. I still use DuckDuckGo since I get good answers, good discovery, taken right to the sources (whom I can cite), and the search indexes are legal vs all the copyright infringement in AI training.

If AI training becomes totally legal, I will definitely start using them more in place of or to supplement search. Right now, I don't even use the AI answers.


You can see their panic - in my country they are running TV ads for Google search, showing it answering LLM-prompt-like queries. They are desperately trying to win back that mind share, and if they lose traditional keyword search too they’re cooked


Which country is that?


Kagi seems to partially be that. Yes really corpo but way Better wibes than Google. Searxng is a bit diffrent but also a thing.


I think even more spectacularly, we may be witnessing the feature to feature obsolescence of big tech.

Models make it cheap to replicate and perform what tech companies do. Their insurmountable moats are lowering as we speak.


yep, seems the big guys running out of ideas, to some degree.


Pardon me for jumping in the discussion, but I didn't know where else to ask this. Does Iroh support streaming, instead of moving blobs? I want to write a little p2p tool to forward one port from one machine to another. Also, forwarding UDP packets doesn't require the congestion control of QUIC. Does Iroh allow disabling it for a certain "message" or stream?


Yes. Iroh itself provides direct QUIC connections. iroh-blobs is a protocol on top of iroh that provides content-addressed data transfer of BLAKE3 hashed data.

What you describe sounds like https://www.dumbpipe.dev/ , a tool/demo built on top of iroh to provide a bidirectional pipe across devices, somewhat like netcat.

Dumbpipe also has a mode where it listens on a port using TCP.

It sounds like you want to basically build dumbpipe for UDP. You can of course use a QUIC stream, but QUIC has an extension, which we support, to send datagrams: https://docs.rs/iroh/latest/iroh/endpoint/struct.Connection....

This basically allows you to opt out of QUIC streams, but you still do get TLS encryption.


It looks like they have examples with unreliable channels: https://github.com/n0-computer/iroh/tree/main/iroh/examples

You'll prob have to check the max packet size that you want to forward because quic adds a bit of overhead.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: