Really? The takeaway isn’t rather that rent-seeking AI models need to figure out a way to reimburse companies and communities who’ve stored up all this capital?
Seems to me SO built and delivered huge, huge amounts of value and it’s now all at risk because multibillion dollar companies are free riding.
Users on SO created value and freely shared it with a community in expectation that the value they created would be freely and collectively shared with everyone. In SO's case this expectation was explicit; the data backup and API was billed as a deliberate choice designed to give users the freedom to migrate and scrape data in case the company went "evil." It was designed specifically to reduce SO's ownership claim over user-generated content.
It's not that SO has a moral right to control and profit from that content. The reality is that SO holding that content at all is a conditionally granted privilege that the community affords the site, and it is a privilege that was always designed to be revocable and the data moveable if SO started abusing its position of power as a host and trying to lock down access.
Some writing/content sites that have taken steps to restrict AI access based specifically on community request. That's a very different situation; if a community (particularly a closed or close-knit community) is collectively and (mostly) uniformly trying to avoid an AI scraping the content that they created, then good for them. There are communities online that are in that position. But "how will the company get reimbursed for our valuable asset" should not be part of that conversation. And SO in particular was set up around norms that deliberately allowed this kind of scraping. It's not their asset to protect.
> rent-seeking AI models
I have issues with modern AI economic models too but I don't think that "rent-seeking" is an accurate term to use. A better word would probably be "parasitic"; I understand (and at somewhat agree with) the argument that OpenAI is looking to repackage information it didn't create in a way that redirects attention away from the original source of information.
But I'm having a really hard time figuring out how OpenAI is hoarding a scarce asset to extract value by controlling access to that asset. The more obvious rent seeking behavior here is coming from SO, a company trying to restrict access to Creative Commons licensed content created for free by unpaid volunteers, and trying to reclassify that content as their corporate property.
I guess being as charitable as possible, I do worry about the SaaS model of many AIs that are dedicated to content generation, and I worry a little bit about AI models becoming heavily integrated into creative processes and then extracting a kind of monetary "creative tax" from artists/creators while heavily restricting what they are allowed to make. That's at least adjacent to rent seeking, but I'm still not sure it's the term I would use and I'm not convinced it's a scenario that's applicable here.
Good point that rent-seeking is maybe not the correct term now, but it looks increasingly like services will have to lock down content or shut down due to AI models frontrunning them with their own content. In that world, the AI models are in a great rent-seeking position (i.e. only they have the [old] content which was broadly available and now is not, due to their own incentive distortion).
In any case I buy your argument with regard to SO stewardship of this data and certainly my intuitions were that the major contributors are not super thrilled about their content being digested by models and spit out with no attribution, but that is absolutely an assumption on my part.
Would be interested to see a poll of those users on this question!
I do think if we were having this conversation about an explicitly community-owned forum or fanfic hosting service -- ie, a scenario where it's obvious that the community is behind the decision -- my reaction would likely be very different. I'm broadly pretty sympathetic to a forum saying, "we're doing this for us, not for a VC firm."
SO in specific though is an interesting site in that the value proposition of the site was very heavily based on this information being freely available and uncontrolled. I think they're in a position where it's much less appropriate for the site owners to try an clamp down on AI scraping.
If there is a strong movement from the SO community to change that, I'm not aware of it, but who knows, maybe I'm out of the loop.
Off the top of my head, another example of the distinction I'm getting at would be something like Wikipedia -- if the Wikipedia owners started trying to outright block site backups my immediate response would be, "well wait a second, that was not the deal we all made around this site, we signed up to help the Wikimedia foundation build an Open encyclopedia, even if that means it gets pulled into an AI dataset. We specifically didn't want the Wikimedia foundation to have the power to decide what usage of this data they would allow or deny."
> but it looks increasingly like services will have to lock down content or shut down due to AI models frontrunning them with their own content.
This feels like its slightly off to me.
An LLM that was trained on job postings to be able to categorize them isn't trying to do job postings ( https://wfhmap.com/algorithm/ ) but rather be able to do meaningful classification of bulk unstructured data.
An LLM trained on reddit is... weird to talk to, but talking to it doesn't replace asking a proper subreddit with people answering and comments back and forth. Is ChatGPT stealing views from people complaining about their job in /r/antiwork? Going to something in /r/news and sort by controversial and getting some popcorn turns out to be much more interesting than ChatGPT ever will be.
Maybe you can say that ChatGPT with some training of Stack Exchange sites has some utility (and that its really classified, tagged, and feedback given makes it even more useful), but GitHub CoPilot was trained on just GitHub stuff and its better at code than pretending "try {some broken code} hope that helps" is going to be useful for a LLM.
To me, this feels much more like CEOs that are having difficulty with existing monetization attempting to lock up the data that they have under a questionable pretext to monetize that to companies looking to train models for other things.
The sorting out of what the rights are on the output of models is something that needs to be sorted out - probably by the courts. I am still of the opinion that if something that might be copyrighted is used from any source, then the person doing the copying (who has agency) needs to do a license check themselves on it. I know that there is GPL code on Stack Overflow that looks like its licensed under CC 4.0 and if you copied the SO answer and put it in a BSD licensed repository, you'd be in violation of the GPL - and that's without touching any LLM.
There are also lots of non copyright things that the data could be used for. I'd like to make a AI-CATegorizer. Train it on a representative number of images form each of the reddit cat subs so that someone can ask it "here's a picture of my cat, what subs can this be posted to" and get back "/r/airplaneears /r/blackcats /r/stealthbombers" - and that's not something that is potentially generating copyrighted content (though it inherently uses it)... pretending that that those images were under a CC license, would it need to attribute all of the images that were part of the training data set to respond back with those three subreddits?
There are a couple of different motivations a company could have around blocking API access to prevent AI scraping:
A) scraping itself is too expensive. I suspect that's probably not the case with SO because they blocked backup. Downloading the database from the Internet Archive doesn't cost SO any money.
B) the AI is going to replace the original creators (or more likely, devalue their work and push wages lower) and they'd like to prevent that negative social consequence. This is the charitable interpretation, and I understand writers/programmers/artists being concerned about it, even if I'm slightly more cynical myself about how AI content generation is going to work out once the "shine" has worn off. Note that I'm not saying that this concern is necessarily right or that there aren't positive uses of AI that have nothing to do with replacing jobs; just that it's a concern that a site/community could reasonably have.
C) companies are realizing that there's a lot of VC money in AI right now, and they would very much like to be in the business of selling shovels, and their feeling is that if anyone anywhere is making money off of "their" content then they are morally deserving of some kind of cut no matter what. This is obviously the case for some companies, but is (charitably) probably not the case for all of them.
One test we could use to try and distinguish between B and C is -- if a company is blocking API access, are they then turning around and licensing that data or opening up paid API access, and if they are, is any of that money going to the users that made the content? If SO turns around and makes API access paid and continues to not pay any of the volunteers writing answers, at that point it's much easier to argue that they're trying to sell shovels, not trying to protect users.
This is also part of why I take a cynical view of what Reddit is doing with its API (although Reddit claims they're in camp A more than B). Reddit is probably not doing this to protect its users from theoretical AI displacement because it's still planning to license the data. It's just pricing it so high that only giant companies would be able to afford it.
> Stack Overflow senior leadership is working on a strategy to protect Stack Overflow data from being misused by companies building LLMs. While working on this strategy, we decided to stop the dump until we could put guardrails in place.
> We are working on setting up the infrastructure to do this correctly in the age of LLMs --- where we continue to be open and share the data with our developer community but work to set up a formal framework for large AI companies that want to leverage the data.
> We are looking for ways to gate access to the Dump, APIs, and SEDE, that will allow individuals access to the data while preventing misuse by organizations looking to profit from the work of our community. We are working to design and implement appropriate safeguards and still sorting out the details and timelines. We will provide regular updates on our progress to this group.
---
On reddit, for the A/C test... yea, I'm going to be cynical there that they're looking to sell the information (and it isn't so much trying to protect users). But also that 3rd party clients are not showing ads and may be poorly behaved when provided a free API with (what at one time) was generous rate limiting.
> First, I'd like to say that the intent of what Prashanth is saying is very simple: to return value to the community for the work that you have put in. The money that we raise from charging these huge companies that have billions of dollars on their balance sheet will be used for projects that directly benefit the community.
This is worded very specifically. Is SO planning to give money to users? They don't say anything like that; instead they say that they'll be "spending that money on the platform."
Well what does that actually mean? Every feature that SO builds could be characterized as "for the benefit of the community." It's hard not to read that response as just another way of saying "we're going to profit from this as a company, but don't worry because we use our profits to fund product development."
Heck, Reddit could make exactly the same claim, and in fact the linked Wired article actually makes that comparison:
> "Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive," Stack Overflow’s Chandrasekar says. "We're very supportive of Reddit’s approach."
It seems more likely that there, too, it was the users doing the work, not SO.
the primary value on SO is generated by the users, and thus the value proposition to enticing new users is also generated by the users. SO is just a forum.
I see where you're coming from calling out AI data miners for rent-seeking, but most social media platforms are also engaging in rent-seeking behavior.
Seems to me SO built and delivered huge, huge amounts of value and it’s now all at risk because multibillion dollar companies are free riding.