The (Possible) AI Training Dilemma

AI, specifically Large Language Models (LLMs), are trained on very large datasets. In the case of OpenAI’s ChatGPT, it looks like eight experts trained on 220 billion parameters each, stacked together like a tower made of Lego bricks. In essence, AI companies train their AI models on most of the publicly available text on the Internet – books, articles, discussions, and source code. In other words: Human-generated content (as most of the content on the public Internet is human-generated).

So far, so good. But what do you do when AI itself generates more and more content on the Internet? Point in case: Stack Overflow. The site is the de-facto standard Q&A site for programmers – software engineers worldwide flock to Stack Overflow to find solutions for gnarly coding problems. Coding AIs, such as Microsoft’s GitHub Copilot, and the code-generating features in popular LLMs, such as ChatGPT, are trained on the rich content generated by tens of thousands of volunteers. Stack Overflow experienced a 14% decline in traffic in March this year, a trend continues to hold as more and more developers shun the site and instead rely on their AI assistants.

With fewer programmers caring to ask and, as importantly, answer questions – the body of new knowledge AIs can be trained on diminishes. In turn, AI’s knowledge becomes frozen in time.

It gets worse. If not having new data to train models on wasn’t bad enough, AIs being trained on their output have been shown to degrade in their quality rapidly. When the Internet is flooded with AI-generated content, which, judging from the early signals, isn’t far away, models will train on this content and exhibit irreversible defects that gradually exacerbate across generations. AIs experience similar degenerative defects like weakening a species’ genetic pool when inbreeding occurs.

Outside of AI-generated content, which seemingly started to become the norm, humans, who are specifically paid to do “human work,” are starting to figure out that they can cheat and have AIs do their work – with the result that even the content we assumed was human-generated comes from an AI.

This is not to say that this future is inevitable, but it requires attention and effort. (via Pascal)

radical’s latest Insights.

→ Explore Past Insights: Our Complete Collection

radical Insights.

Weekly Research and Commentary on the Future of Business and Technology.

The (Possible) AI Training Dilemma.

radical’s latest Insights.