Elon Musk Claims All Human Data for AI Training is 'Exhausted.

The tech boss advocates for the transition to self-learning synthetic data, although some experts caution that this may lead to a phenomenon known as 'model collapse'.

Elon Musk has stated that companies specializing in artificial intelligence have depleted their available data for training models and have essentially reached the limits of human knowledge. The wealthiest individual in the world indicated that technology firms may need to resort to utilizing "synthetic" data—information generated by AI models—to develop and refine new systems, a trend that is already occurring within this rapidly advancing field.

“The cumulative sum of human knowledge has been exhausted in AI training. That happened basically last year,” said Musk, who launched his own AI business, xAI, in 2023.

AI models, including the GPT-4o model that underlies the ChatGPT chatbot, undergo a training process utilizing a comprehensive dataset sourced from the internet. Through this process, these models learn to identify patterns within the data, enabling them to anticipate subsequent words in a sentence, for example.

los,angeles, ,apr,13:,elon,musk,at,the,10th

In a recent interview broadcasted on his social media platform, X, Musk stated that the sole method to address the deficiency of source material for training new models is to transition to synthetic data generated by artificial intelligence.

Mentioning the exhaustion of data troves, he said: “The only way to then supplement that is with synthetic data where … it will sort of write an essay or come up with a thesis and then will grade itself and … go through this process of self-learning.”

Meta, the parent company of Facebook and Instagram, has employed synthetic data to enhance its prominent Llama AI model, while Microsoft has similarly utilized AI-generated content for its Phi-4 model. Additionally, both Google and OpenAI, the organization responsible for ChatGPT, have incorporated synthetic data into their artificial intelligence initiatives.

Nevertheless, Musk cautioned that the propensity of AI models to produce "hallucinations"—a term referring to erroneous or nonsensical outputs—poses a significant risk to the synthetic data methodology.

During a livestreamed discussion with Mark Penn, the chairman of the advertising firm Stagwell, Musk remarked that hallucinations have rendered the use of artificial data "challenging," questioning how one can discern whether an output is a hallucination or a legitimate response.

Andrew Duncan, who serves as the director of foundational AI at the Alan Turing Institute in the UK, remarked that Musk's statement aligns with a recent scholarly article suggesting that the pool of publicly accessible data for AI models may be depleted as early as 2026. He further cautioned that excessive dependence on synthetic data could lead to "model collapse," a phenomenon characterized by a decline in the quality of model outputs.

“When you start to feed a model synthetic stuff you start to get diminishing returns,” he said, with the risk that output is biased and lacking in creativity.

businessman touching digital chat bot on tablet for provide access to information and data in online network, robot application and global connection, ai, artificial intelligence.

Duncan noted that the increase in AI-generated content available on the internet may lead to this material being incorporated into AI training datasets.

The management of high-quality data and the associated rights has emerged as a significant legal issue amid the rapid expansion of artificial intelligence. OpenAI acknowledged last year that developing tools like ChatGPT would not be feasible without utilizing copyrighted materials, prompting calls from the creative sectors and publishers for remuneration for the inclusion of their works in the training of these models.

LATEST: Elon Musk’s Daily, Hourly, and Per-Second Earnings Revealed.

Elon Musk's suggestion to shift to synthetic data for AI training raises serious concerns. While it may provide a temporary solution, the risks of "model collapse" cannot be ignored. Relying on AI-generated content could lead to diminishing returns, where outputs become increasingly biased and lack creativity.

As synthetic data becomes more prevalent, the quality of AI models may decline, resulting in unreliable or nonsensical responses. Additionally, the increasing use of AI-generated content in training datasets could exacerbate these issues, making it difficult to ensure the accuracy and originality of future AI systems.