AI's Next Lesson: Who Will Teach the Teacher?

Danny Briskin, Quality Engineering Practice Manager

Introduction

As more people use Artificial Intelligence (AI) tools, fewer people are using services run by humans. A very noticeable example is the Stack Overflow website, which is much less active now. This trend poses significant challenges for the continued improvement of AI and its ability to provide dependable results. It also jeopardizes easy access to accurate technical knowledge..

From Human Communities to AI Assistants

To understand this shift, we first examine why Stack Overflow’s use has decreased. Traditionally, Stack Overflow operated as a key online community where users could ask IT-related questions and usually receive quick responses from professionals. Many experts willingly shared their knowledge there at no cost, often motivated by a system of reputation points earned through helpful contributions. While users needed to register and follow strict posting rules, the quality of answers often justified the effort.

Today, Large Language Models (LLMs) like ChatGPT offer a simpler way to get answers immediately. Users can refine questions with more details and explore different answers interactively, a flexibility not present on Stack Overflow. Furthermore, Stack Overflow users sometimes faced negative feedback from the community for questions perceived as low-quality or redundant. LLMs, in contrast, offer a judgment-free interaction.

AI tools have become effective aids for software engineers. For instance, LLMs can help generate initial drafts for requirements, user stories, system architecture, or test cases. Additionally, many code assistance tools integrated into Integrated Development Environments (IDEs) use AI to increase developer productivity. Overall, LLMs provide quick and often dependable support when needed.

AI’s Reliance on Human-Generated Knowledge

This situation raises an important question regarding potential downsides. The effectiveness of LLMs, especially in software engineering, is not accidental. These AI models achieved their current capabilities by being trained on extensive and specific datasets. A significant portion of this training data, particularly for coding tasks, was sourced directly from platforms like Stack Overflow and the vast amounts of human-written code available in public repositories.

Potential Effects of a Shrinking Knowledge Base

If the trend of declining use of human-to-human Q&A websites continues, the publicly available collection of expert knowledge, essential for AI training, will significantly decrease. While human expertise will still exist, its accessibility for training AI models will be reduced.

The absence of active platforms like Stack Overflow could hinder the ability of future AI versions to learn about new developments. Some might suggest that LLMs can infer new information from existing data. However, LLMs are designed to generate responses even with limited input. If fresh, relevant data is scarce, an LLM might produce an answer based on its older training. Such an answer could sound convincing but be inaccurate or inapplicable, a phenomenon known as “hallucination,” which is difficult to prevent.

LLMs do not possess human-like understanding; they identify patterns and combine information from their training data, which can sometimes produce seemingly new outputs. However, their capacity to provide correct and useful information for new or fast-changing technical areas depends heavily on up-to-date and varied training data. Without new information reflecting current developments, LLM-generated solutions might be based on old or incomplete patterns.

Consider a scenario where a new version of a widely used programming language or database system is released. Without ongoing human discussion and problem-solving on public platforms, AI models might provide information relevant only to older versions or, worse, generate incorrect solutions. This raises the question of how AI will learn about these new technologies.

Furthermore, while popular technologies like Java or Python have large communities generating vast amounts of data, niche tools have much smaller user bases and fewer experts. The challenge of training AI effectively for these less common technologies becomes even greater with a reduced flow of new human-generated knowledge.

Difficulties in Keeping AI Knowledge Current

Recently, efforts have begun to update existing LLMs by fine-tuning them with new information and identifying outdated content. However, this process is complex. While it’s possible to correct general knowledge gaps, such as updating an LLM with the name of a newly elected official, addressing highly specialized technical subjects is much harder. It raises questions about the number of experts required to continually update AI across all fields of knowledge.

The community environment of platforms like Stack Overflow encouraged knowledge sharing. Motivations such as helping others and gaining recognition led to the organic creation of a large, high-quality dataset. This type of motivation is less apparent in human-AI interactions. Engineers may use AI for their own tasks, but widespread, voluntary contributions to train or correct AI systems are unlikely without clear rewards or a sense of community. This loss of intrinsic motivation weakens a key method for obtaining ongoing training data.

Furthermore, some LLM developers believe that most easily available public text data suitable for training has already been used. To improve AI further, it might be necessary to use real-time information or private data sources. For specific areas like software engineering, obtaining up-to-date, practical knowledge could mean analyzing private code, internal documents, or even developer activities. Such methods present major challenges regarding cost, privacy, ethics, and scalability, potentially making LLM usage too expensive.

Conclusion

The ease of using AI tools like LLMs is unintentionally causing a decrease in the use of important human-based knowledge platforms like Stack Overflow. This situation presents a key problem: the AI systems that depend on these platforms are also contributing to the loss of their essential data sources. This can lead to serious outcomes, such as AI development slowing down, more instances of incorrect or fabricated AI-generated information, specific problems keeping AI updated on new and specialized technologies, and major challenges in finding practical, ethical, and affordable ways to refresh AI knowledge.

This highlights the ongoing importance of promoting mutual support among software engineers, as their collective expertise will remain an indispensable resource. While AI offers unprecedented capabilities, the dynamic, evolving nature of human problem-solving and collaborative learning remains unique. Ensuring that this human element continues to thrive and contribute to the collective knowledge pool will be essential for navigating the complexities of future technological development.