The Core Problem with AI Code Assistants, And Why They Wont Replace Developers.


Since the advent of ChatGPT, large language models (LLMs) have advanced in leaps and bounds these last few years to the point where you can throw practically anything at them and get something at least coherent in return. But with a new coding assistant AI coming out every half hour, whether it be Amazon Q, JetBrains AI Assistant, JetBrains Curie, GitHub Copilot, or TabNine, a popular idea being passed around is the age of the developer is coming to an end, or at the least, it can aid and pick up the majority of the workload of a developer. In my view, this idea is a dangerous one, at least at the moment, due to how these tools work at the core and the crucial drawbacks they cause.

How these coding tools work

Most of, if not all, of the current generation of “AI” code tools are LLMs at their core. While there are different flavours of LLMs and variations in training data that can change the quality of the output, the fundamental issue remains: an LLM isn’t up to the task. An LLM operates as a probability tool. It has a set of tokens—these could be words, smaller groups of letters, or characters and keywords used in programming.

With a lot of training data and established patterns of tokens from that data, the model calculates the most likely next token. Simply put, it’s like predictive text but running on a data centre instead of a mobile phone. The dataset comprises the sum total of human knowledge and writing rather than just your typing habits.

When it comes to programming, the dataset typically includes publicly available repositories, answers on StackExchange, and documentation.* This means there is a wide range of data to train on. It includes code from various languages, frameworks, and patterns with varying levels of quality. Thanks to that, the tool can generate a response to almost any prompt, as there will always be something in the dataset from which to derive the probabilities.

* Concerns have been raised about whether some training data might include copyrighted materials without explicit permission.

Monkeys on Typewriters

At the surface level, this sounds like the perfect formula for replacing a developer. However, the devil is in the details. LLMs are purely probability engines. They lack understanding of the code they generate, its syntax, or its purpose; therefore, it is akin to having one million monkeys writing on one million typewriters. Sure, one will write Shakespeare, but they won’t understand it.

The lack of understanding means that LLMs can generate buggy code. They might produce code that won’t compile or, worse, code that is subtly flawed. Compounded with the fact that the tool can’t truly debug code, it has no actual cognition with which to properly problem-solve. As a result, developers must debug the generated code and review it with a fine-tooth comb to avoid any subtle bugs. In some cases, this could lead to using these tools being slower than writing the code by hand.

Spanglish as Code

I recently attempted to create a Qt application using a package that provided bindings for Golang. Since I had never written a Qt application, it was a slow process of two steps forward and one step back. During this process, I wanted to display an image on a button.

I saw in the Qt documentation a reference to a QPushButton::setIcon method, which would do what I wanted, but it hadn’t been included in the bindings I was using. I asked GitHub Copilot, JetBrains AI Assistant, or possibly another one of these tools to generate a workaround. I needed either to implement a binding manually or call some of the lower-level methods in the library I was using.

Unfortunately, nothing it generated over the hours I spent prompt engineering worked. Most of the code it produced appeared to be raw Qt methods intended for C++ but presented as Golang syntax. So, often, it wasn’t even able to compile. My assumption is it got caught up in the reference to Qt. Since what I was writing was relatively niche, there weren’t too many examples for it to pull from.

This illustrated a couple of issues that have stuck with me regarding how I interact with these tools:

  • It will take a solution that worked in one programming language and will more or less verbatim write it in another programming language’s syntax without concern as to whether the methods or libraries it generates calls to even exist. This is likely even more common on older versions of evolving languages like Java, with it generating code incompatible with older versions, even if specified.
  • There is a lack of ability to ensure the full context of the prompt provided is adequately evaluated. This can cause the model to get “tunnel-vision” on a particular point in the prompt, which ultimately poisons the entire session.

Promptly Taking Our Time

An example lengthy ChatGPT o1 prompt shared by OpenAI President Greg Brockman

If you ask most of these tools to generate, say, a test for a method in Java, what they spit out could be anything from a Junit4 test, to a Junit5 test with Mockito, to a cucumber test, not to mention what Java version they pick. This can be a problem in a professional context, where there are often standards to follow about library usage, style guides, and other standards that a developer considers when writing code.

This highlights the need to learn prompt engineering, which is more challenging than it seems. The image above, tweeted by OpenAI president Greg Brockman, illustrates the ideal way to prompt ChatGPT o1. However, it’s concerning that even one of OpenAI’s most advanced models still requires explicit instructions not to generate false information—hardly a confidence-inspiring detail.

The level of detail and context needed for a simple hiking trail prompt raises questions about how lengthy an ideal prompt for a programming query might be. Programming tasks often require deep context, including standards, edge cases, and the application’s environment. At some point, writing out such an extensive prompt might feel more cumbersome than simply coding the solution by hand—just as searching for hiking trails in San Francisco might have been easier than crafting the perfect AI prompt.

Liability-Driven Development

Developers often focus on quality when writing code, considering both code smells and potential liabilities. They must ensure dependencies are secure, have compatible licenses, and don’t call vulnerable or deprecated methods. Additionally, they need to avoid requiring attribution if utilizing sources like StackExchange.

This could cause issues with using these tools, as it would need to be a behemoth of a prompt to pass all the linter rules, license policies, and the like into a prompt, and even then, it still may generate code that doesn’t cut the mustard. The main issue with current tools is the potential liability should there be any misappropriated code in their training sets. As far as I know, none of the cloud-based offerings guarantees their datasets are entirely free of unauthorized content. Instead, some shift liability to the end user through their terms of service. Since this legal question remains untested—though cases are making their way through the courts—companies using these tools to generate code could face lawsuits.

This uncertainty makes it risky for companies to rely on AI-generated code without a thorough review. Until clear legal precedents are established, businesses should weigh the benefits of these tools against the potential legal and ethical risks.

Not a silver bullet, just another arrow in the quiver.

That said, these tools are here to stay, and setting aside the legal concerns for a moment, it’s crucial to determine where they fit best and how to use them responsibly—leveraging their strengths without amplifying their drawbacks. In my view, their best use case is for small, well-defined tasks like data transformation, where correctness is easy to verify, and a developer could have written and reviewed the code themselves.

I had an example with this blog; I wanted to add tags and pages where posts could be filtered by tags, so I’d need to have a map of tags, and the list of posts associated with the tags, the way the posts are represented are:

{
	"title": "string",
	"description": "string",
	"pubDate": "date",
	"updatedDate": "date",
	"heroImage": "string",
	"tags": "list<string>"
}

And I wanted to transform that into:

[
	"params": {
		"slug": "string" //the tag
	},
	"props": {
		"posts": "list<post>" //as in the post object above
	}
]

Now, through some hard staring and time, I could have chucked together the method to convert this, but I decided to let ChatGPT generate it instead, and what it came up with is this:

const result = posts.reduce((acc, post) => {
    post.data.tags.forEach(tag => {
        // Find the tag group in the accumulator
        const tagGroup = acc.find(group => group.tag === tag);

        if (tagGroup) {
            // If the tag group exists, push the post to its 'posts' array
            tagGroup.posts.push(post);
        } else {
            // If the tag group doesn't exist, create a new one
            acc.push({tag: tag, posts: [post]});
        }
    });
    return acc;
}, []);

It wasn’t exactly what I needed, but I also didn’t spend the time engineering a prompt to get exactly what I needed. If this blog wasn’t statically generated and that code ran on clients, I’d be more concerned about the 3x nested loop. That being said, it saved me the time of chucking that together. All I had to do was edit some details, and it probably saved me a few minutes.

Another potential use case for these tools is assisting with documentation—though I emphasize assisting. The same issues that affect code generation also apply to documentation, and incorrect documentation could be disastrous during an incident. However, since documentation is often an afterthought and a time-consuming task, having AI generate an initial draft—getting you 80% of the way there—can speed up the process. A thorough review afterwards ensures accuracy while still saving developers time.

Conclusion

While AI-powered coding assistants have made impressive strides, they remain fundamentally limited by their probabilistic nature. They can generate valuable snippets, automate repetitive tasks, and even suggest solutions to common problems, but they do not “understand” code like a human developer does. Their reliance on pattern recognition rather than true comprehension means they are prone to errors—some obvious, others dangerously subtle. Compounded with the potential legal liabilities, the use of these tools in a corporate context at the moment is dubious at best.

For now, the idea that these tools can replace developers is misguided. Instead, they should be viewed as aids that can enhance productivity when used wisely but require careful oversight. Developers must still validate, debug, and refine the output to ensure quality and correctness. The more niche or complex the task, the more apparent these limitations become.

Ultimately, coding assistants are just that—assistants. They can accelerate development but cannot replace the experience, problem-solving skills, and intuition human developers bring. Until AI models evolve beyond statistical token prediction and develop a deeper understanding of programming logic, the role of a skilled developer remains irreplaceable.

#AI#Ramblings