GitHub Copilot: Can this “AI programmer” really improve developer productivity?


Image: Hinterhaus Productions/GETTY

GitHub has published a study showing that its recently released Copilot code completion tool really correlates with improving developer productivity.

GitHub Copilot, an AI pair programming service, was made available a month ago for $10 per user per month or $100 per user per year.

It is an extension to Microsoft’s Visual Studio code editor that suggests code to developers that they can accept, reject, or modify. Code suggestions are generated by OpenAI’s Codex-based natural language AI model, itself a version of GPT-3, and have been trained on billions of lines of publicly available source code, including including code published on GitHub.

SEE: Want a happy team at work? Make sure you don’t forget this essential ingredient

Copilot has caused some controversy because not all developers are happy with using their code to train it. But now GitHub has published a study aimed at testing its theory that Copilot leads to higher productivity rates among developers.

Its researchers analyzed 2,631 survey responses from developers using Copilot and compared their responses to metrics collected from the IDE (integrated development environment). Part of the challenge was finding the best way to measure the effect of Copilot on developer productivity.

“We find that the acceptance rate of displayed suggestions is a better predictor of perceived productivity than alternative measures,” explain the authors.

In other words, GitHub tries to define how its service should be measured when evaluating its impact on developer productivity.

GitHub’s productivity measurement method is different from another non-GitHub study published in April on the impact of Copilot on developer productivity that measured repetitive task execution times. Its authors concluded that Copilot did not necessarily improve task completion time or success rate, but most of its 24 participants preferred using Copilot because it often provided a useful starting point and avoided the effort of search online.

One of the authors of the GitHub review, Albert Ziegler, likens the service to “a paired programmer with a calculator attached,” which is really good at the tricky stuff but reliable enough to close all the brackets in the right order. He challenges the idea that developers want to increase productivity by reducing online searches for code snippets to reuse.

“But the word ‘productivity’ in development holds a wide range of possible practical meanings. Do developers ideally want to save keystrokes or avoid Google and StackOverflow searches?” Ziegler asks in a blog post. “Should GitHub Copilot help them stay in the flow by giving them very specific solutions on calculator-like mechanical tasks? Or should it inspire them with speculative stubs that might help unblock them when they get stuck ?

The three key questions that the GitHub study posed to developers were:

  1. Do people feel that GitHub Copilot makes them more productive?
  2. Is this sentiment reflected in objective usage metrics?
  3. Which usage metrics best reflect this sentiment?

Ziegler notes that the study results show that Copilot “is correlated with improved developer productivity.” He got the strongest link by dividing the number of accepted suggestions by the number of posted suggestions.

“This acceptance rate reflects the number of code suggestions produced by GitHub Copilot that are deemed promising enough to be accepted,” it notes.

Additionally, developers who report the highest productivity gains with Copilot also accept the highest number of code suggestions posted.

The study revealed different levels of acceptance rates for different languages.

“We are aware that there are significant differences in the performance of GitHub Copilot for different programming languages,” note the GitHub authors. “The most common languages ​​among our user base are TypeScript (24.7% of all completions viewed in the observed timeframe, 21.9% for survey users), JavaScript (21.3%, 24 .2%) and Python (14.1%, 14.5%). The latter two enjoy higher acceptance rates, perhaps suggesting a relative strength of neural tooling over deductive tooling. for untyped languages.

SEE: Six ways to stay productive when working remotely

The GitHub authors also note that its persistence metrics — or the number of suggestions retained over time — were not aligned with reported productivity.

“In common with previous work, we collected metrics on completion acceptance, but we also developed persistence metrics. This was based on the idea that for longer completions a developer might have to take no more actions after accepting a completion, such as deleting or correcting an erroneous one.

“We were surprised to find that acceptance rate (number of acceptances normalized by number of completions posted) correlated better with reported productivity than our measures of persistence.”

However, they argue that the lack of connection with persistence makes sense because the value of Copilot is not in how many lines of code it automates properly, but in giving users a model to modify.

But in hindsight, it makes sense. Coding is not typing, and the core value of GitHub Copilot is not how the user types in as many lines of code as possible. Instead, it’s about helping the user make the best progress towards their goals,” they note.

“A suggestion that serves as a useful model to tinker with may be as good or better than a perfectly fine (but obvious) line of code that only saves the user a few keystrokes. This suggests that a narrow focus on the accuracy of the suggestions would not tell the whole story of this type of tooling.”


Comments are closed.