Code generation tool StarCoder has received a massive update that could position it as a leading open source alternative to services like GitHub Copilot.
Initially released in May 2023 as part of a collaboration between Hugging Face and ServiceNow, the latest version, StarCoder 2, now also has major industry backing in the form of Nvidia.
The code generation tool helps developers by automating code completion, similar to GitHub Copilot or Amazon CodeWhisperer. It is also capable of summarizing existing code and generating original snippets.
StarCoder 2 is available in three different model sizes, each trained by a different member of the association.
The smallest version is a three billion parameter model trained by ServiceNow, with a seven billion parameter model trained by Hugging Face.
Nvidia was responsible for the largest iteration of StarCoder 2 with a 15 billion parameter model built using its NeMo generative AI platform and trained on Nvidia's accelerated AI infrastructure.
Each fork of the StarCoder 2 models offers a significantly expanded range of programming languages in which they can work.
The original StarCoder tool was trained in over 80 different programming languages, while StarCoder 2 has the ability to generate code in 619 languages.
StarCoder 2 is powered by the Stack v2 dataset, the largest open source dataset suitable for LLM pre-training, according to Hugging Face. The artificial intelligence company said that this latest data set is seven times larger than the original Stack v1.
Along with new training techniques, the trio believe this will help models understand low-resource programming languages, mathematics, and discussions of program source codes.
The performance of each of the new LLMs has also greatly improved, with the three-billion-parameter StarCoder 2 matching the performance of Hugging Face's original 15-billion-parameter StarCoder model.
StarCoder 2 could be a game changer for developers
Rory Bathgate is Multimedia and Features Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
StarCoder 2 is a big step forward for open source AI code generation. By opening the door to competition within the open source community for the title of 'best AI pair programmer' and putting the spotlight on Meta's Code Llama, has ensured that developers have a future of solid and open options to look forward to.
In the document accompanying the release, the team behind StarCoder 2 presented evidence that the model can go toe-to-toe with Code Llama even at its largest size of 34 billion parameters.
In MBPP, a benchmark that compares a coding model to approximately 1,000 entry-level Python programming problems, StarCoder 2's 15 billion-parameter model scored 66.2 versus Code Llama's 65.4 34B.
The fact that training data for StarCoder is openly available through Stack will also be a relief for many organizations.
Future legal battles will be fought over who owns the data used to train AI, and any company that discovers that its source code was generated using mined proprietary data could face a very difficult and expensive replacement process in the future.
On the contrary, the opening of StarCoder 2 is a crowning achievement. In order to thank the developers whose code formed the basis of StarCoder 2, users can enter results in a data set search in Hugging Face to identify whether the code the tool has produced is “original” or a literal copy of its immense training data.
Alternatively, teams can freely search the data set.
It is in the interest of all developers to have solid options like this on the market, as innovation and competition in the sector will only make the models more accurate. But the precedent that StarCoder 2 sets in terms of responsible creation of AI models through open source may be its lasting legacy.