Get featured on IndiaAI

Contribute your expertise or opinions and become part of the ecosystem!

IBM's AI research division has released a 14-million-code-sample dataset, called Project CodeNet, to develop machine learning models that aid programming tasks. The name CodeNet is inspired by ImageNet, a well-known dataset of labelled photos that had contributed significantly to the development of computer vision and deep learning. 

CodeNet is an open-source dataset that could help machine learning models to be trained for various tasks and advance artificial intelligence's (AI) understanding and translation of code. CodeNet’s creators describe it as a “very large scale, diverse, and high-quality dataset to accelerate the algorithmic advances in AI for Code.”

“We find ourselves in a new age where it’s essential to take advantage of today’s powerful technologies like artificial intelligence (AI) and hybrid cloud to create new solutions that can modernize processes across the information technologies (IT) pipeline,” Richir Puri, chief scientist at IBM Research, wrote in a blog post. “Project CodeNet specifically can drive algorithmic innovation to extract this context with sequence-to-sequence models, just like what we have applied in human languages, to make a more significant dent in machine understanding of code as opposed to machine processing of code.”

The dataset not only contains 14 million code samples but also 500 million lines of code in more than 50 programming language like C++, Java, Python, Go, COBOL, Pascal and FORTRAN, that have been collected from nearly 4,000 submissions that were submitted to challenges posted on online coding platforms AIZU and AtCoder. These code samples contain the correct and incorrect submissions in these challenges. CodeNet also has high-quality metadata and annotations, and sample inputs and outputs that can guide researchers to program intent when translating one programming language to another. 

“Our team is excited to give researchers and developers a dataset and a set of technologies that is easy to use and understand, while simultaneously assisting in the development of algorithms that will advance AI for code. With Project CodeNet, we hope to produce lasting business value as enterprises embark on their IT modernization journeys,” Puri wrote. 

While the dataset of codes is impressive due to the diversity of the languages and the sheer size of the dataset, the metadata is the most impressive. The rich annotations help make CodeNet suitable to be applied to a diverse set of tasks and not limited to serving only specific programming tasks like other coding sets. 

The dataset can be used to create machine learning models for language translation since the dataset is rich with various programming languages. Researchers say that potential use for this dataset can include code search and cloud detection, automatic code correction, regression studies and prediction. 

An advanced purpose of the dataset can be used for code generations. CodeNet is a rich library of textual descriptions of problems and their corresponding source code. There have already been several examples of developers using advanced language models such as GPT-3 to generate code from natural language descriptions. 

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE