Microsoft 4 Min Read Aug 29, 2022
Understanding CodeBERT
CodeBERT is extension of BERT model developed by Microsoft in 2020. This model can be used for multiple downstream tasks using programming language and natural language, such as suggesting developer a code for particular task, aiding developers for code translations and many more. This model is already being implemented in MS tools such as visual studio- IntelliCode.
Published By : Kumud Gautami Before diving in the model let's look at the use cases of NLP tasks (source Microsoft Blog) of Programming Language (PL) and Natural Language (NL) :
Use Cases of CodeBert:
- Code to Code Translation: can be used for code completion or code translation. For example, when developer wants to write a java code, that has existing python code, code to code translation can help translate this code block.
- Code to text: can aid developer in code summarization. For example, when developer looks at the unfamiliar piece of code, CodeAI models can translate the code to natural language and summarize them for the developer.
- Text-Code: this can provide code search like feature. This search can provide aid user can retrieve relevant code based on natural language query.
- Text to text: can help translate code domain text to different languages.
refer image below (source Microsoft blog):
Background of Bert:
BERT (Bidirectional Encoder Representations from Transformers) is a self-supervised model proposed in 2018 by Google.
Bert Architecture
- BERT in essence is a stack of Transformer encoder layers (Vaswani et al., 2017) that consist of multiple self-attention “heads”.
- Consider a statement: “I like you” vs “Huh! As if I like you”. While a simple transformer will consider like as same, i.e in token embedding. Bert will also take into consideration Positional embedding and Segment Embedding.
- For every input token in a sequence, each head computes key, value, and query vectors, used to create a weighted representation/embedding.
- The outputs of all heads in the same layer are combined and run through a fully connected layer. Each layer is wrapped with a skip connection and followed by layer normalization.
- The conventional workflow for BERT consists of two stages: pre-training and fine-tuning.
- Pre- training uses two self-supervised tasks: masked language modelling (MLM, prediction of randomly masked input tokens) and next sentence prediction (NSP, predicting if two input sentences are adjacent to each other).
- Fine-tuning is for downstream applications, one or more fully connected layers are typically added on top of the final encoder layer.
CodeBert Architecture:
- BERT is easily extendable to multi-modality, i.e training with different types of dataset.
- CodeBert is bi-modal extension of Bert. i.e CodeBERT utilizes both natural languages and source codes as its input. (Unlike, traditional BERT and RoBERTa that focus on natural language primarily)
image source: https://arxiv.org/abs/2002.08155
Bimodal — NL — PL pairs:
- The typical input on which CodeBERT is trained on is combination of code and well-defined text comments.
image source: https://arxiv.org/abs/2002.08155
CodeBERT describes two pretrained objectives:
- Training CodeBERT with Masked Language Modelling (MLM): a random set of positions are selected for both NL and PL to mask out, and then replace the selected positions with a special [MASK] token. The MLM objective is to predict the original tokens which are masked out
- Training CodeBERT with Replaced Token Detection (RTD): Here, in original NL sequence and PL sequence few tokens will be randomly masked out. Then, a generator model is trained which is a n-gram like probabilistic model. Following with a discriminator model is trained to determine whether a word is the original one or not, which is a binary classification problem.
image source: https://arxiv.org/abs/2002.08155
Training details of CodeBERT
- 125M parameters, 12 layers
- Takes 250 hours of taining on NVIDIA DGX-2 with FP16
Result:
- Results summarizes, shows scores that CodeBert out performs it’s predecessors.
- It is interesting to note that when CodeBERT is used with pretrained representations from RoBERTa model (this RoBERTa model has been trained on codes from Code-SearchNet) vs when it is trained from scratch. Initializing CodeBERT with RoBERTa performs better.
image source: https://arxiv.org/abs/2002.08155
How to use CodeBERT:
CodeBERT is available through Huggingface:
Reference:
https://github.com/microsoft/CodeBERT
https://www.mdpi.com/2076-3417/11/11/4793/htm
https://www.youtube.com/watch?v=YmAXluUDPPI
https://innovation.microsoft.com/en-us/tech-minutes-codebert
Sources of Article
https://innovation.microsoft.com/en-us/tech-minutes-codebert