Understanding CodeBERT

Back

Results for ""

Before diving in the model let's look at the use cases of NLP tasks (source Microsoft Blog) of Programming Language (PL) and Natural Language (NL) :

Use Cases of CodeBert:

Code to Code Translation: can be used for code completion or code translation. For example, when developer wants to write a java code, that has existing python code, code to code translation can help translate this code block.
Code to text: can aid developer in code summarization. For example, when developer looks at the unfamiliar piece of code, CodeAI models can translate the code to natural language and summarize them for the developer.
Text-Code: this can provide code search like feature. This search can provide aid user can retrieve relevant code based on natural language query.
Text to text: can help translate code domain text to different languages.

refer image below (source Microsoft blog):

BERT (Bidirectional Encoder Representations from Transformers) is a self-supervised model proposed in 2018 by Google.

BERT in essence is a stack of Transformer encoder layers (Vaswani et al., 2017) that consist of multiple self-attention “heads”.
Consider a statement: “I like you” vs “Huh! As if I like you”. While a simple transformer will consider like as same, i.e in token embedding. Bert will also take into consideration Positional embedding and Segment Embedding.
For every input token in a sequence, each head computes key, value, and query vectors, used to create a weighted representation/embedding.
The outputs of all heads in the same layer are combined and run through a fully connected layer. Each layer is wrapped with a skip connection and followed by layer normalization.
The conventional workflow for BERT consists of two stages: pre-training and fine-tuning.
Pre- training uses two self-supervised tasks: masked language modelling (MLM, prediction of randomly masked input tokens) and next sentence prediction (NSP, predicting if two input sentences are adjacent to each other).
Fine-tuning is for downstream applications, one or more fully connected layers are typically added on top of the final encoder layer.

BERT is easily extendable to multi-modality, i.e training with different types of dataset.
CodeBert is bi-modal extension of Bert. i.e CodeBERT utilizes both natural languages and source codes as its input. (Unlike, traditional BERT and RoBERTa that focus on natural language primarily)

image source: https://arxiv.org/abs/2002.08155

The typical input on which CodeBERT is trained on is combination of code and well-defined text comments.

image source: https://arxiv.org/abs/2002.08155

Training CodeBERT with Masked Language Modelling (MLM): a random set of positions are selected for both NL and PL to mask out, and then replace the selected positions with a special [MASK] token. The MLM objective is to predict the original tokens which are masked out
Training CodeBERT with Replaced Token Detection (RTD): Here, in original NL sequence and PL sequence few tokens will be randomly masked out. Then, a generator model is trained which is a n-gram like probabilistic model. Following with a discriminator model is trained to determine whether a word is the original one or not, which is a binary classification problem.

image source: https://arxiv.org/abs/2002.08155

Results summarizes, shows scores that CodeBert out performs it’s predecessors.
It is interesting to note that when CodeBERT is used with pretrained representations from RoBERTa model (this RoBERTa model has been trained on codes from Code-SearchNet) vs when it is trained from scratch. Initializing CodeBERT with RoBERTa performs better.

image source: https://arxiv.org/abs/2002.08155