Let's imagine you are solving a puzzle. When we solve the puzzle, we don’t look at individual pieces, but we need to consider all their shapes, colours, and the image to correctly solve it.

Similarly, when reading any document. We do not just notice the text, but we also see its position, size, font and how it is written, in the form of paragraphs, points or the sub-parts.

When we do the document processing, the main challenging part is not just the OCR. There are many open-source OCR extract texts and give you location. The real challenge is labelling these pieces of text accurately and automatically.

That’s where the LayoutLMv3 comes in the picture. LayoutLMv3 doesn't just see words of OCR, it takes everything into account, including where the text is located and its relationship to other pieces of text.

How Does LayoutLMv3 Inference Work?

Humans have a natural ability to gauge the significance of different elements when looking at documents. We effortlessly recognize that a bold title at the top is important or that text inside a box may denote something noteworthy. LayoutLMv3 works in a surprisingly similar fashion to understand documents.

Let’s think about how our eyes scan a document. We observe the words and their placement on the page. LayoutLMv3 does this by using OCR (optical character recognition) technology. This process is like capturing a word along with their location in the page. This location is known as 'bounding boxes'.

Now in the next step like humans reading the text and understanding the word, similarly LayoutLMv3 converts the words into numerical vectors through the embedding, so it can understand. These numerical vectors have semantic meanings for every word.

The context of words comes from where they are placed. Like while reading a document, we observe this is header, this is paragraph, footer etc. In the same way LayoutLMv3 learns. The sequence of the words is necessary in the structure of the sentence. LayoutLMv3 takes the order of words to understand sentences and does its embedding which is known as positional embedding.

It also focuses on the 2D Layout position of text in the segment-level. The text's position with respect to the different parts of the document.

Combining these with the word vector helps the LayoutLM to see the text in the context.

Image source: arXiv

Just like humans, when we read a document, at first glance we understand it is a letter, resume, or anything else. It happens, because our mind sees the whole page of the document as the picture. In the same way, LayoutLMv3 looks into the picture. Instead of the old method using RNN or RCNN, it splits the images into sub-images, which is called patching.

We see the images in the order from left to right and top to bottom. Similarly, the patches are processed sequentially by the transformer, one patch at a time. It starts from the top left to go up to the bottom right.

LayoutLmv3 does the embedding of these patches. The positional embedding is added to patch embedding to preserve the positional relationship between the patches.

LayoutLMv3 sends word embedding in the transformer and the patches of image sequentially one at a time.

Prediction with Multimodal Transformer

Now when both the image patch and word embedding in the multimodal transformer, it starts processing.

It focuses on the specific part of the text embedding that is relevant to the visual features in the patch. This attention is guided by:

Positional Embeddings: These help the model identify text that's likely to be associated with the current image patch based on their spatial proximity in the document.

This is like when we read any table, and if the "Item Description", "Price", "Quantity" is given, that means it is the header of the table.

Semantic Relatedness: The model also considers the semantic similarity of words to the visual features, as it learns to associate visual concepts with their textual counterparts.

This is like when we see an image of a dog, we automatically pay more attention to the words in the document that are related to dogs or pets.

Cross-Modal Learning: The model learns relationships between visual and textual features, building a unified understanding of the document's content.

It’s like when humans read any table, and a number is given for a price in like 500, 200 or 300. At first, he might not understand that the numbers are in rupees or dollars.

But once they read the note at the bottom indicating that the values are in rupees, they will mentally tag this information and the formatting of the table as associated with prices in rupees.

Then in the end, it gives the understanding of the document. Which is often used for tasks like classification. Its common tasks include Document classification, Document layout analysis, Information extraction, Question answering etc.

Pre-training of LayoutLMv3-

The magic of the layoutLMv3 is due to its pre-training process. It's like training the brain of a child by showing them various puzzles and teaching them how to solve based on different shapes, colour and relationship between pieces. There are three main way by which LayoutLMv3 learns:

Masked Language Modeling (MLM) : Imagine hiding some words in a puzzle and asking the child to guess what they are based on the remaining clues.

MLM does something similar. It hides certain words in a document and trains LayoutLMv3 to predict the missing words, learning the context and relationships between words.

For example if in a document it is given, “Doing internship is a great way to gain practical knowledge”, then MLM might hide (mask) the word ‘gain’. By other surrounding words, LayoutLMv3 learns that's gains the most likely missing piece.

Image source: arXiv

Masked Image Modeling (MIM): This is like hiding parts of the puzzle image and asking the child to describe them based on the visible parts. MIM masks certain patches in an image, and LayoutLMv3 has to predict what those hidden patches might contain. This trains the model to understand the visual features and relationships between different parts of an image.

Word-Patch Alignment (WPA): This brings the image and text understanding together. Think of it as showing the child the individual puzzle pieces and their corresponding picture sections, asking them to match them up.

By mastering these pre-training objectives, LayoutLMv3 learns to see documents like humans, not just as a collection of words or images, but as a unified whole with interconnected elements conveying meaning.

Conclusion

LayoutLMv3 is a groundbreaking step in document processing. It observes the document, same as the humans. It truly understands the visual layout of the document. It doesn't just understand individual words or images, but also how they interconnect to convey information.

You can learn more about the training and advanced things in the layoutLMv3 research paper.

This blog details my learnings from my internship at Softude while working with Mradul Kanugo.

Sources of Article

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE