Optimizing Matrix Multiplication: Unveiling the Power of Tiles

Pillars
IndiaAI Portal
Resources
Ecosystem
Sectors

Back

Results for ""

IndiaAI Recommends

Imagine you have a task to multiply two 4x4 matrices together. Its like a mini workout for your brain. You pick one row from the first matrix and one column from the second, then begin the calculation. To get just one number in your answer, you must perform four multiplications and three additions. Oh yes, you must follow the proper order: First, multiply each corresponding pair, then sum the results of these multiplications before combining them to obtain each element. With a 4x4 matrix, you must repeat this process 16 times. We are talking 64 multiplication and 48 additions in total, all in a line, one by one.

There you are, carrying this heavy math load, while your friends are totally chilling. It’s just not fair, is it?

But wait, what if you could share the load? That’s where the “Tiles-Based Matrix Multiplication” comes in. This strategy involves breaking the heavy task into smaller chunks for easier processing. So, you have to grab your friend, give him a sub-part of both matrix to process. You are also processing another sub part of the matrix parallelly. Now, all the calculations you did—bearing the heavy load and breaking a sweat—are divided into two parts. Now your work will become simpler and faster. Also, now your friend is also not chilling, cheers!

Join me as we dive into the world of tiling in matrix multiplication - where all the not performed by one, but it is performed by many with different chunks.

How does Tiles matrix multiplication work?

Matrix multiplication requires significant computational resources. To optimize this process, a method known as 'Tiling' is often used. Tiling involves breaking down the matrices into smaller sub-matrices or 'tiles' that fit into the cache memory, which is much faster than RAM. This allows for more efficient computation as it reduces the time spent on memory access.

Let's consider an example with two matrices:

Input matrix : [ [1, 2, 3, 4], [5, 2, 4, 3], [4, 3, 1, 2], [2, 1, 3, 1] ]

Weight matrix : [ [5, 4, 1, 2], [1, 2, 3, 3], [3, 1, 5, 1], [4, 2, 2, 5] ]

Assuming the cache memory size allows for a 2x2 tile, we can perform matrix multiplication in smaller chunks. This enables multiple multiplications to run in parallel on different processor cores. For example:

This enables multiple multiplications to run in parallel on different processor cores. For example:

C1 = [ [1, 2], [5, 2]] * [ [5, 4], [1, 2]]

C2 = [ [3, 4], [4, 3] ] * [ [3, 1], [4, 2] ]

Ans1 = C1 + C2

Ans1 = [ [32, 19], [51, 34] ]

This c1 and c2 are calculating in parallel.

Then with these small tiles the main final matrix will be calculate like this:

Ans = [[32,19,30,31], [51,34,37,35], [34,27,22,28], [24,15,22,15]]

How does Tiles matrix multiplication work in CPU?

Matrix multiplication is a foundational operation in many computational tasks. To optimize performance on modern computer architectures, which feature hierarchical memory systems.

Let us suppose we have 2 matrices in our secondary storage.

Initially, our data resides in secondary memory a non-volatile storage such as a hard drive or SSD. Upon initiating the inference task, the system loads the dataset into Random Access Memory (RAM). RAM offers much faster access times compared to secondary storage, thus it serves as a ready-to-use workbench for our data.

Image source: Dive Into Systems

Once in RAM, we further accelerate processing by utilizing the CPU's L3 cache. The L3 cache is a larger and faster memory store relative to the RAM, shared across all processor cores. It is, however, smaller than RAM but larger than L2 and L1 caches. During matrix multiplication, the subsets of both matrix 'tiles' of both the matrix loaded into the L3 cache according to its capacity.

The process continues to refine as smaller, different subsets of the cached matrices are allocated to various L2 caches. Each processor core generally possesses its own private L2 cache. By distributing the data across multiple L2 caches, we're essentially parceling out the workload across the available cores.

The last part of out journey before computation involves moving

some small part of out matrices, to L1 cache. When the CPU performs matrix multiplication, it first looks for the necessary data in the L1 cache. Due to it’s speed it greatly increase the throughput of matrix operations.

After a core finishes a partial products, it written back to L3 cache, when the partial product need to be shared between cores.

Then the addition of this partial products correspond to the same segment happen to get the final different tiles.

Then this tiles are placed in their correct position in the final matrix, to generate a final matrix.

In this way all the tiles of answer will generate in different cores and with the help of different tiles parallelly.

How Tiles matrix multiplication work in GPU ?

Now we will see how tile-based matrix multiplication is implemented on the GPU.

Initially, matrix data is stored in the system's RAM. For processing on the GPU, data first needs to be transferred from the CPU's RAM to the GPU's global memory.

Image source : Penny Xu

Now that the matrix data is in the GPU's global memory, individual GPU threads will load elements of matrices into shared memory.

It's much faster than global memory, but it's limited in size.

Each thread block computes a "tile" of the resulting matrix.

They store the result in the thread block and accumulate all the matrix multiplication results.

Then the result is send to the global memory.

Image source : Penny Xu

Role of SIMD in the tiled multiplication:

When using SIMD with tiled multiplication, a CPU core can perform multiple operations on different pieces of data within the same instruction cycle, thus speeding up computation for each tile. If the CPU has a SIMD width of 'q', it can process 'q' data elements simultaneously within a single core.

c1 = a1 * b1 + a2 * b3 + a3 * b5

With SIMD, the multiplications a1 * b1 and a2 * b3 can be completed in a single instruction. However, it's important to note that one SIMD operation works within a single core, not between cores.

With the tiled matrix multiplication, the computation time can be reduced by parallelization. If 'p' is the number of cores and 'q' is the SIMD width, the operations are distributed across the cores, and each core can use SIMD to process 'q' operations simultaneously.

The ideal computation time reduction could be expressed as:

Total time = ((Total operational time) / p ) / q )

However, it's important to note that this is a highly theoretical and optimistic view. Because their are some limitation like Load balancing, Overhead etc.

Advantage of tile based multiplication

Effective utilization of recourses : It optimizes the use of computational resources by fully using the available cache memory and processor cores.

Speed and efficiency : By using the full potential of all cores and cache memory, it execute multiplication in parallel. The parallelism decrease the computational time, increase speed and efficiency.

Improve throughput : As it is utilizing hardware resources efficiently. This can significantly reduce computational time and improve throughput.

Conclusion

In conclusion, tile-based matrix multiplication divides a big math problem into smaller pieces. By doing this, computers can make better use of their memory and speed up the calculation process. This approach is applied to both the brains of our computers the CPU and the GPU, making complex calculations more efficient and quick.

This blog details my learnings from my internship at Softude while working with Mradul Kanugo.

Sources of Article

Wikipedia, Dive into Systems, Techopedia, Penny Xu

IndiaAI Recommends

User Submission - Optimizing Matrix Multiplication: Unveiling the Power of Tiles

Sources of Article

Want to publish your content?

ALSO EXPLORE