weblog/_posts/2024-03-01-cpu-vectorized-acceleration.markdown

---
layout:     post
title:      CPU Vectorized Acceleration (SIMD)
author:     JackCarterSmith
categories: Programming C C++
thumbnail:  cpu_simd
highlight:  true
---

* Contents
{:toc}

Processors (CPUs) are a jowel of technology, capable of executing more than a billion operations per second
(OPS)! However, their sequential design means they can only perform one operation at a time (and per core)...

How do CPUs manage to perform so many OPS despite this handicap and what can we, developers, do about it? That's
what we're going to try to find out in this blog post.

<div class="image300">
<img src="{{"/static/images/posts/sleep_nerd.jpg" | prepend: site.baseurl }}"
	alt="Good reading" title="Good reading!" />
</div>

### __Do one thing at the time, but do it quick!__

First of all, let's take a brief look at how a CPU works.

The CPU is the main data processing unit in a computer. Its role is not limited to reading or writing data
from the memory it accesses, but it can also perform **binary** operations (AND, OR, XOR, etc.) and **mathematical**
operations (ADD, SUB, MUL, etc.). With successive generations of architectures, CPUs have become increasingly
complete in terms of possible operations, faster in executing them, but also more energy-efficient and bigger!

With this simplified architecture of a computing unit executing one operation per clock pulse, we can quickly
deduce that to increase performance, we simply need to increase the number of pulses per second or **frequency**.
Indeed, increasing the frequency of the processor was a first step towards improving performance. But physics
dictated that increasing the frequency meant raising the temperature of the CPU (*Joule effect*) to the
critical temperature of the silicon (the chip material), leading to its destruction.

Among the notable advances, the move to **multi-core** in a single chip is one of the most important! The first of
these (commercial) was IBM's "*POWER4*" (<3) in 2001. It was the start of parallelization of data processing! And
the start of the resource-sharing hassle... Thanks to this technology, it was possible to increase performance
without affecting CPU frequency. Of course, the extra layer of multi-core management added latency to the chain,
but the gains in computing performance more than made up for it.

The principle of parallelization was not new: it was already commonly used in supercomputers such as the "*Cray X-MP*"
released in 1982, but at the time it was reserved for a lucky few. Today, having a CPU with at least 2 cores has
become a common thing.

But there's also something else that has been salvaged from the supercomputers - I'm talking about SIMD, of course!

Before going any further, I must first talk about CPU architecture and the instruction/data processing paradigm,
also known as **Flynn's taxonomy**. CPUs can have different internal structures and the paths that data takes and
how it is transformed/computed is described by the architecture of the CPU. The most common in today PC is the
**AMD64** (aka. x86_64) architecture, but there are many others, such as the very recent *RISC-V* or even *ARM*,
which has been powering our portable equipments for years.

Returning to SIMD, according to Flynn's taxonomy, CPU architectures can be divided into 2 categories:

- are one or more **data** processed per calculation cycle?
- are one or more **instructions** processed per calculation cycle?

<div class="image512">
<img src="{{"/static/images/posts/flynn_table.png" | prepend: site.baseurl }}"
	alt="Flynn's taxonomy" title="Flynn's taxonomy" />
</div>

SISD is the standard operation of the CPU in everyday use. It is the easiest of the 4 to implement, but it is also
the slowest.

A CPU architecture can operate in SISD and change during operation to SIMD.

### __Single Instruction Multiple Datas (SIMD)__

<div class="image640">
<img src="{{"/static/images/posts/simd_evol.png" | prepend: site.baseurl }}"
	alt="SIMD evolution" title="Evolution of x86_64 SIMD extensions" />
</div>

SIMD (or more often called SIMD acceleration) is a way of processing data on one or more CPU cores. It generally
takes the form of an extension to the instruction set of the CPU architecture that wishes to use it. On x86_64,
it is known as **MMX, SSE, AVX** and more recently **AMX**. On ARM it is known as **NEON/SVE**. RISC-V is no more
complicated than that, with a simple "**vector extension**".

Let's stay with x86_64 for the explanations. Some curious people will notice that this architecture has several
MM0..MMx registers. Well, well. Well, yes! These are the registers added by the MMX extension! This dates back to
1997, with the release of Intel's *Pentium*, a few years before the first multi-core processors. SIMD works on the
scale of a single CPU core first and foremost, although SIMD can be applied on a larger scale across multiple cores,
but that's a (likely) next topic! MMX was the first SIMD extension on x86_64, later replaced by SSE and completed
by AVX from 2008.

<div class="image512">
<img src="{{"/static/images/posts/simd_concept.gif" | prepend: site.baseurl }}"
	alt="SIMD concept" title="SISD vs. SIMD" />
</div>

How does SIMD work in practice? Well, by way of comparison, data is normally processed by 1 or 2 registers at a time.
These registers are now 64 bits long, and whatever the size of the data to be processed, the register retains its
64-bit size. So if we do a simple addition between two 8-bits integers, 112 bits of the registers are not used in the
operation!

<div class="image300">
<img src="{{"/static/images/posts/mmx_registers.png" | prepend: site.baseurl }}"
	alt="MMX registers" title="SSE/AVX registers" />
</div>

SIMD allows the use of 128 (SSE), 256 (AVX) or 512 (AVX512) bit registers with the data placed consecutively. For example,
with 32-bit floats, 4 can be stored in an SSE register. In this way, it is possible to calculate 2x4 floats in parallel
(register against register)! However, the type of operation will be the same for each element in the register.
Single Instruction, Multiple Datas!

### __Play with intrinsics__

*That's all very nice, but doesn't the compiler do it all by itself? What do you need to add to the source code or compiler*
*options to use the SIMD extensions?*

Here we go! To keep things simple, we're going to use a **x86_64** machine with **GCC-12** (Linux or Windows, it doesn't matter).

GCC provides several headers allowing low-level calls to use MMx registers and SIMD instructions. There is a header for each
extension, but you can directly include **\<immintrin.h\>** to include all the extensions. These functions are called
‘**intrinsics**’ because they use internal CPU features. Please note, however, that you will have access to all the extensions
during compilation even if your target CPU does not have all or some of these extensions. When your CPU tries to use an
instruction for an extension that it **doesn't have**, you will get a **SIGILL** (*Illegal Instruction* error).

Below is an example of an operation using intrinsics.

{% highlight c %}
// Intrinsics global header
#include <immintrin.h>
#include <stdio.h>

float lhs[1024];
__attribute__ ((aligned (16))) float rhs[1024];
float dst[1024] = {0.f};

int main(int argc, char** argv) {
	unsigned int i;

	for (i = 0; i < 1024; ++i) {
		lhs[i] = rand() / (float) RAND_MAX;
		rhs[i] = rand() / (float) RAND_MAX;
	}

	for (i = 0; i < 1024; i += 4) {
		// Load into MMx register 4 float values from input left and right memory
		__m128 reg_mmx_1 = _mm_loadu_ps(lhs + i);	// Load from un-aligned memory left array
		__m128 reg_mmx_2 = _mm_load_ps(rhs + i);	// Load from 16 bytes aligned memory right array

		// Do vectorized add operation on our registers
		__m128 tmp_result = _mm_add_ps(reg_mmx_1, reg_mmx_2);

		// Store the computed result in output array
		_mm_storeu_ps(dst + i, tmp_result);
	}

	for (i = 0; i < 1024; ++i)
		printf("%f\n", dst[i]);

	return EXIT_SUCCESS;
}
{% endhighlight %}

{% highlight nasm %}
...

	lea	r8, dst[rip]
	lea	rcx, lhs[rip]
	lea	rdx, rhs[rip]
	.p2align 5

.for_loop:
	movaps	xmm0, XMMWORD PTR [rcx+rax]	 # _mm_loadu_ps()
	addps	xmm0, XMMWORD PTR [rdx+rax]	 # _mm_add_ps()
	movaps	XMMWORD PTR [r8+rax], xmm0	 # _mm_storeu_ps()

	add	rax, 16
	cmp	rax, 4096
	jne	.for_loop

...
{% endhighlight %}

To tell the compiler that you want to use MMx registers, use the variable type `__m128` for a 128-bits register. There are also
`__m256` and `__m512`, but the 256 or 512 intrinsics functions must be used accordingly.

You will also notice the difference between `load_ps` and `loadu_ps`, the latter allowing the loading of variables which are
not aligned to 16 bytes in memory whereas the former does. Loading or storing to unaligned memory with load/store_ps will cause
a **SEGFAULT** (*Segmentation Fault* error). There is also a variant for `_mm_store`. You can replace `_ps` with `_pd` to use
**double** instead of **float**.

A list of intrinsics functions for architecture x86_64, sorted by extension, is available [at this website](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).

GCC is capable of adding/replacing SIMD instructions according to the structure of the code it is currently processing, aka.
**auto-vectorisation**. However, it must be told which instructions it has access to, with gcc options `-msse`, `-mavx`, etc.).
Or directly the CPU architecture/model (if known) via `-march=` and `-mtune=`. By default, march and mtune are set to `generic`,
so the compiler will only use instructions with maximum compatibility (and therefore no optimisation with extensions).

> If you are compiling for exclusive use on your machine, compile with `-march=native` and `-mtune=native`. GCC will automatically
> detect your target CPU and use the known extensions for that CPU.

Auto-vectoring is enabled by default on the compiler (with options such as -O2 or -O3), but it can fail for various reasons,
and **without informing the developer**. We recommend that you keep auto-vectorisation active, but explicitly indicate where
and how you want to use intrinsics. This way, in the event of failure (for alignment reasons, for example), the compiler will
warn us that it was unable to apply our instructions for a specified reason X.

### __A little world about OpenMP__

OpenMP is a all-in-one API in many aspects. This implementation of multithreading, which has been extended to SIMD since version 4,
allows you to optimise your code with a few well-placed `#pragmas`.

However, it should be noted that each tool comes with its own advantages and disadvantages. Firstly, the optimisations made by
OpenMP will only be visible to you by re-reading the assembly code of your program. As with the self-vectoring mentioned above,
it's tricky to rely solely on the work of the compiler alone. And although there are profilers that support OpenMP, you may find
it even more complicated to investigate any bugs in your threads.

OpenMP is also not available on all toolchains (especially older or niche ones) and requires an OS to work (R.I.P MCUs).

<div class="image512">
<img src="{{"/static/images/posts/omp_platforms.png" | prepend: site.baseurl }}"
	alt="OpenMP platforms" title="OpenMP supported environnements" />
</div>

If your application needs to run in a multi-CPU environment, in clusters, with dedicated accelerators (GPGPU, FPGA, ASIC, etc.),
then OpenMP will be able to exploit its environment without you having to worry about it.

We'll no doubt be talking about this API again!

See you in the next post!

**[END_OF_REPORT-240301.1027]**