weblog/2024-03-01-cpu-vectorized-acceleration.markdown at master

jackcartersmith/weblog

Fork 0

JackCarterSmith 9aa818c0e5

Build Jekyll weblog / build (push) Successful in 4s

Details

Publish lost simd post

2025-01-15 15:18:13 +01:00

11 KiB

Raw Permalink Blame History

layout, title, author, categories, thumbnail, highlight

layout	title	author	categories	thumbnail	highlight
post	CPU Vectorized Acceleration (SIMD)	JackCarterSmith	Programming C C++	cpu_simd	true

Contents {:toc}

Processors (CPUs) are a jowel of technology, capable of executing more than a billion operations per second (OPS)! However, their sequential design means they can only perform one operation at a time (and per core)...

How do CPUs manage to perform so many OPS despite this handicap and what can we, developers, do about it? That's what we're going to try to find out in this blog post.

Do one thing at the time, but do it quick!

First of all, let's take a brief look at how a CPU works.

The CPU is the main data processing unit in a computer. Its role is not limited to reading or writing data from the memory it accesses, but it can also perform binary operations (AND, OR, XOR, etc.) and mathematical operations (ADD, SUB, MUL, etc.). With successive generations of architectures, CPUs have become increasingly complete in terms of possible operations, faster in executing them, but also more energy-efficient and bigger!

With this simplified architecture of a computing unit executing one operation per clock pulse, we can quickly deduce that to increase performance, we simply need to increase the number of pulses per second or frequency. Indeed, increasing the frequency of the processor was a first step towards improving performance. But physics dictated that increasing the frequency meant raising the temperature of the CPU (Joule effect) to the critical temperature of the silicon (the chip material), leading to its destruction.

Among the notable advances, the move to multi-core in a single chip is one of the most important! The first of these (commercial) was IBM's "POWER4" (<3) in 2001. It was the start of parallelization of data processing! And the start of the resource-sharing hassle... Thanks to this technology, it was possible to increase performance without affecting CPU frequency. Of course, the extra layer of multi-core management added latency to the chain, but the gains in computing performance more than made up for it.

The principle of parallelization was not new: it was already commonly used in supercomputers such as the "Cray X-MP" released in 1982, but at the time it was reserved for a lucky few. Today, having a CPU with at least 2 cores has become a common thing.

But there's also something else that has been salvaged from the supercomputers - I'm talking about SIMD, of course!

Before going any further, I must first talk about CPU architecture and the instruction/data processing paradigm, also known as Flynn's taxonomy. CPUs can have different internal structures and the paths that data takes and how it is transformed/computed is described by the architecture of the CPU. The most common in today PC is the AMD64 (aka. x86_64) architecture, but there are many others, such as the very recent RISC-V or even ARM, which has been powering our portable equipments for years.

Returning to SIMD, according to Flynn's taxonomy, CPU architectures can be divided into 2 categories:

are one or more data processed per calculation cycle?
are one or more instructions processed per calculation cycle?

SISD is the standard operation of the CPU in everyday use. It is the easiest of the 4 to implement, but it is also the slowest.

A CPU architecture can operate in SISD and change during operation to SIMD.

Single Instruction Multiple Datas (SIMD)

SIMD (or more often called SIMD acceleration) is a way of processing data on one or more CPU cores. It generally takes the form of an extension to the instruction set of the CPU architecture that wishes to use it. On x86_64, it is known as MMX, SSE, AVX and more recently AMX. On ARM it is known as NEON/SVE. RISC-V is no more complicated than that, with a simple "vector extension".

Let's stay with x86_64 for the explanations. Some curious people will notice that this architecture has several MM0..MMx registers. Well, well. Well, yes! These are the registers added by the MMX extension! This dates back to 1997, with the release of Intel's Pentium, a few years before the first multi-core processors. SIMD works on the scale of a single CPU core first and foremost, although SIMD can be applied on a larger scale across multiple cores, but that's a (likely) next topic! MMX was the first SIMD extension on x86_64, later replaced by SSE and completed by AVX from 2008.

How does SIMD work in practice? Well, by way of comparison, data is normally processed by 1 or 2 registers at a time. These registers are now 64 bits long, and whatever the size of the data to be processed, the register retains its 64-bit size. So if we do a simple addition between two 8-bits integers, 112 bits of the registers are not used in the operation!

SIMD allows the use of 128 (SSE), 256 (AVX) or 512 (AVX512) bit registers with the data placed consecutively. For example, with 32-bit floats, 4 can be stored in an SSE register. In this way, it is possible to calculate 2x4 floats in parallel (register against register)! However, the type of operation will be the same for each element in the register. Single Instruction, Multiple Datas!

Play with intrinsics

That's all very nice, but doesn't the compiler do it all by itself? What do you need to add to the source code or compiler options to use the SIMD extensions?

Here we go! To keep things simple, we're going to use a x86_64 machine with GCC-12 (Linux or Windows, it doesn't matter).

GCC provides several headers allowing low-level calls to use MMx registers and SIMD instructions. There is a header for each extension, but you can directly include <immintrin.h> to include all the extensions. These functions are called ‘intrinsics’ because they use internal CPU features. Please note, however, that you will have access to all the extensions during compilation even if your target CPU does not have all or some of these extensions. When your CPU tries to use an instruction for an extension that it doesn't have, you will get a SIGILL (Illegal Instruction error).

Below is an example of an operation using intrinsics.

{% highlight c %} // Intrinsics global header #include <immintrin.h> #include <stdio.h>

float lhs[1024]; attribute ((aligned (16))) float rhs[1024]; float dst[1024] = {0.f};

int main(int argc, char** argv) { unsigned int i;

for (i = 0; i < 1024; ++i) {
	lhs[i] = rand() / (float) RAND_MAX;
	rhs[i] = rand() / (float) RAND_MAX;
}

for (i = 0; i < 1024; i += 4) {
	// Load into MMx register 4 float values from input left and right memory
	__m128 reg_mmx_1 = _mm_loadu_ps(lhs + i);	// Load from un-aligned memory left array
	__m128 reg_mmx_2 = _mm_load_ps(rhs + i);	// Load from 16 bytes aligned memory right array
	
	// Do vectorized add operation on our registers
	__m128 tmp_result = _mm_add_ps(reg_mmx_1, reg_mmx_2);
	
	// Store the computed result in output array
	_mm_storeu_ps(dst + i, tmp_result);
}

for (i = 0; i < 1024; ++i)
	printf("%f\n", dst[i]);

return EXIT_SUCCESS;

} {% endhighlight %}

{% highlight nasm %} ...

lea	r8, dst[rip]
lea	rcx, lhs[rip]
lea	rdx, rhs[rip]
.p2align 5

.for_loop: movaps xmm0, XMMWORD PTR [rcx+rax] # _mm_loadu_ps() addps xmm0, XMMWORD PTR [rdx+rax] # _mm_add_ps() movaps XMMWORD PTR [r8+rax], xmm0 # _mm_storeu_ps()

add	rax, 16
cmp	rax, 4096
jne	.for_loop

... {% endhighlight %}

To tell the compiler that you want to use MMx registers, use the variable type __m128 for a 128-bits register. There are also __m256 and __m512, but the 256 or 512 intrinsics functions must be used accordingly.

You will also notice the difference between load_ps and loadu_ps, the latter allowing the loading of variables which are not aligned to 16 bytes in memory whereas the former does. Loading or storing to unaligned memory with load/store_ps will cause a SEGFAULT (Segmentation Fault error). There is also a variant for _mm_store. You can replace _ps with _pd to use double instead of float.

A list of intrinsics functions for architecture x86_64, sorted by extension, is available at this website.

GCC is capable of adding/replacing SIMD instructions according to the structure of the code it is currently processing, aka. auto-vectorisation. However, it must be told which instructions it has access to, with gcc options -msse, -mavx, etc.). Or directly the CPU architecture/model (if known) via -march= and -mtune=. By default, march and mtune are set to generic, so the compiler will only use instructions with maximum compatibility (and therefore no optimisation with extensions).

If you are compiling for exclusive use on your machine, compile with -march=native and -mtune=native. GCC will automatically detect your target CPU and use the known extensions for that CPU.

Auto-vectoring is enabled by default on the compiler (with options such as -O2 or -O3), but it can fail for various reasons, and without informing the developer. We recommend that you keep auto-vectorisation active, but explicitly indicate where and how you want to use intrinsics. This way, in the event of failure (for alignment reasons, for example), the compiler will warn us that it was unable to apply our instructions for a specified reason X.

A little world about OpenMP

OpenMP is a all-in-one API in many aspects. This implementation of multithreading, which has been extended to SIMD since version 4, allows you to optimise your code with a few well-placed #pragmas.

However, it should be noted that each tool comes with its own advantages and disadvantages. Firstly, the optimisations made by OpenMP will only be visible to you by re-reading the assembly code of your program. As with the self-vectoring mentioned above, it's tricky to rely solely on the work of the compiler alone. And although there are profilers that support OpenMP, you may find it even more complicated to investigate any bugs in your threads.

OpenMP is also not available on all toolchains (especially older or niche ones) and requires an OS to work (R.I.P MCUs).

If your application needs to run in a multi-CPU environment, in clusters, with dedicated accelerators (GPGPU, FPGA, ASIC, etc.), then OpenMP will be able to exploit its environment without you having to worry about it.

We'll no doubt be talking about this API again!

See you in the next post!

[END_OF_REPORT-240301.1027]

11 KiB Raw Permalink Blame History Unescape Escape

Do one thing at the time, but do it quick!

Single Instruction Multiple Datas (SIMD)

Play with intrinsics

A little world about OpenMP

11 KiB

Raw Permalink Blame History