All checks were successful
Build Jekyll weblog / build (push) Successful in 4s
232 lines
11 KiB
Markdown
232 lines
11 KiB
Markdown
---
|
||
layout: post
|
||
title: CPU Vectorized Acceleration (SIMD)
|
||
author: JackCarterSmith
|
||
categories: Programming C C++
|
||
thumbnail: cpu_simd
|
||
highlight: true
|
||
---
|
||
|
||
* Contents
|
||
{:toc}
|
||
|
||
Processors (CPUs) are a jowel of technology, capable of executing more than a billion operations per second
|
||
(OPS)! However, their sequential design means they can only perform one operation at a time (and per core)...
|
||
|
||
How do CPUs manage to perform so many OPS despite this handicap and what can we, developers, do about it? That's
|
||
what we're going to try to find out in this blog post.
|
||
|
||
<div class="image300">
|
||
<img src="{{"/static/images/posts/sleep_nerd.jpg" | prepend: site.baseurl }}"
|
||
alt="Good reading" title="Good reading!" />
|
||
</div>
|
||
|
||
### __Do one thing at the time, but do it quick!__
|
||
|
||
First of all, let's take a brief look at how a CPU works.
|
||
|
||
The CPU is the main data processing unit in a computer. Its role is not limited to reading or writing data
|
||
from the memory it accesses, but it can also perform **binary** operations (AND, OR, XOR, etc.) and **mathematical**
|
||
operations (ADD, SUB, MUL, etc.). With successive generations of architectures, CPUs have become increasingly
|
||
complete in terms of possible operations, faster in executing them, but also more energy-efficient and bigger!
|
||
|
||
With this simplified architecture of a computing unit executing one operation per clock pulse, we can quickly
|
||
deduce that to increase performance, we simply need to increase the number of pulses per second or **frequency**.
|
||
Indeed, increasing the frequency of the processor was a first step towards improving performance. But physics
|
||
dictated that increasing the frequency meant raising the temperature of the CPU (*Joule effect*) to the
|
||
critical temperature of the silicon (the chip material), leading to its destruction.
|
||
|
||
Among the notable advances, the move to **multi-core** in a single chip is one of the most important! The first of
|
||
these (commercial) was IBM's "*POWER4*" (<3) in 2001. It was the start of parallelization of data processing! And
|
||
the start of the resource-sharing hassle... Thanks to this technology, it was possible to increase performance
|
||
without affecting CPU frequency. Of course, the extra layer of multi-core management added latency to the chain,
|
||
but the gains in computing performance more than made up for it.
|
||
|
||
The principle of parallelization was not new: it was already commonly used in supercomputers such as the "*Cray X-MP*"
|
||
released in 1982, but at the time it was reserved for a lucky few. Today, having a CPU with at least 2 cores has
|
||
become a common thing.
|
||
|
||
But there's also something else that has been salvaged from the supercomputers - I'm talking about SIMD, of course!
|
||
|
||
Before going any further, I must first talk about CPU architecture and the instruction/data processing paradigm,
|
||
also known as **Flynn's taxonomy**. CPUs can have different internal structures and the paths that data takes and
|
||
how it is transformed/computed is described by the architecture of the CPU. The most common in today PC is the
|
||
**AMD64** (aka. x86_64) architecture, but there are many others, such as the very recent *RISC-V* or even *ARM*,
|
||
which has been powering our portable equipments for years.
|
||
|
||
Returning to SIMD, according to Flynn's taxonomy, CPU architectures can be divided into 2 categories:
|
||
|
||
- are one or more **data** processed per calculation cycle?
|
||
- are one or more **instructions** processed per calculation cycle?
|
||
|
||
<div class="image512">
|
||
<img src="{{"/static/images/posts/flynn_table.png" | prepend: site.baseurl }}"
|
||
alt="Flynn's taxonomy" title="Flynn's taxonomy" />
|
||
</div>
|
||
|
||
SISD is the standard operation of the CPU in everyday use. It is the easiest of the 4 to implement, but it is also
|
||
the slowest.
|
||
|
||
A CPU architecture can operate in SISD and change during operation to SIMD.
|
||
|
||
### __Single Instruction Multiple Datas (SIMD)__
|
||
|
||
<div class="image640">
|
||
<img src="{{"/static/images/posts/simd_evol.png" | prepend: site.baseurl }}"
|
||
alt="SIMD evolution" title="Evolution of x86_64 SIMD extensions" />
|
||
</div>
|
||
|
||
SIMD (or more often called SIMD acceleration) is a way of processing data on one or more CPU cores. It generally
|
||
takes the form of an extension to the instruction set of the CPU architecture that wishes to use it. On x86_64,
|
||
it is known as **MMX, SSE, AVX** and more recently **AMX**. On ARM it is known as **NEON/SVE**. RISC-V is no more
|
||
complicated than that, with a simple "**vector extension**".
|
||
|
||
Let's stay with x86_64 for the explanations. Some curious people will notice that this architecture has several
|
||
MM0..MMx registers. Well, well. Well, yes! These are the registers added by the MMX extension! This dates back to
|
||
1997, with the release of Intel's *Pentium*, a few years before the first multi-core processors. SIMD works on the
|
||
scale of a single CPU core first and foremost, although SIMD can be applied on a larger scale across multiple cores,
|
||
but that's a (likely) next topic! MMX was the first SIMD extension on x86_64, later replaced by SSE and completed
|
||
by AVX from 2008.
|
||
|
||
<div class="image512">
|
||
<img src="{{"/static/images/posts/simd_concept.gif" | prepend: site.baseurl }}"
|
||
alt="SIMD concept" title="SISD vs. SIMD" />
|
||
</div>
|
||
|
||
How does SIMD work in practice? Well, by way of comparison, data is normally processed by 1 or 2 registers at a time.
|
||
These registers are now 64 bits long, and whatever the size of the data to be processed, the register retains its
|
||
64-bit size. So if we do a simple addition between two 8-bits integers, 112 bits of the registers are not used in the
|
||
operation!
|
||
|
||
<div class="image300">
|
||
<img src="{{"/static/images/posts/mmx_registers.png" | prepend: site.baseurl }}"
|
||
alt="MMX registers" title="SSE/AVX registers" />
|
||
</div>
|
||
|
||
SIMD allows the use of 128 (SSE), 256 (AVX) or 512 (AVX512) bit registers with the data placed consecutively. For example,
|
||
with 32-bit floats, 4 can be stored in an SSE register. In this way, it is possible to calculate 2x4 floats in parallel
|
||
(register against register)! However, the type of operation will be the same for each element in the register.
|
||
Single Instruction, Multiple Datas!
|
||
|
||
### __Play with intrinsics__
|
||
|
||
*That's all very nice, but doesn't the compiler do it all by itself? What do you need to add to the source code or compiler*
|
||
*options to use the SIMD extensions?*
|
||
|
||
Here we go! To keep things simple, we're going to use a **x86_64** machine with **GCC-12** (Linux or Windows, it doesn't matter).
|
||
|
||
GCC provides several headers allowing low-level calls to use MMx registers and SIMD instructions. There is a header for each
|
||
extension, but you can directly include **\<immintrin.h\>** to include all the extensions. These functions are called
|
||
‘**intrinsics**’ because they use internal CPU features. Please note, however, that you will have access to all the extensions
|
||
during compilation even if your target CPU does not have all or some of these extensions. When your CPU tries to use an
|
||
instruction for an extension that it **doesn't have**, you will get a **SIGILL** (*Illegal Instruction* error).
|
||
|
||
Below is an example of an operation using intrinsics.
|
||
|
||
{% highlight c %}
|
||
// Intrinsics global header
|
||
#include <immintrin.h>
|
||
#include <stdio.h>
|
||
|
||
float lhs[1024];
|
||
__attribute__ ((aligned (16))) float rhs[1024];
|
||
float dst[1024] = {0.f};
|
||
|
||
int main(int argc, char** argv) {
|
||
unsigned int i;
|
||
|
||
for (i = 0; i < 1024; ++i) {
|
||
lhs[i] = rand() / (float) RAND_MAX;
|
||
rhs[i] = rand() / (float) RAND_MAX;
|
||
}
|
||
|
||
for (i = 0; i < 1024; i += 4) {
|
||
// Load into MMx register 4 float values from input left and right memory
|
||
__m128 reg_mmx_1 = _mm_loadu_ps(lhs + i); // Load from un-aligned memory left array
|
||
__m128 reg_mmx_2 = _mm_load_ps(rhs + i); // Load from 16 bytes aligned memory right array
|
||
|
||
// Do vectorized add operation on our registers
|
||
__m128 tmp_result = _mm_add_ps(reg_mmx_1, reg_mmx_2);
|
||
|
||
// Store the computed result in output array
|
||
_mm_storeu_ps(dst + i, tmp_result);
|
||
}
|
||
|
||
for (i = 0; i < 1024; ++i)
|
||
printf("%f\n", dst[i]);
|
||
|
||
return EXIT_SUCCESS;
|
||
}
|
||
{% endhighlight %}
|
||
|
||
{% highlight nasm %}
|
||
...
|
||
|
||
lea r8, dst[rip]
|
||
lea rcx, lhs[rip]
|
||
lea rdx, rhs[rip]
|
||
.p2align 5
|
||
|
||
.for_loop:
|
||
movaps xmm0, XMMWORD PTR [rcx+rax] # _mm_loadu_ps()
|
||
addps xmm0, XMMWORD PTR [rdx+rax] # _mm_add_ps()
|
||
movaps XMMWORD PTR [r8+rax], xmm0 # _mm_storeu_ps()
|
||
|
||
add rax, 16
|
||
cmp rax, 4096
|
||
jne .for_loop
|
||
|
||
...
|
||
{% endhighlight %}
|
||
|
||
To tell the compiler that you want to use MMx registers, use the variable type `__m128` for a 128-bits register. There are also
|
||
`__m256` and `__m512`, but the 256 or 512 intrinsics functions must be used accordingly.
|
||
|
||
You will also notice the difference between `load_ps` and `loadu_ps`, the latter allowing the loading of variables which are
|
||
not aligned to 16 bytes in memory whereas the former does. Loading or storing to unaligned memory with load/store_ps will cause
|
||
a **SEGFAULT** (*Segmentation Fault* error). There is also a variant for `_mm_store`. You can replace `_ps` with `_pd` to use
|
||
**double** instead of **float**.
|
||
|
||
A list of intrinsics functions for architecture x86_64, sorted by extension, is available [at this website](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).
|
||
|
||
GCC is capable of adding/replacing SIMD instructions according to the structure of the code it is currently processing, aka.
|
||
**auto-vectorisation**. However, it must be told which instructions it has access to, with gcc options `-msse`, `-mavx`, etc.).
|
||
Or directly the CPU architecture/model (if known) via `-march=` and `-mtune=`. By default, march and mtune are set to `generic`,
|
||
so the compiler will only use instructions with maximum compatibility (and therefore no optimisation with extensions).
|
||
|
||
> If you are compiling for exclusive use on your machine, compile with `-march=native` and `-mtune=native`. GCC will automatically
|
||
> detect your target CPU and use the known extensions for that CPU.
|
||
|
||
Auto-vectoring is enabled by default on the compiler (with options such as -O2 or -O3), but it can fail for various reasons,
|
||
and **without informing the developer**. We recommend that you keep auto-vectorisation active, but explicitly indicate where
|
||
and how you want to use intrinsics. This way, in the event of failure (for alignment reasons, for example), the compiler will
|
||
warn us that it was unable to apply our instructions for a specified reason X.
|
||
|
||
### __A little world about OpenMP__
|
||
|
||
OpenMP is a all-in-one API in many aspects. This implementation of multithreading, which has been extended to SIMD since version 4,
|
||
allows you to optimise your code with a few well-placed `#pragmas`.
|
||
|
||
However, it should be noted that each tool comes with its own advantages and disadvantages. Firstly, the optimisations made by
|
||
OpenMP will only be visible to you by re-reading the assembly code of your program. As with the self-vectoring mentioned above,
|
||
it's tricky to rely solely on the work of the compiler alone. And although there are profilers that support OpenMP, you may find
|
||
it even more complicated to investigate any bugs in your threads.
|
||
|
||
OpenMP is also not available on all toolchains (especially older or niche ones) and requires an OS to work (R.I.P MCUs).
|
||
|
||
<div class="image512">
|
||
<img src="{{"/static/images/posts/omp_platforms.png" | prepend: site.baseurl }}"
|
||
alt="OpenMP platforms" title="OpenMP supported environnements" />
|
||
</div>
|
||
|
||
If your application needs to run in a multi-CPU environment, in clusters, with dedicated accelerators (GPGPU, FPGA, ASIC, etc.),
|
||
then OpenMP will be able to exploit its environment without you having to worry about it.
|
||
|
||
We'll no doubt be talking about this API again!
|
||
|
||
See you in the next post!
|
||
|
||
**[END_OF_REPORT-240301.1027]**
|
||
|
||
|