Publish lost simd post
All checks were successful
Build Jekyll weblog / build (push) Successful in 4s
@ -10,4 +10,222 @@ highlight: true
|
||||
* Contents
|
||||
{:toc}
|
||||
|
||||
WIP
|
||||
Processors (CPUs) are a jowel of technology, capable of executing more than a billion operations per second
|
||||
(OPS)! However, their sequential design means they can only perform one operation at a time (and per core)...
|
||||
|
||||
How do CPUs manage to perform so many OPS despite this handicap and what can we, developers, do about it? That's
|
||||
what we're going to try to find out in this blog post.
|
||||
|
||||
<div class="image300">
|
||||
<img src="{{"/static/images/posts/sleep_nerd.jpg" | prepend: site.baseurl }}"
|
||||
alt="Good reading" title="Good reading!" />
|
||||
</div>
|
||||
|
||||
### __Do one thing at the time, but do it quick!__
|
||||
|
||||
First of all, let's take a brief look at how a CPU works.
|
||||
|
||||
The CPU is the main data processing unit in a computer. Its role is not limited to reading or writing data
|
||||
from the memory it accesses, but it can also perform **binary** operations (AND, OR, XOR, etc.) and **mathematical**
|
||||
operations (ADD, SUB, MUL, etc.). With successive generations of architectures, CPUs have become increasingly
|
||||
complete in terms of possible operations, faster in executing them, but also more energy-efficient and bigger!
|
||||
|
||||
With this simplified architecture of a computing unit executing one operation per clock pulse, we can quickly
|
||||
deduce that to increase performance, we simply need to increase the number of pulses per second or **frequency**.
|
||||
Indeed, increasing the frequency of the processor was a first step towards improving performance. But physics
|
||||
dictated that increasing the frequency meant raising the temperature of the CPU (*Joule effect*) to the
|
||||
critical temperature of the silicon (the chip material), leading to its destruction.
|
||||
|
||||
Among the notable advances, the move to **multi-core** in a single chip is one of the most important! The first of
|
||||
these (commercial) was IBM's "*POWER4*" (<3) in 2001. It was the start of parallelization of data processing! And
|
||||
the start of the resource-sharing hassle... Thanks to this technology, it was possible to increase performance
|
||||
without affecting CPU frequency. Of course, the extra layer of multi-core management added latency to the chain,
|
||||
but the gains in computing performance more than made up for it.
|
||||
|
||||
The principle of parallelization was not new: it was already commonly used in supercomputers such as the "*Cray X-MP*"
|
||||
released in 1982, but at the time it was reserved for a lucky few. Today, having a CPU with at least 2 cores has
|
||||
become a common thing.
|
||||
|
||||
But there's also something else that has been salvaged from the supercomputers - I'm talking about SIMD, of course!
|
||||
|
||||
Before going any further, I must first talk about CPU architecture and the instruction/data processing paradigm,
|
||||
also known as **Flynn's taxonomy**. CPUs can have different internal structures and the paths that data takes and
|
||||
how it is transformed/computed is described by the architecture of the CPU. The most common in today PC is the
|
||||
**AMD64** (aka. x86_64) architecture, but there are many others, such as the very recent *RISC-V* or even *ARM*,
|
||||
which has been powering our portable equipments for years.
|
||||
|
||||
Returning to SIMD, according to Flynn's taxonomy, CPU architectures can be divided into 2 categories:
|
||||
|
||||
- are one or more **data** processed per calculation cycle?
|
||||
- are one or more **instructions** processed per calculation cycle?
|
||||
|
||||
<div class="image512">
|
||||
<img src="{{"/static/images/posts/flynn_table.png" | prepend: site.baseurl }}"
|
||||
alt="Flynn's taxonomy" title="Flynn's taxonomy" />
|
||||
</div>
|
||||
|
||||
SISD is the standard operation of the CPU in everyday use. It is the easiest of the 4 to implement, but it is also
|
||||
the slowest.
|
||||
|
||||
A CPU architecture can operate in SISD and change during operation to SIMD.
|
||||
|
||||
### __Single Instruction Multiple Datas (SIMD)__
|
||||
|
||||
<div class="image640">
|
||||
<img src="{{"/static/images/posts/simd_evol.png" | prepend: site.baseurl }}"
|
||||
alt="SIMD evolution" title="Evolution of x86_64 SIMD extensions" />
|
||||
</div>
|
||||
|
||||
SIMD (or more often called SIMD acceleration) is a way of processing data on one or more CPU cores. It generally
|
||||
takes the form of an extension to the instruction set of the CPU architecture that wishes to use it. On x86_64,
|
||||
it is known as **MMX, SSE, AVX** and more recently **AMX**. On ARM it is known as **NEON/SVE**. RISC-V is no more
|
||||
complicated than that, with a simple "**vector extension**".
|
||||
|
||||
Let's stay with x86_64 for the explanations. Some curious people will notice that this architecture has several
|
||||
MM0..MMx registers. Well, well. Well, yes! These are the registers added by the MMX extension! This dates back to
|
||||
1997, with the release of Intel's *Pentium*, a few years before the first multi-core processors. SIMD works on the
|
||||
scale of a single CPU core first and foremost, although SIMD can be applied on a larger scale across multiple cores,
|
||||
but that's a (likely) next topic! MMX was the first SIMD extension on x86_64, later replaced by SSE and completed
|
||||
by AVX from 2008.
|
||||
|
||||
<div class="image512">
|
||||
<img src="{{"/static/images/posts/simd_concept.gif" | prepend: site.baseurl }}"
|
||||
alt="SIMD concept" title="SISD vs. SIMD" />
|
||||
</div>
|
||||
|
||||
How does SIMD work in practice? Well, by way of comparison, data is normally processed by 1 or 2 registers at a time.
|
||||
These registers are now 64 bits long, and whatever the size of the data to be processed, the register retains its
|
||||
64-bit size. So if we do a simple addition between two 8-bits integers, 112 bits of the registers are not used in the
|
||||
operation!
|
||||
|
||||
<div class="image300">
|
||||
<img src="{{"/static/images/posts/mmx_registers.png" | prepend: site.baseurl }}"
|
||||
alt="MMX registers" title="SSE/AVX registers" />
|
||||
</div>
|
||||
|
||||
SIMD allows the use of 128 (SSE), 256 (AVX) or 512 (AVX512) bit registers with the data placed consecutively. For example,
|
||||
with 32-bit floats, 4 can be stored in an SSE register. In this way, it is possible to calculate 2x4 floats in parallel
|
||||
(register against register)! However, the type of operation will be the same for each element in the register.
|
||||
Single Instruction, Multiple Datas!
|
||||
|
||||
### __Play with intrinsics__
|
||||
|
||||
*That's all very nice, but doesn't the compiler do it all by itself? What do you need to add to the source code or compiler*
|
||||
*options to use the SIMD extensions?*
|
||||
|
||||
Here we go! To keep things simple, we're going to use a **x86_64** machine with **GCC-12** (Linux or Windows, it doesn't matter).
|
||||
|
||||
GCC provides several headers allowing low-level calls to use MMx registers and SIMD instructions. There is a header for each
|
||||
extension, but you can directly include **\<immintrin.h\>** to include all the extensions. These functions are called
|
||||
‘**intrinsics**’ because they use internal CPU features. Please note, however, that you will have access to all the extensions
|
||||
during compilation even if your target CPU does not have all or some of these extensions. When your CPU tries to use an
|
||||
instruction for an extension that it **doesn't have**, you will get a **SIGILL** (*Illegal Instruction* error).
|
||||
|
||||
Below is an example of an operation using intrinsics.
|
||||
|
||||
{% highlight c %}
|
||||
// Intrinsics global header
|
||||
#include <immintrin.h>
|
||||
#include <stdio.h>
|
||||
|
||||
float lhs[1024];
|
||||
__attribute__ ((aligned (16))) float rhs[1024];
|
||||
float dst[1024] = {0.f};
|
||||
|
||||
int main(int argc, char** argv) {
|
||||
unsigned int i;
|
||||
|
||||
for (i = 0; i < 1024; ++i) {
|
||||
lhs[i] = rand() / (float) RAND_MAX;
|
||||
rhs[i] = rand() / (float) RAND_MAX;
|
||||
}
|
||||
|
||||
for (i = 0; i < 1024; i += 4) {
|
||||
// Load into MMx register 4 float values from input left and right memory
|
||||
__m128 reg_mmx_1 = _mm_loadu_ps(lhs + i); // Load from un-aligned memory left array
|
||||
__m128 reg_mmx_2 = _mm_load_ps(rhs + i); // Load from 16 bytes aligned memory right array
|
||||
|
||||
// Do vectorized add operation on our registers
|
||||
__m128 tmp_result = _mm_add_ps(reg_mmx_1, reg_mmx_2);
|
||||
|
||||
// Store the computed result in output array
|
||||
_mm_storeu_ps(dst + i, tmp_result);
|
||||
}
|
||||
|
||||
for (i = 0; i < 1024; ++i)
|
||||
printf("%f\n", dst[i]);
|
||||
|
||||
return EXIT_SUCCESS;
|
||||
}
|
||||
{% endhighlight %}
|
||||
|
||||
{% highlight nasm %}
|
||||
...
|
||||
|
||||
lea r8, dst[rip]
|
||||
lea rcx, lhs[rip]
|
||||
lea rdx, rhs[rip]
|
||||
.p2align 5
|
||||
|
||||
.for_loop:
|
||||
movaps xmm0, XMMWORD PTR [rcx+rax] # _mm_loadu_ps()
|
||||
addps xmm0, XMMWORD PTR [rdx+rax] # _mm_add_ps()
|
||||
movaps XMMWORD PTR [r8+rax], xmm0 # _mm_storeu_ps()
|
||||
|
||||
add rax, 16
|
||||
cmp rax, 4096
|
||||
jne .for_loop
|
||||
|
||||
...
|
||||
{% endhighlight %}
|
||||
|
||||
To tell the compiler that you want to use MMx registers, use the variable type `__m128` for a 128-bits register. There are also
|
||||
`__m256` and `__m512`, but the 256 or 512 intrinsics functions must be used accordingly.
|
||||
|
||||
You will also notice the difference between `load_ps` and `loadu_ps`, the latter allowing the loading of variables which are
|
||||
not aligned to 16 bytes in memory whereas the former does. Loading or storing to unaligned memory with load/store_ps will cause
|
||||
a **SEGFAULT** (*Segmentation Fault* error). There is also a variant for `_mm_store`. You can replace `_ps` with `_pd` to use
|
||||
**double** instead of **float**.
|
||||
|
||||
A list of intrinsics functions for architecture x86_64, sorted by extension, is available [at this website](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).
|
||||
|
||||
GCC is capable of adding/replacing SIMD instructions according to the structure of the code it is currently processing, aka.
|
||||
**auto-vectorisation**. However, it must be told which instructions it has access to, with gcc options `-msse`, `-mavx`, etc.).
|
||||
Or directly the CPU architecture/model (if known) via `-march=` and `-mtune=`. By default, march and mtune are set to `generic`,
|
||||
so the compiler will only use instructions with maximum compatibility (and therefore no optimisation with extensions).
|
||||
|
||||
> If you are compiling for exclusive use on your machine, compile with `-march=native` and `-mtune=native`. GCC will automatically
|
||||
> detect your target CPU and use the known extensions for that CPU.
|
||||
|
||||
Auto-vectoring is enabled by default on the compiler (with options such as -O2 or -O3), but it can fail for various reasons,
|
||||
and **without informing the developer**. We recommend that you keep auto-vectorisation active, but explicitly indicate where
|
||||
and how you want to use intrinsics. This way, in the event of failure (for alignment reasons, for example), the compiler will
|
||||
warn us that it was unable to apply our instructions for a specified reason X.
|
||||
|
||||
### __A little world about OpenMP__
|
||||
|
||||
OpenMP is a all-in-one API in many aspects. This implementation of multithreading, which has been extended to SIMD since version 4,
|
||||
allows you to optimise your code with a few well-placed `#pragmas`.
|
||||
|
||||
However, it should be noted that each tool comes with its own advantages and disadvantages. Firstly, the optimisations made by
|
||||
OpenMP will only be visible to you by re-reading the assembly code of your program. As with the self-vectoring mentioned above,
|
||||
it's tricky to rely solely on the work of the compiler alone. And although there are profilers that support OpenMP, you may find
|
||||
it even more complicated to investigate any bugs in your threads.
|
||||
|
||||
OpenMP is also not available on all toolchains (especially older or niche ones) and requires an OS to work (R.I.P MCUs).
|
||||
|
||||
<div class="image512">
|
||||
<img src="{{"/static/images/posts/omp_platforms.png" | prepend: site.baseurl }}"
|
||||
alt="OpenMP platforms" title="OpenMP supported environnements" />
|
||||
</div>
|
||||
|
||||
If your application needs to run in a multi-CPU environment, in clusters, with dedicated accelerators (GPGPU, FPGA, ASIC, etc.),
|
||||
then OpenMP will be able to exploit its environment without you having to worry about it.
|
||||
|
||||
We'll no doubt be talking about this API again!
|
||||
|
||||
See you in the next post!
|
||||
|
||||
**[END_OF_REPORT-240301.1027]**
|
||||
|
||||
|
||||
|
@ -29,7 +29,10 @@ that don't have hardware acceleration (GPU).
|
||||
I haven't planned any particular guidelines for these blog posts. I'll be writing the various topics
|
||||
progress and the various issues I come across!
|
||||
|
||||

|
||||
<div class="image512">
|
||||
<img src="{{"/static/images/posts/computer_nerd.jpg" | prepend: site.baseurl }}"
|
||||
alt="Trust me, I'm an engineer!" title="Trust me, I'm an engineer!" />
|
||||
</div>
|
||||
|
||||
### __My problem with Unreal, Unity, Godot, etc.__
|
||||
|
||||
@ -63,7 +66,10 @@ engine in general.
|
||||
|
||||
No pain, no gain.
|
||||
|
||||

|
||||
<div class="image640">
|
||||
<img src="{{"/static/images/posts/sfml_games.jpg" | prepend: site.baseurl }}"
|
||||
alt="SFML exemples" title="A lot of indies games have been develop using SFML" />
|
||||
</div>
|
||||
|
||||
### __Minimum equipment__
|
||||
|
||||
|
@ -229,6 +229,7 @@ a:hover code { color: rgb(115, 139, 170); text-decoration: underline; }
|
||||
.post {
|
||||
padding: 10px 30px;
|
||||
font-size: 16px;
|
||||
color: rgb(12, 12, 12);
|
||||
line-height: 1.5;
|
||||
}
|
||||
|
||||
@ -344,6 +345,28 @@ img + em {
|
||||
// Explicitly sized images (centered in parent container):
|
||||
// ========================================================
|
||||
|
||||
.image720 {
|
||||
width: auto;
|
||||
height: auto;
|
||||
max-width: 720px;
|
||||
max-height: 720px;
|
||||
position: relative;
|
||||
display: table;
|
||||
margin: auto;
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.image640 {
|
||||
width: auto;
|
||||
height: auto;
|
||||
max-width: 640px;
|
||||
max-height: 640px;
|
||||
position: relative;
|
||||
display: table;
|
||||
margin: auto;
|
||||
margin-bottom: 8px;
|
||||
}
|
||||
|
||||
.image512 {
|
||||
width: auto;
|
||||
height: auto;
|
||||
@ -521,6 +544,7 @@ img + em {
|
||||
float: left;
|
||||
margin-bottom: 15px;
|
||||
padding-left: 15px;
|
||||
color: rgb(38, 38, 38);
|
||||
}
|
||||
|
||||
.footer-col-1, .footer-col-2 {
|
||||
|
BIN
static/images/posts/flynn_table.png
Normal file
After Width: | Height: | Size: 66 KiB |
Before Width: | Height: | Size: 156 KiB |
BIN
static/images/posts/mmx_registers.png
Normal file
After Width: | Height: | Size: 13 KiB |
BIN
static/images/posts/omp_platforms.png
Normal file
After Width: | Height: | Size: 64 KiB |
BIN
static/images/posts/simd_concept.gif
Normal file
After Width: | Height: | Size: 9.6 KiB |
BIN
static/images/posts/simd_evol.png
Normal file
After Width: | Height: | Size: 47 KiB |
BIN
static/images/posts/sleep_nerd.jpg
Normal file
After Width: | Height: | Size: 23 KiB |