diff --git a/_posts/2024-03-01-cpu-vectorized-acceleration.markdown b/_posts/2024-03-01-cpu-vectorized-acceleration.markdown index 6917a86..0963f8a 100644 --- a/_posts/2024-03-01-cpu-vectorized-acceleration.markdown +++ b/_posts/2024-03-01-cpu-vectorized-acceleration.markdown @@ -10,4 +10,222 @@ highlight: true * Contents {:toc} -WIP +Processors (CPUs) are a jowel of technology, capable of executing more than a billion operations per second +(OPS)! However, their sequential design means they can only perform one operation at a time (and per core)... + +How do CPUs manage to perform so many OPS despite this handicap and what can we, developers, do about it? That's +what we're going to try to find out in this blog post. + +
+Good reading +
+ +### __Do one thing at the time, but do it quick!__ + +First of all, let's take a brief look at how a CPU works. + +The CPU is the main data processing unit in a computer. Its role is not limited to reading or writing data +from the memory it accesses, but it can also perform **binary** operations (AND, OR, XOR, etc.) and **mathematical** +operations (ADD, SUB, MUL, etc.). With successive generations of architectures, CPUs have become increasingly +complete in terms of possible operations, faster in executing them, but also more energy-efficient and bigger! + +With this simplified architecture of a computing unit executing one operation per clock pulse, we can quickly +deduce that to increase performance, we simply need to increase the number of pulses per second or **frequency**. +Indeed, increasing the frequency of the processor was a first step towards improving performance. But physics +dictated that increasing the frequency meant raising the temperature of the CPU (*Joule effect*) to the +critical temperature of the silicon (the chip material), leading to its destruction. + +Among the notable advances, the move to **multi-core** in a single chip is one of the most important! The first of +these (commercial) was IBM's "*POWER4*" (<3) in 2001. It was the start of parallelization of data processing! And +the start of the resource-sharing hassle... Thanks to this technology, it was possible to increase performance +without affecting CPU frequency. Of course, the extra layer of multi-core management added latency to the chain, +but the gains in computing performance more than made up for it. + +The principle of parallelization was not new: it was already commonly used in supercomputers such as the "*Cray X-MP*" +released in 1982, but at the time it was reserved for a lucky few. Today, having a CPU with at least 2 cores has +become a common thing. + +But there's also something else that has been salvaged from the supercomputers - I'm talking about SIMD, of course! + +Before going any further, I must first talk about CPU architecture and the instruction/data processing paradigm, +also known as **Flynn's taxonomy**. CPUs can have different internal structures and the paths that data takes and +how it is transformed/computed is described by the architecture of the CPU. The most common in today PC is the +**AMD64** (aka. x86_64) architecture, but there are many others, such as the very recent *RISC-V* or even *ARM*, +which has been powering our portable equipments for years. + +Returning to SIMD, according to Flynn's taxonomy, CPU architectures can be divided into 2 categories: + +- are one or more **data** processed per calculation cycle? +- are one or more **instructions** processed per calculation cycle? + +
+Flynn's taxonomy +
+ +SISD is the standard operation of the CPU in everyday use. It is the easiest of the 4 to implement, but it is also +the slowest. + +A CPU architecture can operate in SISD and change during operation to SIMD. + +### __Single Instruction Multiple Datas (SIMD)__ + +
+SIMD evolution +
+ +SIMD (or more often called SIMD acceleration) is a way of processing data on one or more CPU cores. It generally +takes the form of an extension to the instruction set of the CPU architecture that wishes to use it. On x86_64, +it is known as **MMX, SSE, AVX** and more recently **AMX**. On ARM it is known as **NEON/SVE**. RISC-V is no more +complicated than that, with a simple "**vector extension**". + +Let's stay with x86_64 for the explanations. Some curious people will notice that this architecture has several +MM0..MMx registers. Well, well. Well, yes! These are the registers added by the MMX extension! This dates back to +1997, with the release of Intel's *Pentium*, a few years before the first multi-core processors. SIMD works on the +scale of a single CPU core first and foremost, although SIMD can be applied on a larger scale across multiple cores, +but that's a (likely) next topic! MMX was the first SIMD extension on x86_64, later replaced by SSE and completed +by AVX from 2008. + +
+SIMD concept +
+ +How does SIMD work in practice? Well, by way of comparison, data is normally processed by 1 or 2 registers at a time. +These registers are now 64 bits long, and whatever the size of the data to be processed, the register retains its +64-bit size. So if we do a simple addition between two 8-bits integers, 112 bits of the registers are not used in the +operation! + +
+MMX registers +
+ +SIMD allows the use of 128 (SSE), 256 (AVX) or 512 (AVX512) bit registers with the data placed consecutively. For example, +with 32-bit floats, 4 can be stored in an SSE register. In this way, it is possible to calculate 2x4 floats in parallel +(register against register)! However, the type of operation will be the same for each element in the register. +Single Instruction, Multiple Datas! + +### __Play with intrinsics__ + +*That's all very nice, but doesn't the compiler do it all by itself? What do you need to add to the source code or compiler* +*options to use the SIMD extensions?* + +Here we go! To keep things simple, we're going to use a **x86_64** machine with **GCC-12** (Linux or Windows, it doesn't matter). + +GCC provides several headers allowing low-level calls to use MMx registers and SIMD instructions. There is a header for each +extension, but you can directly include **\** to include all the extensions. These functions are called +‘**intrinsics**’ because they use internal CPU features. Please note, however, that you will have access to all the extensions +during compilation even if your target CPU does not have all or some of these extensions. When your CPU tries to use an +instruction for an extension that it **doesn't have**, you will get a **SIGILL** (*Illegal Instruction* error). + +Below is an example of an operation using intrinsics. + +{% highlight c %} +// Intrinsics global header +#include +#include + +float lhs[1024]; +__attribute__ ((aligned (16))) float rhs[1024]; +float dst[1024] = {0.f}; + +int main(int argc, char** argv) { + unsigned int i; + + for (i = 0; i < 1024; ++i) { + lhs[i] = rand() / (float) RAND_MAX; + rhs[i] = rand() / (float) RAND_MAX; + } + + for (i = 0; i < 1024; i += 4) { + // Load into MMx register 4 float values from input left and right memory + __m128 reg_mmx_1 = _mm_loadu_ps(lhs + i); // Load from un-aligned memory left array + __m128 reg_mmx_2 = _mm_load_ps(rhs + i); // Load from 16 bytes aligned memory right array + + // Do vectorized add operation on our registers + __m128 tmp_result = _mm_add_ps(reg_mmx_1, reg_mmx_2); + + // Store the computed result in output array + _mm_storeu_ps(dst + i, tmp_result); + } + + for (i = 0; i < 1024; ++i) + printf("%f\n", dst[i]); + + return EXIT_SUCCESS; +} +{% endhighlight %} + +{% highlight nasm %} +... + + lea r8, dst[rip] + lea rcx, lhs[rip] + lea rdx, rhs[rip] + .p2align 5 + +.for_loop: + movaps xmm0, XMMWORD PTR [rcx+rax] # _mm_loadu_ps() + addps xmm0, XMMWORD PTR [rdx+rax] # _mm_add_ps() + movaps XMMWORD PTR [r8+rax], xmm0 # _mm_storeu_ps() + + add rax, 16 + cmp rax, 4096 + jne .for_loop + +... +{% endhighlight %} + +To tell the compiler that you want to use MMx registers, use the variable type `__m128` for a 128-bits register. There are also +`__m256` and `__m512`, but the 256 or 512 intrinsics functions must be used accordingly. + +You will also notice the difference between `load_ps` and `loadu_ps`, the latter allowing the loading of variables which are +not aligned to 16 bytes in memory whereas the former does. Loading or storing to unaligned memory with load/store_ps will cause +a **SEGFAULT** (*Segmentation Fault* error). There is also a variant for `_mm_store`. You can replace `_ps` with `_pd` to use +**double** instead of **float**. + +A list of intrinsics functions for architecture x86_64, sorted by extension, is available [at this website](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html). + +GCC is capable of adding/replacing SIMD instructions according to the structure of the code it is currently processing, aka. +**auto-vectorisation**. However, it must be told which instructions it has access to, with gcc options `-msse`, `-mavx`, etc.). +Or directly the CPU architecture/model (if known) via `-march=` and `-mtune=`. By default, march and mtune are set to `generic`, +so the compiler will only use instructions with maximum compatibility (and therefore no optimisation with extensions). + +> If you are compiling for exclusive use on your machine, compile with `-march=native` and `-mtune=native`. GCC will automatically +> detect your target CPU and use the known extensions for that CPU. + +Auto-vectoring is enabled by default on the compiler (with options such as -O2 or -O3), but it can fail for various reasons, +and **without informing the developer**. We recommend that you keep auto-vectorisation active, but explicitly indicate where +and how you want to use intrinsics. This way, in the event of failure (for alignment reasons, for example), the compiler will +warn us that it was unable to apply our instructions for a specified reason X. + +### __A little world about OpenMP__ + +OpenMP is a all-in-one API in many aspects. This implementation of multithreading, which has been extended to SIMD since version 4, +allows you to optimise your code with a few well-placed `#pragmas`. + +However, it should be noted that each tool comes with its own advantages and disadvantages. Firstly, the optimisations made by +OpenMP will only be visible to you by re-reading the assembly code of your program. As with the self-vectoring mentioned above, +it's tricky to rely solely on the work of the compiler alone. And although there are profilers that support OpenMP, you may find +it even more complicated to investigate any bugs in your threads. + +OpenMP is also not available on all toolchains (especially older or niche ones) and requires an OS to work (R.I.P MCUs). + +
+OpenMP platforms +
+ +If your application needs to run in a multi-CPU environment, in clusters, with dedicated accelerators (GPGPU, FPGA, ASIC, etc.), +then OpenMP will be able to exploit its environment without you having to worry about it. + +We'll no doubt be talking about this API again! + +See you in the next post! + +**[END_OF_REPORT-240301.1027]** + + diff --git a/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown b/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown index 75f871b..4753acc 100644 --- a/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown +++ b/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown @@ -29,7 +29,10 @@ that don't have hardware acceleration (GPU). I haven't planned any particular guidelines for these blog posts. I'll be writing the various topics progress and the various issues I come across! -![SFML exemples]({{ "/static/images/posts/computer_nerd.jpg" | prepend: site.baseurl }} "Trust me, I'm an engineer!") +
+Trust me, I'm an engineer! +
### __My problem with Unreal, Unity, Godot, etc.__ @@ -63,7 +66,10 @@ engine in general. No pain, no gain. -![SFML exemples]({{ "/static/images/posts/sfml_games.jpg" | prepend: site.baseurl }} "A lot of indies games have been develop using SFML") +
+SFML exemples +
### __Minimum equipment__ diff --git a/css/main.scss b/css/main.scss index aede421..2234859 100644 --- a/css/main.scss +++ b/css/main.scss @@ -229,6 +229,7 @@ a:hover code { color: rgb(115, 139, 170); text-decoration: underline; } .post { padding: 10px 30px; font-size: 16px; + color: rgb(12, 12, 12); line-height: 1.5; } @@ -344,6 +345,28 @@ img + em { // Explicitly sized images (centered in parent container): // ======================================================== +.image720 { + width: auto; + height: auto; + max-width: 720px; + max-height: 720px; + position: relative; + display: table; + margin: auto; + margin-bottom: 8px; +} + +.image640 { + width: auto; + height: auto; + max-width: 640px; + max-height: 640px; + position: relative; + display: table; + margin: auto; + margin-bottom: 8px; +} + .image512 { width: auto; height: auto; @@ -521,6 +544,7 @@ img + em { float: left; margin-bottom: 15px; padding-left: 15px; + color: rgb(38, 38, 38); } .footer-col-1, .footer-col-2 { diff --git a/static/images/posts/flynn_table.png b/static/images/posts/flynn_table.png new file mode 100644 index 0000000..2400a09 Binary files /dev/null and b/static/images/posts/flynn_table.png differ diff --git a/static/images/posts/good-news.jpg b/static/images/posts/good-news.jpg deleted file mode 100644 index 47414a9..0000000 Binary files a/static/images/posts/good-news.jpg and /dev/null differ diff --git a/static/images/posts/mmx_registers.png b/static/images/posts/mmx_registers.png new file mode 100644 index 0000000..a0f4225 Binary files /dev/null and b/static/images/posts/mmx_registers.png differ diff --git a/static/images/posts/omp_platforms.png b/static/images/posts/omp_platforms.png new file mode 100644 index 0000000..f50e9b4 Binary files /dev/null and b/static/images/posts/omp_platforms.png differ diff --git a/static/images/posts/simd_concept.gif b/static/images/posts/simd_concept.gif new file mode 100644 index 0000000..20bdf17 Binary files /dev/null and b/static/images/posts/simd_concept.gif differ diff --git a/static/images/posts/simd_evol.png b/static/images/posts/simd_evol.png new file mode 100644 index 0000000..807ad89 Binary files /dev/null and b/static/images/posts/simd_evol.png differ diff --git a/static/images/posts/sleep_nerd.jpg b/static/images/posts/sleep_nerd.jpg new file mode 100644 index 0000000..2887455 Binary files /dev/null and b/static/images/posts/sleep_nerd.jpg differ