Publish lost simd post

2025-01-15 14:56:53 +01:00 · 2025-01-15 14:56:53 +01:00 · 9aa818c0e5
commit 9aa818c0e5
parent a079606a1e
10 changed files with 251 additions and 3 deletions
--- a/_posts/2024-03-01-cpu-vectorized-acceleration.markdown
+++ b/_posts/2024-03-01-cpu-vectorized-acceleration.markdown
@ -10,4 +10,222 @@ highlight:  true
 * Contents
 {:toc}

-WIP
+Processors (CPUs) are a jowel of technology, capable of executing more than a billion operations per second 
+(OPS)! However, their sequential design means they can only perform one operation at a time (and per core)...
+
+How do CPUs manage to perform so many OPS despite this handicap and what can we, developers, do about it? That's 
+what we're going to try to find out in this blog post. 
+
+<div class="image300">
+<img src="{{"/static/images/posts/sleep_nerd.jpg" | prepend: site.baseurl }}"
+	alt="Good reading" title="Good reading!" />
+</div>
+
+### __Do one thing at the time, but do it quick!__
+
+First of all, let's take a brief look at how a CPU works.
+
+The CPU is the main data processing unit in a computer. Its role is not limited to reading or writing data 
+from the memory it accesses, but it can also perform **binary** operations (AND, OR, XOR, etc.) and **mathematical** 
+operations (ADD, SUB, MUL, etc.). With successive generations of architectures, CPUs have become increasingly 
+complete in terms of possible operations, faster in executing them, but also more energy-efficient and bigger!
+
+With this simplified architecture of a computing unit executing one operation per clock pulse, we can quickly 
+deduce that to increase performance, we simply need to increase the number of pulses per second or **frequency**. 
+Indeed, increasing the frequency of the processor was a first step towards improving performance. But physics 
+dictated that increasing the frequency meant raising the temperature of the CPU (*Joule effect*) to the 
+critical temperature of the silicon (the chip material), leading to its destruction.
+
+Among the notable advances, the move to **multi-core** in a single chip is one of the most important! The first of 
+these (commercial) was IBM's "*POWER4*" (<3) in 2001. It was the start of parallelization of data processing! And 
+the start of the resource-sharing hassle... Thanks to this technology, it was possible to increase performance 
+without affecting CPU frequency. Of course, the extra layer of multi-core management added latency to the chain, 
+but the gains in computing performance more than made up for it.
+
+The principle of parallelization was not new: it was already commonly used in supercomputers such as the "*Cray X-MP*" 
+released in 1982, but at the time it was reserved for a lucky few. Today, having a CPU with at least 2 cores has 
+become a common thing.
+
+But there's also something else that has been salvaged from the supercomputers - I'm talking about SIMD, of course!
+
+Before going any further, I must first talk about CPU architecture and the instruction/data processing paradigm, 
+also known as **Flynn's taxonomy**. CPUs can have different internal structures and the paths that data takes and 
+how it is transformed/computed is described by the architecture of the CPU. The most common in today PC is the 
+**AMD64** (aka. x86_64) architecture, but there are many others, such as the very recent *RISC-V* or even *ARM*, 
+which has been powering our portable equipments for years.
+
+Returning to SIMD, according to Flynn's taxonomy, CPU architectures can be divided into 2 categories:
+
+- are one or more **data** processed per calculation cycle?
+- are one or more **instructions** processed per calculation cycle?
+
+<div class="image512">
+<img src="{{"/static/images/posts/flynn_table.png" | prepend: site.baseurl }}"
+	alt="Flynn's taxonomy" title="Flynn's taxonomy" />
+</div>
+
+SISD is the standard operation of the CPU in everyday use. It is the easiest of the 4 to implement, but it is also 
+the slowest.
+
+A CPU architecture can operate in SISD and change during operation to SIMD.
+
+### __Single Instruction Multiple Datas (SIMD)__
+
+<div class="image640">
+<img src="{{"/static/images/posts/simd_evol.png" | prepend: site.baseurl }}"
+	alt="SIMD evolution" title="Evolution of x86_64 SIMD extensions" />
+</div>
+
+SIMD (or more often called SIMD acceleration) is a way of processing data on one or more CPU cores. It generally 
+takes the form of an extension to the instruction set of the CPU architecture that wishes to use it. On x86_64, 
+it is known as **MMX, SSE, AVX** and more recently **AMX**. On ARM it is known as **NEON/SVE**. RISC-V is no more 
+complicated than that, with a simple "**vector extension**".
+
+Let's stay with x86_64 for the explanations. Some curious people will notice that this architecture has several 
+MM0..MMx registers. Well, well. Well, yes! These are the registers added by the MMX extension! This dates back to 
+1997, with the release of Intel's *Pentium*, a few years before the first multi-core processors. SIMD works on the 
+scale of a single CPU core first and foremost, although SIMD can be applied on a larger scale across multiple cores, 
+but that's a (likely) next topic! MMX was the first SIMD extension on x86_64, later replaced by SSE and completed 
+by AVX from 2008.
+
+<div class="image512">
+<img src="{{"/static/images/posts/simd_concept.gif" | prepend: site.baseurl }}"
+	alt="SIMD concept" title="SISD vs. SIMD" />
+</div>
+
+How does SIMD work in practice? Well, by way of comparison, data is normally processed by 1 or 2 registers at a time. 
+These registers are now 64 bits long, and whatever the size of the data to be processed, the register retains its 
+64-bit size. So if we do a simple addition between two 8-bits integers, 112 bits of the registers are not used in the 
+operation!
+
+<div class="image300">
+<img src="{{"/static/images/posts/mmx_registers.png" | prepend: site.baseurl }}"
+	alt="MMX registers" title="SSE/AVX registers" />
+</div>
+
+SIMD allows the use of 128 (SSE), 256 (AVX) or 512 (AVX512) bit registers with the data placed consecutively. For example, 
+with 32-bit floats, 4 can be stored in an SSE register. In this way, it is possible to calculate 2x4 floats in parallel 
+(register against register)! However, the type of operation will be the same for each element in the register.
+Single Instruction, Multiple Datas!
+
+### __Play with intrinsics__
+
+*That's all very nice, but doesn't the compiler do it all by itself? What do you need to add to the source code or compiler* 
+*options to use the SIMD extensions?*
+
+Here we go! To keep things simple, we're going to use a **x86_64** machine with **GCC-12** (Linux or Windows, it doesn't matter).
+
+GCC provides several headers allowing low-level calls to use MMx registers and SIMD instructions. There is a header for each 
+extension, but you can directly include **\<immintrin.h\>** to include all the extensions. These functions are called 
+‘**intrinsics**’ because they use internal CPU features. Please note, however, that you will have access to all the extensions 
+during compilation even if your target CPU does not have all or some of these extensions. When your CPU tries to use an 
+instruction for an extension that it **doesn't have**, you will get a **SIGILL** (*Illegal Instruction* error).
+
+Below is an example of an operation using intrinsics.
+
+{% highlight c %}
+// Intrinsics global header
+#include <immintrin.h>
+#include <stdio.h>
+
+float lhs[1024];
+__attribute__ ((aligned (16))) float rhs[1024];
+float dst[1024] = {0.f};
+
+int main(int argc, char** argv) {
+	unsigned int i;
+
+	for (i = 0; i < 1024; ++i) {
+		lhs[i] = rand() / (float) RAND_MAX;
+		rhs[i] = rand() / (float) RAND_MAX;
+	}
+	
+	for (i = 0; i < 1024; i += 4) {
+		// Load into MMx register 4 float values from input left and right memory
+		__m128 reg_mmx_1 = _mm_loadu_ps(lhs + i);	// Load from un-aligned memory left array
+		__m128 reg_mmx_2 = _mm_load_ps(rhs + i);	// Load from 16 bytes aligned memory right array
+		
+		// Do vectorized add operation on our registers
+		__m128 tmp_result = _mm_add_ps(reg_mmx_1, reg_mmx_2);
+		
+		// Store the computed result in output array
+		_mm_storeu_ps(dst + i, tmp_result);
+	}
+	
+	for (i = 0; i < 1024; ++i)
+		printf("%f\n", dst[i]);
+
+	return EXIT_SUCCESS;
+}
+{% endhighlight %}
+
+{% highlight nasm %}
+...
+
+	lea	r8, dst[rip]
+	lea	rcx, lhs[rip]
+	lea	rdx, rhs[rip]
+	.p2align 5
+	
+.for_loop:
+	movaps	xmm0, XMMWORD PTR [rcx+rax]	 # _mm_loadu_ps()
+	addps	xmm0, XMMWORD PTR [rdx+rax]	 # _mm_add_ps()
+	movaps	XMMWORD PTR [r8+rax], xmm0	 # _mm_storeu_ps()
+	
+	add	rax, 16
+	cmp	rax, 4096
+	jne	.for_loop
+	
+...
+{% endhighlight %}
+
+To tell the compiler that you want to use MMx registers, use the variable type `__m128` for a 128-bits register. There are also 
+`__m256` and `__m512`, but the 256 or 512 intrinsics functions must be used accordingly. 
+
+You will also notice the difference between `load_ps` and `loadu_ps`, the latter allowing the loading of variables which are 
+not aligned to 16 bytes in memory whereas the former does. Loading or storing to unaligned memory with load/store_ps will cause 
+a **SEGFAULT** (*Segmentation Fault* error). There is also a variant for `_mm_store`. You can replace `_ps` with `_pd` to use 
+**double** instead of **float**.
+
+A list of intrinsics functions for architecture x86_64, sorted by extension, is available [at this website](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).
+
+GCC is capable of adding/replacing SIMD instructions according to the structure of the code it is currently processing, aka. 
+**auto-vectorisation**. However, it must be told which instructions it has access to, with gcc options `-msse`, `-mavx`, etc.). 
+Or directly the CPU architecture/model (if known) via `-march=` and `-mtune=`. By default, march and mtune are set to `generic`, 
+so the compiler will only use instructions with maximum compatibility (and therefore no optimisation with extensions).
+
+> If you are compiling for exclusive use on your machine, compile with `-march=native` and `-mtune=native`. GCC will automatically 
+> detect your target CPU and use the known extensions for that CPU.
+
+Auto-vectoring is enabled by default on the compiler (with options such as -O2 or -O3), but it can fail for various reasons, 
+and **without informing the developer**. We recommend that you keep auto-vectorisation active, but explicitly indicate where 
+and how you want to use intrinsics. This way, in the event of failure (for alignment reasons, for example), the compiler will 
+warn us that it was unable to apply our instructions for a specified reason X.
+
+### __A little world about OpenMP__
+
+OpenMP is a all-in-one API in many aspects. This implementation of multithreading, which has been extended to SIMD since version 4, 
+allows you to optimise your code with a few well-placed `#pragmas`.
+
+However, it should be noted that each tool comes with its own advantages and disadvantages. Firstly, the optimisations made by 
+OpenMP will only be visible to you by re-reading the assembly code of your program. As with the self-vectoring mentioned above, 
+it's tricky to rely solely on the work of the compiler alone. And although there are profilers that support OpenMP, you may find 
+it even more complicated to investigate any bugs in your threads.
+
+OpenMP is also not available on all toolchains (especially older or niche ones) and requires an OS to work (R.I.P MCUs).
+
+<div class="image512">
+<img src="{{"/static/images/posts/omp_platforms.png" | prepend: site.baseurl }}"
+	alt="OpenMP platforms" title="OpenMP supported environnements" />
+</div>
+
+If your application needs to run in a multi-CPU environment, in clusters, with dedicated accelerators (GPGPU, FPGA, ASIC, etc.), 
+then OpenMP will be able to exploit its environment without you having to worry about it.
+
+We'll no doubt be talking about this API again!
+
+See you in the next post! 
+
+**[END_OF_REPORT-240301.1027]**
+
+
--- a/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown
+++ b/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown
@ -29,7 +29,10 @@ that don't have hardware acceleration (GPU).
 I haven't planned any particular guidelines for these blog posts. I'll be writing the various topics 
 progress and the various issues I come across!

-![SFML exemples]({{ "/static/images/posts/computer_nerd.jpg" | prepend: site.baseurl }} "Trust me, I'm an engineer!")
+<div class="image512">
+<img src="{{"/static/images/posts/computer_nerd.jpg" | prepend: site.baseurl }}"
+	alt="Trust me, I'm an engineer!" title="Trust me, I'm an engineer!" />
+</div>

 ### __My problem with Unreal, Unity, Godot, etc.__

@ -63,7 +66,10 @@ engine in general.

 No pain, no gain.

-![SFML exemples]({{ "/static/images/posts/sfml_games.jpg" | prepend: site.baseurl }} "A lot of indies games have been develop using SFML")
+<div class="image640">
+<img src="{{"/static/images/posts/sfml_games.jpg" | prepend: site.baseurl }}"
+	alt="SFML exemples" title="A lot of indies games have been develop using SFML" />
+</div>

 ### __Minimum equipment__

--- a/css/main.scss
+++ b/css/main.scss
@ -229,6 +229,7 @@ a:hover   code { color: rgb(115, 139, 170); text-decoration: underline; }
 .post {
    padding:     10px 30px;
    font-size:   16px;
+	color:       rgb(12, 12, 12);
    line-height: 1.5;
 }

@ -344,6 +345,28 @@ img + em {
 // Explicitly sized images (centered in parent container):
 // ========================================================

+.image720 {
+    width:         auto;
+    height:        auto;
+    max-width:     720px;
+    max-height:    720px;
+    position:      relative;
+    display:       table;
+    margin:        auto;
+    margin-bottom: 8px;
+}
+
+.image640 {
+    width:         auto;
+    height:        auto;
+    max-width:     640px;
+    max-height:    640px;
+    position:      relative;
+    display:       table;
+    margin:        auto;
+    margin-bottom: 8px;
+}
+
 .image512 {
    width:         auto;
    height:        auto;
@ -521,6 +544,7 @@ img + em {
    float:         left;
    margin-bottom: 15px;
    padding-left:  15px;
+	color:         rgb(38, 38, 38);
 }

 .footer-col-1, .footer-col-2 {
--- a/static/images/posts/flynn_table.png
+++ b/static/images/posts/flynn_table.png
--- a/static/images/posts/good-news.jpg
+++ b/static/images/posts/good-news.jpg
--- a/static/images/posts/mmx_registers.png
+++ b/static/images/posts/mmx_registers.png
--- a/static/images/posts/omp_platforms.png
+++ b/static/images/posts/omp_platforms.png
--- a/static/images/posts/simd_concept.gif
+++ b/static/images/posts/simd_concept.gif
--- a/static/images/posts/simd_evol.png
+++ b/static/images/posts/simd_evol.png
--- a/static/images/posts/sleep_nerd.jpg
+++ b/static/images/posts/sleep_nerd.jpg