diff --git a/_posts/2024-03-01-cpu-vectorized-acceleration.markdown b/_posts/2024-03-01-cpu-vectorized-acceleration.markdown
index 6917a86..0963f8a 100644
--- a/_posts/2024-03-01-cpu-vectorized-acceleration.markdown
+++ b/_posts/2024-03-01-cpu-vectorized-acceleration.markdown
@@ -10,4 +10,222 @@ highlight: true
* Contents
{:toc}
-WIP
+Processors (CPUs) are a jowel of technology, capable of executing more than a billion operations per second
+(OPS)! However, their sequential design means they can only perform one operation at a time (and per core)...
+
+How do CPUs manage to perform so many OPS despite this handicap and what can we, developers, do about it? That's
+what we're going to try to find out in this blog post.
+
+
+

+
+
+### __Do one thing at the time, but do it quick!__
+
+First of all, let's take a brief look at how a CPU works.
+
+The CPU is the main data processing unit in a computer. Its role is not limited to reading or writing data
+from the memory it accesses, but it can also perform **binary** operations (AND, OR, XOR, etc.) and **mathematical**
+operations (ADD, SUB, MUL, etc.). With successive generations of architectures, CPUs have become increasingly
+complete in terms of possible operations, faster in executing them, but also more energy-efficient and bigger!
+
+With this simplified architecture of a computing unit executing one operation per clock pulse, we can quickly
+deduce that to increase performance, we simply need to increase the number of pulses per second or **frequency**.
+Indeed, increasing the frequency of the processor was a first step towards improving performance. But physics
+dictated that increasing the frequency meant raising the temperature of the CPU (*Joule effect*) to the
+critical temperature of the silicon (the chip material), leading to its destruction.
+
+Among the notable advances, the move to **multi-core** in a single chip is one of the most important! The first of
+these (commercial) was IBM's "*POWER4*" (<3) in 2001. It was the start of parallelization of data processing! And
+the start of the resource-sharing hassle... Thanks to this technology, it was possible to increase performance
+without affecting CPU frequency. Of course, the extra layer of multi-core management added latency to the chain,
+but the gains in computing performance more than made up for it.
+
+The principle of parallelization was not new: it was already commonly used in supercomputers such as the "*Cray X-MP*"
+released in 1982, but at the time it was reserved for a lucky few. Today, having a CPU with at least 2 cores has
+become a common thing.
+
+But there's also something else that has been salvaged from the supercomputers - I'm talking about SIMD, of course!
+
+Before going any further, I must first talk about CPU architecture and the instruction/data processing paradigm,
+also known as **Flynn's taxonomy**. CPUs can have different internal structures and the paths that data takes and
+how it is transformed/computed is described by the architecture of the CPU. The most common in today PC is the
+**AMD64** (aka. x86_64) architecture, but there are many others, such as the very recent *RISC-V* or even *ARM*,
+which has been powering our portable equipments for years.
+
+Returning to SIMD, according to Flynn's taxonomy, CPU architectures can be divided into 2 categories:
+
+- are one or more **data** processed per calculation cycle?
+- are one or more **instructions** processed per calculation cycle?
+
+
+

+
+
+SISD is the standard operation of the CPU in everyday use. It is the easiest of the 4 to implement, but it is also
+the slowest.
+
+A CPU architecture can operate in SISD and change during operation to SIMD.
+
+### __Single Instruction Multiple Datas (SIMD)__
+
+
+

+
+
+SIMD (or more often called SIMD acceleration) is a way of processing data on one or more CPU cores. It generally
+takes the form of an extension to the instruction set of the CPU architecture that wishes to use it. On x86_64,
+it is known as **MMX, SSE, AVX** and more recently **AMX**. On ARM it is known as **NEON/SVE**. RISC-V is no more
+complicated than that, with a simple "**vector extension**".
+
+Let's stay with x86_64 for the explanations. Some curious people will notice that this architecture has several
+MM0..MMx registers. Well, well. Well, yes! These are the registers added by the MMX extension! This dates back to
+1997, with the release of Intel's *Pentium*, a few years before the first multi-core processors. SIMD works on the
+scale of a single CPU core first and foremost, although SIMD can be applied on a larger scale across multiple cores,
+but that's a (likely) next topic! MMX was the first SIMD extension on x86_64, later replaced by SSE and completed
+by AVX from 2008.
+
+
+

+
+
+How does SIMD work in practice? Well, by way of comparison, data is normally processed by 1 or 2 registers at a time.
+These registers are now 64 bits long, and whatever the size of the data to be processed, the register retains its
+64-bit size. So if we do a simple addition between two 8-bits integers, 112 bits of the registers are not used in the
+operation!
+
+
+

+
+
+SIMD allows the use of 128 (SSE), 256 (AVX) or 512 (AVX512) bit registers with the data placed consecutively. For example,
+with 32-bit floats, 4 can be stored in an SSE register. In this way, it is possible to calculate 2x4 floats in parallel
+(register against register)! However, the type of operation will be the same for each element in the register.
+Single Instruction, Multiple Datas!
+
+### __Play with intrinsics__
+
+*That's all very nice, but doesn't the compiler do it all by itself? What do you need to add to the source code or compiler*
+*options to use the SIMD extensions?*
+
+Here we go! To keep things simple, we're going to use a **x86_64** machine with **GCC-12** (Linux or Windows, it doesn't matter).
+
+GCC provides several headers allowing low-level calls to use MMx registers and SIMD instructions. There is a header for each
+extension, but you can directly include **\** to include all the extensions. These functions are called
+‘**intrinsics**’ because they use internal CPU features. Please note, however, that you will have access to all the extensions
+during compilation even if your target CPU does not have all or some of these extensions. When your CPU tries to use an
+instruction for an extension that it **doesn't have**, you will get a **SIGILL** (*Illegal Instruction* error).
+
+Below is an example of an operation using intrinsics.
+
+{% highlight c %}
+// Intrinsics global header
+#include
+#include
+
+float lhs[1024];
+__attribute__ ((aligned (16))) float rhs[1024];
+float dst[1024] = {0.f};
+
+int main(int argc, char** argv) {
+ unsigned int i;
+
+ for (i = 0; i < 1024; ++i) {
+ lhs[i] = rand() / (float) RAND_MAX;
+ rhs[i] = rand() / (float) RAND_MAX;
+ }
+
+ for (i = 0; i < 1024; i += 4) {
+ // Load into MMx register 4 float values from input left and right memory
+ __m128 reg_mmx_1 = _mm_loadu_ps(lhs + i); // Load from un-aligned memory left array
+ __m128 reg_mmx_2 = _mm_load_ps(rhs + i); // Load from 16 bytes aligned memory right array
+
+ // Do vectorized add operation on our registers
+ __m128 tmp_result = _mm_add_ps(reg_mmx_1, reg_mmx_2);
+
+ // Store the computed result in output array
+ _mm_storeu_ps(dst + i, tmp_result);
+ }
+
+ for (i = 0; i < 1024; ++i)
+ printf("%f\n", dst[i]);
+
+ return EXIT_SUCCESS;
+}
+{% endhighlight %}
+
+{% highlight nasm %}
+...
+
+ lea r8, dst[rip]
+ lea rcx, lhs[rip]
+ lea rdx, rhs[rip]
+ .p2align 5
+
+.for_loop:
+ movaps xmm0, XMMWORD PTR [rcx+rax] # _mm_loadu_ps()
+ addps xmm0, XMMWORD PTR [rdx+rax] # _mm_add_ps()
+ movaps XMMWORD PTR [r8+rax], xmm0 # _mm_storeu_ps()
+
+ add rax, 16
+ cmp rax, 4096
+ jne .for_loop
+
+...
+{% endhighlight %}
+
+To tell the compiler that you want to use MMx registers, use the variable type `__m128` for a 128-bits register. There are also
+`__m256` and `__m512`, but the 256 or 512 intrinsics functions must be used accordingly.
+
+You will also notice the difference between `load_ps` and `loadu_ps`, the latter allowing the loading of variables which are
+not aligned to 16 bytes in memory whereas the former does. Loading or storing to unaligned memory with load/store_ps will cause
+a **SEGFAULT** (*Segmentation Fault* error). There is also a variant for `_mm_store`. You can replace `_ps` with `_pd` to use
+**double** instead of **float**.
+
+A list of intrinsics functions for architecture x86_64, sorted by extension, is available [at this website](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).
+
+GCC is capable of adding/replacing SIMD instructions according to the structure of the code it is currently processing, aka.
+**auto-vectorisation**. However, it must be told which instructions it has access to, with gcc options `-msse`, `-mavx`, etc.).
+Or directly the CPU architecture/model (if known) via `-march=` and `-mtune=`. By default, march and mtune are set to `generic`,
+so the compiler will only use instructions with maximum compatibility (and therefore no optimisation with extensions).
+
+> If you are compiling for exclusive use on your machine, compile with `-march=native` and `-mtune=native`. GCC will automatically
+> detect your target CPU and use the known extensions for that CPU.
+
+Auto-vectoring is enabled by default on the compiler (with options such as -O2 or -O3), but it can fail for various reasons,
+and **without informing the developer**. We recommend that you keep auto-vectorisation active, but explicitly indicate where
+and how you want to use intrinsics. This way, in the event of failure (for alignment reasons, for example), the compiler will
+warn us that it was unable to apply our instructions for a specified reason X.
+
+### __A little world about OpenMP__
+
+OpenMP is a all-in-one API in many aspects. This implementation of multithreading, which has been extended to SIMD since version 4,
+allows you to optimise your code with a few well-placed `#pragmas`.
+
+However, it should be noted that each tool comes with its own advantages and disadvantages. Firstly, the optimisations made by
+OpenMP will only be visible to you by re-reading the assembly code of your program. As with the self-vectoring mentioned above,
+it's tricky to rely solely on the work of the compiler alone. And although there are profilers that support OpenMP, you may find
+it even more complicated to investigate any bugs in your threads.
+
+OpenMP is also not available on all toolchains (especially older or niche ones) and requires an OS to work (R.I.P MCUs).
+
+
+

+
+
+If your application needs to run in a multi-CPU environment, in clusters, with dedicated accelerators (GPGPU, FPGA, ASIC, etc.),
+then OpenMP will be able to exploit its environment without you having to worry about it.
+
+We'll no doubt be talking about this API again!
+
+See you in the next post!
+
+**[END_OF_REPORT-240301.1027]**
+
+
diff --git a/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown b/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown
index 75f871b..4753acc 100644
--- a/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown
+++ b/_posts/2025-01-14-journey-into-3D-game-engine-p1.markdown
@@ -29,7 +29,10 @@ that don't have hardware acceleration (GPU).
I haven't planned any particular guidelines for these blog posts. I'll be writing the various topics
progress and the various issues I come across!
-
+
+

+
### __My problem with Unreal, Unity, Godot, etc.__
@@ -63,7 +66,10 @@ engine in general.
No pain, no gain.
-
+
+

+
### __Minimum equipment__
diff --git a/css/main.scss b/css/main.scss
index aede421..2234859 100644
--- a/css/main.scss
+++ b/css/main.scss
@@ -229,6 +229,7 @@ a:hover code { color: rgb(115, 139, 170); text-decoration: underline; }
.post {
padding: 10px 30px;
font-size: 16px;
+ color: rgb(12, 12, 12);
line-height: 1.5;
}
@@ -344,6 +345,28 @@ img + em {
// Explicitly sized images (centered in parent container):
// ========================================================
+.image720 {
+ width: auto;
+ height: auto;
+ max-width: 720px;
+ max-height: 720px;
+ position: relative;
+ display: table;
+ margin: auto;
+ margin-bottom: 8px;
+}
+
+.image640 {
+ width: auto;
+ height: auto;
+ max-width: 640px;
+ max-height: 640px;
+ position: relative;
+ display: table;
+ margin: auto;
+ margin-bottom: 8px;
+}
+
.image512 {
width: auto;
height: auto;
@@ -521,6 +544,7 @@ img + em {
float: left;
margin-bottom: 15px;
padding-left: 15px;
+ color: rgb(38, 38, 38);
}
.footer-col-1, .footer-col-2 {
diff --git a/static/images/posts/flynn_table.png b/static/images/posts/flynn_table.png
new file mode 100644
index 0000000..2400a09
Binary files /dev/null and b/static/images/posts/flynn_table.png differ
diff --git a/static/images/posts/good-news.jpg b/static/images/posts/good-news.jpg
deleted file mode 100644
index 47414a9..0000000
Binary files a/static/images/posts/good-news.jpg and /dev/null differ
diff --git a/static/images/posts/mmx_registers.png b/static/images/posts/mmx_registers.png
new file mode 100644
index 0000000..a0f4225
Binary files /dev/null and b/static/images/posts/mmx_registers.png differ
diff --git a/static/images/posts/omp_platforms.png b/static/images/posts/omp_platforms.png
new file mode 100644
index 0000000..f50e9b4
Binary files /dev/null and b/static/images/posts/omp_platforms.png differ
diff --git a/static/images/posts/simd_concept.gif b/static/images/posts/simd_concept.gif
new file mode 100644
index 0000000..20bdf17
Binary files /dev/null and b/static/images/posts/simd_concept.gif differ
diff --git a/static/images/posts/simd_evol.png b/static/images/posts/simd_evol.png
new file mode 100644
index 0000000..807ad89
Binary files /dev/null and b/static/images/posts/simd_evol.png differ
diff --git a/static/images/posts/sleep_nerd.jpg b/static/images/posts/sleep_nerd.jpg
new file mode 100644
index 0000000..2887455
Binary files /dev/null and b/static/images/posts/sleep_nerd.jpg differ