Building a powerful GNU/Linux workstation with a blazingly fast storage :: Regalis Technologies

Wed 11 October 2023 in gnu-linux by Regalis

Building a powerful GNU/Linux workstation with a blazingly fast storage

It's been a while since the premiere of the latest processors for the AM4 platform. That is exactly why it could be the best moment to build a powerful workstation based on this platform (as it comes to performance + stability vs. price). PCI-e 4.0 NVMEs are cheap and have successfully passed the test of time, not to mention DDR4 memory which is widely available and also relatively cheap.

It so happens that I have some AM4-based equipment left in the lab - it is a great opportunity to start a new series about building the ultimate GNU/Linux workstation from scratch.

Regalis Technologies - building AM4-based GNU/Linux workstation

Have you ever wondered what really affects the speed of your computer? Is it a CPU? Or RAM? Not at all... The perceived speed of your system depends mainly on the speed of your storage. I bet you have an intuition that the feeling of using a computer with the latest processor and an old HDD would be very bad.

I will try to demonstrate how to push the speed of mass storage to its limits using a regular home PC. I can assure you that this option is much cheaper than buying a new computer with PCIe 5.0 and what's even more important - the result is much better!

The goal of this series is to present an approach that will lead to building a stable and powerful workstation capable of achieving enormous storage speeds (~30GB/s and more).

All this using Free (as in freedom) production-ready, server grade technology - GNU/Linux.

Know your hardware

Any such project should begin with familiarizing yourself with the capabilities of the equipment. The most important elements are, of course:

the CPU,
the chipset,
the motherboard.

A lack of understanding the platform can lead to wrong decisions and may significantly affect the final results.

Our goal is to maximize storage speeds so we should mainly focus on the PCIe bandwidth.

CPU

Ryzen 5000 series processors based on the Zen 3 architecture from AMD have up to 24 PCI Express 4.0 lanes. Four of the lanes are reserved as a link to the chipset. Each PCIe 4.0 lane can provide a throughput of 1.969 GB/s (without the encoding overhead).

NVME drives use four PCIe lanes - we can expect maximum throughputs around 7.8 GB/s (taking into account the slight overhead - it will realistically be around 7.4 GB/s).

By using 16x PCI Express 4.0 lanes - we can achieve storage bandwidth of up to 30 GB/s.

Does this mean we can choose any Ryzen 5000 series processor? Unfortunately - NO!

Some of them will not meet our requirements at all, in particular all the so-called APU (CPU with an integrated graphics card). The same goes for some of the lower models, e.g. Ryzen 5100, Ryzen 5500 or Ryzen 5700. The reason is the lack of PCIe 4.0 lanes, instead the processors are equipped with PCIe 3.0 lanes.

Always check technical specifications!

Always check the technical specifications carefully and make sure that the CPU has an appropriate number of PCI Express lanes (of an appropriate generation).

All the below CPUs will be OK for our build:

Chipset and motherboard

As it comes to a chipset - we basically have two options:

AMD B550,
AMD X570.

The X570 chipset uses four PCI Express 4.0 lanes to connect to the main CPU while B550 connects to the processor via a four-lane PCI Express 3.0 connection. What is more, the X570 chipset provides additional (multiplexed) sixteen PCIe 4.0 lanes while the B550 provides only ten PCIe 3.0 lanes.

Chipset	CPU connection	Additional PCIe 4.0 lanes	Additional PCIe 3.0 lanes
X570	PCIe 4.0 x4	16	0
B550	PCIe 3.0 x4	0	10

Please remember that regardless of the chipset - you should still have 20 independent lanes straight from your CPU.

Note!

The final connection of the PCIe lanes depends on the motherboard's manufacturer. Read the manual to determine how the manufacturer allocated the additional lanes.

PCIe bifurcation

PCIe bifurcation is a feature that allows the division of data lanes in a single PCIe slot. For example, the single PCIe x16 slot (the longest one) can be configured into two PCIe x8 lanes or even four independent x4 lanes.

PCIe bifurcation does not affect overall connection speed. The only purpose of this feature is to allow a larger slot to act like multiple smaller slots.

This is exactly what we need to connect four PCIe 4.0 NVMEs directly into the CPU via the first PCIe x16 slot in our motherboard.

ASUS Hyper M.2 x16 Gen 4

So we have another requirement for the correct selection of equipment. We need support for the PCIe bifurcation.

Warning

Always consult your motherboard's manufacturer's user manual and check whether PCIe bifurcation is supported.

In the case of ASUS - here is a list of supported motherboards divided by chipset and bifurcation modes.

My choice of the motherboard

Since I have the ASUS TUF X570-PLUS laying around - I will go with it for this build.

Warning

This article is not sponsored and I am not affiliated with ASUS in any way.

We will use the first PCIe x16 slot (marked as the PCIEX16_1) configured for bifurcation (mode x4/x4/x4/x4) as in the picture below.

ASUS TUF X570-PLUS - PCIe bifurcation

Selecting NVME drives

I have decided to go for the Seagate FireCuda 530 1TB (four of them). This is my first time building a new system without Samsung's NVME drives and there are a few specific reasons for this.

First of all, Samsung had very serious problems when introducing theirs flagship models. Secondly, Seagate provides better tools - for example they have released a freedom-respecting, MPL-licensed cross platform utilities useful for performing various operations (like firmware update) on SATA, SAS, NVMe and USB storage devices.

Here are some of the most interesting data about the drives I chose:

Sequential Read: 7300 MB/s,
Sequential Write: 6000 MB/s,
Random Read (IOSP): 800000 IOPS,
Random Write (IOPS): 1000000 IOPS,
TBW (Total Bytes Written): 1275 TB.

Should be enough to max out our rig.

Putting things together

Assembling the PCIe card and NVME drives:

ASUS Hyper M.2 Gen 4 with Seagate FireCuda 530 1TB

Configuring the motherboard (BIOS/UEFI)

In this section we will configure the BIOS/UEFI.

Flashing the new firmware

The first and most important thing to do when working with a new equipment is updating the firmware. You should always carefully read release notes and update the firmware before starting any configurations.

This will save you many incompatibility problems and eliminate potential serious issues that may even lead to hardware destruction (as was in the Samsung's NVME drives case).

Tuning CPU and RAM

Let's start with the RAM configuration, we must enable the DOCP (Direct Overclocking Profile) profile in order to fully use the potential of our RAM. After installing new RAM (and probably after updating the BIOS/UEFI) - the CPU and memory will start with the slowest possible settings.

Go to AI Tweaker -> Ai Overclock Tuner and select D.O.C.P:

Configuring RAM - setting DOCP profile

The one and only Noctua NHD-15 gives me the power to slightly tune my CPU so I can't resist. 😇 Setting all core clocks to 4.5GHz is a good starting point for my workloads - it allows me to keep CPU temperatures below 65 degrees while still being able to compile the whole operating systems on all cores for hours.

I have changed the following options (suitable only for Ryzen 9 5950X):

CPU Core Ratio: 45.00,
VDDCR CPU Voltage: Manual,
VDDCR CPU Voltage Override: 1.18125.

Configuring CPU - changing Core Ratio and VDDR CPU Voltage

Note on CPU overclocking

Please note that this step is 100% optional and will not affect the final result as it comes to storage speeds.

Enabling PCIe bifurcation

Go to Advanced -> Onboard Devices Configuration -> PCIEX16_1 Bandwidth Bifurcation Configuration and select PCIe RAID Mode.

Enabling PCIe bifurcation - mode x4/x4/x4/x4

Please note that this setting does not enable any built-in RAID mode (aka Fake RAID). In my opinion, this option is named incorrectly. What this option actually does is splitting the full x16 slot into the four separate x4 enabling us to use all of our four NVME drives.

Enabling virtualization support

Go to Advanced -> Advanced CPU Configuration -> SVM Mode and select Enabled to enable CPU virtualization.

Enabling CPU virtualization

I am going to use network/GPU passthrough/virtualization so I also need to change the PCI subsystem settings.

I have changed the following options:

Above 4G Decoding: Enabled,
SR-IOV Support: Enabled.

Enabling SR-IOV support

Disabling CSM

We are going to configure this system with full disk encryption (including encryption of the /boot partition), this option is only available in UEFI mode so we can completely disable the Compatibility Support Module (CMS).

CSM is a component of the UEFI firmware that provides legacy BIOS compatibility. It is useful for booting an old operating system or some specific OSes like those prepared by disks manufacturers to provide the universal way to update the firmware.

Go to Boot -> CSM (Compatibility Support Module) -> Launch CSM and select Disabled.

Disabling CSM

Make sure that Secure Boot is disabled and OS Type is selected as Other OS.

Changing OS Type

Final state

After rebooting - you should be able to see all four NVMEs. Also, please verify that you are running at your desired clocks and voltages (for both CPU and RAM):

UEFI/BIOS - final state

Booting GNU/Linux - LiveCD

It looks like it's time to boot a LiveCD and run some initial tests.

But before we start, there is one more very important thing

For many years I have been professionally involved in building Free (as in freedom), independent and stable IT infrastructure. Recently I have also been teaching and running courses on advanced use of GNU/Linux systems (in Cracow, Poland).

Over the years, I have encountered data centers that were so badly configured (in terms of storage) that it was beyond my imagination. Seriously. I've seen very expensive servers packed with dozens of SSDs, surrounded by completely pointless hardware - all "designed" to run at less than 10% of the achievable storage speed. It all happened inside companies that work in the IT industry on a daily basis. It really looked as if it had been possible to make an environment several times faster, less complicated (and therefore more stable) for much less money.

Some very important facts and rules, or simply... how not to become a monkey 🐒:

Before starting any configuration - determine and write down the expected results. This is very important to know what to expect before you start. Without it - how would you know that you hadn't screwed something completely? When you buy four NVME drives, each capable of reading sequential data at ~7 GB/s, the overall expected result is to read files at around 28 GB/s. This requires many things to be done right - configuring the BIOS/UEFI, preparing the drives (setting the appropriate phisical sector size), configuring RAID, setting a proper chunk size, configuring the filesystem and so on. You need to verify the results after an each step.
Don't count on others, especially on ready-to-use boxed solutions. When it comes to storage - you can't count on universality, each use case is different.
Don't believe in myths. There is a belief that "software RAID" is a bad thing. They say that we should instead use a "dedicated hardware solution" such as Synology. But guess what... Under the hood, Synology uses Debian GNU/Linux and configures standard in-kernel, software RAID. The only reason someone calls this solution a "hardware RAID" is because it is sold in a closed housing. Don't believe the marketers! Check the facts! Buying a "hardware RAID" to get a generally pre-configured, suboptimal device with an extremely outdated Debian GNU/Linux operating system, plus buying network cards and switches and thus complicating the infrastructure - this is the perfect way not only to become a monkey, but also to witness the magic 10% result when it comes to performance.
The right results don't just happen by accident. Only bad results do that. If you feel uncomfortable while configuring a given element, be prepared to step back for a while and catch up on your knowledge. It will all pay off.
Taking the time to understand how it works allows you to become independent in creating complete, stand-alone solutions. On the other hand, relying 100% on ready-to-use, proprietary solutions leads to a situation in which such people are unable to perform even the simplest task on their own. This is exactly the situation that on a larger scale - leads to a technological enslavement.
GNU/Linux is everywhere and powers the most critical elements of human infrastructure. It gives freedom, independence, stability and allows to move forward without the need of being chained to some big company (or to your credit card). GNU/Linux is the only one that gives you complete, unique control over each and every element.

Booting Arch Linux

I have been using Arch Linux for all kinds of operations related to system installation, troubleshooting, and performance measurement since ever. Arch usually has the latest packages and is very well equipped with the basic tools for any type of work like this.

First of all, let's verify that the system detects all of our NVME drives:

root@blog-regalis-tech ~ # nvme list
Node                  Generic               SN                   Model                              
--------------------- --------------------- -------------------- -----------------------------------
/dev/nvme3n1          /dev/ng3n1            7VQ0739D             Seagate FireCuda 530 ZP1000GM30013 
/dev/nvme2n1          /dev/ng2n1            7VQ073C9             Seagate FireCuda 530 ZP1000GM30013 
/dev/nvme1n1          /dev/ng1n1            7VQ072ML             Seagate FireCuda 530 ZP1000GM30013 
/dev/nvme0n1          /dev/ng0n1            7VQ073CD             Seagate FireCuda 530 ZP1000GM30013 

So far so good.

Preparing the NVME drives

Once again, before you begin - please update the firmware of your NVME drives. Skipping this step may result in a faster disk wear (which is exactly what happened in the case of the previously mentioned Samsung).

Updating NVME firmware with GNU/Linux

The process of updating a firmware will be explained in a dedicated article.

LBA

NVME disks have something called a namespace, which is simply a set of logical block addresses (LBA) that are available to the host (in our case - to the Linux kernel). One physical NVME disk can consist of multiple namespaces. A namespace ID (NSID) is an identifier used by a controller to provide access to a namespace.

In Linux, namespaces appear each with a unique identifier in devices; /dev/nvme0n1 is simply the namespace 1 of controller 0).

Since we have four NVME drives (each with a single controller) with a single namespace on each on them - the kernel has exposed to us four devices:

/dev/nvme0n1,
/dev/nvme1n1,
/dev/nvme2n1,
/dev/nvme3n1.

Very important attribute of a NVME drive is the LBA format. This is exactly the point from which the kernel takes information about what the optimal size of commands to sent to the namespace is.

Most standard filesystems use 4k as a block size which means each IO operation issued to the filesystem will perform a read/write of size 4k. What happens if the disk uses block sizes of 512b? You guessed it, the drive will have to do eight independent reads in order to return a single 4k block. As you can imagine - this can cause huge performance losses.

Unfortunately, most standard drives come preconfigured with the LBA format set to 512b, even though they also support 4k. It's all due to compatibility problems with Windows systems 🐒.

How to check supported LBA formats for a given disk? We can use the nvme utility:

root@blog-regalis-tech ~ # nvme id-ns -H /dev/nvme0n1
(cut)
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better

As you can see - our Seagate NVME drive supports both 512b and 4k formats and the one being currently in use is the 512b. At this point we have a great opportunity to shoot ourselves in the foot - by leaving this format as 512b (default state). Changing the LBA format later without losing data is impossible.

Let's change the LBA format to 4k for all of our drives:

root@blog-regalis-tech ~ # for DRIVE in /dev/nvme{0..3}n1; do nvme format --force --block-size=4096 $DRIVE; done
Success formatting namespace:1
Success formatting namespace:1
Success formatting namespace:1
Success formatting namespace:1

Let's verify this step:

root@blog-regalis-tech ~ # for DRIVE in /dev/nvme{0..3}n1; do nvme id-ns -H $DRIVE | grep '(in use)'; done
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better (in use)

Verify the PCIe link status

We need to make sure that all our NVMEs are running at full speed, which means we use PCIe 4.0 instead of PCIe 3.0 and we use all four available PCIe lanes (x4).

We can use lspci to obtain information about the PCIe devices, let's find our NVME drives:

root@blog-regalis-tech ~ # lspci | grep Seagate
0a:00.0 Non-Volatile memory controller: Seagate Technology PLC FireCuda 530 SSD (rev 01)
0b:00.0 Non-Volatile memory controller: Seagate Technology PLC FireCuda 530 SSD (rev 01)
0c:00.0 Non-Volatile memory controller: Seagate Technology PLC FireCuda 530 SSD (rev 01)
0d:00.0 Non-Volatile memory controller: Seagate Technology PLC FireCuda 530 SSD (rev 01)

In order to show show more details we can use -vv flags of the lspci utility. We can filter the device(s) with the -s option, for example to show details about our first NVME drive we can use lspci -s 0a:00.0 -vv.

Since we are interested in PCIe link details, we can use grep to filter it:

root@blog-regalis-tech ~ # lspci -s 0a:00.0 -vv | grep 'Lnk'
    LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
    LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
    LnkSta: Speed 16GT/s, Width x4
    LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
    LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
    LnkCtl3: LnkEquIntrruptEn- PerformEqu-

That is information about the capabilities of our endpoint device. As you can see our NVME drive supports speeds from 2.5GT/s (PCIe 1.0) to 16GT/s (PCIe 4.0) and is equipped with four PCIe lanes.

Another important information is where the device is connected to. We can use lspci -tv to display a tree-like diagram containing all buses, bridges, devices and connections between them:

root@blog-regalis-tech ~ # lspci -tv
-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
           +-00.2  Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
           +-01.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-01.2-[01-09]----00.0-[02-09]--+-02.0-[03-05]----00.0-[04-05]----00.0-[05]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 WKS-XL [Radeon PRO W6600]
           |                               |                                            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
           |                               +-05.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           |                               +-08.0-[07]--+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
           |                               |            +-00.1  Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
           |                               |            \-00.3  Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
           |                               +-09.0-[08]----00.0  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
           |                               \-0a.0-[09]----00.0  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
           +-02.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-03.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-03.1-[0a]----00.0  Seagate Technology PLC FireCuda 530 SSD
           +-03.2-[0b]----00.0  Seagate Technology PLC FireCuda 530 SSD
           +-03.3-[0c]----00.0  Seagate Technology PLC FireCuda 530 SSD
           +-03.4-[0d]----00.0  Seagate Technology PLC FireCuda 530 SSD
           +-04.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-05.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-07.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-07.1-[0e]----00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
           +-08.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
           +-08.1-[0f]--+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
           |            +-00.1  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
           |            +-00.3  Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
           |            \-00.4  Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller
           +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
           +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
           +-18.0  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0
           +-18.1  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1
           +-18.2  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2
           +-18.3  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3
           +-18.4  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4
           +-18.5  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5
           +-18.6  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6
           \-18.7  Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7

We can see a single Root Complex and a Host Bridge (at 03.0), our first NVME drive is connected to the first port of that bridge (03.1) and so on.

Let's find out information about the 03.1 slot:

root@blog-regalis-tech ~ # lspci -s 03.1 -vv | grep Lnk
    LnkCap: Port #3, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
    LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
    LnkSta: Speed 16GT/s, Width x4
    LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
    LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
    LnkCtl3: LnkEquIntrruptEn- PerformEqu-

As you can see in the LnkSta the port is configured as 16GT/s (PCIe 4.0) and has four lanes (x4).

An example of bad configuration

If you happen to notice that your disk reports "downgraded" link status like this:

root@blog-regalis-tech ~ # lspci -s 0a:00.0 -vv | grep 'Lnk'
    LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
    LnkCtl: ASPM Disabled; Disabled- CommClk+
    LnkSta: Speed 16GT/s, Width x4 (downgraded)
    LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
    LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
    LnkCtl3: LnkEquIntrruptEn- PerformEqu-

This is probably because it is connected to the port with lower speed (for example to a PCIe 3.0):

root@blog-regalis-tech ~ # lspci -s 03.1 -vv | grep Lnk
    LnkCap: Port #3, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
    LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
    LnkSta: Speed 8GT/s, Width x4
    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
    LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
    LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
    LnkCtl3: LnkEquIntrruptEn- PerformEqu-

In this case - the speed of the NVME drive will be cut in half. I have seen too many cases like this.

There may be a lot of reasons for this situation:

incorrect BIOS/UEFI configuration,
you connected the device to the wrong port,
your CPU does not support PCIe 4.0,
the motherboard's manufacturer designed the motherboard in this way,
you let the 🐒 configure your hardware.

Initial performance tests

Let's install the powerful tool for measuring performance of storage devices - fio (Flexible I/O Tester ).

root@blog-regalis-tech ~ # pacman -Sy fio
:: Synchronizing package databases...
 core is up to date
 extra is up to date
resolving dependencies...
looking for conflicting packages...

Packages (4) glusterfs-1:11.0-1  gperftools-2.13-1  libunwind-1.6.2-2  fio-3.35-1

Total Download Size:   10.49 MiB
Total Installed Size:  30.58 MiB

:: Proceed with installation? [Y/n]

Let's measure the performance of a single NVME drive:

fio --readonly \
    --name blog.regalis.tech \
    --gtod_reduce=1 \
    --time_based \
    --runtime=30s \
    --readwrite=randread \
    --ioengine=io_uring \
    --direct=1 \
    --iodepth=128 \
    --filename=/dev/nvme0n1 \
    --bs=32k

Understanding fio's options and output

Fio will be explained in details in the following articles in this series.

Here are my results:

root@blog-regalis-tech ~ # fio --readonly \
    --name blog.regalis.tech \
    --gtod_reduce=1 \
    --time_based \
    --runtime=30s \
    --readwrite=randread \
    --ioengine=io_uring \
    --direct=1 \
    --iodepth=128 \
    --filename=/dev/nvme0n1 \
    --bs=32k
blog.regalis.tech: (g=0): rw=randread, bs=(R) 32.0KiB-32.0KiB, (W) 32.0KiB-32.0KiB, (T) 32.0KiB-32.0KiB, ioengine=io_uring, iodepth=128
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=7153MiB/s][r=229k IOPS][eta 00m:00s]
blog.regalis.tech: (groupid=0, jobs=1): err= 0: pid=4117:
  read: IOPS=229k, BW=7144MiB/s (7491MB/s)(209GiB/30001msec)
   bw (  MiB/s): min= 6946, max= 7152, per=100.00%, avg=7145.53, stdev=33.65, samples=59
   iops        : min=222302, max=228888, avg=228656.88, stdev=1076.71, samples=59
  cpu          : usr=7.24%, sys=57.68%, ctx=966928, majf=0, minf=618
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=6858814,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=7144MiB/s (7491MB/s), 7144MiB/s-7144MiB/s (7491MB/s-7491MB/s), io=209GiB (225GB), run=30001-30001msec

Disk stats (read/write):
  nvme0n1: ios=6835329/0, merge=0/0, ticks=3597995/0, in_queue=3597995, util=99.69%

For now, we can see that a single process can successfully perform a direct (without any buffering) read of randomly selected 32-kilobytes blocks of data from a single NVME drive with a constant, stable bandwidth of 7491 MB/s issuing 229k operations per second.

Here are my results for all common block sizes (running single process on a single core):

block size	IOPS	bandwidth
4k	420k	1721 MB/s
8k	341k	2794 MB/s
16k	305k	4996 MB/s
32k	229k	7486 MB/s
64k	115k	7509 MB/s
128k	57.3k	7515 MB/s
256k	28.7k	7514 MB/s
512k	14.3k	7515 MB/s
1m	7.1k	7514 MB/s

To measure the maximum number of I/O operations the drive can handle, we need to run multiple instances of these tests in parallel. This should be easy since fio has a dedicated option for this called --numjobs.

Let's check how many random 4k blocks we can read when trying from 16 processes at once:

root@blog-regalis-tech ~ # fio --readonly \
    --name blog.regalis.tech \
    --gtod_reduce=1 \
    --time_based \
    --runtime=30s \
    --readwrite=randread \
    --ioengine=io_uring \
    --direct=1 \
    --iodepth=128 \
    --filename=/dev/nvme0n1 \
    --bs=4k \
    --numjobs=16 \
    --group_reporting=1
blog.regalis.tech: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=128
...
fio-3.35
Starting 16 processes
Jobs: 16 (f=16):
blog.regalis.tech: (groupid=0, jobs=16): err= 0: pid=6020: 
  read: IOPS=1800k, BW=7031MiB/s (7373MB/s)(206GiB/30002msec)
(cut)

But what about reading from all of our NVME drives at once? We can pass more devices to the fio's --filename option and run multiple jobs in parallel, for example:

root@blog-regalis-tech ~ # fio --readonly \
    --name blog.regalis.tech \
    --gtod_reduce=1 \
    --time_based \
    --runtime=30s \
    --readwrite=randread \
    --ioengine=io_uring \
    --direct=1 \
    --iodepth=128 \
    --filename=/dev/nvme{0..3}n1 \
    --bs=1m \
    --numjobs=16 \
    --group_reporting=1
blog.regalis.tech: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=io_uring, iodepth=128
...
fio-3.35
Starting 16 processes
blog.regalis.tech: (groupid=0, jobs=16): err= 0: pid=7027: 
  read: IOPS=28.5k, BW=27.8GiB/s (29.9GB/s)(837GiB/30085msec)
   bw (  MiB/s): min=12897, max=47894, per=100.00%, avg=28494.46, stdev=491.84, samples=959
   iops        : min=12897, max=47894, avg=28494.26, stdev=491.84, samples=959
  cpu          : usr=0.21%, sys=2.05%, ctx=834198, majf=0, minf=9367
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=857120,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=27.8GiB/s (29.9GB/s), 27.8GiB/s-27.8GiB/s (29.9GB/s-29.9GB/s), io=837GiB (899GB), run=30085-30085msec

Disk stats (read/write):
  nvme0n1: ios=1708645/0, merge=0/0, ticks=118071829/0, in_queue=118071829, util=99.88%
  nvme1n1: ios=1708638/0, merge=0/0, ticks=118198814/0, in_queue=118198814, util=99.90%
  nvme2n1: ios=1708560/0, merge=0/0, ticks=120492806/0, in_queue=120492806, util=99.91%
  nvme3n1: ios=1708530/0, merge=0/0, ticks=120666990/0, in_queue=120666990, util=99.93%

Looks great, we are hitting 29.9 GB/s from all NVMEs 🚀🚀🚀.

The output of the iostat program from the time when fio was running:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.13    0.00    4.61   44.31    0.00   50.96

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme0n1       57251.00   7328128.00         0.00         0.00    7328128          0          0
nvme1n1       57245.00   7327360.00         0.00         0.00    7327360          0          0
nvme2n1       57245.00   7327360.00         0.00         0.00    7327360          0          0
nvme3n1       57248.00   7327744.00         0.00         0.00    7327744          0          0

Therefore, we have a proof that our configuration allows parallel reading from all disks, stably reaching approximately 30 GB/s of bandwidth.

This is in line with our expectations - so far we have done everything right.

Here are the test results (all were run with the --numjobs=16 option).

block size	IOPS	bandwidth
4k	5556k	22.8 GB/s
32k	914k	30.0 GB/s
256k	115k	30.0 GB/s
512k	57.2k	30.0 GB/s
1m	28.5k	30.0 GB/s

I would like to make it clear that we will achieve exactly the same results with the Ryzen 5600X processor.

If an ordinary PC is able to achieve bandwidth of 30GB/s and more than 5 million IOPS - imagine what a professional server can do... And on the other hand - how much of the server's potential can be wasted if a series of mistakes are made during configuration. The "worst" is yet to come because we have to configure RAID properly and all the filesystems to be able to maintain these speeds. Several layers are waiting for us.

Summary

This is the end of the first part. In the next articles I will cover several elements:

useful utilities to measure performance (including fio, iostat, dstat etc.),
partitioning,
preparing RAID, testing performance of multiple RAID configurations,
preparing filesystems,
performing a manual Debian GNU/Linux installation,
performing a manual Debian GNU/Linux installation with full disk encryption for multiple users with the ability of unlocking volumes with multiple FIDO USB keys,
compiling the whole system from scratch - performing a manual Gentoo GNU/Linux installation for maximum performance.

Final result

Regalis Technologies - powerful GNU/Linux workstation