Hakutaku: SMP on x86-64

Jul 30, 2020
1842 Words
Hakutaku Operating Systems OSDev X86 SMP multicore

This is one post in my blog series about developing an Operating System for the x86-64 platform in Rust. I aim to cover a small topic in each post along with some code samples to get you started if you are also interested in doing such a project. The complete project is on Github

Multi-core

Multi-core often sounds like a scary topic for most hobbyist OS developers. Today we will tackle this problem together in our Rust OS, Hakutaku.

Background

X86-64 SMP Start-Up General Process

In x86-64 the system / BIOS selects a processor as the Bootstrap Processor or BSP, and the rest will be selected as Application Processor or AP. All the APs will be put into a “waiting” state until a INIT/SIPI (Start-up IPI) is received. Without any special operations, all of your code will run on this processor. In order to enable all the other processors, we need to do a few things.

Locate and identify all the other processors on the system.
Send INIT and SIPI Inter-Processor Interrupts (IPI) to each enabled processor.
Setup temporary GDT / IDT for the APs to bootstrap into Long Mode
Add the APs into the scheduler

APIC / LAPIC

APIC or the Advanced Programmable Interrupt Controller, is the new mechanism Intel introduced as part of the Intel Multi-Processor (MP) specification. This is the new mechanism employed by modern processors to manage interrupts.

The APIC system consists of two main components, the Local APIC (LAPIC) and the IO APIC. Each logical processor on the system will have its own LAPIC and all processors on a system will share one single IO APIC. On boot, the IOAPIC should be configured to pass/emulate the Legacy (8259) PIC’s interrupts. For now we will only be configuring the LAPIC. IOAPIC is relatively complicated, we will leave it for a later post.

The Local APIC

The Local APIC is relatively simple. It consists of three sets of registers, the Interrupt Command Word (ICW), APIC Local Timer, and Local Vector Table (LVT). Information on register layout can be found in the Intel Manual (Vol. 3, Section 10.4.1, Table 10-1)

Interrupt Command Word (ICW)

This is probably the most important register as of waking up all the other cores. This is the register that you will need to probe in order to send IPIs. The ICW controls what kind of interrupt it is sending, which vector should it be send to, and who is the recipient of the said interrupt.

Local APIC Timer

The Local APIC Timer as the name suggests, is local to each logical processor, and is capable of operating in one of three modes: One-Shot, Periodic and TSC-Deadline. In our case, since this timer will be generating the periodic interrupt for our scheduler on each core, we will be configuring it to the periodic mode.

Local Vector Table

Unlike the Interrupt Vector Table which determines what action to take on each interrupt vector, the LVT controls what vector / interrupt to invoke when the specified event is triggered. In our case we will need to configure the Spurious Interrupt vector and Local Timer Interrupt vector.

Advanced Configuration and Power Interface (ACPI)

I am sure many of you have heard or have seen this word before. This is a table that describes the configuration of the hardware that our OS is currently running on. It includes many information about IO Mapping, IRQ Routing, Processor Topology and much more.

Although some of the tables in ACPI are quite hard to parse (i.e. DSDT & SSDT), the MADT table we need right now is rather easy.

Multiple APIC Description Table (MADT)

MADT is a simple table that contains entries describing all the LAPIC and by association their logical processors. Each entry will have describe the state that the processor is in as well as its processor ID and LAPIC ID.

Implementation

To add multi-core capability to Hakutaku, we will need to first locate all the other logical processors. There are multiple ways of doing this, at the time of writing, the most reliable and widely used method is to locate and parse the ACPI tables.

ACPI Tables

There are two main ways to locate the ACPI Tables, depending on if you are booting from BIOS or EFI, the method will vary. On traditional BIOS boot mode, the ACPI table pointer is loaded in to certain section of the system memory (BIOS / BIOS shadow area). You can simply search for the string RSD PTR (non-null terminated). When you find the pointer, you can begin to check its signature and its version number to determine if you are looking at a ACPI 1.0 (version number 0) or later ACPI tables.

Follow the RSD PTR you should be able to locate the remaining parts of the table and find the previously mentioned MADT section to list all the available processors. Before you proceed further in waking up those processors, you should always check that they are in a Waiting for SIPI mode. (aka not disabled)

Waking Up

Now comes the interesting part, actually waking up the sleeping cores. To do this I recommend start with only one AP. This will help you detect problem in the process.

NOTE: When you wake up a application processor, it will boot to Real Mode and have access to only the first 1MB of Physical Memory. So you should plan for that and place some code for the APs to run initially. Here I just have a halt loop and a output to the POST code hex display (Often on I/O Port 0x80).

; intel Syntax, nasm

align 4096
bits 16

ap_boot:
    mov al, 0x45
    outb al, 0x80
.loop:
    hlt
    jmp ap_boot.loop

Note the entry point have to be aligned to 4K and located in the first MiB of memory.

According to intel manual, the protocol for waking up APs consists of the following:

Send an INIT.
Wait for 10ms.
Send the first SIPI.
Wait for 200uS.
If the processor is still not running, send a second SIPI.

The SIPI Interrupt

The SIPI interrupt is a very special one, it kicks off the processor from the default “idling” state. Since we need to tell the processor where to begin execution, the vector field for this interrupt is special.

Instead of specifying a interrupt vector, the value in the vector field is used to select a 4K frame starting from physical address 0 to begin execution hence the code needs to be aligned on a 4K boundary below 1 MiB.

Getting to Long Mode

Now that the cores all woke up and executing code, we can think about how to get them into long mode. If you happened to write your own bootloader this would be simply repeating the steps but without having to load the binary this time. (It is already in memory) However, in my case I am booted into Protected Mode by a multiboot compatible bootloader, there will be quite a few steps to go.

Setup Protected Mode GDT

Although there is nothing preventing you from directly entering long mode from real mode, I decided to go through protected mode as a launch pad.

According to the Intel Manual, to enter protected mode, there are two main steps:

set the protected mode bit in CR0 (bit 0)
we need to setup at least some sort of segmentation system (We will not be enabling paging in protected mode).

For #2 The easiest way to achieve this is to hard code a gdt with two entries: one for CS(code segment) and one for DS(data segment). Then all we need to do is to load this GDT and point ss, ds, cs to the correct selector by using mov and ljmp respectively.

core_wakeup:
    cli                     ; Disable interrupts, we want to be left alone

    xor ax, ax
    mov ds, ax              ; Set DS-register to 0 - used by lgdt

    lgdt [gdt_desc]         ; Load the GDT descriptor

    mov eax, cr0            ; Copy the contents of CR0 into EAX
    or eax, 1               ; Set bit 0
    mov cr0, eax            ; Copy the contents of EAX into CR0

    jmp 08h:smp_protected_entry

gdt:                    ; Address for the GDT
gdt_null:               ; Null Segment
        dd 0
        dd 0

gdt_code:               ; Code segment, read/execute, nonconforming
        dw 0xFFFF
        dw 0
        db 0
        db 10011010b
        db 11001111b
        db 0

gdt_data:               ; Data segment, read/write, expand down
        dw 0xFFFF
        dw 0
        db 0
        db 10010010b
        db 11001111b
        db 0

gdt_end:                ; Used to calculate the size of the GDT

gdt_desc:                       ; The GDT descriptor
        dw gdt_end - gdt - 1    ; Limit (size)
        dd gdt                  ; Address of the GDT

bits 32
section .smp.protected

smp_protected_entry:
    mov ax, 0x10
    mov ds, ax
    mov ss, ax

    jmp _ap_start

Jump to Long Mode

Now all that’s left is to setup the last bit of code to get us into Long Mode. Since we already have the initial GDT setup from kernel’s initial booting process, we can simply reuse that structure and code to setup long mode. And the process of getting to long mode is very similar to what we did above for protected mode. Except this time we will be enable paging and loading CR3 with the initial page table. (This will later be replaced with a page table assigned for each core)

_ap_start:
    ; load P4 to cr3 register (cpu uses this to access the P4 table)
    mov eax, p4_table
    mov cr3, eax

    ; enable PAE-flag in cr4 (Physical Address Extension)
    mov eax, cr4
    or eax, 1 << 5
    mov cr4, eax

    ; set the long mode bit in the EFER MSR (model specific register)
    mov ecx, 0xC0000080
    rdmsr
    or eax, 1 << 8
    wrmsr

    ; enable paging in the cr0 register
    mov eax, cr0
    or eax, 1 << 31
    mov cr0, eax

    ; JMP to long
    lgdt [gdt64.pointer]

    jmp gdt64.code:_ap_long_mode_start

.loop:
    hlt
    jmp _ap_start.loop


enable_paging:
    ; load P4 to cr3 register (cpu uses this to access the P4 table)
    mov eax, p4_table
    mov cr3, eax

    ; enable PAE-flag in cr4 (Physical Address Extension)
    mov eax, cr4
    or eax, 1 << 5
    mov cr4, eax

    ; set the long mode bit in the EFER MSR (model specific register)
    mov ecx, 0xC0000080
    rdmsr
    or eax, 1 << 8
    wrmsr

    ; enable paging in the cr0 register
    mov eax, cr0
    or eax, 1 << 31
    mov cr0, eax

    ret

section .rodata.init
gdt64:
    dq 0 ; zero entry
.code: equ $ - gdt64 ; new
    dq (1<<43) | (1<<44) | (1<<47) | (1<<53) ; code segment
.pointer:
    dw $ - gdt64 - 1
    dq gdt64

Summary

The whole process sure looks daunting at first, but it is actually rather straight forward to implement if you have got this far in your kernel. Once you are in long mode, simply proceed to replace the page table and GDT / TSS with a custom one. Just be careful that you can not reuse a TSS on multiple cores.

Hopefully this guide is helpful to you, and feel free to contact me for any typo/ mistakes. You can find my contact information in the about tab.