Hakutaku: SMP on x86-64
1842 Words
Hakutaku Operating Systems OSDev X86 SMP multicore
This is one post in my blog series about developing an Operating System for the x86-64
platform in Rust.
I aim to cover a small topic in each post along with some code samples to get you started if you are
also interested in doing such a project. The complete project is on Github
Multi-core
Multi-core often sounds like a scary topic for most hobbyist OS developers. Today we will tackle this problem together in our Rust OS, Hakutaku.
Background
X86-64 SMP Start-Up General Process
In x86-64
the system / BIOS selects a processor as the Bootstrap Processor
or BSP, and the rest will be selected as Application Processor or AP. All the
APs will be put into a “waiting” state until a INIT/SIPI (Start-up IPI) is
received. Without any special operations, all of your code will run on this
processor. In order to enable all the other processors, we need to do a few
things.
- Locate and identify all the other processors on the system.
- Send
INIT
andSIPI
Inter-Processor Interrupts (IPI) to each enabled processor. - Setup temporary GDT / IDT for the APs to bootstrap into
Long Mode
- Add the APs into the scheduler
APIC / LAPIC
APIC or the Advanced Programmable Interrupt Controller, is the new mechanism Intel introduced as part of the Intel Multi-Processor (MP) specification. This is the new mechanism employed by modern processors to manage interrupts.
The APIC system consists of two main components, the Local APIC (LAPIC) and the IO APIC. Each logical processor on the system will have its own LAPIC and all processors on a system will share one single IO APIC. On boot, the IOAPIC should be configured to pass/emulate the Legacy (8259) PIC’s interrupts. For now we will only be configuring the LAPIC. IOAPIC is relatively complicated, we will leave it for a later post.
The Local APIC
The Local APIC is relatively simple. It consists of three sets of registers, the Interrupt Command Word (ICW), APIC Local Timer, and Local Vector Table (LVT). Information on register layout can be found in the Intel Manual (Vol. 3, Section 10.4.1, Table 10-1)
Interrupt Command Word (ICW)
This is probably the most important register as of waking up all the other cores. This is the register that you will need to probe in order to send IPIs. The ICW controls what kind of interrupt it is sending, which vector should it be send to, and who is the recipient of the said interrupt.
Local APIC Timer
The Local APIC Timer as the name suggests, is local to each logical processor, and is capable of operating in one of three modes: One-Shot, Periodic and TSC-Deadline. In our case, since this timer will be generating the periodic interrupt for our scheduler on each core, we will be configuring it to the periodic mode.
Local Vector Table
Unlike the Interrupt Vector Table which determines what action to take on each interrupt vector, the LVT controls what vector / interrupt to invoke when the specified event is triggered. In our case we will need to configure the Spurious Interrupt vector and Local Timer Interrupt vector.
Advanced Configuration and Power Interface (ACPI)
I am sure many of you have heard or have seen this word before. This is a table that describes the configuration of the hardware that our OS is currently running on. It includes many information about IO Mapping, IRQ Routing, Processor Topology and much more.
Although some of the tables in ACPI are quite hard to parse (i.e. DSDT
& SSDT
),
the MADT
table we need right now is rather easy.
Multiple APIC Description Table (MADT)
MADT is a simple table that contains entries describing all the LAPIC and by association their logical processors. Each entry will have describe the state that the processor is in as well as its processor ID and LAPIC ID.
Implementation
To add multi-core capability to Hakutaku, we will need to first locate all the other logical processors. There are multiple ways of doing this, at the time of writing, the most reliable and widely used method is to locate and parse the ACPI tables.
ACPI Tables
There are two main ways to locate the ACPI Tables, depending on if you are booting
from BIOS or EFI, the method will vary. On traditional BIOS boot mode, the ACPI
table pointer is loaded in to certain section of the system memory (BIOS / BIOS
shadow area). You can simply search for the string RSD PTR
(non-null terminated).
When you find the pointer, you can begin to check its signature and its version
number to determine if you are looking at a ACPI 1.0 (version number 0) or later
ACPI tables.
Follow the RSD PTR
you should be able to locate the remaining parts of the table
and find the previously mentioned MADT
section to list all the available
processors. Before you proceed further in waking up those processors, you should
always check that they are in a Waiting for SIPI mode. (aka not disabled)
Waking Up
Now comes the interesting part, actually waking up the sleeping cores. To do this I recommend start with only one AP. This will help you detect problem in the process.
NOTE: When you wake up a application processor, it will boot to Real Mode
and have access to only the first 1MB of Physical Memory. So you should plan
for that and place some code for the APs to run initially. Here I just have a
halt loop and a output to the POST code hex display (Often on I/O Port 0x80).
; intel Syntax, nasm
align 4096
bits 16
ap_boot:
mov al, 0x45
outb al, 0x80
.loop:
hlt
jmp ap_boot.loop
Note the entry point have to be aligned to 4K and located in the first MiB of memory.
According to intel manual, the protocol for waking up APs consists of the following:
- Send an INIT.
- Wait for 10ms.
- Send the first SIPI.
- Wait for 200uS.
- If the processor is still not running, send a second SIPI.
The SIPI Interrupt
The SIPI interrupt is a very special one, it kicks off the processor from the default “idling” state. Since we need to tell the processor where to begin execution, the vector field for this interrupt is special.
Instead of specifying a interrupt vector, the value in the vector field is used to select a 4K frame starting from physical address 0 to begin execution hence the code needs to be aligned on a 4K boundary below 1 MiB.
Getting to Long Mode
Now that the cores all woke up and executing code, we can think about how to get them into long mode. If you happened to write your own bootloader this would be simply repeating the steps but without having to load the binary this time. (It is already in memory) However, in my case I am booted into Protected Mode by a multiboot compatible bootloader, there will be quite a few steps to go.
Setup Protected Mode GDT
Although there is nothing preventing you from directly entering long mode from real mode, I decided to go through protected mode as a launch pad.
According to the Intel Manual, to enter protected mode, there are two main steps:
- set the protected mode bit in CR0 (bit 0)
- we need to setup at least some sort of segmentation system (We will not be enabling paging in protected mode).
For #2 The easiest way to achieve this is to hard code a gdt with two entries: one for CS(code segment) and one for DS(data segment). Then all we need to do is to load this GDT and point ss, ds, cs to the correct selector by using mov and ljmp respectively.
core_wakeup:
cli ; Disable interrupts, we want to be left alone
xor ax, ax
mov ds, ax ; Set DS-register to 0 - used by lgdt
lgdt [gdt_desc] ; Load the GDT descriptor
mov eax, cr0 ; Copy the contents of CR0 into EAX
or eax, 1 ; Set bit 0
mov cr0, eax ; Copy the contents of EAX into CR0
jmp 08h:smp_protected_entry
gdt: ; Address for the GDT
gdt_null: ; Null Segment
dd 0
dd 0
gdt_code: ; Code segment, read/execute, nonconforming
dw 0xFFFF
dw 0
db 0
db 10011010b
db 11001111b
db 0
gdt_data: ; Data segment, read/write, expand down
dw 0xFFFF
dw 0
db 0
db 10010010b
db 11001111b
db 0
gdt_end: ; Used to calculate the size of the GDT
gdt_desc: ; The GDT descriptor
dw gdt_end - gdt - 1 ; Limit (size)
dd gdt ; Address of the GDT
bits 32
section .smp.protected
smp_protected_entry:
mov ax, 0x10
mov ds, ax
mov ss, ax
jmp _ap_start
Jump to Long Mode
Now all that’s left is to setup the last bit of code to get us into Long Mode. Since we already have the initial GDT setup from kernel’s initial booting process, we can simply reuse that structure and code to setup long mode. And the process of getting to long mode is very similar to what we did above for protected mode. Except this time we will be enable paging and loading CR3 with the initial page table. (This will later be replaced with a page table assigned for each core)
_ap_start:
; load P4 to cr3 register (cpu uses this to access the P4 table)
mov eax, p4_table
mov cr3, eax
; enable PAE-flag in cr4 (Physical Address Extension)
mov eax, cr4
or eax, 1 << 5
mov cr4, eax
; set the long mode bit in the EFER MSR (model specific register)
mov ecx, 0xC0000080
rdmsr
or eax, 1 << 8
wrmsr
; enable paging in the cr0 register
mov eax, cr0
or eax, 1 << 31
mov cr0, eax
; JMP to long
lgdt [gdt64.pointer]
jmp gdt64.code:_ap_long_mode_start
.loop:
hlt
jmp _ap_start.loop
enable_paging:
; load P4 to cr3 register (cpu uses this to access the P4 table)
mov eax, p4_table
mov cr3, eax
; enable PAE-flag in cr4 (Physical Address Extension)
mov eax, cr4
or eax, 1 << 5
mov cr4, eax
; set the long mode bit in the EFER MSR (model specific register)
mov ecx, 0xC0000080
rdmsr
or eax, 1 << 8
wrmsr
; enable paging in the cr0 register
mov eax, cr0
or eax, 1 << 31
mov cr0, eax
ret
section .rodata.init
gdt64:
dq 0 ; zero entry
.code: equ $ - gdt64 ; new
dq (1<<43) | (1<<44) | (1<<47) | (1<<53) ; code segment
.pointer:
dw $ - gdt64 - 1
dq gdt64
Summary
The whole process sure looks daunting at first, but it is actually rather straight forward to implement if you have got this far in your kernel. Once you are in long mode, simply proceed to replace the page table and GDT / TSS with a custom one. Just be careful that you can not reuse a TSS on multiple cores.
Hopefully this guide is helpful to you, and feel free to contact me for any typo/ mistakes. You can find my contact information in the about tab.