Exploring the scene behind the power button

I am a sophomore at Northwestern University studying computer science in the McCormick School of Engineering and Applied Science. I work with Professor Peter Dinda in the department of Electrical Engineering and Computer Science, who is the head of the Prescience Lab. I became involved with Professor Dinda's research through the V3VEE project, an open-source virtual machine framework for modern architectures. We work on numerous projects including Palacios, an OS-independent and embeddable virtual machine monitor. My research interests lie on the broad area of computer systems, including virtualization, operating systems, security, and compilers. Outside of class I work as a part-time software engineering intern for Shmoop, an education startup based in Mountain View, California or work on my toy projects like TCCLK (Tiny C Compiler in Linux Kernel).

Page Allocation in Palacios

In today’s somewhat delayed post, I will be talking about the page allocator that I wrote to port the Palacios VMM to the Nautilus kernel.

In operating systems and kernels, a page refers to a block of memory. It’s a basic unit of memory chunk that the kernel uses to allocate memory to user level programs. Kernels need units of memory, and these are often allocated in terms of pages. For example, a page table is usually allocated within a page by the kernel. Page size may differ across different platforms (ex. 4KB on a Linux kernel) and the size affects the performance of the kernel page allocator.

A page allocator on the other hand, is what divides up the virtual memory into chunks and then allocate them to the kernel. In my project, Palacios VMM needs its own page allocator, because Nautilus and Palacios have different page sizes. Nautilus has a page size of 2MB, which is 500 times greater than that of Palacios, which is the standard 4KB pages that Linux kernels use. This means that if Palacios ask for a page from its host, Nautilus, it will receive 2MB pages instead of 4KB. When Palacios was ported to the Linux kernel, this was not a problem because page sizes are 4KB in Linux kernels – however, I need a page allocator that can allocate a certain number of 2MB pages and then reallocate it for the guest in chunks of 4KB. This process is (somewhat) better shown in the diagram below.

page-allocator

There are numerous algorithms that exist for this, and one of the best known algorithms for doing this is the buddy allocator. I was tempted to use this algorithm at first, but my advisor Prof. Peter Dinda suggested using a simple allocator “that works” first. This is because if I happen to have a bug in the page allocator itself, I wouldn’t be able to proceed until I catch this bug. Or even worse, it might seem to not have any bugs at the time I write it but actually have a bug that doesn’t get revealed until later stage in the research, which would be hard to track down and fix. Taking his advice, I decided to use the simplest allocator possible, which is the bitmap allocator. This algorithm uses a single bit to indicate whether a page is free or not (1 or 0). Hence, it significantly reduces the extra space I need for this allocator itself. (A memory allocator takes up memory itself) The diagram below shows how this algorithm works in a visual way.

bitmap_allocation

Red = Allocated, Green = Free

My implementation of bitmap page allocator has an array of bits, each bit representing a page on Palacios. The first row shows a state – it has some pages allocated to the kernel, and most of them free. If Palacios calls alloc(2), which means “give me 2 pages”, the allocator goes through each bit and sees if there are 2 contiguous bits that are marked as 0 (green). It reaches the third bit and sees that it is free, but the next one is already occupied. Hence this page cannot be allocated. It keeps going, until it sees the 8th bit, which does have 2 contiguous bits that are free. It then allocates this particular page and marks them as allocated (colored orange on the second row in the diagram above). Similarly, if another request comes in to allocate 1 page, it can now allocate the page referred by the 3rd bit, since it only needs 1 pages.

It is important to note that this algorithm is not a particularly efficient one in terms of performance, because it tries to search the entire bits one by one in a linear time (actually, quadratic in my initial implementation but I fixed it) However, it is very simple and simplicity prevents bugs from happening (~150 lines of C code) Eventually I am planning to dig further into this and write an optimized allocator that best suits the need of this particular project, but decided to move on for now.

Porting Palacios to Nautilus

Porting a software to another platform is not as easy as it sounds. For those who are unfamiliar with what porting means, think of how you can’t run programs that run on Windows directly on a Mac. (i.e. You can’t run a program with a .exe extension on a Mac or one with a .dmg extension on a Windows machine) This is because there are parts of the software that are dependent on the platform they run on. For example, even creating a simple thread differs widely among different kernels. For example, under Windows NT (the kernel that modern Windows OSs like Windows 7, Windows 8, etc. uses), CreateThread function is used to create a new thread, whereas in Linux kernel, kthread_create() can be used, which is defined in <linux/thread.h>

As for that reason, the Palacios VMM cannot directly run on Nautilus kernel as of now. However, it is possible to port Palacios to Nautilus, because Palacios is designed to be embeddable to different OSs, and has already been ported across numerous platforms, such as Linux kernel and the Kitten Lightweight kernel.

The way this works is shown in this rather mediocre diagram I drew:

Screen Shot 2015-08-09 at 5.46.08 PM

All the functions that are defined in the Palacios source code can be compiled into a static library, called libv3vee.a. Nautilus can be then re-compiled to include this library when it is built, so that all the relevant VMM functionality is embedded into the kernel. The interaction with the guest OS can be supported by the definitions provided by the libv3vee library. libv3vee itself is about 8MB in size, and combined with Nautilus it provides a nice 20MB image, which is not an impossible size to be embedded into a ROM.

Palacios needs a defined set of OS hooks to work properly. The code below shows this:

Screen Shot 2015-08-09 at 5.51.49 PM

This struct defines the set of hooks that it needs to run the guest kernel. Most of these come down to functionalities relevant to memory and page allocation, CPU and process controls for multiprocessing, thread and locks. The functions on the right that begin with prefix palacios_ are functions that I had to write, most of which are just direct function calls that Nautilus provides. There was a slight complication with the memory relevant functions, which I will have to explain in another post.

Some of these functions were not provided by the Nautilus kernel, such as  starting a thread that already has been created (start_thread). To provide this, I had to extend the Nautilus kernel’s thread implementation to support this. It wasn’t too much work – I just added a way to associate a thread id with the information about that thread like the function to execute and its input arguments. After that, running a thread came down to just 3 lines of code, which looks something like this:

int nk_thread_run (nk_thread_id_t t) {
nk_thread_t * tid = (nk_thread_t*)t;
// this sets up the stack before running it
thread_setup_init_stack(tid, tid->fun, tid->input);
// this puts the thread on the run queue of cpu
nk_enqueue_thread_on_runq(tid, tid->bound_cpu);
return 0;
}

If only it worked after just doing these, my project would be almost over, and I would’ve posted many more blog posts for the past few weeks I have been silent (…), but life is never that easy in systems programming, and lots of things have been broken and fixed, some remaining broken.

I’ll be posting about page allocator I had to write and a bizarre bug that was solved but I still cannot figure out why the fix solved the problem, possibly tomorrow.

 

Progress so far

It’s been quite a while since my last blog post – and I’ve dived too deep since then and thought that it would be a nice timing to make another blog post on the progress so far before I go deeper.

To recap, my research project breaks down into 3 parts:

1. Flashing Coreboot on the Gizmosphere machine provided to me.
2. Getting Coreboot to boot the Nautilus kernel, a tiny kernel written by Kyle Hale, a Ph.D. student in our lab.3. Porting the Palacios virtual machine monitor to the Nautilus kernel

The first part was easily (relatively) done because Gizmosphere provides an interface for flashing its ROM. However, for the sake of debugging, I decided to use QEMU, an open source machine emulator. Using QEMU for debugging Coreboot is relatively easier because debugging at a hardware level can be difficult, especially when it comes to debugging processes that happen before the OS comes into the place.

For the past week, I’ve been struggling with Coreboot. Coreboot, previously known as the LinuxBIOS project, is an open-source project aimed at completely replacing the BIOS. As mentioned in previous posts, BIOS is often troublesome because they may contain strange bugs, but are not open about it at all – in short, it’s like an ugly duckling that no one wants to take care of, mainly because of its nature.

Coreboot supports many different CPU architectures, including the most popular ones such as x86 architecture.

Compiling Coreboot

The most difficult part about dealing with Coreboot was compiling it. Coreboot aims to eventually replace BIOS, which means that it has to deal with setting up the hardware of different architectures. Cross compilation is a terminology used to describe such situations – when a program targeted at a specific architecture is compiled on a machine that has a different architecture. Cross compilation requires very specific toolchains, and such requirement is quite common when it comes to low level code that has to deal with behavior of the hardware. However, Coreboot used by far the most specific toolchain I have ever seen. Not only that it required particular versions of compilers such as gcc, but it also patched different commits of the library dependencies and compiler versions. In addition, for some reason Coreboot didn’t really compile on Ubuntu, probably due to the toolchain issues. I tried to compile Coreboot on multiple server machines used in our lab, and there was one machine where it succeeded in compiling (a x86 machine using the ancient Fedora 15.. but whatever, it works). I decided to not struggle with installing this on my local Ubuntu or Mac machine and decided to move forward.

Initial Attempt – Coreboot + SeaBIOS + GRUB2 + Nautilus

Coreboot uses something called payload, which is just an ELF formatted program that it jumps to after doing a basic setup of the machine. Coreboot can also be configured to use open source BIOS like SeaBIOS and bootloaders like GRUB. To see whether Nautilus would boot under (more normal) scenarios, I configured Coreboot to use SeaBIOS as its BIOS and GRUB2 as the bootloader.

This worked without any problem as expected, and I started moving forward.

Successfully booted Nautilus kernel - a screen I really want to see without GRUB and SeaBIOS

Successfully booted Nautilus kernel – a screen I really want to see without GRUB and SeaBIOS

Next: Coreboot + GRUB2 + Nautilus

I then configured Coreboot to not use any BIOS at all, and instead just use GRUB2 as its bootloader to boot the Nautilus kernel. This also worked without any problem, and I was quite surprised that Coreboot did everything correctly to set up Nautilus. The boot time was also extremely fast due to the minimal boot time of the Nautilus kernel (it takes about as much as a Linux kernel can fork a process – basically clone a process – for Nautilus kernel to boot).

The real trouble: Coreboot + Nautilus

This didn’t go so well. I configured Coreboot to use the Nautilus kernel as image, and it didn’t go so well. The machine went on an infinite loop of boot -> kernel panic -> reboot. This indicated a double fault, which happens when a fault happens in the middle of handling another fault. But it was unclear whether the fault was happening from Coreboot or Nautilus.

Since then it has been a long debugging process that hasn’t ended yet. First I wasn’t even  sure whether Coreboot succeeded in loading Nautilus kernel correctly into the memory, so I looked at the different segments. Using readelf, I compared the physical addresses of the different segments of Nautilus executable with the outputs from Coreboot, and they matched correctly. I then started looking into the Coreboot code to see what was happening, and it simply did a jmp instruction to the location of Nautilus. I then debugged by modifying the Nautilus kernel’s main() function (the place it starts executing first) and seeing what changes.

output from coreboot

outputs from Coreboot – I don’t see anything from Nautilus 🙁

Interestingly enough, the kernel didn’t panic any more when I commented out the code that shows the cool “Nautilus” sign at the start of the kernel setup. I started to wonder what could be causing this, and discussed this problem with Professor Dinda and Kyle. While it’s not entirely sure, the double fault may have been caused due to multiboot tables not being set up correctly. When the print was removed, problems still existed. It got stuck while setting up some serial output device.

The root of the problem is that Nautilus is a multiboot compliant kernel, and Coreboot does not support the version of multiboot that Nautilus uses (which is quite strange, since it’s not too much of work to do this). I started working on modifying Coreboot code to do the setup fitting to multiboot 2 specifications.

My experience with research so far

So far, I’m going through the pain that happens during any systems research project – debugging. I can’t debug using tools like GDB, and have to rely on reading memory dump and register values. Yet I’m learning a lot from this project, since I get to look into different parts of systems such as virtual machines, OS, and bootloaders.

In the near future I will be posting on my work with porting Palacios to Nautilus.

 

 

Shortening the Boot Time – Why Bother?

To give a better understanding of the “big picture” of my research, I’d like to talk about why I am doing this. Shortening the boot time of a computer seems nice, but not necessary. An important application that I’d like to focus is its application in data centers.

The demand for large data centers has been rising along with the Internet boom. Starting with the familiar consumer-facing social medias like Facebook, Twitter and Instagram to larger business-facing services like Amazon AWS and Akamai, services need a lot of servers to operate. These servers reside in a space called data centers, which consist of thousands of computers. Every time we post a photo on Instagram or watch a video on Youtube, we’re communicating with one of these servers.

One thing to note about servers in a data center is that they are left on all the time, even when they are not being used. On average, 90% of the servers in a datacenter are idle (not doing anything). The reason for this is mainly because of unpredictable peak usage from the consumer side. If 90% of these servers are turned off and then suddenly the number of users increases, the service cannot meet the needs of consumers. (Who wants to wait a minute for a Google search, or 3 minutes for a photo on Facebook, or 2 hours for a video on Youtube? Certainly not me) This can result in a serious damage to the company’s business, so data center operators are hesitant to turn off any servers even if they aren’t being used.

The result of idle servers lead to an enormous power consumption. A report from the Natural Resources Defense Council says that U.S. data centers consumed 91 billion KWh of electricity, which is enough to power all of New York City’s households twice over.

Researchers have been focusing on algorithms that try to turn off idle servers. An example is AutoScale, a policy based on leaving a “padding” of idle servers. Servers cannot be turned off easily because servers take a very long time to boot – usually around 300 seconds. In order to bridge this gap, AutoScale leaves on extra servers on and turns off the rest, and when there is an increase in the request, more servers than needed are turned on, while the excess servers are dealing with the increase in request.

The idea of my research is that if the long boot time of servers – 300 seconds currently – can be brought down to few seconds, fewer “extra” servers can be left on when adopting policies like AutoScale to actual mechanisms. Eventually this can lead to a more realistic and efficient adaptation of algorithms like AutoScale, contributing to saving energy consumption of data centers.

Getting started

Not many know what goes behind the scene when we press the power button on our computer. Fewer people know that the time it takes to fully boot a computer is affected by legacy code in the BIOS, making our brand new machines to go through through the very same steps that early computers in the 70s went through. Regardless, our personal computers boot fast enough – perhaps a minute or two at the maximum. However, the story becomes different when we’re talking about huge server racks in data centers. Bringing down the boot time of the servers has a range of applications, one of them being a potential reduction in the power consumption of large data centers.

This summer I will be conducting a research to prove that it is possible to bring down the boot time through virtualization. More specifically, by replacing the BIOS firmware with a kernel that is small enough to fit in the ROM where the BIOS usually resides but capable enough to launch a virtual machine monitor, which will run the server OS as its hosts.

I anticipate this blog’s posts to be little more technical/jargonish than what most people are used to, so I will try to not only post the progress of my research but also background information on the topic from time to time.