Livepatching the kernel

One of the interesting features of the modern operating systems is live-patching: availability to do updates of the kernel without the need of rebooting the system. Updates without any downtime can be beneficial for businesses such a Data Centers or other service providers. In this post, we will take a look at how different kernels implemented this feature. In general, users, are reluctant to reboot running production systems. This reboot is considered as a hot discussion topic. Usually required by the security updates, and involve some risk management (when to reboot). There are some well-known techniques like clustering that can help with such a process, but also brings additional costs. One of the recent, interesting techniques that some kernels implemented is patching kernel during the normal work. Applying patches to the running kernel without bringing the whole system down can be beneficial to the users. It minimizes downtime, while reliability and security of the infrastructure remain at the required level.

Live Patching concept

The idea of modifying the running binary code is not new. Microsoft patent from 2002 ‘Patching of in-use functions on a running computer system’. However, as we can read from one of the discussions in Linux community people used to do patching in memory code back in the PDP-11 days.

Microsoft came up with this novel new technique in the distant past: 2002. The posting immediately brought out a crowd of surprised graybeards who distinctly remember using such techniques on their PDP-11 systems some decades before Microsoft “invented” hot-patching. The basic claim of the patent would thus appear to be invalidated by some decades’ worth of prior art, but some of the dependent claims include features (such as capturing all other processors on the system) which were unlikely to be useful on PDP-11s.

What are we really trying to do?

Let’s imagine the situation when we need to change a constant value from 2 << 10 to 0xbadc0ffe in the following code:

# Function A before changes

int get_caps(int version)
{
	return (2 << 10) + get_cp(version);
}

# Function B after changes

int get_caps(int version)
{
	return 0xbadc0ffe + get_cp(version);
}

# Function get_caps before modification compiled with clang-902  

__Z8get_capsi:
      50:       55      pushq   %rbp
      51:       48 89 e5        movq    %rsp, %rbp
      54:       48 83 ec 10     subq    $16, %rsp
      58:       89 7d f8        movl    %edi, -8(%rbp)
      5b:       e8 00 00 00 00  callq   0 <__Z8get_capsi+0x10>
      60:       8b 7d f8        movl    -8(%rbp), %edi
      63:       89 7d fc        movl    %edi, -4(%rbp)
      66:       8b 7d fc        movl    -4(%rbp), %edi
      69:       e8 00 00 00 00  callq   0 <__Z8get_capsi+0x1E>
      6e:       05 00 08 00 00  addl    $2048, %eax
      73:       48 83 c4 10     addq    $16, %rsp
      77:       5d      popq    %rbp
      78:       c3      retq

# Function get_caps after modification compiled with clang-902 

__Z8get_capsi:
      50:       55      pushq   %rbp
      51:       48 89 e5        movq    %rsp, %rbp
      54:       48 83 ec 10     subq    $16, %rsp
      58:       89 7d f8        movl    %edi, -8(%rbp)
      5b:       e8 00 00 00 00  callq   0 <__Z8get_capsi+0x10>
      60:       8b 7d f8        movl    -8(%rbp), %edi
      63:       89 7d fc        movl    %edi, -4(%rbp)
      66:       8b 7d fc        movl    -4(%rbp), %edi
      69:       e8 00 00 00 00  callq   0 <__Z8get_capsi+0x1E>
      6e:       05 fe 0f dc ba  addl    $3134984190, %eax
      73:       48 83 c4 10     addq    $16, %rsp
      77:       5d      popq    %rbp
      78:       c3      retq

As we can see a change in machine code is just one value at offset 6e. The change in the code is just a line also the amount of assembly code generated by this change is small. As we expected we only changed one constant value. But what will happen if we will compile this code in the slightly different way? Above code is compiled without optimizations lets see what will happen if we will use -02.

# Function get_caps after modification copmiled with -02 with clang-902

__Z8get_capsi:
      60:       55                        pushq   %rbp
      61:       48 89 e5                  movq    %rsp, %rbp
      64:       85 ff                     testl   %edi, %edi
      66:       7e 1a                     jle     26 <__Z8get_capsi+0x22>
      68:       0f 1f 84 00 00 00 00 00   nopl    (%rax,%rax)
      70:       89 f8                     movl    %edi, %eax
      72:       d1 ef                     shrl    %edi
      74:       75 fa                     jne     -6 <__Z8get_capsi+0x10>
      76:       83 e0 01                  andl    $1, %eax
      79:       8d 84 00 fe 0f dc ba      leal    -1159983106(%rax,%rax), %eax
      80:       5d                        popq    %rbp
      81:       c3                        retq
      82:       b8 fe 0f dc ba            movl    $3134984190, %eax
      87:       5d                        popq    %rbp
      88:       c3                        retq

Now machine code generated with optimizations is even longer than without optimization, also it does not looks like previous versions. What’s worse optimized machine code is not shorter than not optimized, as we might expected! Curious reader can also notice double call statememt. This is because in this example I used -pg flag to compiler, which will be explained later. Unfortuneatly changing few machine instructions in generic case is not always possible. Many changes add additional instructions and the resulting binary code is bigger than original code. Below the example of recent security fix from ZFS codebase. These security check usually bring additional conditional jump instructions. Replacing method would not works, as new version of the function will have more instructions so it wont fit in the function boundaries. Now machine code generated with optimizations is even longer than without optimization, also it does not looks like previous versions. What’s worse optimized machine code is not shorter than not optimized, as we might be expected! The curious reader can also notice double call statement. This is because in this example I used -pg flag to the compiler, which will be explained later. Unfortunately changing few machine instructions in a generic case is not always possible. Many changes add additional instructions and the resulting binary code is bigger than the original code. Below the example of recent security fix from ZFS codebase. These security checks usually bring additional conditional jump instructions. Replacing method would not works, a new version of the function will have more instructions so it won’t fit in the function boundaries.

@@ -60,10 +60,14 @@ zfs_init_vattr(vattr_t *vap, uint64_t mask, uint64
 {
 	VATTR_NULL(vap);
 	vap->va_mask = (uint_t)mask;
-	vap->va_type = IFTOVT(mode);
-	vap->va_mode = mode & MODEMASK;
-	vap->va_uid = (uid_t)uid;
-	vap->va_gid = (gid_t)gid;
+	if (mask & AT_TYPE)
+		vap->va_type = IFTOVT(mode);
+	if (mask & AT_MODE)
+		vap->va_mode = mode & MODEMASK;
+	if (mask & AT_UID)
+		vap->va_uid = (uid_t)uid;
+	if (mask & AT_GID)
+		vap->va_gid = (gid_t)gid;
 	vap->va_rdev = zfs_cmpldev(rdev);
 	vap->va_nodeid = nodeid;
 }

Jumping

To omit possible limitation related to the binary layout of the functions the jump instruction can be used. Instead of patching in place which is not always possible, is much better to allocate new memory, copy the patched code in this location. Then compute the offset from old code to new code, add trampoline jump to the new code.

Linux Live Patching

Bit of History

The first example of patching running kernel will be a Linux implementation. Over the time 3 different implementation of live patching running kernel was implemented, then SUSE and RedHat combined effort and provided generic live-patching. Below we will go briefly through their history and differences.

kGraft (SuSE)
kPatch (Red Hat)
kSplice (Ksplice, acq by Oracle)
Linux live-patching (Combined effort)

Ksplice

First solution for patching running linux kernel was Ksplice, created by four MIT students based on Jeff Arnold’s master’s thesis (initial release 2008). On 21 July 2011, Oracle Corporation announced that they acquired Ksplice. Ksplice works by taking as an input modified kernel binary. Then it compare original running kernel with the modified and extract modified symbols (functions code). As a next step Ksplice stops all CPU’s except one working on patching, after process of applying changes is done, system return to normal work. The first solution to patching running Linux kernel was Ksplice. Created by four MIT students based on Jeff Arnold’s master’s thesis (initial release 2008). On 21 July 2011, Oracle Corporation announced that they acquired Ksplice. Ksplice works by taking as an input modified kernel binary. Then it compares the original running kernel with the modified and extracts modified symbols (functions code). As a next step, Ksplice stops all CPU’s except one working on patching after the process of applying changes is done, system return to normal work.

kPatch

Developed by Red Hat (initial release Feb 2014). Provide kernel part which is responsible for applying the patch, and also utilities for creating a patch. Kpatch operates on functions (patching is done by replacing functions body), this process is done by hooking function enter by ftrace. When a function is entered at very beginning control is passed to ftrace which redirect execution to replacement function (see image below). kPatch ensures that is safe to patch function by applying changes while stopping all running processes, as well as ensuring that patched functions arent executed by any thread. As a side effect during the patching process, small latency is introduced to the system.

picture 1: Live patching process redirect calls to patched kernel functions invoke their replacement using ftrace. Source Wiki

kGraft

Developed by SUSE, released in almost same time as kPatch (March 2014). kGraft aims to maximize system uptime and availability. Similar to kPatch is divided to kernel code and userspace utility.
The biggest difference between two is that kGraft does not need to stop kernel (as kPatch does). To achieve this kGraft patch functions per process, understanding the context of execution (called universe view) and patching only when called function is not in the program execution stack.

picture 2: Each process is monitored so it executes a patched function consistently within a single system call. Source Wikipedia

picture 3: After everything migrates to a new "universe", trampoline-style checks are no longer needed. Source Wikipedia

Livepatching implementation

Current live-patching implementation is a combination of two techniques: kPatch and kGraph. The first upstream version uses the technique of stopping the machine during the patching. Later this approach was replaced with a more complicated implementation which was able to do incremental patching without the need of suspending a machine. Movie from stop machine approach to incremental patching required some additional changes in the kernel itself (see Live-patching challenges).

Patching the function

As we saw earlier the most reliable way to patch function actually jumps to the new/patched function body. After execution of the new function is done, return to the caller address. This jump can be implemented in few different ways: we can just put jmp instruction with the address which takes 5 bytes, or software int with the address, or without the address and do jump from software interrupt. This operation requires to put 5 bytes at the very beginning of the function body before any registers are pushed to the stack or space allocated for local variables. Instead of overwriting the first 5 bytes of the original function, the other feature of the kernel can be exploited. By default kernel is compiled with ftrace framework enabled, this requires special gcc flag -pg. We can read from gcc documentation.

-pg Generate extra code to write profile information suitable for the analysis program gprof. You must use this option when compiling the source files you want data about, and you must also use it when linking.

So we do have empty space at the begin of almost every function. Now we can insert the jump there. But here is another detail from Linux kernel: we do not need to put jump instruction by our self. Coming back to ftrace tracking framework, one of its features is to be able to hook function. So by using the tracing framework execution can be redirected, and before a new function is called some additional logic can be performed. This process is illustrated below:

picture 4. The function using livepatching

We can also build our live patch and verify this process, using the debugger. We can see a beginning unpatched function which we will patch, the first 5 bytes are empty (used nopl instructions). Next, after applying the patch, there is a hook in place of empty instruction.

# Dump of assembler code for function cmdline_proc_show before patching

0xffffffff814982e0 <+0>:     nopl   0x0(%rax,%rax,1)                         
0xffffffff814982e5 <+5>:     push   %rbp                                     
0xffffffff814982e6 <+6>:     mov    %rsp,%rbp                                
0xffffffff814982e9 <+9>:     push   %rbx                                     
0xffffffff814982ea <+10>:    mov    %rdi,%rbx                                
0xffffffff814982ed <+13>:    mov    $0xffffffff83558200,%rdi                 
0xffffffff814982f4 <+20>:    callq  0xffffffff8138a5a0 <__asan_load8>        
0xffffffff814982f9 <+25>:    mov    0x20bff00(%rip),%rdx
0xffffffff81498300 <+32>:    mov    %rbx,%rdi                                
0xffffffff81498303 <+35>:    mov    $0xffffffff825418a0,%rsi                 
0xffffffff8149830a <+42>:    callq  0xffffffff814087f0 <seq_printf>          
0xffffffff8149830f <+47>:    xor    %eax,%eax                                
0xffffffff81498311 <+49>:    pop    %rbx                                     
0xffffffff81498312 <+50>:    pop    %rbp                                     
0xffffffff81498313 <+51>:    retq

# Dump of assembler code for function cmdline_proc_show after patching

0xffffffff814982e0 <+0>:     callq  0xffffffffc04c8000                       
0xffffffff814982e5 <+5>:     push   %rbp                                     
0xffffffff814982e6 <+6>:     mov    %rsp,%rbp                                
0xffffffff814982e9 <+9>:     push   %rbx                                     
...                                   
0xffffffff81498312 <+50>:    pop    %rbp                                     
0xffffffff81498313 <+51>:    retq

Stop Machine vs Incremental Patching:

The most important part of changing the code in memory is to make sure that during the patching function is in the deterministic state/point. The worst thing that can happen is to change the code that is currently executed, which may cause in the best situation triggering some error as site effect but generally would cause a system crash. The first implementation of Linux live patch, designed patching process to be performed when the machine is halted. During the stop all processes except patching process are suspended, interrupts are disabled. The patching process checks the threads execution stack to make sure that function can be switched to the newer version, if this is not possible, the process is aborted and repeated after some period of time. This operation is illustrated below.

picture 5. Patching with machine stop. Source: [6] Kpatch Without Stop Machine

Patching the function while the machine is stopped is simple and safe. However for some type of workloads like control/network appliances hanging machine even for a couple of milliseconds is unacceptable. Another difficult situation can happen in some virtual environments where CPU isn’t pinned to the guest whole process could potentially take more time as all VCPUs are scheduled on the host machine. Because of this factors new approach based on initial kGraft implementation to provide live-patching free of machine halting.

picture 6. livepatchingtching without machine stop. Source: [6] Kpatch Without Stop Machine

Xen Live Patching

First official work on Xen live-patching was presented on Linux Plumbers Conf in 2014, by Martin Pohlack. Amazon did not open source their implementation of live-patching, but instead their shared their design and also some architectural decisions (see [1]). A year later the work on live-patching Xen hypervisor was taken over by joined effort from Oracle and Citrix, and in 2016 first version of live-patching was released. Xen live-patching design is similar to Linux kpatch, however, Xen hypervisor kernel design is a little bit different than Linux kernel. The biggest difference is that Xen is monolithic kernel without support for loadable kernel modules, so support for linker and parsing binary file was not implemented in the kernel. Next significant difference is that ftrace like framework is not present in Xen so inserting a jump hook needed to be implemented state alone. Next difficulty is fact that hypervisor kernel is not compiled with -pg flag by gcc, so no dedicated space is present and trampoline code overrides the first 5 bytes of the patched function code. Another interesting area where hypervisor community put more attention is correctness of patched code. As Xen before do not deal with linking machine code, no need for providing security is required. To achieve this goal before any patch is applied the build-id is checked. Due to this mechanism kernel knows that the cod that is going to be applied is from the same hypervisor and compiler version.

Xen patching process

Xen patching process is similar to techniques described in the Linux. However, hypervisor kernel has a slightly different design, which influences logic that is required during the consistency check. For consistency, the stack should be checked as we cannot patch the function which is in use by another CPU. To simplify this problem, patching is performed when the hypervisor has no stack: at the deterministic point. This implementation required some scheduling code changes.

picture 7. Live Update on Xen. Source: [2] Patching with Xen Livepatch

AIX Live Update

One of interesting implementation of live updates of operating system was introduced in 2015 by IBM as part of AIX Operating system. This is quite interesting implementation as it went in different way than all Linux solution and Xen (which follow linux solutions).
AIX Live Update require additional partition of size at least equal to existing root partition. During the patching, surrogate kernel is created on additional partition, while old kernel continue to run existing workload. Once the surrogate is created, the surrogat partition undergoes for checkpoint process, and processes are paused and migrated to the new kernel. Once the process is finished migrated processes are restarted/un-paused and continue to run on the new partition. The process of live update is illustrated on the scheme below.
Unfortuneatly the only place where I found information about live update feature, IBM tech blog, does not contain details about the implementation and how migration process makes sure that functions are suitable to be safely migrated.
What is interesting at AIX solution is that, due to the design of live update, the problem of consistency check could be eliminated in theory. One discussion in Linux community was also arguing about such solution instead of incremental patching [15].

picture 8. Live Update on AIX. Source: [3] AIX Live-update

Building own trivial livepatch on Linux

Almost all live-patches frameworks provide specific build tools to prepare a patch. However in the Linux as long as we do not need to use specific ELF symbols or module don’t need to access non-exported parts in the code. We can build live-patch module using default kernel build system.

Dummy Example

Inside <linux-upstrem>/samples/livepatch comes some examples. We will compile and apply livepatch_sample. This code comes as an LKM and is one *.c file, however, it needs to be built with the entire kernel (also need to be chosen from menuconfig inside samples). As we don’t want to wait a long time for the kernel to compile an entire kernel just to get examples (here I believe there might be a way to disable the whole build and compile only samples, but I didn’t find it), we will provide external makefile:

TARGET = livepatch-sample

# Specify here your kernel dir:
KDIR := <i.e. /root/kernel/linux-4.15.13>
PWD := $(shell pwd)

# To implement some macros declaration after satement is required
ccflags-y := -std=gnu99 -Wno-declaration-after-statement

obj-m += $(TARGET).o

#dmyfs-objs := livepatch-sample.o

default:
    make -C $(KDIR) SUBDIRS=$(shell pwd) modules

clean:
    make -C $(KDIR) SUBDIRS=$(shell pwd) clean

Now after compilation, (by typing make inside <linux-upstrem>/samples/livepatch folder), we can test our livepatch:

 $ cat /proc/cmdline
 <your cmdline>

 $ insmod livepatch-sample.ko
 $ cat /proc/cmdline
 this has been live patched

 $ echo 0 > /sys/kernel/livepatch/livepatch_sample/enabled
 $ cat /proc/cmdline
 <your cmdline>

Livepatching challenges:

Consistency check

Since April 2015, there is ongoing work on porting kGraft to the common live patching core provided by the Linux kernel mainline [14]. However, implementation of the required function-level consistency mechanisms has been delayed because the call stacks provided by the Linux kernel may be unreliable in situations that involve assembly code without proper stack frames; as a result, the porting work remains in progress as of September 2015. In an attempt to improve the reliability of kernel’s call stacks, a specialized sanity-check stack tool userspace utility has also been developed.

Other challenges (not solved entirely)

Despite the fact that the live patching topic looks like well studied there are still some existing challenges that are hard to mitigate in a generic way. Below the list collected from the existing discussions and conferences.

Inline assembly patching
NMI and MCE handling when patching
Patching Scheduler functions
Unhelpful compiler optimizations: -fipa-src -fipa-pure-const -fipa-icf -fipa-ra

References:

Xen Live-patching

[1] A design proposal for Xen hotpatching Martin Pohlack 2014-10-17 Slides [2] Patching with Xen LivePatch Non disruptive patching of hypervisor Konrad Rzeszutek Wilk, Ross Lagerwall YT presentation Slides

AIX Live Update:

[3] AIX Live Update - No Reboot Required! Non-disruptive OS Updates! Slides

Linux Live-patching:

[4] kpatch Have your security and eat it too! Josh Poimboeuf LinuxCon North America August 22, 2014 Presentation [5] kGraft Live patching of the Linux kernel Presentation [6] Kpatch Without Stop Machine The Next Step of Kernel Live Patching Presentation [7] Livepatching kernel documentation [8] Ksplice wiki [9] Ksplice: Automatic Rebootless Kernel Updates [10] kGraft wiki [11] kpatch wiki [12] Linux Stack Validation LWN [14] Unhelpful compiler optimizations LWM [15] A unified consistency model LWM

GNU GCC

[13] Gcc GNU documentation