How new-lines affect the Linux kernel performance

The Linux kernel strives to be fast and efficient. As it is written mostly in C, it can mostly control how the generated machine code looks. Nevertheless, as the kernel code is compiled into machine code, the compiler optimizes the generated code to improve its performance. The kernel code, however, employs uncommon coding techniques, which can fail code optimizations. In this blog-post, I would share my experience in analyzing the reasons for poor code inlining of the kernel code. Although the performance improvement are not significant in most cases, understanding these issues are valuable in preventing them from becoming larger. New-lines, as promised, will be one of the reasons, though not the only one.

New lines in inline assembly

One fine day, I encountered a strange phenomenon: minor changes I performed in the Linux source code, caused small but noticeable performance degradation. As I expected these changes to actually improve performance, I decided to disassemble the functions which I changed. To my surprise, I realized that my change caused functions that were previously inlined, not to be inlined anymore. The decision not to inline these functions seem dubious as they were short.

I decided to further investigate this issue and to check whether it affects other parts of the kernel. Arguably, it is rather hard to say whether a function should be inlined, so some sort of indication of bad inlining decisions is needed. C functions that are declared with the inline keyword are not bound to be inlined by the compiler, so having a non-inlined function that is marked with the inline keyword is not an indication by itself for bad inlining decision.

Arguably, there are two simple heuristics to find functions which were suspiciously not inlined for the wrong reason. One heuristic is to look for short (binary-wise) functions by looking at the static symbols. A second heuristic is to look for functions which appear in multiple translation units (objects), as this might indicate they were declared as inline but were eventually not inlined, and that they are in common use. In both cases, there may be valid reasons for the compiler not to inline functions even if they are short, for example if they are used as a value for a function pointer. However, they can give an indication if something is "very wrong" in how inlining is performed, or more correctly, ignored.

In practice, I used both heuristics, but in this post I will only use the second one to check whether inlining decisions seem dubious. To do so I rebuild the kernel, using the localyesconfig make target to incorporate the modules into the core. I ensure the "kernel hacking" features in the config are off, as those tend to blow the size of the code and rightfully cause functions not to be inlined. I then looked for static function which had the most instances in the built kernel:

As seen, the results are suspicious. As mentioned before, in some cases there are good reasons for functions not to be inlined. jhash() is a big function (303 bytes) so it is reasonable for it is not to be inlined. dst_output() address is used as a function pointer, which causes it not to be inlined. Yet the other functions seem to be great candidates for inlining, and it is not clear why they are not inlined. Let's look at the source code of copy_overflow(), which has many instances in the binary:

static inline void copy_overflow(int size, unsigned long count) 
{ 
  WARN(1, "Buffer overflow detected (%d < %lu)!\n", size, count); 
}

Will the disassembly tell us anything?


0xffffffff819315e0 <+0>: push   %rbp 
0xffffffff819315e1 <+1>: mov    %rsi,%rdx 
0xffffffff819315e4 <+4>: mov    %edi,%esi 
0xffffffff819315e6 <+6>: mov    $0xffffffff820bc4b8,%rdi  
0xffffffff819315ed <+13>: mov    %rsp,%rbp 
0xffffffff819315f0 <+16>: callq  0xffffffff81089b70 <__warn_printk> 
0xffffffff819315f5 <+21>: ud2 
0xffffffff819315f7 <+23>: pop    %rbp  
0xffffffff819315f8 <+24>: retq 
 
Apparently not. Notice that out of the 9 assembly instructions that are shown above, 6 deal with the function entry and exit - for example, updating the frame pointer, and only the 3 bolded ones are really needed.

To understand the problem, we must dig deeper and look at the warning mechanism in Linux. In x86, this mechanism shares the same infrastructure with the bug reporting mechanism. When a bug or a warning are triggered, the kernel prints the filename and the line number in the source-code that triggered the bug, which can then used to analyze the root-cause of the bug. A naive implementation, however, would cause the code-cache to be polluted with the this information as well as the function call to the function that prints the error message, consequently causing performance degradation.

Linux therefore uses a different scheme by setting an exception triggering instruction (ud2 on x86) and saving the warning information in a bug table that is set in a different section in the executable. Once a warning is triggered using the WARN() macro, an exception is triggered and the exception handler looks for the warning information - the source-code filename and line - in the table.
Inline assembly is used to save this information in _BUG_FLAGS(). Here is its code after some simplifications to ease readability: 

asm volatile("1: ud2\n" 
 ".pushsection __bug_table,\"aw\"\n" 
 "2: .long 1b - 2b\n" /* bug_entry::bug_addr */ 
 " .long %c0 - 2b\n" /* bug_entry::file */ 
 " .word %c1\n" /* bug_entry::line */ 
 " .word %c2\n" /* bug_entry::flags */ 
 " .org 2b+%c3\n" 
 ".popsection" 
 : : "i" (__FILE__), "i" (__LINE__), 
 "i" (flags), 
 "i" (sizeof(struct bug_entry)));
 
Ignoring the assembly shenanigans that this code uses, we can see that in practice it generates a single ud2 instruction. However, the compiler considers this code to be "big" and consequently oftentimes does not inline functions that use WARN() or similar functions.

The reason turns to be the newline characters (marked as '\n' above). The kernel compiler, GCC, is unaware to the code size that will be generated by the inline assembly. It therefore tries to estimate its size based on newline characters and statement separators (';' on x86). In GCC, we can see the code that performs this estimation in the estimate_num_insns() function: 

int estimate_num_insns (gimple *stmt, eni_weights *weights) 
{ 
... 
 case GIMPLE_ASM: 
 { 
   int count = asm_str_count (gimple_asm_string (as_a <gasm *> (stmt))); 
   /* 1000 means infinity. This avoids overflows later 
      with very long asm statements. */ 
    if (count > 1000) 
      count = 1000; 
    return count; 
 } 
 .. 
}
 
Note that this pattern, of saving data using inline assembly, is not limited to bugs and warnings. The kernel uses it for many additional purposes: exception tables, that gracefully handle an exception that is triggered inside the kernel; alternative instructions table, that tailors the kernel on boot-time to the specific CPU architecture extensions that are supported; annotations that are used for stack metadata validation by objtool and so on.

Before we get to solving this problem, a question needs to be raised: is the current behavior flawed at all? Eventually, the size of the kernel will increase if functions that use WARN(), for example, will be inlined. This increase in size can cause the kernel image to be bigger, and since the Linux kernel cannot be paged out, will also increase memory consumption. However, the main reason that the compiler strives to avoid inflation of the code size is to avoid pressure on the instruction cache, whose impact may offset inlining benefits. Moreover, the heuristics of other compiler optimizations (e.g., loop optimizations) depend on the size of the code.

Solving the problem is not trivial. Ideally, GCC would have used an integrated assembler, similarly to LLVM, which would give better estimation of the generated code size of inline assembly. Experimentally, LLVM seems to make the right inlining decisions and is not affected by new-lines or data that is set in other sections of the executable. Interestingly, it appears to do so even when the integrated assembler is not used for assembly. GCC, however, uses the GNU assembler after the code is compiled, which prevents it from getting a correct estimation of the code size.
Alternatively, the problem could have been solved by overriding GCC's code size estimation through a directive or a built-in function. However, looking at GCC code does not reveal a direct or indirect way to achieve this goal.

One may think that using the always_inline function attribute to force the compiler to inline functions would solve the problem. It appears that some have encountered the problem of poor inlining decisions in the past, without understanding the root-cause and used this solution. However, this solution has several drawbacks. First, it is hard to make and maintain these annotations. Second, this solution does not address other code optimizations the rely on code-size estimation. Third, the kernel uses various configurations and supports multiple CPU architectures, which may require a certain function to be inlined in some setups and not inlined in other. Finally, and most importantly, using always_inline can just push the problem upwards to calling functions, as we will later see. 

Therefore, a more systematic solution is needed. The solution comes in the form of assembly macros that are set to hold the long assembly code, and use a single line inside the inline assembly that calls the macro. This solution does not only improve the generated machine code, but makes the assembly code more readable, as it prevents various quirks that are required in inline assembly, for example new-line characters. Moreover, in certain cases this change allows to consolidate the currently separate implementations that are used in C and assembly, which eases code maintenance.
Addressing the issue shows a performance improvement of tens of cycles for certain system calls, which are indeed not too notable. After addressing these issues, we see copy_overflow() and other functions disappear from the commonly non-inlined inline functions list.

Instances Size               Function Name 
 9        000000000000012f t jhash 
 8        0000000000000011 t kzalloc 
 7        0000000000000017 t dst_output 
 5        000000000000002f t acpi_os_allocate_zeroed 
 5        0000000000000029 t acpi_os_allocate

However, we got some new ones. Lets try to understand where do they come from.

Constant computations and inlining

As shown, kzalloc() is not always inlined, although its code is very simple. 

static inline void *kzalloc(size_t size, gfp_t flags) 
{ 
  return kmalloc(size, flags | __GFP_ZERO); 
}
 
The assembly, again does not provide any answers as to why it is not inlined:

0xffffffff817929e0 <+0>: push %rbp 
0xffffffff817929e1 <+1>: mov $0x14080c0,%esi 
0xffffffff817929e6 <+6>: mov %rsp,%rbp 
0xffffffff817929e9 <+9>: callq 0xffffffff8125d590 <__kmalloc> 
0xffffffff817929ee <+14>: pop %rbp 
0xffffffff817929ef <+15>: retq 
 
The answer to our question lies in kmalloc(), which is called by kzalloc() and is considered to have many instructions by GCC heuristics. kmalloc() is inlined since it is marked with the always_inline attribute, but its estimated instruction count is then attributed to the calling function, kzalloc() in this case. This result exemplifies why the use of the always_inline attribute is not a sufficient solution for code inlining problem.
Still, it is not clear why GCC estimates that kmalloc() would be compiled into many instructions. As shown, it is compiled into a single call to __kmalloc(). To answer this question, we need to follow kmalloc() code, which eventually uses the ilog2() macro to compute the log2 of an integer, in order to compute the page allocation order.

Here is a and shortened version of ilog2(): 

#define ilog2(n) \ 
( \ 
 __builtin_constant_p(n) ? ( \ 
 /* Optimized version for constants */ \ 
 (n) < 2 ? 0 : \ 
 (n) & (1ULL << 63) ? 63 : \ 
 (n) & (1ULL << 62) ? 62 : \ 
 ... 
 (n) & (1ULL << 3) ? 3 : \ 
 (n) & (1ULL << 2) ? 2 : \ 
 1 ) : \ 
 /* Another version for non-constants */ \ 
 (sizeof(n) <= 4) ? \ 
 __ilog2_u32(n) : \ 
 __ilog2_u64(n) \ 
}
As shown, the macro first uses the built-in function __builtin_constant_p() to determine whether n is known to be a constant during compilation time. If n is known to be constant, a long series of conditions is evaluated to compute the result during compilation time, which allows further optimizations. Otherwise, if n is not known to be constant, a short code is emitted to compute during runtime the result. Yet, regardless of whether n is constant or not, all of the conditions in the ilog2() macro are evaluated during compilation time and do not translate into any machine code instructions.

However, although the generated code is efficient, it causes GCC, again, to estimate the number of instructions that ilog2() takes incorrectly. Apparently, the number of instructions is estimated before inlining decisions take place, and in this stage the compiler usually still does not know whether n is constant. Later, after inlining decisions are performed, GCC cannot update the instruction count estimation accordingly.

This inlining problem is not as common as the previous one, yet it is not rare. Bit operations (e.g., test_bit()) and bitmaps commonly use __builtin_constant_p() in the described manner. As a result, functions that use these facilities, for example cpumask_weight(), are not inlined.
A possible solution for this problem is to use the built-in __builtin_choose_expr() to test __builtin_constant_p() instead of using C if-conditions and conditional operators (?:) :

#define ilog2(n) \ 
( \
  __builtin_choose_expr(__builtin_constant_p(n), \ 
 ((n) < 2 ? 0 : \ 
 (n) & (1ULL << 63) ? 63 : \ 
 (n) & (1ULL << 62) ? 62 : \ 
 ... 
 (n) & (1ULL << 3) ? 3 : \ 
 (n) & (1ULL << 2) ? 2 : \ 
 1 )), \ 
 (sizeof(n) <= 4) ? \ 
 __ilog2_u32(n) : \ 
 __ilog2_u64(n) \ 
)

This built-in is evaluated earlier in the compilation process, before inlining decisions are being made. Yet, there is a catch: as this built-in is evaluated earlier, GCC is only able to determine that an argument is constant for constant expressions, which can cause less efficient code to be generated. For instance, if a constant was given as a function argument, GCC will not be able to determine it is constant. In the following case, for example, the non-constant version will be used:

int bar(int n) 
{ 
  return ilog2(n) 
} 

int foo(int n)
{
  return bar(n); 
} 

v = foo(bar(5)); /* will use the non-constant version */
 
It is therefore questionable whether using __builtin_choose_expr() is an appropriate solution. Perhaps it is better to just mark functions such as kzalloc() with the always_inline attribute. Compiling using LLVM reveals, again, that LLVM inlining decisions are not negatively affected by the use of __builtin_constant_p().

Function attributes

Finally, there are certain function attributes that affect inlining decision. Using function attributes to set an optimization levels for specific functions can prevent the compiler from inlining the functions or functions that are called by them. The Linux kernel rarely uses such attributes, but one of its uses is in the KVM function vmx_vcpu_run() which is a very hot function that launches or resumes the virtual machine. The use of the optimization attribute in this function is actually just to prevent cloning of the function. Its side-effect is, however, that all the functions it uses are not inlined, including, for example the function to_vmx(): 

0x0000000000000150 <+0>: push %rbp
0x0000000000000151 <+1>: mov %rdi,%rax 
0x0000000000000154 <+4>: mov %rsp,%rbp
0x0000000000000157 <+7>: pop %rbp
0x0000000000000158 <+8>: retq 
 
This function just returns as an output the same argument it got as an input. Not inlining functions that are called by vmx_vcpu_run() induces significant overhead, which can be as high as 10% for a VM-exit.

Finally, the cold function attribute causes inlining to be done less aggressively. This attribute informs the compiler that a function is unlikely to be executed, and the compiler, among other things, optimizes these functions for size rather than speed, which can result in very non-aggressive inlining decisions. All the __init and __exit functions, which are used during the kernel and modules (de)initializations are marked as cold. It is questionable whether this is the desired behavior.

Conclusions

Despite the fact that C appears to give us great control over the generated code, it is not always the case. Compiler extensions may be needed to give programmers greater control. Tools that analyze whether the generated binary is efficient, considering the source code, may be needed. In the meanwhile, there is no alternative to manual inspection of the generated binary code.

Thanks to Linus Torvalds, Hans Peter Anvin, Masahiro Yamada, Josh Poimboeuf, Peter Zijistra, Kees Cook, Ingo Molnar and others for their assistance in the analysis and in solving this problem.

Are you protected against Spectre & Meltdown? Probably not

Several months ago, everybody was surprised to find out that the CPU speculative execution can be exploited to leak privileged information, using attacks that were named Spectre and Meltdown. Technical blogs and even the main-stream media reported broadly about those vulnerabilities. CPU vendors have struggled to provide firmware patches that would prevent the attacks in a timely manner. OS and other software providers introduced software solutions such as the retpoline and page table isolation (PTI) to protect against these vulnerabilities. All of these mitigations caused performance degradation, required extensive engineering effort, and caused various problems

So half a year later - are you protected? Probably not. Recently I found that the Linux protection against Spectre v2 is broken in virtual machines and Dave Hansen found a bug in the Meltdown protection. Spectre v1 was never considered fully resolved, with mitigations keep coming in, but even the existing ones were found to be buggy.
There is clearly a problem, as currently it is hard for people to easily realize whether they are protected against these attacks. Sure, one can see whether the OS or other piece of software reports that it is protected, but these reports might be wrong. There is a fundamental problem with these protections that, in a way, is the same one that caused Linus Torvalds to politely (yes, politely) decline a Meltdown mitigation technique we proposed*: 

Sure, I can see it working, but it's some really shady stuff, and now the scheduler needs to save/restore/check one more subtle bit.
And if you get it wrong, things will happily work, except you've now defeated PTI. But you'll never notice, because you won't be testing for it, and the only people who will are the black hats.
This is exactly the "security depends on it being in sync" thing that makes me go "eww" about the whole model. Get one thing wrong, and you'll blow all the PTI code out of the water.

Linus criticism of our work is valid, yet it does not seem that other protection mechanisms against these vulnerabilities are much better. And even if the OS is well-protected against these vulnerabilities, nobody guarantees that the system will remain safe after an out-of-tree module is loaded, for example. All it takes for the Spectre v2 protection to be broken, for example, is a single indirect branch that was not converted into a retpoline. 

It seems that in order to make the protection work, independent tools that validate the protection mechanisms are needed. I found the Spectre v2 by using the hardware performance counters to count indirect branches that were executed by the kernel and finding it is not zero. Dumping the page-tables and tracing translation-lookaside buffer (TLB) invalidations can be used to find out PTI bugs. Anti-malware tools should take up the glove and make these checks.

Yet, perhaps there is an additional problem of over-hyping Spectre. Side-channel attacks were known long before Spectre, and invoking them using speculative execution may not be such a game-changer. Unlike Meltdown, which is a real CPU bug, the Spectre family of vulnerabilities may pose lower risk as they are not easily exploitable. The Spectre v2 proof-of-concept exploited some Linux wrongdoings (e.g., not zeroing registers after a context switch from a virtual machine), which were relatively easily fixed and became a good mitigation against other OS bugs. Some new Spectre attacks were not even reported to be successfully exploitable other than in artificial demos.

It might be that the industry over-reacted to Spectre. Even if Spectre vulnerabilities are addressed, software might still leak privileged data through side-channels, so it is not as if the existing protection schemes are complete. Now that the media frenzy is gone, perhaps it is time to reconsider whether paying in performance for questionable "generic" protection schemes against these attacks makes sense, or whether protection should be done on a case-by-case basis.

Update: I wonder if Windows is indeed safe. Windows uses retpoline, but it is not clear whether they are used exclusively or with alternative solutions that use hardware mitigation (IBPB/IBRS). Anyhow, measuring the performance counter of Windows 10 (that runs in the VM) raises some questions, as it show there are indirect branches inside Windows kernel. Here are the performance counters as measure in a KVM guest: 

$ sudo perf stat -e br_inst_exec.taken_indirect_jump_non_call_ret:Gk \ 
 -e br_inst_exec.taken_indirect_near_call:Gk -a -- sleep 5 

Performance counter stats for 'system wide':

    1,682,939 br_inst_exec.taken_indirect_jump_non_call_ret:Gk 
    1,102,037 br_inst_exec.taken_indirect_near_call:Gk
  5.001077704 seconds time elapsed

* Based on work with Michael Wei and Dan Tsafrir (who may not share my views)

roundup() vs round_up()

Occasionally, you make a mistake that makes you think that surely others have made a similar one.
Such case happened to me one day when I was trying, in Linux, to round up a number (x) into a multiple of another number (y). The Linux kernel provides two macros for such an operation:
  • round_up(x, y) which only works when y is a power of 2, as it does the rounding using bit operations for efficiency. If y is not a power of 2, it quietly returns the wrong result.
  • roundup(x, y) which works for any value of y.
Yes, this is not a typo. One macro is named round_up() and the other roundup(). Both macros are defined in the same file (kernel.h), so their developers should have known that there are similarly named macros. They are not defined, however, next to each other and there is no comment in the code to explain the difference between the two, so it is very easy for a developer to use the wrong macro. 

There is also no assertion to ensure that y is a power of 2 in round_up(), and unfortunately, adding such an assertion is not as simple as adding a single line, since the macro is also used to define the size of some arrays.

Both macros are widely used (~500 times each), which made me suspect that in some cases round_up() might have been used inappropriately (i.e., when y is not a power of 2). Out of curiosity, I added a static assertion just for the cases in which y is known to be constant in compilation time.

Immediately, two bugs were found, one in a GPIO driver and a second one in the USB video class driver. The impact of these bugs is not clear. There might actually be more bugs in cases where y is not known during compilation time or when a different kernel configuration is used.

A less lazy person would have written a Coccinelle script to rename one of the functions and to add the required assertions, but making such tree-wide changes is a time consuming task. I have just submitted patches to fix the issues I found.

People can take different lessons from this story. Personally, it just made me think: how many more silly bugs exist that can easily be prevented by adding simple assertions?

Can you break Windows Page Table Isolation?

The Meltdown and L1TF attacks proved that Intel CPUs are susceptible for attacks that use speculative execution to leak data from the kernel address-space. The root-cause for both of these attacks is the ability of unprivileged code to read speculatively data even if the page-table permissions should prevent the access. To address this security vulnerability Page Table Isolation (PTI) has been introduced by all common OSes. When PTI is used, a different set of page-tables is used when user-space run.

The key for PTI to be effective is that it only allows an access to minimal set of privileged pages, whose content can be leaked. These pages include memory that holds various privileged architectural data-structures, for example the Global Descriptor Table (GDT), and privileged trampoline code. This trampoline code runs when a system call, interrupt or exception are invoked, and switches to the kernel page-table hierarchy, for the kernel to be able to handle this event.

Is PTI implemented correctly? Well, in Linux, PTI implementation is open for review, reasonable, and is easy to audit . Microsoft provides a lot of information about the implementation of PTI in their system, but the code is obviously not open for review. Out of curiosity, I decided to check Windows 10. Using KVM and a simple script, I dumped pages which are mapped as supervisor pages in the userspace. The results were surprising:

$ sudo ./pti_test.py --socket ~/vm-images/qmp-sock

 kernel frames: 124 
 page tables accessible by the user: 0 
 page tables accessible by the kernel: 115 
 writable page tables: 115 
 page tables: 115 
 global kernel frames: 9 
 kernel w+x: 118 
 kernel/user aliases: 0

This results surprised me for two reasons. First, Windows 10 does not implement W^X , and maps pages which are both executable and writable even in the user page-tables. Inspecting the content of these pages shows that at least some of them are guaranteed not to hold code, and should have been marked as non-executable in the page-tables. I contacted Microsoft which claimed that this is intended since "in some cases the kernel is mapped with large pages" and that this can be prevented by enabling virtualization based protection (VBS). However, dumping the page-tables (you can use the attached script), shows that no huge-pages were used, at least in my experiment, and still pages were mapped as writable and executable. VBS is not commonly used, and it might induce significant performance overheads, especially when Windows is being run inside a virtual-machine.

The second issue is that the page-tables themselves are mapped in the guest page-tables. Actually, Microsoft previously had a bug in which these page-tables could also be modified by userspace applications (in some Windows version). This obviously was a terrible security vulnerability, since userspace applications could have established page-table entries that would allow them to access (read or write) any piece of memory. Today, they keep being mapped as writable, but only accessible by privileged code. Microsoft, again, claimed that this is necessary. I cannot understand why - Linux, for example, does not need to do so.

Can anyone exploit these behavior? Well, besides the fact that the lack of W^X ease fully compromising a system when another security vulnerability is found, Windows behavior might potentially enable more Meltdown-like attacks. As shown by the Meltdown and L1TF vulnerabilities, Intel CPUs defer at least some PTE validity and permission checks and still uses them during speculative execution. Mapping as few pages as possible in the userspace page-tables is necessary for good Meltdown mitigation as Microsoft itself acknowledges. If one can somehow speculatively inject code or modify the page tables or the executable code, and use the modified versions, this would cause a security vulnerability.

Still, it is not clear whether it actually poses a security vulnerability. Initially, I thought that mapping the page-tables might be exploited by running a speculatively-executed code that sets page-table entries that grant userspace code access to privileged data, and then reads the data and leaks it. However, this is unlikely to work. Accessing data can only be done after the page-table entry is cached in the translation lookaside buffer (TLB), and the TLB should not be able to hold entries whose content is based on speculative execution. If TLB entries would have been set based on speculative execution results, this would cause correctness issues.

Feel free to use my silly script (through this link) and let me know what you think. And if you are a security researcher, it might be worthy to have a look at Windows Spectre v2 mitigations. As I noted before, performance counters indicate that there are indirect branches inside Windows kernel, which might not be safe if Windows relies solely on retpolines for Spectre v2.

How new-lines affect the Linux kernel performance

The Linux kernel strives to be fast and efficient. As it is written mostly in C, it can mostly control how the generated machine code l...