| @node Security |
| @chapter Security |
| |
| @section Overview |
| |
| This chapter explains the security requirements that QEMU is designed to meet |
| and principles for securely deploying QEMU. |
| |
| @section Security Requirements |
| |
| QEMU supports many different use cases, some of which have stricter security |
| requirements than others. The community has agreed on the overall security |
| requirements that users may depend on. These requirements define what is |
| considered supported from a security perspective. |
| |
| @subsection Virtualization Use Case |
| |
| The virtualization use case covers cloud and virtual private server (VPS) |
| hosting, as well as traditional data center and desktop virtualization. These |
| use cases rely on hardware virtualization extensions to execute guest code |
| safely on the physical CPU at close-to-native speed. |
| |
| The following entities are untrusted, meaning that they may be buggy or |
| malicious: |
| |
| @itemize |
| @item Guest |
| @item User-facing interfaces (e.g. VNC, SPICE, WebSocket) |
| @item Network protocols (e.g. NBD, live migration) |
| @item User-supplied files (e.g. disk images, kernels, device trees) |
| @item Passthrough devices (e.g. PCI, USB) |
| @end itemize |
| |
| Bugs affecting these entities are evaluated on whether they can cause damage in |
| real-world use cases and treated as security bugs if this is the case. |
| |
| @subsection Non-virtualization Use Case |
| |
| The non-virtualization use case covers emulation using the Tiny Code Generator |
| (TCG). In principle the TCG and device emulation code used in conjunction with |
| the non-virtualization use case should meet the same security requirements as |
| the virtualization use case. However, for historical reasons much of the |
| non-virtualization use case code was not written with these security |
| requirements in mind. |
| |
| Bugs affecting the non-virtualization use case are not considered security |
| bugs at this time. Users with non-virtualization use cases must not rely on |
| QEMU to provide guest isolation or any security guarantees. |
| |
| @section Architecture |
| |
| This section describes the design principles that ensure the security |
| requirements are met. |
| |
| @subsection Guest Isolation |
| |
| Guest isolation is the confinement of guest code to the virtual machine. When |
| guest code gains control of execution on the host this is called escaping the |
| virtual machine. Isolation also includes resource limits such as throttling of |
| CPU, memory, disk, or network. Guests must be unable to exceed their resource |
| limits. |
| |
| QEMU presents an attack surface to the guest in the form of emulated devices. |
| The guest must not be able to gain control of QEMU. Bugs in emulated devices |
| could allow malicious guests to gain code execution in QEMU. At this point the |
| guest has escaped the virtual machine and is able to act in the context of the |
| QEMU process on the host. |
| |
| Guests often interact with other guests and share resources with them. A |
| malicious guest must not gain control of other guests or access their data. |
| Disk image files and network traffic must be protected from other guests unless |
| explicitly shared between them by the user. |
| |
| @subsection Principle of Least Privilege |
| |
| The principle of least privilege states that each component only has access to |
| the privileges necessary for its function. In the case of QEMU this means that |
| each process only has access to resources belonging to the guest. |
| |
| The QEMU process should not have access to any resources that are inaccessible |
| to the guest. This way the guest does not gain anything by escaping into the |
| QEMU process since it already has access to those same resources from within |
| the guest. |
| |
| Following the principle of least privilege immediately fulfills guest isolation |
| requirements. For example, guest A only has access to its own disk image file |
| @code{a.img} and not guest B's disk image file @code{b.img}. |
| |
| In reality certain resources are inaccessible to the guest but must be |
| available to QEMU to perform its function. For example, host system calls are |
| necessary for QEMU but are not exposed to guests. A guest that escapes into |
| the QEMU process can then begin invoking host system calls. |
| |
| New features must be designed to follow the principle of least privilege. |
| Should this not be possible for technical reasons, the security risk must be |
| clearly documented so users are aware of the trade-off of enabling the feature. |
| |
| @subsection Isolation mechanisms |
| |
| Several isolation mechanisms are available to realize this architecture of |
| guest isolation and the principle of least privilege. With the exception of |
| Linux seccomp, these mechanisms are all deployed by management tools that |
| launch QEMU, such as libvirt. They are also platform-specific so they are only |
| described briefly for Linux here. |
| |
| The fundamental isolation mechanism is that QEMU processes must run as |
| unprivileged users. Sometimes it seems more convenient to launch QEMU as |
| root to give it access to host devices (e.g. @code{/dev/net/tun}) but this poses a |
| huge security risk. File descriptor passing can be used to give an otherwise |
| unprivileged QEMU process access to host devices without running QEMU as root. |
| It is also possible to launch QEMU as a non-root user and configure UNIX groups |
| for access to @code{/dev/kvm}, @code{/dev/net/tun}, and other device nodes. |
| Some Linux distros already ship with UNIX groups for these devices by default. |
| |
| @itemize |
| @item SELinux and AppArmor make it possible to confine processes beyond the |
| traditional UNIX process and file permissions model. They restrict the QEMU |
| process from accessing processes and files on the host system that are not |
| needed by QEMU. |
| |
| @item Resource limits and cgroup controllers provide throughput and utilization |
| limits on key resources such as CPU time, memory, and I/O bandwidth. |
| |
| @item Linux namespaces can be used to make process, file system, and other system |
| resources unavailable to QEMU. A namespaced QEMU process is restricted to only |
| those resources that were granted to it. |
| |
| @item Linux seccomp is available via the QEMU @option{--sandbox} option. It disables |
| system calls that are not needed by QEMU, thereby reducing the host kernel |
| attack surface. |
| @end itemize |