| QEMU virtio-fs shared file system daemon |
| ======================================== |
| |
| Synopsis |
| -------- |
| |
| **virtiofsd** [*OPTIONS*] |
| |
| Description |
| ----------- |
| |
| Share a host directory tree with a guest through a virtio-fs device. This |
| program is a vhost-user backend that implements the virtio-fs device. Each |
| virtio-fs device instance requires its own virtiofsd process. |
| |
| This program is designed to work with QEMU's ``--device vhost-user-fs-pci`` |
| but should work with any virtual machine monitor (VMM) that supports |
| vhost-user. See the Examples section below. |
| |
| This program must be run as the root user. The program drops privileges where |
| possible during startup although it must be able to create and access files |
| with any uid/gid: |
| |
| * The ability to invoke syscalls is limited using seccomp(2). |
| * Linux capabilities(7) are dropped. |
| |
| In "namespace" sandbox mode the program switches into a new file system |
| namespace and invokes pivot_root(2) to make the shared directory tree its root. |
| A new pid and net namespace is also created to isolate the process. |
| |
| In "chroot" sandbox mode the program invokes chroot(2) to make the shared |
| directory tree its root. This mode is intended for container environments where |
| the container runtime has already set up the namespaces and the program does |
| not have permission to create namespaces itself. |
| |
| Both sandbox modes prevent "file system escapes" due to symlinks and other file |
| system objects that might lead to files outside the shared directory. |
| |
| Options |
| ------- |
| |
| .. program:: virtiofsd |
| |
| .. option:: -h, --help |
| |
| Print help. |
| |
| .. option:: -V, --version |
| |
| Print version. |
| |
| .. option:: -d |
| |
| Enable debug output. |
| |
| .. option:: --syslog |
| |
| Print log messages to syslog instead of stderr. |
| |
| .. option:: -o OPTION |
| |
| * debug - |
| Enable debug output. |
| |
| * flock|no_flock - |
| Enable/disable flock. The default is ``no_flock``. |
| |
| * modcaps=CAPLIST |
| Modify the list of capabilities allowed; CAPLIST is a colon separated |
| list of capabilities, each preceded by either + or -, e.g. |
| ''+sys_admin:-chown''. |
| |
| * log_level=LEVEL - |
| Print only log messages matching LEVEL or more severe. LEVEL is one of |
| ``err``, ``warn``, ``info``, or ``debug``. The default is ``info``. |
| |
| * posix_lock|no_posix_lock - |
| Enable/disable remote POSIX locks. The default is ``no_posix_lock``. |
| |
| * readdirplus|no_readdirplus - |
| Enable/disable readdirplus. The default is ``readdirplus``. |
| |
| * sandbox=namespace|chroot - |
| Sandbox mode: |
| - namespace: Create mount, pid, and net namespaces and pivot_root(2) into |
| the shared directory. |
| - chroot: chroot(2) into shared directory (use in containers). |
| The default is "namespace". |
| |
| * source=PATH - |
| Share host directory tree located at PATH. This option is required. |
| |
| * timeout=TIMEOUT - |
| I/O timeout in seconds. The default depends on cache= option. |
| |
| * writeback|no_writeback - |
| Enable/disable writeback cache. The cache allows the FUSE client to buffer |
| and merge write requests. The default is ``no_writeback``. |
| |
| * xattr|no_xattr - |
| Enable/disable extended attributes (xattr) on files and directories. The |
| default is ``no_xattr``. |
| |
| * posix_acl|no_posix_acl - |
| Enable/disable posix acl support. Posix ACLs are disabled by default. |
| |
| * security_label|no_security_label - |
| Enable/disable security label support. Security labels are disabled by |
| default. This will allow client to send a MAC label of file during |
| file creation. Typically this is expected to be SELinux security |
| label. Server will try to set that label on newly created file |
| atomically wherever possible. |
| |
| .. option:: --socket-path=PATH |
| |
| Listen on vhost-user UNIX domain socket at PATH. |
| |
| .. option:: --socket-group=GROUP |
| |
| Set the vhost-user UNIX domain socket gid to GROUP. |
| |
| .. option:: --fd=FDNUM |
| |
| Accept connections from vhost-user UNIX domain socket file descriptor FDNUM. |
| The file descriptor must already be listening for connections. |
| |
| .. option:: --thread-pool-size=NUM |
| |
| Restrict the number of worker threads per request queue to NUM. The default |
| is 64. |
| |
| .. option:: --cache=none|auto|always |
| |
| Select the desired trade-off between coherency and performance. ``none`` |
| forbids the FUSE client from caching to achieve best coherency at the cost of |
| performance. ``auto`` acts similar to NFS with a 1 second metadata cache |
| timeout. ``always`` sets a long cache lifetime at the expense of coherency. |
| The default is ``auto``. |
| |
| Extended attribute (xattr) mapping |
| ---------------------------------- |
| |
| By default the name of xattr's used by the client are passed through to the server |
| file system. This can be a problem where either those xattr names are used |
| by something on the server (e.g. selinux client/server confusion) or if the |
| ``virtiofsd`` is running in a container with restricted privileges where it |
| cannot access some attributes. |
| |
| Mapping syntax |
| ~~~~~~~~~~~~~~ |
| |
| A mapping of xattr names can be made using -o xattrmap=mapping where the ``mapping`` |
| string consists of a series of rules. |
| |
| The first matching rule terminates the mapping. |
| The set of rules must include a terminating rule to match any remaining attributes |
| at the end. |
| |
| Each rule consists of a number of fields separated with a separator that is the |
| first non-white space character in the rule. This separator must then be used |
| for the whole rule. |
| White space may be added before and after each rule. |
| |
| Using ':' as the separator a rule is of the form: |
| |
| ``:type:scope:key:prepend:`` |
| |
| **scope** is: |
| |
| - 'client' - match 'key' against a xattr name from the client for |
| setxattr/getxattr/removexattr |
| - 'server' - match 'prepend' against a xattr name from the server |
| for listxattr |
| - 'all' - can be used to make a single rule where both the server |
| and client matches are triggered. |
| |
| **type** is one of: |
| |
| - 'prefix' - is designed to prepend and strip a prefix; the modified |
| attributes then being passed on to the client/server. |
| |
| - 'ok' - Causes the rule set to be terminated when a match is found |
| while allowing matching xattr's through unchanged. |
| It is intended both as a way of explicitly terminating |
| the list of rules, and to allow some xattr's to skip following rules. |
| |
| - 'bad' - If a client tries to use a name matching 'key' it's |
| denied using EPERM; when the server passes an attribute |
| name matching 'prepend' it's hidden. In many ways it's use is very like |
| 'ok' as either an explicit terminator or for special handling of certain |
| patterns. |
| |
| - 'unsupported' - If a client tries to use a name matching 'key' it's |
| denied using ENOTSUP; when the server passes an attribute |
| name matching 'prepend' it's hidden. In many ways it's use is very like |
| 'ok' as either an explicit terminator or for special handling of certain |
| patterns. |
| |
| **key** is a string tested as a prefix on an attribute name originating |
| on the client. It maybe empty in which case a 'client' rule |
| will always match on client names. |
| |
| **prepend** is a string tested as a prefix on an attribute name originating |
| on the server, and used as a new prefix. It may be empty |
| in which case a 'server' rule will always match on all names from |
| the server. |
| |
| e.g.: |
| |
| ``:prefix:client:trusted.:user.virtiofs.:`` |
| |
| will match 'trusted.' attributes in client calls and prefix them before |
| passing them to the server. |
| |
| ``:prefix:server::user.virtiofs.:`` |
| |
| will strip 'user.virtiofs.' from all server replies. |
| |
| ``:prefix:all:trusted.:user.virtiofs.:`` |
| |
| combines the previous two cases into a single rule. |
| |
| ``:ok:client:user.::`` |
| |
| will allow get/set xattr for 'user.' xattr's and ignore |
| following rules. |
| |
| ``:ok:server::security.:`` |
| |
| will pass 'securty.' xattr's in listxattr from the server |
| and ignore following rules. |
| |
| ``:ok:all:::`` |
| |
| will terminate the rule search passing any remaining attributes |
| in both directions. |
| |
| ``:bad:server::security.:`` |
| |
| would hide 'security.' xattr's in listxattr from the server. |
| |
| A simpler 'map' type provides a shorter syntax for the common case: |
| |
| ``:map:key:prepend:`` |
| |
| The 'map' type adds a number of separate rules to add **prepend** as a prefix |
| to the matched **key** (or all attributes if **key** is empty). |
| There may be at most one 'map' rule and it must be the last rule in the set. |
| |
| Note: When the 'security.capability' xattr is remapped, the daemon has to do |
| extra work to remove it during many operations, which the host kernel normally |
| does itself. |
| |
| Security considerations |
| ~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Operating systems typically partition the xattr namespace using |
| well defined name prefixes. Each partition may have different |
| access controls applied. For example, on Linux there are multiple |
| partitions |
| |
| * ``system.*`` - access varies depending on attribute & filesystem |
| * ``security.*`` - only processes with CAP_SYS_ADMIN |
| * ``trusted.*`` - only processes with CAP_SYS_ADMIN |
| * ``user.*`` - any process granted by file permissions / ownership |
| |
| While other OS such as FreeBSD have different name prefixes |
| and access control rules. |
| |
| When remapping attributes on the host, it is important to |
| ensure that the remapping does not allow a guest user to |
| evade the guest access control rules. |
| |
| Consider if ``trusted.*`` from the guest was remapped to |
| ``user.virtiofs.trusted*`` in the host. An unprivileged |
| user in a Linux guest has the ability to write to xattrs |
| under ``user.*``. Thus the user can evade the access |
| control restriction on ``trusted.*`` by instead writing |
| to ``user.virtiofs.trusted.*``. |
| |
| As noted above, the partitions used and access controls |
| applied, will vary across guest OS, so it is not wise to |
| try to predict what the guest OS will use. |
| |
| The simplest way to avoid an insecure configuration is |
| to remap all xattrs at once, to a given fixed prefix. |
| This is shown in example (1) below. |
| |
| If selectively mapping only a subset of xattr prefixes, |
| then rules must be added to explicitly block direct |
| access to the target of the remapping. This is shown |
| in example (2) below. |
| |
| Mapping examples |
| ~~~~~~~~~~~~~~~~ |
| |
| 1) Prefix all attributes with 'user.virtiofs.' |
| |
| :: |
| |
| -o xattrmap=":prefix:all::user.virtiofs.::bad:all:::" |
| |
| |
| This uses two rules, using : as the field separator; |
| the first rule prefixes and strips 'user.virtiofs.', |
| the second rule hides any non-prefixed attributes that |
| the host set. |
| |
| This is equivalent to the 'map' rule: |
| |
| :: |
| |
| -o xattrmap=":map::user.virtiofs.:" |
| |
| 2) Prefix 'trusted.' attributes, allow others through |
| |
| :: |
| |
| "/prefix/all/trusted./user.virtiofs./ |
| /bad/server//trusted./ |
| /bad/client/user.virtiofs.// |
| /ok/all///" |
| |
| |
| Here there are four rules, using / as the field |
| separator, and also demonstrating that new lines can |
| be included between rules. |
| The first rule is the prefixing of 'trusted.' and |
| stripping of 'user.virtiofs.'. |
| The second rule hides unprefixed 'trusted.' attributes |
| on the host. |
| The third rule stops a guest from explicitly setting |
| the 'user.virtiofs.' path directly to prevent access |
| control bypass on the target of the earlier prefix |
| remapping. |
| Finally, the fourth rule lets all remaining attributes |
| through. |
| |
| This is equivalent to the 'map' rule: |
| |
| :: |
| |
| -o xattrmap="/map/trusted./user.virtiofs./" |
| |
| 3) Hide 'security.' attributes, and allow everything else |
| |
| :: |
| |
| "/bad/all/security./security./ |
| /ok/all///' |
| |
| The first rule combines what could be separate client and server |
| rules into a single 'all' rule, matching 'security.' in either |
| client arguments or lists returned from the host. This stops |
| the client seeing any 'security.' attributes on the server and |
| stops it setting any. |
| |
| SELinux support |
| --------------- |
| One can enable support for SELinux by running virtiofsd with option |
| "-o security_label". But this will try to save guest's security context |
| in xattr security.selinux on host and it might fail if host's SELinux |
| policy does not permit virtiofsd to do this operation. |
| |
| Hence, it is preferred to remap guest's "security.selinux" xattr to say |
| "trusted.virtiofs.security.selinux" on host. |
| |
| "-o xattrmap=:map:security.selinux:trusted.virtiofs.:" |
| |
| This will make sure that guest and host's SELinux xattrs on same file |
| remain separate and not interfere with each other. And will allow both |
| host and guest to implement their own separate SELinux policies. |
| |
| Setting trusted xattr on host requires CAP_SYS_ADMIN. So one will need |
| add this capability to daemon. |
| |
| "-o modcaps=+sys_admin" |
| |
| Giving CAP_SYS_ADMIN increases the risk on system. Now virtiofsd is more |
| powerful and if gets compromised, it can do lot of damage to host system. |
| So keep this trade-off in my mind while making a decision. |
| |
| Examples |
| -------- |
| |
| Export ``/var/lib/fs/vm001/`` on vhost-user UNIX domain socket |
| ``/var/run/vm001-vhost-fs.sock``: |
| |
| .. parsed-literal:: |
| |
| host# virtiofsd --socket-path=/var/run/vm001-vhost-fs.sock -o source=/var/lib/fs/vm001 |
| host# |qemu_system| \\ |
| -chardev socket,id=char0,path=/var/run/vm001-vhost-fs.sock \\ |
| -device vhost-user-fs-pci,chardev=char0,tag=myfs \\ |
| -object memory-backend-memfd,id=mem,size=4G,share=on \\ |
| -numa node,memdev=mem \\ |
| ... |
| guest# mount -t virtiofs myfs /mnt |