diff options
Diffstat (limited to 'docs/system/devices')
-rw-r--r-- | docs/system/devices/can.rst | 189 | ||||
-rw-r--r-- | docs/system/devices/canokey.rst | 158 | ||||
-rw-r--r-- | docs/system/devices/ccid.rst | 171 | ||||
-rw-r--r-- | docs/system/devices/cxl.rst | 386 | ||||
-rw-r--r-- | docs/system/devices/ivshmem.rst | 64 | ||||
-rw-r--r-- | docs/system/devices/net.rst | 100 | ||||
-rw-r--r-- | docs/system/devices/nvme.rst | 323 | ||||
-rw-r--r-- | docs/system/devices/usb.rst | 408 | ||||
-rw-r--r-- | docs/system/devices/vhost-user-rng.rst | 39 | ||||
-rw-r--r-- | docs/system/devices/vhost-user.rst | 59 | ||||
-rw-r--r-- | docs/system/devices/virtio-pmem.rst | 76 |
11 files changed, 1973 insertions, 0 deletions
diff --git a/docs/system/devices/can.rst b/docs/system/devices/can.rst new file mode 100644 index 00000000..0af3d991 --- /dev/null +++ b/docs/system/devices/can.rst @@ -0,0 +1,189 @@ +CAN Bus Emulation Support +========================= +The CAN bus emulation provides mechanism to connect multiple +emulated CAN controller chips together by one or multiple CAN busses +(the controller device "canbus" parameter). The individual busses +can be connected to host system CAN API (at this time only Linux +SocketCAN is supported). + +The concept of busses is generic and different CAN controllers +can be implemented. + +The initial submission implemented SJA1000 controller which +is common and well supported by by drivers for the most operating +systems. + +The PCI addon card hardware has been selected as the first CAN +interface to implement because such device can be easily connected +to systems with different CPU architectures (x86, PowerPC, Arm, etc.). + +In 2020, CTU CAN FD controller model has been added as part +of the bachelor thesis of Jan Charvat. This controller is complete +open-source/design/hardware solution. The core designer +of the project is Ondrej Ille, the financial support has been +provided by CTU, and more companies including Volkswagen subsidiaries. + +The project has been initially started in frame of RTEMS GSoC 2013 +slot by Jin Yang under our mentoring The initial idea was to provide generic +CAN subsystem for RTEMS. But lack of common environment for code and RTEMS +testing lead to goal change to provide environment which provides complete +emulated environment for testing and RTEMS GSoC slot has been donated +to work on CAN hardware emulation on QEMU. + +Examples how to use CAN emulation for SJA1000 based boards +---------------------------------------------------------- +When QEMU with CAN PCI support is compiled then one of the next +CAN boards can be selected + +(1) CAN bus Kvaser PCI CAN-S (single SJA1000 channel) board. QEMU startup options:: + + -object can-bus,id=canbus0 + -device kvaser_pci,canbus=canbus0 + +Add "can-host-socketcan" object to connect device to host system CAN bus:: + + -object can-host-socketcan,id=canhost0,if=can0,canbus=canbus0 + +(2) CAN bus PCM-3680I PCI (dual SJA1000 channel) emulation:: + + -object can-bus,id=canbus0 + -device pcm3680_pci,canbus0=canbus0,canbus1=canbus0 + +Another example:: + + -object can-bus,id=canbus0 + -object can-bus,id=canbus1 + -device pcm3680_pci,canbus0=canbus0,canbus1=canbus1 + +(3) CAN bus MIOe-3680 PCI (dual SJA1000 channel) emulation:: + + -device mioe3680_pci,canbus0=canbus0 + +The ''kvaser_pci'' board/device model is compatible with and has been tested with +the ''kvaser_pci'' driver included in mainline Linux kernel. +The tested setup was Linux 4.9 kernel on the host and guest side. + +Example for qemu-system-x86_64:: + + qemu-system-x86_64 -accel kvm -kernel /boot/vmlinuz-4.9.0-4-amd64 \ + -initrd ramdisk.cpio \ + -virtfs local,path=shareddir,security_model=none,mount_tag=shareddir \ + -object can-bus,id=canbus0 \ + -object can-host-socketcan,id=canhost0,if=can0,canbus=canbus0 \ + -device kvaser_pci,canbus=canbus0 \ + -nographic -append "console=ttyS0" + +Example for qemu-system-arm:: + + qemu-system-arm -cpu arm1176 -m 256 -M versatilepb \ + -kernel kernel-qemu-arm1176-versatilepb \ + -hda rpi-wheezy-overlay \ + -append "console=ttyAMA0 root=/dev/sda2 ro init=/sbin/init-overlay" \ + -nographic \ + -virtfs local,path=shareddir,security_model=none,mount_tag=shareddir \ + -object can-bus,id=canbus0 \ + -object can-host-socketcan,id=canhost0,if=can0,canbus=canbus0 \ + -device kvaser_pci,canbus=canbus0,host=can0 \ + +The CAN interface of the host system has to be configured for proper +bitrate and set up. Configuration is not propagated from emulated +devices through bus to the physical host device. Example configuration +for 1 Mbit/s:: + + ip link set can0 type can bitrate 1000000 + ip link set can0 up + +Virtual (host local only) can interface can be used on the host +side instead of physical interface:: + + ip link add dev can0 type vcan + +The CAN interface on the host side can be used to analyze CAN +traffic with "candump" command which is included in "can-utils":: + + candump can0 + +CTU CAN FD support examples +--------------------------- +This open-source core provides CAN FD support. CAN FD drames are +delivered even to the host systems when SocketCAN interface is found +CAN FD capable. + +The PCIe board emulation is provided for now (the device identifier is +ctucan_pci). The default build defines two CTU CAN FD cores +on the board. + +Example how to connect the canbus0-bus (virtual wire) to the host +Linux system (SocketCAN used) and to both CTU CAN FD cores emulated +on the corresponding PCI card expects that host system CAN bus +is setup according to the previous SJA1000 section:: + + qemu-system-x86_64 -enable-kvm -kernel /boot/vmlinuz-4.19.52+ \ + -initrd ramdisk.cpio \ + -virtfs local,path=shareddir,security_model=none,mount_tag=shareddir \ + -vga cirrus \ + -append "console=ttyS0" \ + -object can-bus,id=canbus0-bus \ + -object can-host-socketcan,if=can0,canbus=canbus0-bus,id=canbus0-socketcan \ + -device ctucan_pci,canbus0=canbus0-bus,canbus1=canbus0-bus \ + -nographic + +Setup of CTU CAN FD controller in a guest Linux system:: + + insmod ctucanfd.ko || modprobe ctucanfd + insmod ctucanfd_pci.ko || modprobe ctucanfd_pci + + for ifc in /sys/class/net/can* ; do + if [ -e $ifc/device/vendor ] ; then + if ! grep -q 0x1760 $ifc/device/vendor ; then + continue; + fi + else + continue; + fi + if [ -e $ifc/device/device ] ; then + if ! grep -q 0xff00 $ifc/device/device ; then + continue; + fi + else + continue; + fi + ifc=$(basename $ifc) + /bin/ip link set $ifc type can bitrate 1000000 dbitrate 10000000 fd on + /bin/ip link set $ifc up + done + +The test can run for example:: + + candump can1 + +in the guest system and next commands in the host system for basic CAN:: + + cangen can0 + +for CAN FD without bitrate switch:: + + cangen can0 -f + +and with bitrate switch:: + + cangen can0 -b + +The test can also be run the other way around, generating messages in the +guest system and capturing them in the host system. Other combinations are +also possible. + +Links to other resources +------------------------ + + (1) `CAN related projects at Czech Technical University, Faculty of Electrical Engineering <http://canbus.pages.fel.cvut.cz>`_ + (2) `Repository with development can-pci branch at Czech Technical University <https://gitlab.fel.cvut.cz/canbus/qemu-canbus>`_ + (3) `RTEMS page describing project <https://devel.rtems.org/wiki/Developer/Simulators/QEMU/CANEmulation>`_ + (4) `RTLWS 2015 article about the project and its use with CANopen emulation <http://cmp.felk.cvut.cz/~pisa/can/doc/rtlws-17-pisa-qemu-can.pdf>`_ + (5) `GNU/Linux, CAN and CANopen in Real-time Control Applications Slides from LinuxDays 2017 (include updated RTLWS 2015 content) <https://www.linuxdays.cz/2017/video/Pavel_Pisa-CAN_canopen.pdf>`_ + (6) `Linux SocketCAN utilities <https://github.com/linux-can/can-utils>`_ + (7) `CTU CAN FD project including core VHDL design, Linux driver, test utilities etc. <https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_ + (8) `CTU CAN FD Core Datasheet Documentation <http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/doc/Datasheet.pdf>`_ + (9) `CTU CAN FD Core System Architecture Documentation <http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/doc/System_Architecture.pdf>`_ + (10) `CTU CAN FD Driver Documentation <https://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/doc/linux_driver/build/ctucanfd-driver.html>`_ + (11) `Integration with PCIe interfacing for Intel/Altera Cyclone IV based board <https://gitlab.fel.cvut.cz/canbus/pcie-ctu_can_fd>`_ diff --git a/docs/system/devices/canokey.rst b/docs/system/devices/canokey.rst new file mode 100644 index 00000000..cfa6186e --- /dev/null +++ b/docs/system/devices/canokey.rst @@ -0,0 +1,158 @@ +.. _canokey: + +CanoKey QEMU +------------ + +CanoKey [1]_ is an open-source secure key with supports of + +* U2F / FIDO2 with Ed25519 and HMAC-secret +* OpenPGP Card V3.4 with RSA4096, Ed25519 and more [2]_ +* PIV (NIST SP 800-73-4) +* HOTP / TOTP +* NDEF + +All these platform-independent features are in canokey-core [3]_. + +For different platforms, CanoKey has different implementations, +including both hardware implementions and virtual cards: + +* CanoKey STM32 [4]_ +* CanoKey Pigeon [5]_ +* (virt-card) CanoKey USB/IP +* (virt-card) CanoKey FunctionFS + +In QEMU, yet another CanoKey virt-card is implemented. +CanoKey QEMU exposes itself as a USB device to the guest OS. + +With the same software configuration as a hardware key, +the guest OS can use all the functionalities of a secure key as if +there was actually an hardware key plugged in. + +CanoKey QEMU provides much convenience for debugging: + +* libcanokey-qemu supports debugging output thus developers can + inspect what happens inside a secure key +* CanoKey QEMU supports trace event thus event +* QEMU USB stack supports pcap thus USB packet between the guest + and key can be captured and analysed + +Then for developers: + +* For developers on software with secure key support (e.g. FIDO2, OpenPGP), + they can see what happens inside the secure key +* For secure key developers, USB packets between guest OS and CanoKey + can be easily captured and analysed + +Also since this is a virtual card, it can be easily used in CI for testing +on code coping with secure key. + +Building +======== + +libcanokey-qemu is required to use CanoKey QEMU. + +.. code-block:: shell + + git clone https://github.com/canokeys/canokey-qemu + mkdir canokey-qemu/build + pushd canokey-qemu/build + +If you want to install libcanokey-qemu in a different place, +add ``-DCMAKE_INSTALL_PREFIX=/path/to/your/place`` to cmake below. + +.. code-block:: shell + + cmake .. + make + make install # may need sudo + popd + +Then configuring and building: + +.. code-block:: shell + + # depending on your env, lib/pkgconfig can be lib64/pkgconfig + export PKG_CONFIG_PATH=/path/to/your/place/lib/pkgconfig:$PKG_CONFIG_PATH + ./configure --enable-canokey && make + +Using CanoKey QEMU +================== + +CanoKey QEMU stores all its data on a file of the host specified by the argument +when invoking qemu. + +.. parsed-literal:: + + |qemu_system| -usb -device canokey,file=$HOME/.canokey-file + +Note: you should keep this file carefully as it may contain your private key! + +The first time when the file is used, it is created and initialized by CanoKey, +afterwards CanoKey QEMU would just read this file. + +After the guest OS boots, you can check that there is a USB device. + +For example, If the guest OS is an Linux machine. You may invoke lsusb +and find CanoKey QEMU there: + +.. code-block:: shell + + $ lsusb + Bus 001 Device 002: ID 20a0:42d4 Clay Logic CanoKey QEMU + +You may setup the key as guided in [6]_. The console for the key is at [7]_. + +Debugging +========= + +CanoKey QEMU consists of two parts, ``libcanokey-qemu.so`` and ``canokey.c``, +the latter of which resides in QEMU. The former provides core functionality +of a secure key while the latter provides platform-dependent functions: +USB packet handling. + +If you want to trace what happens inside the secure key, when compiling +libcanokey-qemu, you should add ``-DQEMU_DEBUG_OUTPUT=ON`` in cmake command +line: + +.. code-block:: shell + + cmake .. -DQEMU_DEBUG_OUTPUT=ON + +If you want to trace events happened in canokey.c, use + +.. parsed-literal:: + + |qemu_system| --trace "canokey_*" \\ + -usb -device canokey,file=$HOME/.canokey-file + +If you want to capture USB packets between the guest and the host, you can: + +.. parsed-literal:: + + |qemu_system| -usb -device canokey,file=$HOME/.canokey-file,pcap=key.pcap + +Limitations +=========== + +Currently libcanokey-qemu.so has dozens of global variables as it was originally +designed for embedded systems. Thus one qemu instance can not have +multiple CanoKey QEMU running, namely you can not + +.. parsed-literal:: + + |qemu_system| -usb -device canokey,file=$HOME/.canokey-file \\ + -device canokey,file=$HOME/.canokey-file2 + +Also, there is no lock on canokey-file, thus two CanoKey QEMU instance +can not read one canokey-file at the same time. + +References +========== + +.. [1] `<https://canokeys.org>`_ +.. [2] `<https://docs.canokeys.org/userguide/openpgp/#supported-algorithm>`_ +.. [3] `<https://github.com/canokeys/canokey-core>`_ +.. [4] `<https://github.com/canokeys/canokey-stm32>`_ +.. [5] `<https://github.com/canokeys/canokey-pigeon>`_ +.. [6] `<https://docs.canokeys.org/>`_ +.. [7] `<https://console.canokeys.org/>`_ diff --git a/docs/system/devices/ccid.rst b/docs/system/devices/ccid.rst new file mode 100644 index 00000000..3b8c2ab4 --- /dev/null +++ b/docs/system/devices/ccid.rst @@ -0,0 +1,171 @@ +Chip Card Interface Device (CCID) +================================= + +USB CCID device +--------------- +The USB CCID device is a USB device implementing the CCID specification, which +lets one connect smart card readers that implement the same spec. For more +information see the specification:: + + Universal Serial Bus + Device Class: Smart Card + CCID + Specification for + Integrated Circuit(s) Cards Interface Devices + Revision 1.1 + April 22rd, 2005 + +Smartcards are used for authentication, single sign on, decryption in +public/private schemes and digital signatures. A smartcard reader on the client +cannot be used on a guest with simple usb passthrough since it will then not be +available on the client, possibly locking the computer when it is "removed". On +the other hand this device can let you use the smartcard on both the client and +the guest machine. It is also possible to have a completely virtual smart card +reader and smart card (i.e. not backed by a physical device) using this device. + +Building +-------- +The cryptographic functions and access to the physical card is done via the +libcacard library, whose development package must be installed prior to +building QEMU: + +In redhat/fedora:: + + yum install libcacard-devel + +In ubuntu:: + + apt-get install libcacard-dev + +Configuring and building:: + + ./configure --enable-smartcard && make + +Using ccid-card-emulated with hardware +-------------------------------------- +Assuming you have a working smartcard on the host with the current +user, using libcacard, QEMU acts as another client using ccid-card-emulated:: + + qemu -usb -device usb-ccid -device ccid-card-emulated + +Using ccid-card-emulated with certificates stored in files +---------------------------------------------------------- +You must create the CA and card certificates. This is a one time process. +We use NSS certificates:: + + mkdir fake-smartcard + cd fake-smartcard + certutil -N -d sql:$PWD + certutil -S -d sql:$PWD -s "CN=Fake Smart Card CA" -x -t TC,TC,TC -n fake-smartcard-ca + certutil -S -d sql:$PWD -t ,, -s "CN=John Doe" -n id-cert -c fake-smartcard-ca + certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (signing)" --nsCertType smime -n signing-cert -c fake-smartcard-ca + certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (encryption)" --nsCertType sslClient -n encryption-cert -c fake-smartcard-ca + +Note: you must have exactly three certificates. + +You can use the emulated card type with the certificates backend:: + + qemu -usb -device usb-ccid -device ccid-card-emulated,backend=certificates,db=sql:$PWD,cert1=id-cert,cert2=signing-cert,cert3=encryption-cert + +To use the certificates in the guest, export the CA certificate:: + + certutil -L -r -d sql:$PWD -o fake-smartcard-ca.cer -n fake-smartcard-ca + +and import it in the guest:: + + certutil -A -d /etc/pki/nssdb -i fake-smartcard-ca.cer -t TC,TC,TC -n fake-smartcard-ca + +In a Linux guest you can then use the CoolKey PKCS #11 module to access +the card:: + + certutil -d /etc/pki/nssdb -L -h all + +It will prompt you for the PIN (which is the password you assigned to the +certificate database early on), and then show you all three certificates +together with the manually imported CA cert:: + + Certificate Nickname Trust Attributes + fake-smartcard-ca CT,C,C + John Doe:CAC ID Certificate u,u,u + John Doe:CAC Email Signature Certificate u,u,u + John Doe:CAC Email Encryption Certificate u,u,u + +If this does not happen, CoolKey is not installed or not registered with +NSS. Registration can be done from Firefox or the command line:: + + modutil -dbdir /etc/pki/nssdb -add "CAC Module" -libfile /usr/lib64/pkcs11/libcoolkeypk11.so + modutil -dbdir /etc/pki/nssdb -list + +Using ccid-card-passthru with client side hardware +-------------------------------------------------- +On the host specify the ccid-card-passthru device with a suitable chardev:: + + qemu -chardev socket,server=on,host=0.0.0.0,port=2001,id=ccid,wait=off \ + -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid + +On the client run vscclient, built when you built QEMU:: + + vscclient <qemu-host> 2001 + +Using ccid-card-passthru with client side certificates +------------------------------------------------------ +This case is not particularly useful, but you can use it to debug +your setup. + +Follow instructions above, except run QEMU and vscclient as follows. + +Run qemu as per above, and run vscclient from the "fake-smartcard" +directory as follows:: + + qemu -chardev socket,server=on,host=0.0.0.0,port=2001,id=ccid,wait=off \ + -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid + vscclient -e "db=\"sql:$PWD\" use_hw=no soft=(,Test,CAC,,id-cert,signing-cert,encryption-cert)" <qemu-host> 2001 + + +Passthrough protocol scenario +----------------------------- +This is a typical interchange of messages when using the passthru card device. +usb-ccid is a usb device. It defaults to an unattached usb device on startup. +usb-ccid expects a chardev and expects the protocol defined in +cac_card/vscard_common.h to be passed over that. +The usb-ccid device can be in one of three modes: + +* detached +* attached with no card +* attached with card + +A typical interchange is (the arrow shows who started each exchange, it can be client +originated or guest originated):: + + client event | vscclient | passthru | usb-ccid | guest event + ------------------------------------------------------------------------------------------------ + | VSC_Init | | | + | VSC_ReaderAdd | | attach | + | | | | sees new usb device. + card inserted -> | | | | + | VSC_ATR | insert | insert | see new card + | | | | + | VSC_APDU | VSC_APDU | | <- guest sends APDU + client <-> physical | | | | + card APDU exchange | | | | + client response -> | VSC_APDU | VSC_APDU | | receive APDU response + ... + [APDU<->APDU repeats several times] + ... + card removed -> | | | | + | VSC_CardRemove | remove | remove | card removed + ... + [(card insert, apdu's, card remove) repeat] + ... + kill/quit | | | | + vscclient | | | | + | VSC_ReaderRemove | | detach | + | | | | usb device removed. + +libcacard +--------- +Both ccid-card-emulated and vscclient use libcacard as the card emulator. +libcacard implements a completely virtual CAC (DoD standard for smart +cards) compliant card and uses NSS to retrieve certificates and do +any encryption. The backend can then be a real reader and card, or +certificates stored in files. diff --git a/docs/system/devices/cxl.rst b/docs/system/devices/cxl.rst new file mode 100644 index 00000000..f25783a4 --- /dev/null +++ b/docs/system/devices/cxl.rst @@ -0,0 +1,386 @@ +Compute Express Link (CXL) +========================== +From the view of a single host, CXL is an interconnect standard that +targets accelerators and memory devices attached to a CXL host. +This description will focus on those aspects visible either to +software running on a QEMU emulated host or to the internals of +functional emulation. As such, it will skip over many of the +electrical and protocol elements that would be more of interest +for real hardware and will dominate more general introductions to CXL. +It will also completely ignore the fabric management aspects of CXL +by considering only a single host and a static configuration. + +CXL shares many concepts and much of the infrastructure of PCI Express, +with CXL Host Bridges, which have CXL Root Ports which may be directly +attached to CXL or PCI End Points. Alternatively there may be CXL Switches +with CXL and PCI Endpoints attached below them. In many cases additional +control and capabilities are exposed via PCI Express interfaces. +This sharing of interfaces and hence emulation code is reflected +in how the devices are emulated in QEMU. In most cases the various +CXL elements are built upon an equivalent PCIe devices. + +CXL devices support the following interfaces: + +* Most conventional PCIe interfaces + + - Configuration space access + - BAR mapped memory accesses used for registers and mailboxes. + - MSI/MSI-X + - AER + - DOE mailboxes + - IDE + - Many other PCI express defined interfaces.. + +* Memory operations + + - Equivalent of accessing DRAM / NVDIMMs. Any access / feature + supported by the host for normal memory should also work for + CXL attached memory devices. + +* Cache operations. The are mostly irrelevant to QEMU emulation as + QEMU is not emulating a coherency protocol. Any emulation related + to these will be device specific and is out of the scope of this + document. + +CXL 2.0 Device Types +-------------------- +CXL 2.0 End Points are often categorized into three types. + +**Type 1:** These support coherent caching of host memory. Example might +be a crypto accelerators. May also have device private memory accessible +via means such as PCI memory reads and writes to BARs. + +**Type 2:** These support coherent caching of host memory and host +managed device memory (HDM) for which the coherency protocol is managed +by the host. This is a complex topic, so for more information on CXL +coherency see the CXL 2.0 specification. + +**Type 3 Memory devices:** These devices act as a means of attaching +additional memory (HDM) to a CXL host including both volatile and +persistent memory. The CXL topology may support interleaving across a +number of Type 3 memory devices using HDM Decoders in the host, host +bridge, switch upstream port and endpoints. + +Scope of CXL emulation in QEMU +------------------------------ +The focus of CXL emulation is CXL revision 2.0 and later. Earlier CXL +revisions defined a smaller set of features, leaving much of the control +interface as implementation defined or device specific, making generic +emulation challenging with host specific firmware being responsible +for setup and the Endpoints being presented to operating systems +as Root Complex Integrated End Points. CXL rev 2.0 looks a lot +more like PCI Express, with fully specified discoverability +of the CXL topology. + +CXL System components +---------------------- +A CXL system is made up a Host with a number of 'standard components' +the control and capabilities of which are discoverable by system software +using means described in the CXL 2.0 specification. + +CXL Fixed Memory Windows (CFMW) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A CFMW consists of a particular range of Host Physical Address space +which is routed to particular CXL Host Bridges. At time of generic +software initialization it will have a particularly interleaving +configuration and associated Quality of Service Throttling Group (QTG). +This information is available to system software, when making +decisions about how to configure interleave across available CXL +memory devices. It is provide as CFMW Structures (CFMWS) in +the CXL Early Discovery Table, an ACPI table. + +Note: QTG 0 is the only one currently supported in QEMU. + +CXL Host Bridge (CXL HB) +~~~~~~~~~~~~~~~~~~~~~~~~ +A CXL host bridge is similar to the PCIe equivalent, but with a +specification defined register interface called CXL Host Bridge +Component Registers (CHBCR). The location of this CHBCR MMIO +space is described to system software via a CXL Host Bridge +Structure (CHBS) in the CEDT ACPI table. The actual interfaces +are identical to those used for other parts of the CXL hierarchy +as CXL Component Registers in PCI BARs. + +Interfaces provided include: + +* Configuration of HDM Decoders to route CXL Memory accesses with + a particularly Host Physical Address range to the target port + below which the CXL device servicing that address lies. This + may be a mapping to a single Root Port (RP) or across a set of + target RPs. + +CXL Root Ports (CXL RP) +~~~~~~~~~~~~~~~~~~~~~~~ +A CXL Root Port servers te same purpose as a PCIe Root Port. +There are a number of CXL specific Designated Vendor Specific +Extended Capabilities (DVSEC) in PCIe Configuration Space +and associated component register access via PCI bars. + +CXL Switch +~~~~~~~~~~ +Here we consider a simple CXL switch with only a single +virtual hierarchy. Whilst more complex devices exist, their +visibility to a particular host is generally the same as for +a simple switch design. Hosts often have no awareness +of complex rerouting and device pooling, they simply see +devices being hot added or hot removed. + +A CXL switch has a similar architecture to those in PCIe, +with a single upstream port, internal PCI bus and multiple +downstream ports. + +Both the CXL upstream and downstream ports have CXL specific +DVSECs in configuration space, and component registers in PCI +BARs. The Upstream Port has the configuration interfaces for +the HDM decoders which route incoming memory accesses to the +appropriate downstream port. + +A CXL switch is created in a similar fashion to PCI switches +by creating an upstream port (cxl-upstream) and a number of +downstream ports on the internal switch bus (cxl-downstream). + +CXL Memory Devices - Type 3 +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +CXL type 3 devices use a PCI class code and are intended to be supported +by a generic operating system driver. They have HDM decoders +though in these EP devices, the decoder is responsible not for +routing but for translation of the incoming host physical address (HPA) +into a Device Physical Address (DPA). + +CXL Memory Interleave +--------------------- +To understand the interaction of different CXL hardware components which +are emulated in QEMU, let us consider a memory read in a fully configured +CXL topology. Note that system software is responsible for configuration +of all components with the exception of the CFMWs. System software is +responsible for allocating appropriate ranges from within the CFMWs +and exposing those via normal memory configurations as would be done +for system RAM. + +Example system Topology. x marks the match in each decoder level:: + + |<------------------SYSTEM PHYSICAL ADDRESS MAP (1)----------------->| + | __________ __________________________________ __________ | + | | | | | | | | + | | CFMW 0 | | CXL Fixed Memory Window 1 | | CFMW 1 | | + | | HB0 only | | Configured to interleave memory | | HB1 only | | + | | | | memory accesses across HB0/HB1 | | | | + | |__________| |_____x____________________________| |__________| | + | | | | + | | | | + | | | | + | Interleave Decoder | | + | Matches this HB | | + \_____________| |_____________/ + __________|__________ _____|_______________ + | | | | + (2) | CXL HB 0 | | CXL HB 1 | + | HB IntLv Decoders | | HB IntLv Decoders | + | PCI/CXL Root Bus 0c | | PCI/CXL Root Bus 0d | + | | | | + |___x_________________| |_____________________| + | | | | + | | | | + A HB 0 HDM Decoder | | | + matches this Port | | | + | | | | + ___________|___ __________|__ __|_________ ___|_________ + (3)| Root Port 0 | | Root Port 1 | | Root Port 2| | Root Port 3 | + | Appears in | | Appears in | | Appears in | | Appear in | + | PCI topology | | PCI Topology| | PCI Topo | | PCI Topo | + | As 0c:00.0 | | as 0c:01.0 | | as de:00.0 | | as de:01.0 | + |_______________| |_____________| |____________| |_____________| + | | | | + | | | | + _____|_________ ______|______ ______|_____ ______|_______ + (4)| x | | | | | | | + | CXL Type3 0 | | CXL Type3 1 | | CXL type3 2| | CLX Type 3 3 | + | | | | | | | | + | PMEM0(Vol LSA)| | PMEM1 (...) | | PMEM2 (...)| | PMEM3 (...) | + | Decoder to go | | | | | | | + | from host PA | | PCI 0e:00.0 | | PCI df:00.0| | PCI e0:00.0 | + | to device PA | | | | | | | + | PCI as 0d:00.0| | | | | | | + |_______________| |_____________| |____________| |______________| + +Notes: + +(1) **3 CXL Fixed Memory Windows (CFMW)** corresponding to different + ranges of the system physical address map. Each CFMW has + particular interleave setup across the CXL Host Bridges (HB) + CFMW0 provides uninterleaved access to HB0, CFW2 provides + uninterleaved access to HB1. CFW1 provides interleaved memory access + across HB0 and HB1. + +(2) **Two CXL Host Bridges**. Each of these has 2 CXL Root Ports and + programmable HDM decoders to route memory accesses either to + a single port or interleave them across multiple ports. + A complex configuration here, might be to use the following HDM + decoders in HB0. HDM0 routes CFMW0 requests to RP0 and hence + part of CXL Type3 0. HDM1 routes CFMW0 requests from a + different region of the CFMW0 PA range to RP2 and hence part + of CXL Type 3 1. HDM2 routes yet another PA range from within + CFMW0 to be interleaved across RP0 and RP1, providing 2 way + interleave of part of the memory provided by CXL Type3 0 and + CXL Type 3 1. HDM3 routes those interleaved accesses from + CFMW1 that target HB0 to RP 0 and another part of the memory of + CXL Type 3 0 (as part of a 2 way interleave at the system level + across for example CXL Type3 0 and CXL Type3 2. + HDM4 is used to enable system wide 4 way interleave across all + the present CXL type3 devices, by interleaving those (interleaved) + requests that HB0 receives from from CFMW1 across RP 0 and + RP 1 and hence to yet more regions of the memory of the + attached Type3 devices. Note this is a representative subset + of the full range of possible HDM decoder configurations in this + topology. + +(3) **Four CXL Root Ports.** In this case the CXL Type 3 devices are + directly attached to these ports. + +(4) **Four CXL Type3 memory expansion devices.** These will each have + HDM decoders, but in this case rather than performing interleave + they will take the Host Physical Addresses of accesses and map + them to their own local Device Physical Address Space (DPA). + +Example topology involving a switch:: + + |<------------------SYSTEM PHYSICAL ADDRESS MAP (1)----------------->| + | __________ __________________________________ __________ | + | | | | | | | | + | | CFMW 0 | | CXL Fixed Memory Window 1 | | CFMW 1 | | + | | HB0 only | | Configured to interleave memory | | HB1 only | | + | | | | memory accesses across HB0/HB1 | | | | + | |____x_____| |__________________________________| |__________| | + | | | | + | | | | + | | | + Interleave Decoder | | | + Matches this HB | | | + \_____________| |_____________/ + __________|__________ _____|_______________ + | | | | + | CXL HB 0 | | CXL HB 1 | + | HB IntLv Decoders | | HB IntLv Decoders | + | PCI/CXL Root Bus 0c | | PCI/CXL Root Bus 0d | + | | | | + |___x_________________| |_____________________| + | | | | + | + A HB 0 HDM Decoder + matches this Port + ___________|___ + | Root Port 0 | + | Appears in | + | PCI topology | + | As 0c:00.0 | + |___________x___| + | + | + \_____________________ + | + | + --------------------------------------------------- + | Switch 0 USP as PCI 0d:00.0 | + | USP has HDM decoder which direct traffic to | + | appropriate downstream port | + | Switch BUS appears as 0e | + |x__________________________________________________| + | | | | + | | | | + _____|_________ ______|______ ______|_____ ______|_______ + (4)| x | | | | | | | + | CXL Type3 0 | | CXL Type3 1 | | CXL type3 2| | CLX Type 3 3 | + | | | | | | | | + | PMEM0(Vol LSA)| | PMEM1 (...) | | PMEM2 (...)| | PMEM3 (...) | + | Decoder to go | | | | | | | + | from host PA | | PCI 10:00.0 | | PCI 11:00.0| | PCI 12:00.0 | + | to device PA | | | | | | | + | PCI as 0f:00.0| | | | | | | + |_______________| |_____________| |____________| |______________| + +Example command lines +--------------------- +A very simple setup with just one directly attached CXL Type 3 device:: + + qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \ + ... + -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M \ + -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \ + -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ + -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ + -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \ + -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G + +A setup suitable for 4 way interleave. Only one fixed window provided, to enable 2 way +interleave across 2 CXL host bridges. Each host bridge has 2 CXL Root Ports, with +the CXL Type3 device directly attached (no switches).:: + + qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \ + ... + -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M \ + -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M \ + -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M \ + -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest4.raw,size=256M \ + -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \ + -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M \ + -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M \ + -object memory-backend-file,id=cxl-lsa4,share=on,mem-path=/tmp/lsa4.raw,size=256M \ + -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ + -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2 \ + -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ + -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \ + -device cxl-rp,port=1,bus=cxl.1,id=root_port14,chassis=0,slot=3 \ + -device cxl-type3,bus=root_port14,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \ + -device cxl-rp,port=0,bus=cxl.2,id=root_port15,chassis=0,slot=5 \ + -device cxl-type3,bus=root_port15,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \ + -device cxl-rp,port=1,bus=cxl.2,id=root_port16,chassis=0,slot=6 \ + -device cxl-type3,bus=root_port16,memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \ + -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k + +An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave:: + + qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \ + ... + -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \ + -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest1.raw,size=256M \ + -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M \ + -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M \ + -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \ + -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa1.raw,size=256M \ + -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M \ + -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M \ + -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ + -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \ + -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \ + -device cxl-upstream,bus=root_port0,id=us0 \ + -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \ + -device cxl-type3,bus=swport0,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0,size=256M \ + -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \ + -device cxl-type3,bus=swport1,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1,size=256M \ + -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \ + -device cxl-type3,bus=swport2,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2,size=256M \ + -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \ + -device cxl-type3,bus=swport3,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3,size=256M \ + -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k + +Kernel Configuration Options +---------------------------- + +In Linux 5.18 the following options are necessary to make use of +OS management of CXL memory devices as described here. + +* CONFIG_CXL_BUS +* CONFIG_CXL_PCI +* CONFIG_CXL_ACPI +* CONFIG_CXL_PMEM +* CONFIG_CXL_MEM +* CONFIG_CXL_PORT +* CONFIG_CXL_REGION + +References +---------- + + - Consortium website for specifications etc: + http://www.computeexpresslink.org + - Compute Express link Revision 2 specification, October 2020 + - CEDT CFMWS & QTG _DSM ECN May 2021 diff --git a/docs/system/devices/ivshmem.rst b/docs/system/devices/ivshmem.rst new file mode 100644 index 00000000..b03a48af --- /dev/null +++ b/docs/system/devices/ivshmem.rst @@ -0,0 +1,64 @@ +.. _pcsys_005fivshmem: + +Inter-VM Shared Memory device +----------------------------- + +On Linux hosts, a shared memory device is available. The basic syntax +is: + +.. parsed-literal:: + + |qemu_system_x86| -device ivshmem-plain,memdev=hostmem + +where hostmem names a host memory backend. For a POSIX shared memory +backend, use something like + +:: + + -object memory-backend-file,size=1M,share,mem-path=/dev/shm/ivshmem,id=hostmem + +If desired, interrupts can be sent between guest VMs accessing the same +shared memory region. Interrupt support requires using a shared memory +server and using a chardev socket to connect to it. The code for the +shared memory server is qemu.git/contrib/ivshmem-server. An example +syntax when using the shared memory server is: + +.. parsed-literal:: + + # First start the ivshmem server once and for all + ivshmem-server -p pidfile -S path -m shm-name -l shm-size -n vectors + + # Then start your qemu instances with matching arguments + |qemu_system_x86| -device ivshmem-doorbell,vectors=vectors,chardev=id + -chardev socket,path=path,id=id + +When using the server, the guest will be assigned a VM ID (>=0) that +allows guests using the same server to communicate via interrupts. +Guests can read their VM ID from a device register (see +ivshmem-spec.txt). + +Migration with ivshmem +~~~~~~~~~~~~~~~~~~~~~~ + +With device property ``master=on``, the guest will copy the shared +memory on migration to the destination host. With ``master=off``, the +guest will not be able to migrate with the device attached. In the +latter case, the device should be detached and then reattached after +migration using the PCI hotplug support. + +At most one of the devices sharing the same memory can be master. The +master must complete migration before you plug back the other devices. + +ivshmem and hugepages +~~~~~~~~~~~~~~~~~~~~~ + +Instead of specifying the <shm size> using POSIX shm, you may specify a +memory backend that has hugepage support: + +.. parsed-literal:: + + |qemu_system_x86| -object memory-backend-file,size=1G,mem-path=/dev/hugepages/my-shmem-file,share,id=mb1 + -device ivshmem-plain,memdev=mb1 + +ivshmem-server also supports hugepages mount points with the ``-m`` +memory path argument. diff --git a/docs/system/devices/net.rst b/docs/system/devices/net.rst new file mode 100644 index 00000000..4b2640c4 --- /dev/null +++ b/docs/system/devices/net.rst @@ -0,0 +1,100 @@ +.. _pcsys_005fnetwork: + +Network emulation +----------------- + +QEMU can simulate several network cards (e.g. PCI or ISA cards on the PC +target) and can connect them to a network backend on the host or an +emulated hub. The various host network backends can either be used to +connect the NIC of the guest to a real network (e.g. by using a TAP +devices or the non-privileged user mode network stack), or to other +guest instances running in another QEMU process (e.g. by using the +socket host network backend). + +Using TAP network interfaces +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is the standard way to connect QEMU to a real network. QEMU adds a +virtual network device on your host (called ``tapN``), and you can then +configure it as if it was a real ethernet card. + +Linux host +^^^^^^^^^^ + +As an example, you can download the ``linux-test-xxx.tar.gz`` archive +and copy the script ``qemu-ifup`` in ``/etc`` and configure properly +``sudo`` so that the command ``ifconfig`` contained in ``qemu-ifup`` can +be executed as root. You must verify that your host kernel supports the +TAP network interfaces: the device ``/dev/net/tun`` must be present. + +See :ref:`sec_005finvocation` to have examples of command +lines using the TAP network interfaces. + +Windows host +^^^^^^^^^^^^ + +There is a virtual ethernet driver for Windows 2000/XP systems, called +TAP-Win32. But it is not included in standard QEMU for Windows, so you +will need to get it separately. It is part of OpenVPN package, so +download OpenVPN from : https://openvpn.net/. + +Using the user mode network stack +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By using the option ``-net user`` (default configuration if no ``-net`` +option is specified), QEMU uses a completely user mode network stack +(you don't need root privilege to use the virtual network). The virtual +network configuration is the following:: + + guest (10.0.2.15) <------> Firewall/DHCP server <-----> Internet + | (10.0.2.2) + | + ----> DNS server (10.0.2.3) + | + ----> SMB server (10.0.2.4) + +The QEMU VM behaves as if it was behind a firewall which blocks all +incoming connections. You can use a DHCP client to automatically +configure the network in the QEMU VM. The DHCP server assign addresses +to the hosts starting from 10.0.2.15. + +In order to check that the user mode network is working, you can ping +the address 10.0.2.2 and verify that you got an address in the range +10.0.2.x from the QEMU virtual DHCP server. + +Note that ICMP traffic in general does not work with user mode +networking. ``ping``, aka. ICMP echo, to the local router (10.0.2.2) +shall work, however. If you're using QEMU on Linux >= 3.0, it can use +unprivileged ICMP ping sockets to allow ``ping`` to the Internet. The +host admin has to set the ping_group_range in order to grant access to +those sockets. To allow ping for GID 100 (usually users group):: + + echo 100 100 > /proc/sys/net/ipv4/ping_group_range + +When using the built-in TFTP server, the router is also the TFTP server. + +When using the ``'-netdev user,hostfwd=...'`` option, TCP or UDP +connections can be redirected from the host to the guest. It allows for +example to redirect X11, telnet or SSH connections. + +Hubs +~~~~ + +QEMU can simulate several hubs. A hub can be thought of as a virtual +connection between several network devices. These devices can be for +example QEMU virtual ethernet cards or virtual Host ethernet devices +(TAP devices). You can connect guest NICs or host network backends to +such a hub using the ``-netdev +hubport`` or ``-nic hubport`` options. The legacy ``-net`` option also +connects the given device to the emulated hub with ID 0 (i.e. the +default hub) unless you specify a netdev with ``-net nic,netdev=xxx`` +here. + +Connecting emulated networks between QEMU instances +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using the ``-netdev socket`` (or ``-nic socket`` or ``-net socket``) +option, it is possible to create emulated networks that span several +QEMU instances. See the description of the ``-netdev socket`` option in +:ref:`sec_005finvocation` to have a basic +example. diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst new file mode 100644 index 00000000..30f841ef --- /dev/null +++ b/docs/system/devices/nvme.rst @@ -0,0 +1,323 @@ +============== +NVMe Emulation +============== + +QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and +``nvme-subsys`` devices. + +See the following sections for specific information on + + * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_. + * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_, + `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data + Protection`_, + +Adding NVMe Devices +=================== + +Controller Emulation +-------------------- + +The QEMU emulated NVMe controller implements version 1.4 of the NVM Express +specification. All mandatory features are implement with a couple of exceptions +and limitations: + + * Accounting numbers in the SMART/Health log page are reset when the device + is power cycled. + * Interrupt Coalescing is not supported and is disabled by default. + +The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the +following parameters: + +.. code-block:: console + + -drive file=nvm.img,if=none,id=nvm + -device nvme,serial=deadbeef,drive=nvm + +There are a number of optional general parameters for the ``nvme`` device. Some +are mentioned here, but see ``-device nvme,help`` to list all possible +parameters. + +``max_ioqpairs=UINT32`` (default: ``64``) + Set the maximum number of allowed I/O queue pairs. This replaces the + deprecated ``num_queues`` parameter. + +``msix_qsize=UINT16`` (default: ``65``) + The number of MSI-X vectors that the device should support. + +``mdts=UINT8`` (default: ``7``) + Set the Maximum Data Transfer Size of the device. + +``use-intel-id`` (default: ``off``) + Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and + Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID + previously used. + +Additional Namespaces +--------------------- + +In the simplest possible invocation sketched above, the device only support a +single namespace with the namespace identifier ``1``. To support multiple +namespaces and additional features, the ``nvme-ns`` device must be used. + +.. code-block:: console + + -device nvme,id=nvme-ctrl-0,serial=deadbeef + -drive file=nvm-1.img,if=none,id=nvm-1 + -device nvme-ns,drive=nvm-1 + -drive file=nvm-2.img,if=none,id=nvm-2 + -device nvme-ns,drive=nvm-2 + +The namespaces defined by the ``nvme-ns`` device will attach to the most +recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace +identifiers are allocated automatically, starting from ``1``. + +There are a number of parameters available: + +``nsid`` (default: ``0``) + Explicitly set the namespace identifier. + +``uuid`` (default: *autogenerated*) + Set the UUID of the namespace. This will be reported as a "Namespace UUID" + descriptor in the Namespace Identification Descriptor List. + +``eui64`` + Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended + Unique Identifier" descriptor in the Namespace Identification Descriptor List. + Since machine type 6.1 a non-zero default value is used if the parameter + is not provided. For earlier machine types the field defaults to 0. + +``bus`` + If there are more ``nvme`` devices defined, this parameter may be used to + attach the namespace to a specific ``nvme`` device (identified by an ``id`` + parameter on the controller device). + +NVM Subsystems +-------------- + +Additional features becomes available if the controller device (``nvme``) is +linked to an NVM Subsystem device (``nvme-subsys``). + +The NVM Subsystem emulation allows features such as shared namespaces and +multipath I/O. + +.. code-block:: console + + -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0 + -device nvme,serial=deadbeef,subsys=nvme-subsys-0 + -device nvme,serial=deadbeef,subsys=nvme-subsys-0 + +This will create an NVM subsystem with two controllers. Having controllers +linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters: + +``shared`` (default: ``on`` since 6.2) + Specifies that the namespace will be attached to all controllers in the + subsystem. If set to ``off``, the namespace will remain a private namespace + and may only be attached to a single controller at a time. Shared namespaces + are always automatically attached to all controllers (also when controllers + are hotplugged). + +``detached`` (default: ``off``) + If set to ``on``, the namespace will be be available in the subsystem, but + not attached to any controllers initially. A shared namespace with this set + to ``on`` will never be automatically attached to controllers. + +Thus, adding + +.. code-block:: console + + -drive file=nvm-1.img,if=none,id=nvm-1 + -device nvme-ns,drive=nvm-1,nsid=1 + -drive file=nvm-2.img,if=none,id=nvm-2 + -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on + +will cause NSID 1 will be a shared namespace that is initially attached to both +controllers. NSID 3 will be a private namespace due to ``shared=off`` and only +attachable to a single controller at a time. Additionally it will not be +attached to any controller initially (due to ``detached=on``) or to hotplugged +controllers. + +Optional Features +================= + +Controller Memory Buffer +------------------------ + +``nvme`` device parameters related to the Controller Memory Buffer support: + +``cmb_size_mb=UINT32`` (default: ``0``) + This adds a Controller Memory Buffer of the given size at offset zero in BAR + 2. + +``legacy-cmb`` (default: ``off``) + By default, the device uses the "v1.4 scheme" for the Controller Memory + Buffer support (i.e, the CMB is initially disabled and must be explicitly + enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the + CMB. + +Simple Copy +----------- + +The device includes support for TP 4065 ("Simple Copy Command"). A number of +additional ``nvme-ns`` device parameters may be used to control the Copy +command limits: + +``mssrl=UINT16`` (default: ``128``) + Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum + number of logical blocks that may be specified in each source range. + +``mcl=UINT32`` (default: ``128``) + Set the Maximum Copy Length (``MCL``). This is the maximum number of logical + blocks that may be specified in a Copy command (the total for all source + ranges). + +``msrc=UINT8`` (default: ``127``) + Set the Maximum Source Range Count (``MSRC``). This is the maximum number of + source ranges that may be used in a Copy command. This is a 0's based value. + +Zoned Namespaces +---------------- + +A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set +``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace. + +The namespace may be configured with additional parameters + +``zoned.zone_size=SIZE`` (default: ``128MiB``) + Define the zone size (``ZSZE``). + +``zoned.zone_capacity=SIZE`` (default: ``0``) + Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone + capacity will equal the zone size. + +``zoned.descr_ext_size=UINT32`` (default: ``0``) + Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64 + bytes. + +``zoned.cross_read=BOOL`` (default: ``off``) + Set to ``on`` to allow reads to cross zone boundaries. + +``zoned.max_active=UINT32`` (default: ``0``) + Set the maximum number of active resources (``MAR``). The default (``0``) + allows all zones to be active. + +``zoned.max_open=UINT32`` (default: ``0``) + Set the maximum number of open resources (``MOR``). The default (``0``) + allows all zones to be open. If ``zoned.max_active`` is specified, this value + must be less than or equal to that. + +``zoned.zasl=UINT8`` (default: ``0``) + Set the maximum data transfer size for the Zone Append command. Like + ``mdts``, the value is specified as a power of two (2^n) and is in units of + the minimum memory page size (CAP.MPSMIN). The default value (``0``) + has this property inherit the ``mdts`` value. + +Metadata +-------- + +The virtual namespace device supports LBA metadata in the form separate +metadata (``MPTR``-based) and extended LBAs. + +``ms=UINT16`` (default: ``0``) + Defines the number of metadata bytes per LBA. + +``mset=UINT8`` (default: ``0``) + Set to ``1`` to enable extended LBAs. + +End-to-End Data Protection +-------------------------- + +The virtual namespace device supports DIF- and DIX-based protection information +(depending on ``mset``). + +``pi=UINT8`` (default: ``0``) + Enable protection information of the specified type (type ``1``, ``2`` or + ``3``). + +``pil=UINT8`` (default: ``0``) + Controls the location of the protection information within the metadata. Set + to ``1`` to transfer protection information as the first eight bytes of + metadata. Otherwise, the protection information is transferred as the last + eight bytes. + +Virtualization Enhancements and SR-IOV (Experimental Support) +------------------------------------------------------------- + +The ``nvme`` device supports Single Root I/O Virtualization and Sharing +along with Virtualization Enhancements. The controller has to be linked to +an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV. + +A number of parameters are present (**please note, that they may be +subject to change**): + +``sriov_max_vfs`` (default: ``0``) + Indicates the maximum number of PCIe virtual functions supported + by the controller. Specifying a non-zero value enables reporting of both + SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities + by the NVMe device. Virtual function controllers will not report SR-IOV. + +``sriov_vq_flexible`` + Indicates the total number of flexible queue resources assignable to all + the secondary controllers. Implicitly sets the number of primary + controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``. + +``sriov_vi_flexible`` + Indicates the total number of flexible interrupt resources assignable to + all the secondary controllers. Implicitly sets the number of primary + controller's private resources to ``(msix_qsize - sriov_vi_flexible)``. + +``sriov_max_vi_per_vf`` (default: ``0``) + Indicates the maximum number of virtual interrupt resources assignable + to a secondary controller. The default ``0`` resolves to + ``(sriov_vi_flexible / sriov_max_vfs)`` + +``sriov_max_vq_per_vf`` (default: ``0``) + Indicates the maximum number of virtual queue resources assignable to + a secondary controller. The default ``0`` resolves to + ``(sriov_vq_flexible / sriov_max_vfs)`` + +The simplest possible invocation enables the capability to set up one VF +controller and assign an admin queue, an IO queue, and a MSI-X interrupt. + +.. code-block:: console + + -device nvme-subsys,id=subsys0 + -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1, + sriov_vq_flexible=2,sriov_vi_flexible=1 + +The minimum steps required to configure a functional NVMe secondary +controller are: + + * unbind flexible resources from the primary controller + +.. code-block:: console + + nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0 + nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0 + + * perform a Function Level Reset on the primary controller to actually + release the resources + +.. code-block:: console + + echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset + + * enable VF + +.. code-block:: console + + echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs + + * assign the flexible resources to the VF and set it ONLINE + +.. code-block:: console + + nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1 + nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2 + nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0 + + * bind the NVMe driver to the VF + +.. code-block:: console + + echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
\ No newline at end of file diff --git a/docs/system/devices/usb.rst b/docs/system/devices/usb.rst new file mode 100644 index 00000000..37cb9b33 --- /dev/null +++ b/docs/system/devices/usb.rst @@ -0,0 +1,408 @@ +.. _pcsys_005fusb: + +USB emulation +------------- + +QEMU can emulate a PCI UHCI, OHCI, EHCI or XHCI USB controller. You can +plug virtual USB devices or real host USB devices (only works with +certain host operating systems). QEMU will automatically create and +connect virtual USB hubs as necessary to connect multiple USB devices. + +USB controllers +~~~~~~~~~~~~~~~ + +XHCI controller support +^^^^^^^^^^^^^^^^^^^^^^^ + +QEMU has XHCI host adapter support. The XHCI hardware design is much +more virtualization-friendly when compared to EHCI and UHCI, thus XHCI +emulation uses less resources (especially CPU). So if your guest +supports XHCI (which should be the case for any operating system +released around 2010 or later) we recommend using it: + + qemu -device qemu-xhci + +XHCI supports USB 1.1, USB 2.0 and USB 3.0 devices, so this is the +only controller you need. With only a single USB controller (and +therefore only a single USB bus) present in the system there is no +need to use the bus= parameter when adding USB devices. + + +EHCI controller support +^^^^^^^^^^^^^^^^^^^^^^^ + +The QEMU EHCI Adapter supports USB 2.0 devices. It can be used either +standalone or with companion controllers (UHCI, OHCI) for USB 1.1 +devices. The companion controller setup is more convenient to use +because it provides a single USB bus supporting both USB 2.0 and USB +1.1 devices. See next section for details. + +When running EHCI in standalone mode you can add UHCI or OHCI +controllers for USB 1.1 devices too. Each controller creates its own +bus though, so there are two completely separate USB buses: One USB +1.1 bus driven by the UHCI controller and one USB 2.0 bus driven by +the EHCI controller. Devices must be attached to the correct +controller manually. + +The easiest way to add a UHCI controller to a ``pc`` machine is the +``-usb`` switch. QEMU will create the UHCI controller as function of +the PIIX3 chipset. The USB 1.1 bus will carry the name ``usb-bus.0``. + +You can use the standard ``-device`` switch to add a EHCI controller to +your virtual machine. It is strongly recommended to specify an ID for +the controller so the USB 2.0 bus gets an individual name, for example +``-device usb-ehci,id=ehci``. This will give you a USB 2.0 bus named +``ehci.0``. + +When adding USB devices using the ``-device`` switch you can specify the +bus they should be attached to. Here is a complete example: + +.. parsed-literal:: + + |qemu_system| -M pc ${otheroptions} \\ + -drive if=none,id=usbstick,format=raw,file=/path/to/image \\ + -usb \\ + -device usb-ehci,id=ehci \\ + -device usb-tablet,bus=usb-bus.0 \\ + -device usb-storage,bus=ehci.0,drive=usbstick + +This attaches a USB tablet to the UHCI adapter and a USB mass storage +device to the EHCI adapter. + + +Companion controller support +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The UHCI and OHCI controllers can attach to a USB bus created by EHCI +as companion controllers. This is done by specifying the ``masterbus`` +and ``firstport`` properties. ``masterbus`` specifies the bus name the +controller should attach to. ``firstport`` specifies the first port the +controller should attach to, which is needed as usually one EHCI +controller with six ports has three UHCI companion controllers with +two ports each. + +There is a config file in docs which will do all this for +you, which you can use like this: + +.. parsed-literal:: + + |qemu_system| -readconfig docs/config/ich9-ehci-uhci.cfg + +Then use ``bus=ehci.0`` to assign your USB devices to that bus. + +Using the ``-usb`` switch for ``q35`` machines will create a similar +USB controller configuration. + + +.. _Connecting USB devices: + +Connecting USB devices +~~~~~~~~~~~~~~~~~~~~~~ + +USB devices can be connected with the ``-device usb-...`` command line +option or the ``device_add`` monitor command. Available devices are: + +``usb-mouse`` + Virtual Mouse. This will override the PS/2 mouse emulation when + activated. + +``usb-tablet`` + Pointer device that uses absolute coordinates (like a touchscreen). + This means QEMU is able to report the mouse position without having + to grab the mouse. Also overrides the PS/2 mouse emulation when + activated. + +``usb-storage,drive=drive_id`` + Mass storage device backed by drive_id (see the :ref:`disk images` + chapter in the System Emulation Users Guide). This is the classic + bulk-only transport protocol used by 99% of USB sticks. This + example shows it connected to an XHCI USB controller and with + a drive backed by a raw format disk image: + + .. parsed-literal:: + + |qemu_system| [...] \\ + -drive if=none,id=stick,format=raw,file=/path/to/file.img \\ + -device nec-usb-xhci,id=xhci \\ + -device usb-storage,bus=xhci.0,drive=stick + +``usb-uas`` + USB attached SCSI device. This does not create a SCSI disk, so + you need to explicitly create a ``scsi-hd`` or ``scsi-cd`` device + on the command line, as well as using the ``-drive`` option to + specify what those disks are backed by. One ``usb-uas`` device can + handle multiple logical units (disks). This example creates three + logical units: two disks and one cdrom drive: + + .. parsed-literal:: + + |qemu_system| [...] \\ + -drive if=none,id=uas-disk1,format=raw,file=/path/to/file1.img \\ + -drive if=none,id=uas-disk2,format=raw,file=/path/to/file2.img \\ + -drive if=none,id=uas-cdrom,media=cdrom,format=raw,file=/path/to/image.iso \\ + -device nec-usb-xhci,id=xhci \\ + -device usb-uas,id=uas,bus=xhci.0 \\ + -device scsi-hd,bus=uas.0,scsi-id=0,lun=0,drive=uas-disk1 \\ + -device scsi-hd,bus=uas.0,scsi-id=0,lun=1,drive=uas-disk2 \\ + -device scsi-cd,bus=uas.0,scsi-id=0,lun=5,drive=uas-cdrom + +``usb-bot`` + Bulk-only transport storage device. This presents the guest with the + same USB bulk-only transport protocol interface as ``usb-storage``, but + the QEMU command line option works like ``usb-uas`` and does not + automatically create SCSI disks for you. ``usb-bot`` supports up to + 16 LUNs. Unlike ``usb-uas``, the LUN numbers must be continuous, + i.e. for three devices you must use 0+1+2. The 0+1+5 numbering from the + ``usb-uas`` example above won't work with ``usb-bot``. + +``usb-mtp,rootdir=dir`` + Media transfer protocol device, using dir as root of the file tree + that is presented to the guest. + +``usb-host,hostbus=bus,hostaddr=addr`` + Pass through the host device identified by bus and addr + +``usb-host,vendorid=vendor,productid=product`` + Pass through the host device identified by vendor and product ID + +``usb-wacom-tablet`` + Virtual Wacom PenPartner tablet. This device is similar to the + ``tablet`` above but it can be used with the tslib library because in + addition to touch coordinates it reports touch pressure. + +``usb-kbd`` + Standard USB keyboard. Will override the PS/2 keyboard (if present). + +``usb-serial,chardev=id`` + Serial converter. This emulates an FTDI FT232BM chip connected to + host character device id. + +``usb-braille,chardev=id`` + Braille device. This emulates a Baum Braille device USB port. id has to + specify a character device defined with ``-chardev …,id=id``. One will + normally use BrlAPI to display the braille output on a BRLTTY-supported + device with + + .. parsed-literal:: + + |qemu_system| [...] -chardev braille,id=brl -device usb-braille,chardev=brl + + or alternatively, use the following equivalent shortcut: + + .. parsed-literal:: + + |qemu_system| [...] -usbdevice braille + +``usb-net[,netdev=id]`` + Network adapter that supports CDC ethernet and RNDIS protocols. id + specifies a netdev defined with ``-netdev …,id=id``. For instance, + user-mode networking can be used with + + .. parsed-literal:: + + |qemu_system| [...] -netdev user,id=net0 -device usb-net,netdev=net0 + +``usb-ccid`` + Smartcard reader device + +``usb-audio`` + USB audio device + +``u2f-{emulated,passthru}`` + Universal Second Factor device + +``canokey`` + An Open-source Secure Key implementing FIDO2, OpenPGP, PIV and more. + For more information, see :ref:`canokey`. + +Physical port addressing +^^^^^^^^^^^^^^^^^^^^^^^^ + +For all the above USB devices, by default QEMU will plug the device +into the next available port on the specified USB bus, or onto +some available USB bus if you didn't specify one explicitly. +If you need to, you can also specify the physical port where +the device will show up in the guest. This can be done using the +``port`` property. UHCI has two root ports (1,2). EHCI has six root +ports (1-6), and the emulated (1.1) USB hub has eight ports. + +Plugging a tablet into UHCI port 1 works like this:: + + -device usb-tablet,bus=usb-bus.0,port=1 + +Plugging a hub into UHCI port 2 works like this:: + + -device usb-hub,bus=usb-bus.0,port=2 + +Plugging a virtual USB stick into port 4 of the hub just plugged works +this way:: + + -device usb-storage,bus=usb-bus.0,port=2.4,drive=... + +In the monitor, the ``device_add` command also accepts a ``port`` +property specification. If you want to unplug devices too you should +specify some unique id which you can use to refer to the device. +You can then use ``device_del`` to unplug the device later. +For example:: + + (qemu) device_add usb-tablet,bus=usb-bus.0,port=1,id=my-tablet + (qemu) device_del my-tablet + +Hotplugging USB storage +~~~~~~~~~~~~~~~~~~~~~~~ + +The ``usb-bot`` and ``usb-uas`` devices can be hotplugged. In the hotplug +case they are added with ``attached = false`` so the guest will not see +the device until the ``attached`` property is explicitly set to true. +That allows you to attach one or more scsi devices before making the +device visible to the guest. The workflow looks like this: + +#. ``device-add usb-bot,id=foo`` +#. ``device-add scsi-{hd,cd},bus=foo.0,lun=0`` +#. optionally add more devices (luns 1 ... 15) +#. ``scripts/qmp/qom-set foo.attached = true`` + +.. _host_005fusb_005fdevices: + +Using host USB devices on a Linux host +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +WARNING: this is an experimental feature. QEMU will slow down when using +it. USB devices requiring real time streaming (i.e. USB Video Cameras) +are not supported yet. + +1. If you use an early Linux 2.4 kernel, verify that no Linux driver is + actually using the USB device. A simple way to do that is simply to + disable the corresponding kernel module by renaming it from + ``mydriver.o`` to ``mydriver.o.disabled``. + +2. Verify that ``/proc/bus/usb`` is working (most Linux distributions + should enable it by default). You should see something like that: + + :: + + ls /proc/bus/usb + 001 devices drivers + +3. Since only root can access to the USB devices directly, you can + either launch QEMU as root or change the permissions of the USB + devices you want to use. For testing, the following suffices: + + :: + + chown -R myuid /proc/bus/usb + +4. Launch QEMU and do in the monitor: + + :: + + info usbhost + Device 1.2, speed 480 Mb/s + Class 00: USB device 1234:5678, USB DISK + + You should see the list of the devices you can use (Never try to use + hubs, it won't work). + +5. Add the device in QEMU by using: + + :: + + device_add usb-host,vendorid=0x1234,productid=0x5678 + + Normally the guest OS should report that a new USB device is plugged. + You can use the option ``-device usb-host,...`` to do the same. + +6. Now you can try to use the host USB device in QEMU. + +When relaunching QEMU, you may have to unplug and plug again the USB +device to make it work again (this is a bug). + +``usb-host`` properties for specifying the host device +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The example above uses the ``vendorid`` and ``productid`` to +specify which host device to pass through, but this is not +the only way to specify the host device. ``usb-host`` supports +the following properties: + +``hostbus=<nr>`` + Specifies the bus number the device must be attached to +``hostaddr=<nr>`` + Specifies the device address the device got assigned by the guest os +``hostport=<str>`` + Specifies the physical port the device is attached to +``vendorid=<hexnr>`` + Specifies the vendor ID of the device +``productid=<hexnr>`` + Specifies the product ID of the device. + +In theory you can combine all these properties as you like. In +practice only a few combinations are useful: + +- ``vendorid`` and ``productid`` -- match for a specific device, pass it to + the guest when it shows up somewhere in the host. + +- ``hostbus`` and ``hostport`` -- match for a specific physical port in the + host, any device which is plugged in there gets passed to the + guest. + +- ``hostbus`` and ``hostaddr`` -- most useful for ad-hoc pass through as the + hostaddr isn't stable. The next time you plug the device into the host it + will get a new hostaddr. + +Note that on the host USB 1.1 devices are handled by UHCI/OHCI and USB +2.0 by EHCI. That means different USB devices plugged into the very +same physical port on the host may show up on different host buses +depending on the speed. Supposing that devices plugged into a given +physical port appear as bus 1 + port 1 for 2.0 devices and bus 3 + port 1 +for 1.1 devices, you can pass through any device plugged into that port +and also assign it to the correct USB bus in QEMU like this: + +.. parsed-literal:: + + |qemu_system| -M pc [...] \\ + -usb \\ + -device usb-ehci,id=ehci \\ + -device usb-host,bus=usb-bus.0,hostbus=3,hostport=1 \\ + -device usb-host,bus=ehci.0,hostbus=1,hostport=1 + +``usb-host`` properties for reset behavior +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``guest-reset`` and ``guest-reset-all`` properties control +whenever the guest is allowed to reset the physical usb device on the +host. There are three cases: + +``guest-reset=false`` + The guest is not allowed to reset the (physical) usb device. + +``guest-reset=true,guest-resets-all=false`` + The guest is allowed to reset the device when it is not yet + initialized (aka no usb bus address assigned). Usually this results + in one guest reset being allowed. This is the default behavior. + +``guest-reset=true,guest-resets-all=true`` + The guest is allowed to reset the device as it pleases. + +The reason for this existing are broken usb devices. In theory one +should be able to reset (and re-initialize) usb devices at any time. +In practice that may result in shitty usb device firmware crashing and +the device not responding any more until you power-cycle (aka un-plug +and re-plug) it. + +What works best pretty much depends on the behavior of the specific +usb device at hand, so it's a trial-and-error game. If the default +doesn't work, try another option and see whenever the situation +improves. + +record usb transfers +^^^^^^^^^^^^^^^^^^^^ + +All usb devices have support for recording the usb traffic. This can +be enabled using the ``pcap=<file>`` property, for example: + +``-device usb-mouse,pcap=mouse.pcap`` + +The pcap files are compatible with the linux kernels usbmon. Many +tools, including ``wireshark``, can decode and inspect these trace +files. diff --git a/docs/system/devices/vhost-user-rng.rst b/docs/system/devices/vhost-user-rng.rst new file mode 100644 index 00000000..a145d410 --- /dev/null +++ b/docs/system/devices/vhost-user-rng.rst @@ -0,0 +1,39 @@ +QEMU vhost-user-rng - RNG emulation +=================================== + +Background +---------- + +What follows builds on the material presented in vhost-user.rst - it should +be reviewed before moving forward with the content in this file. + +Description +----------- + +The vhost-user-rng device implementation was designed to work with a random +number generator daemon such as the one found in the vhost-device crate of +the rust-vmm project available on github [1]. + +[1]. https://github.com/rust-vmm/vhost-device + +Examples +-------- + +The daemon should be started first: + +:: + + host# vhost-device-rng --socket-path=rng.sock -c 1 -m 512 -p 1000 + +The QEMU invocation needs to create a chardev socket the device can +use to communicate as well as share the guests memory over a memfd. + +:: + + host# qemu-system \ + -chardev socket,path=$(PATH)/rng.sock,id=rng0 \ + -device vhost-user-rng-pci,chardev=rng0 \ + -m 4096 \ + -object memory-backend-file,id=mem,size=4G,mem-path=/dev/shm,share=on \ + -numa node,memdev=mem \ + ... diff --git a/docs/system/devices/vhost-user.rst b/docs/system/devices/vhost-user.rst new file mode 100644 index 00000000..86128114 --- /dev/null +++ b/docs/system/devices/vhost-user.rst @@ -0,0 +1,59 @@ +.. _vhost_user: + +vhost-user back ends +-------------------- + +vhost-user back ends are way to service the request of VirtIO devices +outside of QEMU itself. To do this there are a number of things +required. + +vhost-user device +=================== + +These are simple stub devices that ensure the VirtIO device is visible +to the guest. The code is mostly boilerplate although each device has +a ``chardev`` option which specifies the ID of the ``--chardev`` +device that connects via a socket to the vhost-user *daemon*. + +vhost-user daemon +================= + +This is a separate process that is connected to by QEMU via a socket +following the :ref:`vhost_user_proto`. There are a number of daemons +that can be built when enabled by the project although any daemon that +meets the specification for a given device can be used. + +Shared memory object +==================== + +In order for the daemon to access the VirtIO queues to process the +requests it needs access to the guest's address space. This is +achieved via the ``memory-backend-file`` or ``memory-backend-memfd`` +objects. A reference to a file-descriptor which can access this object +will be passed via the socket as part of the protocol negotiation. + +Currently the shared memory object needs to match the size of the main +system memory as defined by the ``-m`` argument. + +Example +======= + +First start you daemon. + +.. parsed-literal:: + + $ virtio-foo --socket-path=/var/run/foo.sock $OTHER_ARGS + +The you start your QEMU instance specifying the device, chardev and +memory objects. + +.. parsed-literal:: + + $ |qemu_system| \\ + -m 4096 \\ + -chardev socket,id=ba1,path=/var/run/foo.sock \\ + -device vhost-user-foo,chardev=ba1,$OTHER_ARGS \\ + -object memory-backend-memfd,id=mem,size=4G,share=on \\ + -numa node,memdev=mem \\ + ... + diff --git a/docs/system/devices/virtio-pmem.rst b/docs/system/devices/virtio-pmem.rst new file mode 100644 index 00000000..c82ac067 --- /dev/null +++ b/docs/system/devices/virtio-pmem.rst @@ -0,0 +1,76 @@ + +=========== +virtio pmem +=========== + +This document explains the setup and usage of the virtio pmem device. +The virtio pmem device is a paravirtualized persistent memory device +on regular (i.e non-NVDIMM) storage. + +Usecase +------- + +Virtio pmem allows to bypass the guest page cache and directly use +host page cache. This reduces guest memory footprint as the host can +make efficient memory reclaim decisions under memory pressure. + +How does virtio-pmem compare to the nvdimm emulation? +----------------------------------------------------- + +NVDIMM emulation on regular (i.e. non-NVDIMM) host storage does not +persist the guest writes as there are no defined semantics in the device +specification. The virtio pmem device provides guest write persistence +on non-NVDIMM host storage. + +virtio pmem usage +----------------- + +A virtio pmem device backed by a memory-backend-file can be created on +the QEMU command line as in the following example:: + + -object memory-backend-file,id=mem1,share,mem-path=./virtio_pmem.img,size=4G + -device virtio-pmem-pci,memdev=mem1,id=nv1 + +where: + + - "object memory-backend-file,id=mem1,share,mem-path=<image>, size=<image size>" + creates a backend file with the specified size. + + - "device virtio-pmem-pci,id=nvdimm1,memdev=mem1" creates a virtio pmem + pci device whose storage is provided by above memory backend device. + +Multiple virtio pmem devices can be created if multiple pairs of "-object" +and "-device" are provided. + +Hotplug +------- + +Virtio pmem devices can be hotplugged via the QEMU monitor. First, the +memory backing has to be added via 'object_add'; afterwards, the virtio +pmem device can be added via 'device_add'. + +For example, the following commands add another 4GB virtio pmem device to +the guest:: + + (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=virtio_pmem2.img,size=4G + (qemu) device_add virtio-pmem-pci,id=virtio_pmem2,memdev=mem2 + +Guest Data Persistence +---------------------- + +Guest data persistence on non-NVDIMM requires guest userspace applications +to perform fsync/msync. This is different from a real nvdimm backend where +no additional fsync/msync is required. This is to persist guest writes in +host backing file which otherwise remains in host page cache and there is +risk of losing the data in case of power failure. + +With virtio pmem device, MAP_SYNC mmap flag is not supported. This provides +a hint to application to perform fsync for write persistence. + +Limitations +----------- + +- Real nvdimm device backend is not supported. +- virtio pmem hotunplug is not supported. +- ACPI NVDIMM features like regions/namespaces are not supported. +- ndctl command is not supported. |