| ## @file | |
| # | |
| # Technical notes for the virtio-net driver. | |
| # | |
| # Copyright (C) 2013, Red Hat, Inc. | |
| # | |
| # This program and the accompanying materials are licensed and made available | |
| # under the terms and conditions of the BSD License which accompanies this | |
| # distribution. The full text of the license may be found at | |
| # http://opensource.org/licenses/bsd-license.php | |
| # | |
| # THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS, WITHOUT | |
| # WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED. | |
| # | |
| ## | |
| Disclaimer | |
| ---------- | |
| All statements concerning standards and specifications are informative and not | |
| normative. They are made in good faith. Corrections are most welcome on the | |
| edk2-devel mailing list. | |
| The following documents have been perused while writing the driver and this | |
| document: | |
| - Unified Extensible Firmware Interface Specification, Version 2.3.1, Errata C; | |
| June 27, 2012 | |
| - Driver Writer's Guide for UEFI 2.3.1, 03/08/2012, Version 1.01; | |
| - Virtio PCI Card Specification, v0.9.5 DRAFT, 2012 May 7. | |
| Summary | |
| ------- | |
| The VirtioNetDxe UEFI_DRIVER implements the Simple Network Protocol for | |
| virtio-net devices. Higher level protocols are automatically installed on top | |
| of it by the DXE Core / the ConnectController() boot service, enabling for | |
| virtio-net devices eg. DHCP configuration, TCP transfers with edk2 StdLib | |
| applications, and PXE booting in OVMF. | |
| UEFI driver structure | |
| --------------------- | |
| A driver instance, belonging to a given virtio-net device, can be in one of | |
| four states at any time. The states stack up as follows below. The state | |
| transitions are labeled with the primary function (and its important callees | |
| faithfully indented) that implement the transition. | |
| | ^ | |
| | | | |
| [DriverBinding.c] | | [DriverBinding.c] | |
| VirtioNetDriverBindingStart | | VirtioNetDriverBindingStop | |
| VirtioNetSnpPopulate | | VirtioNetSnpEvacuate | |
| VirtioNetGetFeatures | | | |
| v | | |
| +-------------------------+ | |
| | EfiSimpleNetworkStopped | | |
| +-------------------------+ | |
| | ^ | |
| [SnpStart.c] | | [SnpStop.c] | |
| VirtioNetStart | | VirtioNetStop | |
| | | | |
| v | | |
| +-------------------------+ | |
| | EfiSimpleNetworkStarted | | |
| +-------------------------+ | |
| | ^ | |
| [SnpInitialize.c] | | [SnpShutdown.c] | |
| VirtioNetInitialize | | VirtioNetShutdown | |
| VirtioNetInitRing {Rx, Tx} | | VirtioNetShutdownRx [SnpSharedHelpers.c] | |
| VirtioRingInit | | VirtioNetShutdownTx [SnpSharedHelpers.c] | |
| VirtioNetInitTx | | VirtioRingUninit {Tx, Rx} | |
| VirtioNetInitRx | | | |
| v | | |
| +-----------------------------+ | |
| | EfiSimpleNetworkInitialized | | |
| +-----------------------------+ | |
| The state at the top means "nonexistent" and is hence unnamed on the diagram -- | |
| a driver instance actually doesn't exist at that point. The transition | |
| functions out of and into that state implement the Driver Binding Protocol. | |
| The lower three states characterize an existent driver instance and are all | |
| states defined by the Simple Network Protocol. The transition functions between | |
| them are member functions of the Simple Network Protocol. | |
| Each transition function validates its expected source state and its | |
| parameters. For example, VirtioNetDriverBindingStop will refuse to disconnect | |
| from the controller unless it's in EfiSimpleNetworkStopped. | |
| Driver instance states (Simple Network Protocol) | |
| ------------------------------------------------ | |
| In the EfiSimpleNetworkStopped state, the virtio-net device is (has been) | |
| re-set. No resources are allocated for networking / traffic purposes. The MAC | |
| address and other device attributes have been retrieved from the device (this | |
| is necessary for completing the VirtioNetDriverBindingStart transition). | |
| The EfiSimpleNetworkStarted is completely identical to the | |
| EfiSimpleNetworkStopped state for virtio-net, in the functional and | |
| resource-usage sense. This state is mandated / provided by the Simple Network | |
| Protocol for flexibility that the virtio-net driver doesn't exploit. | |
| In particular, the EfiSimpleNetworkStarted state is the target of the Shutdown | |
| SNP member function, and must therefore correspond to a hardware configuration | |
| where "[it] is safe for another driver to initialize". (Clearly another UEFI | |
| driver could not do that due to the exclusivity of the driver binding that | |
| VirtioNetDriverBindingStart() installs, but a later OS driver might qualify.) | |
| The EfiSimpleNetworkInitialized state is the live state of the virtio NIC / the | |
| driver instance. Virtio and other resources required for network traffic have | |
| been allocated, and the following SNP member functions are available (in | |
| addition to VirtioNetShutdown which leaves the state): | |
| - VirtioNetReceive [SnpReceive.c]: poll the virtio NIC for an Rx packet that | |
| may have arrived asynchronously; | |
| - VirtioNetTransmit [SnpTransmit.c]: queue a Tx packet for asynchronous | |
| transmission (meant to be used together with VirtioNetGetStatus); | |
| - VirtioNetGetStatus [SnpGetStatus.c]: query link status and status of pending | |
| Tx packets; | |
| - VirtioNetMcastIpToMac [SnpMcastIpToMac.c]: transform a multicast IPv4/IPv6 | |
| address into a multicast MAC address; | |
| - VirtioNetReceiveFilters [SnpReceiveFilters.c]: emulate unicast / multicast / | |
| broadcast filter configuration (not their actual effect -- a more liberal | |
| filter setting than requested is allowed by the UEFI specification). | |
| The following SNP member functions are not supported [SnpUnsupported.c]: | |
| - VirtioNetReset: reinitialize the virtio NIC without shutting it down (a loop | |
| from/to EfiSimpleNetworkInitialized); | |
| - VirtioNetStationAddress: assign a new MAC address to the virtio NIC, | |
| - VirtioNetStatistics: collect statistics, | |
| - VirtioNetNvData: access non-volatile data on the virtio NIC. | |
| Missing support for these functions is allowed by the UEFI specification and | |
| doesn't seem to trip up higher level protocols. | |
| Events and task priority levels | |
| ------------------------------- | |
| The UEFI specification defines a sophisticated mechanism for asynchronous | |
| events / callbacks (see "6.1 Event, Timer, and Task Priority Services" for | |
| details). Such callbacks work like software interrupts, and some notion of | |
| locking / masking is important to implement critical sections (atomic or | |
| exclusive access to data or a device). This notion is defined as Task Priority | |
| Levels. | |
| The virtio-net driver for OVMF must concern itself with events for two reasons: | |
| - The Simple Network Protocol provides its clients with a (non-optional) WAIT | |
| type event called WaitForPacket: it allows them to check or wait for Rx | |
| packets by polling or blocking on this event. (This functionality overlaps | |
| with the Receive member function.) The event is available to clients starting | |
| with EfiSimpleNetworkStopped (inclusive). | |
| The virtio-net driver is informed about such client polling or blockage by | |
| receiving an asynchronous callback (a software interrupt). In the callback | |
| function the driver must interrogate the driver instance state, and if it is | |
| EfiSimpleNetworkInitialized, access the Rx queue and see if any packets are | |
| available for consumption. If so, it must signal the WaitForPacket WAIT type | |
| event, waking the client. | |
| For simplicity and safety, all parts of the virtio-net driver that access any | |
| bit of the driver instance (data or device) run at the TPL_CALLBACK level. | |
| This is the highest level allowed for an SNP implementation, and all code | |
| protected in this manner satisfies even stricter non-blocking requirements | |
| than what's documented for TPL_CALLBACK. | |
| The task priority level for the WaitForPacket callback too is set by the | |
| driver, the choice is TPL_CALLBACK again. This in effect serializes the | |
| WaitForPacket callback (VirtioNetIsPacketAvailable [Events.c]) with "normal" | |
| parts of the driver. | |
| - According to the Driver Writer's Guide, a network driver should install a | |
| callback function for the global EXIT_BOOT_SERVICES event (a special NOTIFY | |
| type event). When the ExitBootServices() boot service has cleaned up internal | |
| firmware state and is about to pass control to the OS, any network driver has | |
| to stop any in-flight DMA transfers, lest it corrupts OS memory. For this | |
| reason EXIT_BOOT_SERVICES is emitted and the network driver must abort | |
| in-flight DMA transfers. | |
| This callback (VirtioNetExitBoot) is synchronized with the rest of the driver | |
| code just the same as explained for WaitForPacket. In | |
| EfiSimpleNetworkInitialized state it resets the virtio NIC, halting all data | |
| transfer. After the callback returns, no further driver code is expected to | |
| be scheduled. | |
| Virtio internals -- Rx | |
| ---------------------- | |
| Requests (Rx and Tx alike) are always submitted by the guest and processed by | |
| the host. For Tx, processing means transmission. For Rx, processing means | |
| filling in the request with an incoming packet. Submitted requests exist on the | |
| "Available Ring", and answered (processed) requests show up on the "Used Ring". | |
| Packet data includes the media (Ethernet) header: destination MAC, source MAC, | |
| and Ethertype (14 bytes total). | |
| The following structures implement packet reception. Most of them are defined | |
| in the Virtio specification, the only driver-specific trait here is the static | |
| pre-configuration of the two-part descriptor chains, in VirtioNetInitRx. The | |
| diagram is simplified. | |
| Available Index Available Index | |
| last processed incremented | |
| by the host by the guest | |
| v -------> v | |
| Available +-------+-------+-------+-------+-------+ | |
| Ring |DescIdx|DescIdx|DescIdx|DescIdx|DescIdx| | |
| +-------+-------+-------+-------+-------+ | |
| =D6 =D2 | |
| D2 D3 D4 D5 D6 D7 | |
| Descr. +----------+----------++----------+----------++----------+----------+ | |
| Table |Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx| | |
| +----------+----------++----------+----------++----------+----------+ | |
| =A2 =D3 =A3 =A4 =D5 =A5 =A6 =D7 =A7 | |
| A2 A3 A4 A5 A6 A7 | |
| Receive +---------------+---------------+---------------+ | |
| Destination |vnet hdr:packet|vnet hdr:packet|vnet hdr:packet| | |
| Area +---------------+---------------+---------------+ | |
| Used Index Used Index incremented | |
| last processed by the guest by the host | |
| v -------> v | |
| Used +-----------+-----------+-----------+-----------+-----------+ | |
| Ring |DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len| | |
| +-----------+-----------+-----------+-----------+-----------+ | |
| =D4 | |
| In VirtioNetInitRx, the guest allocates the fixed size Receive Destination | |
| Area, which accommodates all packets delivered asynchronously by the host. To | |
| each packet, a slice of this area is dedicated; each slice is further | |
| subdivided into virtio-net request header and network packet data. The | |
| (guest-physical) addresses of these sub-slices are denoted with A2, A3, A4 and | |
| so on. Importantly, an even-subscript "A" always belongs to a virtio-net | |
| request header, while an odd-subscript "A" always belongs to a packet | |
| sub-slice. | |
| Furthermore, the guest lays out a static pattern in the Descriptor Table. For | |
| each packet that can be in-flight or already arrived from the host, | |
| VirtioNetInitRx sets up a separate, two-part descriptor chain. For packet N, | |
| the Nth descriptor chain is set up as follows: | |
| - the first (=head) descriptor, with even index, points to the fixed-size | |
| sub-slice receiving the virtio-net request header, | |
| - the second descriptor (with odd index) points to the fixed (1514 byte) size | |
| sub-slice receiving the packet data, | |
| - a link from the first (head) descriptor in the chain is established to the | |
| second (tail) descriptor in the chain. | |
| Finally, the guest populates the Available Ring with the indices of the head | |
| descriptors. All descriptor indices on both the Available Ring and the Used | |
| Ring are even. | |
| Packet reception occurs as follows: | |
| - The host consumes a descriptor index off the Available Ring. This index is | |
| even (=2*N), and fingers the head descriptor of the chain belonging to packet | |
| N. | |
| - The host reads the descriptors D(2*N) and -- following the Next link there | |
| --- D(2*N+1), and stores the virtio-net request header at A(2*N), and the | |
| packet data at A(2*N+1). | |
| - The host places the index of the head descriptor, 2*N, onto the Used Ring, | |
| and sets the Len field in the same Used Ring Element to the total number of | |
| bytes transferred for the entire descriptor chain. This enables the guest to | |
| identify the length of Rx packets. | |
| - VirtioNetReceive polls the Used Ring. If a new Used Ring Element shows up, it | |
| copies the data out to the caller, and recycles the index of the head | |
| descriptor (ie. 2*N) to the Available Ring. | |
| - Because the host can process (answer) Rx requests in any order theoretically, | |
| the order of head descriptor indices on each of the Available Ring and the | |
| Used Ring is virtually random. (Except right after the initial population in | |
| VirtioNetInitRx, when the Available Ring is full and increasing, and the Used | |
| Ring is empty.) | |
| - If the Available Ring is empty, the host is forced to drop packets. If the | |
| Used Ring is empty, VirtioNetReceive returns EFI_NOT_READY (no packet | |
| available). | |
| Virtio internals -- Tx | |
| ---------------------- | |
| The transmission structure erected by VirtioNetInitTx is similar, it differs | |
| in the following: | |
| - There is no Receive Destination Area. | |
| - Each head descriptor, D(2*N), points to a read-only virtio-net request header | |
| that is shared by all of the head descriptors. This virtio-net request header | |
| is never modified by the host. | |
| - Each tail descriptor is re-pointed to the caller-supplied packet buffer | |
| whenever VirtioNetTransmit places the corresponding head descriptor on the | |
| Available Ring. The caller is responsible to hang on to the unmodified buffer | |
| until it is reported transmitted by VirtioNetGetStatus. | |
| Steps of packet transmission: | |
| - Client code calls VirtioNetTransmit. VirtioNetTransmit tracks free descriptor | |
| chains by keeping the indices of their head descriptors in a stack that is | |
| private to the driver instance. All elements of the stack are even. | |
| - If the stack is empty (that is, each descriptor chain, in isolation, is | |
| either pending transmission, or has been processed by the host but not | |
| yet recycled by a VirtioNetGetStatus call), then VirtioNetTransmit returns | |
| EFI_NOT_READY. | |
| - Otherwise the index of a free chain's head descriptor is popped from the | |
| stack. The linked tail descriptor is re-pointed as discussed above. The head | |
| descriptor's index is pushed on the Available Ring. | |
| - The host moves the head descriptor index from the Available Ring to the Used | |
| Ring when it transmits the packet. | |
| - Client code calls VirtioNetGetStatus. In case the Used Ring is empty, the | |
| function reports no Tx completion. Otherwise, a head descriptor's index is | |
| consumed from the Used Ring and recycled to the private stack. The client | |
| code's original packet buffer address is fetched from the tail descriptor | |
| (where it has been stored at VirtioNetTransmit time) and returned to the | |
| caller. | |
| - The Len field of the Used Ring Element is not checked. The host is assumed to | |
| have transmitted the entire packet -- VirtioNetTransmit had forced it below | |
| 1514 bytes (inclusive). The Virtio specification suggests this packet size is | |
| always accepted (and a lower MTU could be encountered on any later hop as | |
| well). Additionally, there's no good way to report a short transmit via | |
| VirtioNetGetStatus; EFI_DEVICE_ERROR seems too serious from the specification | |
| and higher level protocols could interpret it as a fatal condition. | |
| - The host can theoretically reorder head descriptor indices when moving them | |
| from the Available Ring to the Used Ring (out of order transmission). Because | |
| of this (and the choice of a stack over a list for free descriptor chain | |
| tracking) the order of head descriptor indices on either Ring is | |
| unpredictable. |