docs/devel/s390-dasd-ipl.rst - qemu - Git at Google

 Booting from real channel-attached devices on s390x
 ===================================================

 s390 hardware IPL
 -----------------

 The s390 hardware IPL process consists of the following steps.

 1. A READ IPL ccw is constructed in memory location ``0x0``.
    This ccw, by definition, reads the IPL1 record which is located on the disk
    at cylinder 0 track 0 record 1. Note that the chain flag is on in this ccw
    so when it is complete another ccw will be fetched and executed from memory
    location ``0x08``.

 2. Execute the Read IPL ccw at ``0x00``, thereby reading IPL1 data into ``0x00``.
    IPL1 data is 24 bytes in length and consists of the following pieces of
    information: ``[psw][read ccw][tic ccw]``. When the machine executes the Read
    IPL ccw it read the 24-bytes of IPL1 to be read into memory starting at
    location ``0x0``. Then the ccw program at ``0x08`` which consists of a read
    ccw and a tic ccw is automatically executed because of the chain flag from
    the original READ IPL ccw. The read ccw will read the IPL2 data into memory
    and the TIC (Transfer In Channel) will transfer control to the channel
    program contained in the IPL2 data. The TIC channel command is the
    equivalent of a branch/jump/goto instruction for channel programs.

    NOTE: The ccws in IPL1 are defined by the architecture to be format 0.

 3. Execute IPL2.
    The TIC ccw instruction at the end of the IPL1 channel program will begin
    the execution of the IPL2 channel program. IPL2 is stage-2 of the boot
    process and will contain a larger channel program than IPL1. The point of
    IPL2 is to find and load either the operating system or a small program that
    loads the operating system from disk. At the end of this step all or some of
    the real operating system is loaded into memory and we are ready to hand
    control over to the guest operating system. At this point the guest
    operating system is entirely responsible for loading any more data it might
    need to function.

    NOTE: The IPL2 channel program might read data into memory
    location ``0x0`` thereby overwriting the IPL1 psw and channel program. This is ok
    as long as the data placed in location ``0x0`` contains a psw whose instruction
    address points to the guest operating system code to execute at the end of
    the IPL/boot process.

    NOTE: The ccws in IPL2 are defined by the architecture to be format 0.

 4. Start executing the guest operating system.
    The psw that was loaded into memory location ``0x0`` as part of the ipl process
    should contain the needed flags for the operating system we have loaded. The
    psw's instruction address will point to the location in memory where we want
    to start executing the operating system. This psw is loaded (via LPSW
    instruction) causing control to be passed to the operating system code.

 In a non-virtualized environment this process, handled entirely by the hardware,
 is kicked off by the user initiating a "Load" procedure from the hardware
 management console. This "Load" procedure crafts a special "Read IPL" ccw in
 memory location 0x0 that reads IPL1. It then executes this ccw thereby kicking
 off the reading of IPL1 data. Since the channel program from IPL1 will be
 written immediately after the special "Read IPL" ccw, the IPL1 channel program
 will be executed immediately (the special read ccw has the chaining bit turned
 on). The TIC at the end of the IPL1 channel program will cause the IPL2 channel
 program to be executed automatically. After this sequence completes the "Load"
 procedure then loads the psw from ``0x0``.

 How this all pertains to QEMU (and the kernel)
 ----------------------------------------------

 In theory we should merely have to do the following to IPL/boot a guest
 operating system from a DASD device:

 1. Place a "Read IPL" ccw into memory location ``0x0`` with chaining bit on.
 2. Execute channel program at ``0x0``.
 3. LPSW ``0x0``.

 However, our emulation of the machine's channel program logic within the kernel
 is missing one key feature that is required for this process to work:
 non-prefetch of ccw data.

 When we start a channel program we pass the channel subsystem parameters via an
 ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
 bit is on then the vfio-ccw kernel driver is allowed to read the entire channel
 program from guest memory before it starts executing it. This means that any
 channel commands that read additional channel commands will not work as expected
 because the newly read commands will only exist in guest memory and NOT within
 the kernel's channel subsystem memory. The kernel vfio-ccw driver currently
 requires this bit to be on for all channel programs. This is a problem because
 the IPL process consists of transferring control from the "Read IPL" ccw
 immediately to the IPL1 channel program that was read by "Read IPL".

 Not being able to turn off prefetch will also prevent the TIC at the end of the
 IPL1 channel program from transferring control to the IPL2 channel program.

 Lastly, in some cases (the zipl bootloader for example) the IPL2 program also
 transfers control to another channel program segment immediately after reading
 it from the disk. So we need to be able to handle this case.

 What QEMU does
 --------------

 Since we are forced to live with prefetch we cannot use the very simple IPL
 procedure we defined in the preceding section. So we compensate by doing the
 following.

 1. Place "Read IPL" ccw into memory location ``0x0``, but turn off chaining bit.
 2. Execute "Read IPL" at ``0x0``.

    So now IPL1's psw is at ``0x0`` and IPL1's channel program is at ``0x08``.

 3. Write a custom channel program that will seek to the IPL2 record and then
    execute the READ and TIC ccws from IPL1.  Normally the seek is not required
    because after reading the IPL1 record the disk is automatically positioned
    to read the very next record which will be IPL2. But since we are not reading
    both IPL1 and IPL2 as part of the same channel program we must manually set
    the position.

 4. Grab the target address of the TIC instruction from the IPL1 channel program.
    This address is where the IPL2 channel program starts.

    Now IPL2 is loaded into memory somewhere, and we know the address.

 5. Execute the IPL2 channel program at the address obtained in step #4.

    Because this channel program can be dynamic, we must use a special algorithm
    that detects a READ immediately followed by a TIC and breaks the ccw chain
    by turning off the chain bit in the READ ccw. When control is returned from
    the kernel/hardware to the QEMU bios code we immediately issue another start
    subchannel to execute the remaining TIC instruction. This causes the entire
    channel program (starting from the TIC) and all needed data to be refetched
    thereby stepping around the limitation that would otherwise prevent this
    channel program from executing properly.

    Now the operating system code is loaded somewhere in guest memory and the psw
    in memory location ``0x0`` will point to entry code for the guest operating
    system.

 6. LPSW ``0x0``

    LPSW transfers control to the guest operating system and we're done.
	Booting from real channel-attached devices on s390x
	===================================================

	s390 hardware IPL
	-----------------

	The s390 hardware IPL process consists of the following steps.

	1. A READ IPL ccw is constructed in memory location ``0x0``.
	This ccw, by definition, reads the IPL1 record which is located on the disk
	at cylinder 0 track 0 record 1. Note that the chain flag is on in this ccw
	so when it is complete another ccw will be fetched and executed from memory
	location ``0x08``.

	2. Execute the Read IPL ccw at ``0x00``, thereby reading IPL1 data into ``0x00``.
	IPL1 data is 24 bytes in length and consists of the following pieces of
	information: ``[psw][read ccw][tic ccw]``. When the machine executes the Read
	IPL ccw it read the 24-bytes of IPL1 to be read into memory starting at
	location ``0x0``. Then the ccw program at ``0x08`` which consists of a read
	ccw and a tic ccw is automatically executed because of the chain flag from
	the original READ IPL ccw. The read ccw will read the IPL2 data into memory
	and the TIC (Transfer In Channel) will transfer control to the channel
	program contained in the IPL2 data. The TIC channel command is the
	equivalent of a branch/jump/goto instruction for channel programs.

	NOTE: The ccws in IPL1 are defined by the architecture to be format 0.

	3. Execute IPL2.
	The TIC ccw instruction at the end of the IPL1 channel program will begin
	the execution of the IPL2 channel program. IPL2 is stage-2 of the boot
	process and will contain a larger channel program than IPL1. The point of
	IPL2 is to find and load either the operating system or a small program that
	loads the operating system from disk. At the end of this step all or some of
	the real operating system is loaded into memory and we are ready to hand
	control over to the guest operating system. At this point the guest
	operating system is entirely responsible for loading any more data it might
	need to function.

	NOTE: The IPL2 channel program might read data into memory
	location ``0x0`` thereby overwriting the IPL1 psw and channel program. This is ok
	as long as the data placed in location ``0x0`` contains a psw whose instruction
	address points to the guest operating system code to execute at the end of
	the IPL/boot process.

	NOTE: The ccws in IPL2 are defined by the architecture to be format 0.

	4. Start executing the guest operating system.
	The psw that was loaded into memory location ``0x0`` as part of the ipl process
	should contain the needed flags for the operating system we have loaded. The
	psw's instruction address will point to the location in memory where we want
	to start executing the operating system. This psw is loaded (via LPSW
	instruction) causing control to be passed to the operating system code.

	In a non-virtualized environment this process, handled entirely by the hardware,
	is kicked off by the user initiating a "Load" procedure from the hardware
	management console. This "Load" procedure crafts a special "Read IPL" ccw in
	memory location 0x0 that reads IPL1. It then executes this ccw thereby kicking
	off the reading of IPL1 data. Since the channel program from IPL1 will be
	written immediately after the special "Read IPL" ccw, the IPL1 channel program
	will be executed immediately (the special read ccw has the chaining bit turned
	on). The TIC at the end of the IPL1 channel program will cause the IPL2 channel
	program to be executed automatically. After this sequence completes the "Load"
	procedure then loads the psw from ``0x0``.

	How this all pertains to QEMU (and the kernel)
	----------------------------------------------

	In theory we should merely have to do the following to IPL/boot a guest
	operating system from a DASD device:

	1. Place a "Read IPL" ccw into memory location ``0x0`` with chaining bit on.
	2. Execute channel program at ``0x0``.
	3. LPSW ``0x0``.

	However, our emulation of the machine's channel program logic within the kernel
	is missing one key feature that is required for this process to work:
	non-prefetch of ccw data.

	When we start a channel program we pass the channel subsystem parameters via an
	ORB (Operation Request Block). One of those parameters is a prefetch bit. If the
	bit is on then the vfio-ccw kernel driver is allowed to read the entire channel
	program from guest memory before it starts executing it. This means that any
	channel commands that read additional channel commands will not work as expected
	because the newly read commands will only exist in guest memory and NOT within
	the kernel's channel subsystem memory. The kernel vfio-ccw driver currently
	requires this bit to be on for all channel programs. This is a problem because
	the IPL process consists of transferring control from the "Read IPL" ccw
	immediately to the IPL1 channel program that was read by "Read IPL".

	Not being able to turn off prefetch will also prevent the TIC at the end of the
	IPL1 channel program from transferring control to the IPL2 channel program.

	Lastly, in some cases (the zipl bootloader for example) the IPL2 program also
	transfers control to another channel program segment immediately after reading
	it from the disk. So we need to be able to handle this case.

	What QEMU does
	--------------

	Since we are forced to live with prefetch we cannot use the very simple IPL
	procedure we defined in the preceding section. So we compensate by doing the
	following.

	1. Place "Read IPL" ccw into memory location ``0x0``, but turn off chaining bit.
	2. Execute "Read IPL" at ``0x0``.

	So now IPL1's psw is at ``0x0`` and IPL1's channel program is at ``0x08``.

	3. Write a custom channel program that will seek to the IPL2 record and then
	execute the READ and TIC ccws from IPL1. Normally the seek is not required
	because after reading the IPL1 record the disk is automatically positioned
	to read the very next record which will be IPL2. But since we are not reading
	both IPL1 and IPL2 as part of the same channel program we must manually set
	the position.

	4. Grab the target address of the TIC instruction from the IPL1 channel program.
	This address is where the IPL2 channel program starts.

	Now IPL2 is loaded into memory somewhere, and we know the address.

	5. Execute the IPL2 channel program at the address obtained in step #4.

	Because this channel program can be dynamic, we must use a special algorithm
	that detects a READ immediately followed by a TIC and breaks the ccw chain
	by turning off the chain bit in the READ ccw. When control is returned from
	the kernel/hardware to the QEMU bios code we immediately issue another start
	subchannel to execute the remaining TIC instruction. This causes the entire
	channel program (starting from the TIC) and all needed data to be refetched
	thereby stepping around the limitation that would otherwise prevent this
	channel program from executing properly.

	Now the operating system code is loaded somewhere in guest memory and the psw
	in memory location ``0x0`` will point to entry code for the guest operating
	system.

	6. LPSW ``0x0``

	LPSW transfers control to the guest operating system and we're done.