Any resources on VLIW?

Discussion:

(too old to reply)

Anne & Lynn Wheeler

2006-07-20 19:02:18 UTC

Might be relevant if Lynn Wheeler could expand on the unreleased VAMPS
microcode to speed up 370 SMP, and also provided logical processors
with similarities to those on current zSeries LPARs, although that may
just have dropped parts of 370 sequential code down into microcode.

so presumably this recent post vis-a-vis vamps and the later i432
http://www.garlic.com/~lynn/2006n.html#42 Why is zSeries so CPU poor?

misc. collected past vamps postings
http://www.garlic.com/~lynn/subtopic.html#bounce

early microcode effort was "VMA" original for 370/158 that helped
virtual machine performance. for subset of "supervisor" state
instructions, microcode was added to execute the instruction using
"virtual machine" rules (to avoid interrupting into the virtual
machine hypervisor where the instruction was simulated).

concurrent with VAMPS effort was "ECPS" for 370 138&148. ECPS did some
more stuff like VMA on the 158 (direct supervisor state instruction
execution) ... but it also identified parts of the hypervisor kernel
and moved that kernel code into microcode. the issue on 138&148
machines was that there was an avg. of 10:1 microcode instructions
executed for every 370 instruction. Much of the kernel code moved to
microcode on straigh 1:1 basis resulting in ten times performance
speed up. old posting identifying specific kernel code segments for
migrating into microcode.
http://www.garlic.com/~lynn/94.html#21 370 ECPS VM microcode assist

the VMA-related efforts eventually evolved into SIE ... where nearly
all supervisor state instructions had microcode enhancement for
directly executing with regard to virtual machine rules (avoiding a
lot of interruption into virtual machine hypervisor to simulate
supervisor state instructions). SIE was a state change instruction
that gathered up all the fields needed by various supervisor state
instructions to execute according to "virtual machine" rules. post of
old SIE discussion about implementation issue differences between 3081
and "trout" (3090)
http://www.garlic.com/~lynn/2006j.html#27 virtual memory

there were still things like page faults for the virtual machine that
resulted in interruptions into the hypervisor kernel for handling. a
special case was defined involving things like dedicated real storage
for a virtual machine ... eliminating need to interrupt into the
hypervisor kernel. This resulted in being able to operate a virtual
machine subset directly supported by hardware ... w/o the need for a
virtual machine kernel. This was called "PR/SM" ... and PR/SM
capability eventually evolved into the current LPARs (logical
partitions). a reference discussing some current LPAR and PR/SM
http://researchweb.watson.ibm.com/journal/rd/483/siegel.html

current machines can have a configurable limited number of LPARs ...
and it is possible to run a virtual machine hypervisor in an LPAR,
which in turns supports a much larger number of virtual machines. The
has been an evoluation of the SIE support. Initially, SIE was not
virtualized but LPARs make use of SIE for support. That met that a
virtual machine hypervisor running in an LPAR wouldn't have
performance assist of SIE for running its virtual machines (all
virtual machine supervisor instructions would interrupt into the
hypervisor for simulation). Enhancements were required to virtualize
SIE for at least one level (so it could be used both by LPAR function
and also by hypervisor running in an LPAR).

Since I was doing both VAMPS and ECPS ... I borrowed a lot of stuff
done for ECPS for doing VAMPS. However, for VAMPS, I wanted it
extended in a much more architected way ... rather than simply doing a
1-fo-1 movement of existing kernel 370 code into microcode. VAMPS was
to have up to five processors ... and I defined a microcode hardware
queued work interface where the hypervisor put units of work on the
queued work interface (and the microcode took the queued work and
executed on whatever available processor there were). The hardeware
microcode also placed queued work for the hypervisor to handle ...
like things that were i/o interrupts in traditional 370 or page fault
interrupts (from executing virtual machines), etc.

The VAMPS abstraction of queued work for multiprocessor environment
was somewhat akin to the later defintion found later in i432. Some of
the VAMPS abstraction for i/o work queueing was somewhat akin to what
showed up later for 370-xa i/o operations.

After VAMPS was killed, I adapted the multiprocessing microcode queued
processing to an software implementation. A lot of the SMP kernel
implementations used a single, global kernel SPIN lock to serialize
all kernel execution. This drastically minimized the amount of code
changes to adapt a single-processor operating system to support a
multiprocessor operation.

In adapting the VAMPs multiprocessing microcode support to software, I
took the equivalent kernel software functions (that had been moved to
microcode in VAMPS) and made them multiprocessing parallelized with
fine-grain locking. This amounted to the majority of the software
kernel execution time ... but a relatively small amount of the total
kernel instructions. The majority of the kernel instructions relied on
a somewhat traditional global kernel lock. However, when ever the
"parallized" kernel code required to transition into the "sequential"
kernel code ... rather than "spinning" on the global kernel lock
... it "bounced". If it obtained the global kernel lock, then it
proceeded as normal. If it couldn't obtain the global kernel lock, it
would queue a super lightweight work request ... and go off and look
for other "parallelized" work.

This approach obtained almost all the thruput benefit of having a
kernel fine-grain locking implementation, avoided the degradation of
single kernel spin-lock implementation ... but the kernel code changes
were not significantly more than required for a single kernel
spin-lock implementation. This implementation shipped in VM370 release
four.

John Ahlstrom

2006-07-20 19:20:45 UTC

Permalink

-- snip snip

Post by Anne & Lynn Wheeler
concurrent with VAMPS effort was "ECPS" for 370 138&148. ECPS did some
more stuff like VMA on the 158 (direct supervisor state instruction
execution) ... but it also identified parts of the hypervisor kernel
and moved that kernel code into microcode. the issue on 138&148
machines was that there was an avg. of 10:1 microcode instructions
executed for every 370 instruction. Much of the kernel code moved to
microcode on straigh 1:1 basis resulting in ten times performance
speed up. old posting identifying specific kernel code segments for
migrating into microcode.
http://www.garlic.com/~lynn/94.html#21 370 ECPS VM microcode assist

-- snip snip

Lynn:

Can you point us to any information on 138/148 microprogramming and
microarchitecture? Examples of the 10:1 microcode to 370 instruction
expansion would be fascinating.

JKA

--
Smart is thinking you are right.
Wise is knowing you might be wrong.

Anne & Lynn Wheeler

2006-07-20 19:42:47 UTC

Permalink

Post by John Ahlstrom
Can you point us to any information on 138/148 microprogramming and
microarchitecture? Examples of the 10:1 microcode to 370 instruction
expansion would be fascinating.

re:
http://www.garlic.com/~lynn/2006n.html#44 Any resources on VLIW?

i don't have any left ... and am not aware of any online
resources. possibly somebody has some old field engineering manuals
with instruction description.

the high-end 370s had horizontal microcode ... more akin to VLIW.

the low and mid-range 370s tended to be relatively straightforward
processor enginess ... and the 370 "microcode" was relatively
straight-foward sequential instruction sequences (i.e. "vertical"
microcode) ... and the avg. of 10:1 microcode instruction per 370
instructions was relatively the same across variety of engines
(i.e. the microprocessor MIP rate had to be on the order of ten times
that of whatever 370 model it was being used in).

the large variety of these different (microcode) processing engines
gave rise to the "fort knox" effort circa 1980 ... to replace most of
the internal microcode processing engines with 801s (aka risc).
http://www.garlic.com/~lynn/subtopic.html#801

where the standard 801/risc instruction set was extended with some
instructions that aided in instruction simulation.

the followon to the 138/148 was the 4331/4341. the follow-on to the
4331/4341 (4361/4381) were going to have 801 risc processors as the
microcode engine. i help author a white paper that killed that effort.
the issue was that technology was advancing to the point where it was
possible to implement nearly the whole 370 directly in silicon
... avoiding much of the instruction emulation overhead altogether
(i.e. 4381 was much more of a direct silicon implementation).

some number of the 370 instructions required a lot more then 10
microprocessor instructions and for which there wouldn't be a direct
simple microcode instruction ... however, the typical high useage 370
kernel instructions tended to be a lot of testing bits/state and
branching ... for which there typically was an exact correspondance in
the microcode instruction set (i.e. eliminate microcode decode of the
370 instruction, manipulate the 370 registers, etc)

Anne & Lynn Wheeler

2006-07-20 19:52:23 UTC

Permalink

re:
http://www.garlic.com/~lynn/2006n.html#44 Any resources on VLIW?
http://www.garlic.com/~lynn/2006n.html#47 Any resources on VLIW?

as an aside ... some number of the relatively recent 370 emulators
written for intel platforms have quoted avg. instruction ratio numbers
around 10:1 also (have to play some real tricks to get it much below
10:1).