Post by Barton RobinsonThe real reason for ensuring a hierarchy of storage in z/VM
is to meet the objective of paging less to DASD.
The page stealing algorithm used to take pages from main storage
is not as efficient as the LRU algorithm for moving pages from
Expanded Storage.
Memory constrained systems found that their external paging rate
dropped when they converted some real storage to expanded.
The stealing algorithm steals a lot of the wrong pages, often
taking needed pages and moving them to dasd. bad.
Sure moving pages back and forth between expanded and real cost
CPU - but paging to disk is orders of magnitude worse.
that was one of the issues that happened in the initial morph from
cp67 to vm370. i had introduced global lru into cp67 as an
undergraudate.
in the morph from cp67 to vm370, they severely perverted the global
LRU implementation (beside changing the dispatching algorithm and
other things). the morph to vm370 introduced a threaded list of all
real pages (as opposed to real storage index table). in theory the
threaded list was suppose to approx the real storage index table
... however, at queue transitions ... all pages for a specific address
space was collected and put on a flush list. if the virtual machine
re-entered queue ... any pages from the flush list were collected and
put back on the "in-q" list. the result was that the ordering that
pages were examined for stealing tended to have a fifo ordering with
most pages for the same virtual machine all clustered together.
the original global LRU implementation was based on having a
relatively uniform time between examining pages; aka a page was
examined and had its page reference bit reset. then all the other
pages in real storage was examined before that page was examined
again. this was uniformly true for all pages in real storage. the
only change was that if the demand for real storage increased, the
time it took to cycle around all real pages decreased ... but
decreased relatively uniformly for all pages.
in any case, the list approached introduced in the morph from cp67 to
vm370 would drastically and constantly reorder how pages were
examined. there was an implicit feature of LRU algorithms (local or
global) that the examination process be uniform for all pages. the
list manipulation totally invalidated an implicit feature of LRU
implementation ... and while it appeared to still examine and reset
reference bits ... it was no longer approx. true LRU (either global or
local) ... and the number of "bad" decisions went way up.
this was "fixed" when the resource manager was shipped ... and the
code put back like I had originally done in cp67 ... and restored
having true LRU. now the resource manager was still a straight clock
(as defined later in the stanford PHD thesis). basically the way i had
implemented clock had a bunch of implicit charactierstics that had a
drastically reduced the pathlength implementation ... however that
made a lot of things about the implementation "implicit" ... there not
being necessarily an obvious correlation between the careful way that
pages were examined and how it preserved faithful implementation of
LRU.
i had somewhat similar argument with the people putting virtual memory
support into mvt for vs2 (first svs and then mvs). they observed if
they "stole" non-changed pages before "stealing" changed pages (while
still cycling examining and reseting reference bits corresponding to
some supposedly LRU paradigm) ... they could be more efficient. No
matter how hard i argued against doing it (that it violated
fundamental principles of LRU theory) ... they still insisted. so well
into mvs (3.8 or later) ... somebody finally realized that they were
stealing high-use linkpack (shared executable) instruction/non-changed
pages before stealing much lower used, private data pages. another
example if you are going to violate some fundamental principles of the
implementation ... you no longer really had an LRU implementation.
there was a side issue. shortly after joining the science center,
http://www.garlic.com/~lynn/subtopic.html#545tech
i discovered another variation (although this was not deployed
in the resource manager). basically it involved two observations
1) the usefullness of the history information degrades over time.
implicit in LRU is that if a page has been referenced ... it is more
likely to be used in the future than a page that hasn't been
referenced. since there is only a single bit, all you can determine is
that the page was referenced at some point since it was reset. if it
is taking a long time between examinations ... some bits may have a
lot more meaning than other bits ... but it isn't determinable. in
this time frame, the guys on the 5th floor also published an article
about having multiple reference bits ... and instead of a straight
reset operations ... it became a one-bit shift operation (with zero
being introduced in the nearest bit). their article was performance
effects on using one, two, three, four, etc bits.
2) LRU is based on applications actually following references based on
LRU ... that if a page has been recently used ... it is more likely to
be used in the near future than pages that haven't been recently used.
However, there are (sometimes pathelogical) cases where that isn't
true. one case that can crop up is when you have a LRU implementation
running under an LRU implementation (the 2nd level can be a virtual
machine doing its own LRU page approx or a database system that is
managing a cache of records using an LRU-like implementation). So I
invented this slight of hand implementation ... it looked and tasted
almost exactly like my standard global LRU implementation except it
had the characteristic that in situations when LRU would nominal
perform well, it approximated LRU ... but in situations when LRU was
not a good solution, it magically was doing random replacement
selection. It was hard to understand this ... because the code still
cycled around resetting bits ... and it was a true slight of hand that
would result in it selecting based on LRU or random (you had to really
understand some very intricate implicit relationship behind code
implementation and the way each instruction related to true LRU
implemenation).
the science center was doing a lot of performance work including lots
of detailed traces and modeling ... both event-based model and
analytical models. this included a lot of stuff that today
is taken for granted ... including the early foundation for
capacity planning. some past collected posts on performance work
http://www.garlic.com/~lynn/subtopic.html#bench
this included what was eventually made available on HONE as the
performance predictor ... an APL analytical model ... SE and salesmen
could get on HONE ... feed in customer performance, configuration and
workload information and ask "what-if" questions about changing
configuration and/or workload.
http://www.garlic.com/~lynn/subtopic.html#hone
in any case, there was a detailed virtual memory and page replacement
model. we got exact page reference traces and fed it into the model
simulating lots of different page replacement algorithms. for one, the
model had a true, exact LRU implementation, as well as various
operating system global LRU approximation implementations, local LRU
implementations, etc. It turns out that the modeling also showed up
that global LRU was better than local LRU ... and that "true" LRU
tended to be 5-15 percent better than global LRU approximation.
However, the magic, slight-of-hand implementation tended to be 5-10
percent bettern than true LRU. It turned out that the slight-of-hand
implementation was magically changing from LRU approximation
replacement to random replacement in situations where LRU didn't apply
(i.e. the assumptions about least recently used pages being the most
likely next to be used wasn't holding true). So in the domain
environment where LRU tended to hold true, the code tended to approx
LRU (but not quite as good as "exact" LRU ... where all pages in real
memory are exactly ordered as to when they were most recently
referenced). However, in execution periods when LRU assumptions
weren't applicable ... the implementation started randomly selecting
pages for replacement. It was in these situations that LRU-algorithm
started making bad choices ... and random tended to be better than
LRU-based decisions.
In any case, there are a lot of assumptions about execution patterns
built into LRU replacement algorithms. Furthermore, there are several
implementation pitfalls ... where you may think you have an LRU
implementation and it is, in fact, exhibiting radically different
replacement selections. An issue is to know that you are really doing
LRU replacement when LRU replacement is appropriate ... and to really
know you are doing something else ... when something else is more
applicable (particularly when assumptions about execution patterns and
applicable of LRU replacement don't apply).
So there are a lot of pit-falls having to do with stealing pages ...
both because 1) the implementation can have significant problems with
correctly implementating any specific algorithm and 2) assumptions
about a specific algorithm being valid may not apply to specific
conditions at the moment.
Either of these deficiencies may appear to be randomly and/or
unexplicably affected by changes in configurations. trading off real
executable memory for expanded storage can be easily shown to cause
more overhead and lower thruput (i.e. pages start exhibiting brownian
motion ... moving back and forth between real storage and expanded
storage). however, the configuration change may have secondary effects
on a poorly implemented page steal/replacement implementation which
results in fewer wrong pages being shipped off to disk. the
inexplicable effects on the poorly implementated page
steal/replacement algorithm resulting in fewer bad choices being sent
to disk may more than offset any of the brownian motion of pages
moving back and forth between normal storage and expanded storage.
the original purpose of expanded store was to add additional
electronic memory more than could be available by straight processor
execution memory (and used for paging in lieu of doing some sort of
real i/o). in the current situations you are trading off real
executable memory for a memory construct that has fundamentally more
overhead. however, these trade-off has secondary effects on a
steal/replacement implementation that is otherwise making bad choices
(that it shouldn't be making).
various collected posts about clock, local lru, global lru, magically
switching between lru and random, (wsclock was the stanford phd on
global lru ... some ten plus years after i had done it as
undergraduate) etc
http://www.garlic.com/~lynn/subtopic.html#wsclock
for even more drift ... one of the other things done with the detailed
tracing was a product that eventually came out of the science center
(announced and shipped two months before i announced and shipped
resource manager) was called vs/repack. this basically took detailed
program storage traces and did semi-automated program re-organization
to improve page working set characteristics. i'm not sure how much
customer use the product got, but it was used extensively internally
by lots of development groups ... especially the big production stuff
that was migrating from os/360 real storage to virtual storage
operation (big user that comes to mind was the ims development group).
the traces also turned out to be useful for "hot-spot" identification
(particular parts of applications that were responsible for majority
of exeuction).
misc. past vs/repack posts
http://www.garlic.com/~lynn/94.html#7 IBM 7090 (360s, 370s, apl, etc)
http://www.garlic.com/~lynn/99.html#68 The Melissa Virus or War on Microsoft?
ttp://www.garlic.com/~lynn/2000g.html#30 Could CDR-coding be on the way back?
http://www.garlic.com/~lynn/2001b.html#83 Z/90, S/390, 370/ESA (slightly off topic)
http://www.garlic.com/~lynn/2001c.html#31 database (or b-tree) page sizes
http://www.garlic.com/~lynn/2001c.html#33 database (or b-tree) page sizes
http://www.garlic.com/~lynn/2001i.html#20 Very CISC Instuctions (Was: why the machine word size ...)
http://www.garlic.com/~lynn/2002c.html#28 OS Workloads : Interactive etc
http://www.garlic.com/~lynn/2002c.html#45 cp/67 addenda (cross-post warning)
http://www.garlic.com/~lynn/2002c.html#46 cp/67 addenda (cross-post warning)
http://www.garlic.com/~lynn/2002c.html#49 Swapper was Re: History of Login Names
http://www.garlic.com/~lynn/2002e.html#50 IBM going after Strobe?
http://www.garlic.com/~lynn/2002f.html#50 Blade architectures
http://www.garlic.com/~lynn/2003f.html#15 Alpha performance, why?
http://www.garlic.com/~lynn/2003f.html#21 "Super-Cheap" Supercomputing
http://www.garlic.com/~lynn/2003f.html#53 Alpha performance, why?
http://www.garlic.com/~lynn/2003g.html#15 Disk capacity and backup solutions
http://www.garlic.com/~lynn/2003h.html#8 IBM says AMD dead in 5yrs ... -- Microsoft Monopoly vs. IBM
http://www.garlic.com/~lynn/2003j.html#32 Language semantics wrt exploits
http://www.garlic.com/~lynn/2004.html#14 Holee shit! 30 years ago!
http://www.garlic.com/~lynn/2004c.html#21 PSW Sampling
http://www.garlic.com/~lynn/2004m.html#22 Lock-free algorithms
http://www.garlic.com/~lynn/2004n.html#55 Integer types for 128-bit addressing
http://www.garlic.com/~lynn/2004o.html#7 Integer types for 128-bit addressing
http://www.garlic.com/~lynn/2004q.html#73 Athlon cache question
http://www.garlic.com/~lynn/2004q.html#76 Athlon cache question
http://www.garlic.com/~lynn/2005.html#4 Athlon cache question
http://www.garlic.com/~lynn/2005d.html#41 Thou shalt have no other gods before the ANSI C standard
http://www.garlic.com/~lynn/2005d.html#48 Secure design
http://www.garlic.com/~lynn/2005h.html#15 Exceptions at basic block boundaries
http://www.garlic.com/~lynn/2005j.html#62 More on garbage collection
http://www.garlic.com/~lynn/2005k.html#17 More on garbage collection
http://www.garlic.com/~lynn/2005m.html#28 IBM's mini computers--lack thereof
http://www.garlic.com/~lynn/2005n.html#18 Code density and performance?
http://www.garlic.com/~lynn/2005o.html#5 Code density and performance?
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/