{SPAM?} Re: Expanded Storage

Discussion:

(too old to reply)

Barton Robinson

2006-02-07 15:12:55 UTC

The real reason for ensuring a hierarchy of storage in z/VM
is to meet the objective of paging less to DASD.

The page stealing algorithm used to take pages from main storage
is not as efficient as the LRU algorithm for moving pages from
Expanded Storage.

Memory constrained systems found that their external paging rate
dropped when they converted some real storage to expanded.
The stealing algorithm steals a lot of the wrong pages, often
taking needed pages and moving them to dasd. bad.

Sure moving pages back and forth between expanded and real cost
CPU - but paging to disk is orders of magnitude worse.

Yes, in your configuration you should define expanded storage. It's
for providing a hierarchy in storage management as well as a
circumvention to reduce the impact of contention under the bar.
Especially when the total active memory of your Linux server is
getting close to 2G (and unless you do things, eventually the entire
Linux virtual machine main memory will appear active to VM).
25% has been suggested as a starting point, but measurements should
help you determine the right value. The right value depends a lot on
what Linux is doing. And make sure to disable MDC into expanded
storage as I suggested yesterday.

note if you have 16gbytes of expanded store and 16gbytes of regular
storage ... then only stuff in the 16gbytes of regular store and be
used/executed. stuff in expanded store has to be brought into regular
store to be accessed (and something in regular store pushed out
... possibly exchanging places with stuff in expanded store).
if you have 32gbytes of regular store ... then everything in regular
store can be directly used/executed ... w/o being moved around.

"If you can't measure it, I'm Just NOT interested!"(tm)

/************************************************************/
Barton Robinson - CBW Internet: ***@VELOCITY-SOFTWARE.COM
Velocity Software, Inc Mailing Address:
196-D Castro Street P.O. Box 390640
Mountain View, CA 94041 Mountain View, CA 94039-0640

VM Performance Hotline: 650-964-8867
Fax: 650-964-9012 Web Page: WWW.VELOCITY-SOFTWARE.COM
/************************************************************/

Anne & Lynn Wheeler

2006-02-07 17:30:36 UTC

Permalink

Post by Barton Robinson
The real reason for ensuring a hierarchy of storage in z/VM
is to meet the objective of paging less to DASD.
The page stealing algorithm used to take pages from main storage
is not as efficient as the LRU algorithm for moving pages from
Expanded Storage.
Memory constrained systems found that their external paging rate
dropped when they converted some real storage to expanded.
The stealing algorithm steals a lot of the wrong pages, often
taking needed pages and moving them to dasd. bad.
Sure moving pages back and forth between expanded and real cost
CPU - but paging to disk is orders of magnitude worse.

that was one of the issues that happened in the initial morph from
cp67 to vm370. i had introduced global lru into cp67 as an
undergraudate.

in the morph from cp67 to vm370, they severely perverted the global
LRU implementation (beside changing the dispatching algorithm and
other things). the morph to vm370 introduced a threaded list of all
real pages (as opposed to real storage index table). in theory the
threaded list was suppose to approx the real storage index table
... however, at queue transitions ... all pages for a specific address
space was collected and put on a flush list. if the virtual machine
re-entered queue ... any pages from the flush list were collected and
put back on the "in-q" list. the result was that the ordering that
pages were examined for stealing tended to have a fifo ordering with
most pages for the same virtual machine all clustered together.

the original global LRU implementation was based on having a
relatively uniform time between examining pages; aka a page was
examined and had its page reference bit reset. then all the other
pages in real storage was examined before that page was examined
again. this was uniformly true for all pages in real storage. the
only change was that if the demand for real storage increased, the
time it took to cycle around all real pages decreased ... but
decreased relatively uniformly for all pages.

in any case, the list approached introduced in the morph from cp67 to
vm370 would drastically and constantly reorder how pages were
examined. there was an implicit feature of LRU algorithms (local or
global) that the examination process be uniform for all pages. the
list manipulation totally invalidated an implicit feature of LRU
implementation ... and while it appeared to still examine and reset
reference bits ... it was no longer approx. true LRU (either global or
local) ... and the number of "bad" decisions went way up.

this was "fixed" when the resource manager was shipped ... and the
code put back like I had originally done in cp67 ... and restored
having true LRU. now the resource manager was still a straight clock
(as defined later in the stanford PHD thesis). basically the way i had
implemented clock had a bunch of implicit charactierstics that had a
drastically reduced the pathlength implementation ... however that
made a lot of things about the implementation "implicit" ... there not
being necessarily an obvious correlation between the careful way that
pages were examined and how it preserved faithful implementation of
LRU.

i had somewhat similar argument with the people putting virtual memory
support into mvt for vs2 (first svs and then mvs). they observed if
they "stole" non-changed pages before "stealing" changed pages (while
still cycling examining and reseting reference bits corresponding to
some supposedly LRU paradigm) ... they could be more efficient. No
matter how hard i argued against doing it (that it violated
fundamental principles of LRU theory) ... they still insisted. so well
into mvs (3.8 or later) ... somebody finally realized that they were
stealing high-use linkpack (shared executable) instruction/non-changed
pages before stealing much lower used, private data pages. another
example if you are going to violate some fundamental principles of the
implementation ... you no longer really had an LRU implementation.

there was a side issue. shortly after joining the science center,
http://www.garlic.com/~lynn/subtopic.html#545tech
i discovered another variation (although this was not deployed
in the resource manager). basically it involved two observations

1) the usefullness of the history information degrades over time.
implicit in LRU is that if a page has been referenced ... it is more
likely to be used in the future than a page that hasn't been
referenced. since there is only a single bit, all you can determine is
that the page was referenced at some point since it was reset. if it
is taking a long time between examinations ... some bits may have a
lot more meaning than other bits ... but it isn't determinable. in
this time frame, the guys on the 5th floor also published an article
about having multiple reference bits ... and instead of a straight
reset operations ... it became a one-bit shift operation (with zero
being introduced in the nearest bit). their article was performance
effects on using one, two, three, four, etc bits.

2) LRU is based on applications actually following references based on
LRU ... that if a page has been recently used ... it is more likely to
be used in the near future than pages that haven't been recently used.
However, there are (sometimes pathelogical) cases where that isn't
true. one case that can crop up is when you have a LRU implementation
running under an LRU implementation (the 2nd level can be a virtual
machine doing its own LRU page approx or a database system that is
managing a cache of records using an LRU-like implementation). So I
invented this slight of hand implementation ... it looked and tasted
almost exactly like my standard global LRU implementation except it
had the characteristic that in situations when LRU would nominal
perform well, it approximated LRU ... but in situations when LRU was
not a good solution, it magically was doing random replacement
selection. It was hard to understand this ... because the code still
cycled around resetting bits ... and it was a true slight of hand that
would result in it selecting based on LRU or random (you had to really
understand some very intricate implicit relationship behind code
implementation and the way each instruction related to true LRU
implemenation).

the science center was doing a lot of performance work including lots
of detailed traces and modeling ... both event-based model and
analytical models. this included a lot of stuff that today
is taken for granted ... including the early foundation for
capacity planning. some past collected posts on performance work
http://www.garlic.com/~lynn/subtopic.html#bench

this included what was eventually made available on HONE as the
performance predictor ... an APL analytical model ... SE and salesmen
could get on HONE ... feed in customer performance, configuration and
workload information and ask "what-if" questions about changing
configuration and/or workload.
http://www.garlic.com/~lynn/subtopic.html#hone

in any case, there was a detailed virtual memory and page replacement
model. we got exact page reference traces and fed it into the model
simulating lots of different page replacement algorithms. for one, the
model had a true, exact LRU implementation, as well as various
operating system global LRU approximation implementations, local LRU
implementations, etc. It turns out that the modeling also showed up
that global LRU was better than local LRU ... and that "true" LRU
tended to be 5-15 percent better than global LRU approximation.
However, the magic, slight-of-hand implementation tended to be 5-10
percent bettern than true LRU. It turned out that the slight-of-hand
implementation was magically changing from LRU approximation
replacement to random replacement in situations where LRU didn't apply
(i.e. the assumptions about least recently used pages being the most
likely next to be used wasn't holding true). So in the domain
environment where LRU tended to hold true, the code tended to approx
LRU (but not quite as good as "exact" LRU ... where all pages in real
memory are exactly ordered as to when they were most recently
referenced). However, in execution periods when LRU assumptions
weren't applicable ... the implementation started randomly selecting
pages for replacement. It was in these situations that LRU-algorithm
started making bad choices ... and random tended to be better than
LRU-based decisions.

In any case, there are a lot of assumptions about execution patterns
built into LRU replacement algorithms. Furthermore, there are several
implementation pitfalls ... where you may think you have an LRU
implementation and it is, in fact, exhibiting radically different
replacement selections. An issue is to know that you are really doing
LRU replacement when LRU replacement is appropriate ... and to really
know you are doing something else ... when something else is more
applicable (particularly when assumptions about execution patterns and
applicable of LRU replacement don't apply).

So there are a lot of pit-falls having to do with stealing pages ...
both because 1) the implementation can have significant problems with
correctly implementating any specific algorithm and 2) assumptions
about a specific algorithm being valid may not apply to specific
conditions at the moment.

Either of these deficiencies may appear to be randomly and/or
unexplicably affected by changes in configurations. trading off real
executable memory for expanded storage can be easily shown to cause
more overhead and lower thruput (i.e. pages start exhibiting brownian
motion ... moving back and forth between real storage and expanded
storage). however, the configuration change may have secondary effects
on a poorly implemented page steal/replacement implementation which
results in fewer wrong pages being shipped off to disk. the
inexplicable effects on the poorly implementated page
steal/replacement algorithm resulting in fewer bad choices being sent
to disk may more than offset any of the brownian motion of pages
moving back and forth between normal storage and expanded storage.

the original purpose of expanded store was to add additional
electronic memory more than could be available by straight processor
execution memory (and used for paging in lieu of doing some sort of
real i/o). in the current situations you are trading off real
executable memory for a memory construct that has fundamentally more
overhead. however, these trade-off has secondary effects on a
steal/replacement implementation that is otherwise making bad choices
(that it shouldn't be making).

various collected posts about clock, local lru, global lru, magically
switching between lru and random, (wsclock was the stanford phd on
global lru ... some ten plus years after i had done it as
undergraduate) etc
http://www.garlic.com/~lynn/subtopic.html#wsclock

for even more drift ... one of the other things done with the detailed
tracing was a product that eventually came out of the science center
(announced and shipped two months before i announced and shipped
resource manager) was called vs/repack. this basically took detailed
program storage traces and did semi-automated program re-organization
to improve page working set characteristics. i'm not sure how much
customer use the product got, but it was used extensively internally
by lots of development groups ... especially the big production stuff
that was migrating from os/360 real storage to virtual storage
operation (big user that comes to mind was the ims development group).
the traces also turned out to be useful for "hot-spot" identification
(particular parts of applications that were responsible for majority
of exeuction).

misc. past vs/repack posts
http://www.garlic.com/~lynn/94.html#7 IBM 7090 (360s, 370s, apl, etc)
http://www.garlic.com/~lynn/99.html#68 The Melissa Virus or War on Microsoft?
ttp://www.garlic.com/~lynn/2000g.html#30 Could CDR-coding be on the way back?
http://www.garlic.com/~lynn/2001b.html#83 Z/90, S/390, 370/ESA (slightly off topic)
http://www.garlic.com/~lynn/2001c.html#31 database (or b-tree) page sizes
http://www.garlic.com/~lynn/2001c.html#33 database (or b-tree) page sizes
http://www.garlic.com/~lynn/2001i.html#20 Very CISC Instuctions (Was: why the machine word size ...)
http://www.garlic.com/~lynn/2002c.html#28 OS Workloads : Interactive etc
http://www.garlic.com/~lynn/2002c.html#45 cp/67 addenda (cross-post warning)
http://www.garlic.com/~lynn/2002c.html#46 cp/67 addenda (cross-post warning)
http://www.garlic.com/~lynn/2002c.html#49 Swapper was Re: History of Login Names
http://www.garlic.com/~lynn/2002e.html#50 IBM going after Strobe?
http://www.garlic.com/~lynn/2002f.html#50 Blade architectures
http://www.garlic.com/~lynn/2003f.html#15 Alpha performance, why?
http://www.garlic.com/~lynn/2003f.html#21 "Super-Cheap" Supercomputing
http://www.garlic.com/~lynn/2003f.html#53 Alpha performance, why?
http://www.garlic.com/~lynn/2003g.html#15 Disk capacity and backup solutions
http://www.garlic.com/~lynn/2003h.html#8 IBM says AMD dead in 5yrs ... -- Microsoft Monopoly vs. IBM
http://www.garlic.com/~lynn/2003j.html#32 Language semantics wrt exploits
http://www.garlic.com/~lynn/2004.html#14 Holee shit! 30 years ago!
http://www.garlic.com/~lynn/2004c.html#21 PSW Sampling
http://www.garlic.com/~lynn/2004m.html#22 Lock-free algorithms
http://www.garlic.com/~lynn/2004n.html#55 Integer types for 128-bit addressing
http://www.garlic.com/~lynn/2004o.html#7 Integer types for 128-bit addressing
http://www.garlic.com/~lynn/2004q.html#73 Athlon cache question
http://www.garlic.com/~lynn/2004q.html#76 Athlon cache question
http://www.garlic.com/~lynn/2005.html#4 Athlon cache question
http://www.garlic.com/~lynn/2005d.html#41 Thou shalt have no other gods before the ANSI C standard
http://www.garlic.com/~lynn/2005d.html#48 Secure design
http://www.garlic.com/~lynn/2005h.html#15 Exceptions at basic block boundaries
http://www.garlic.com/~lynn/2005j.html#62 More on garbage collection
http://www.garlic.com/~lynn/2005k.html#17 More on garbage collection
http://www.garlic.com/~lynn/2005m.html#28 IBM's mini computers--lack thereof
http://www.garlic.com/~lynn/2005n.html#18 Code density and performance?
http://www.garlic.com/~lynn/2005o.html#5 Code density and performance?

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/

Anne & Lynn Wheeler

2006-02-07 17:59:18 UTC

Permalink

ref:
http://www.garlic.com/~lynn/2006b.html#14 Expanded Storage
http://www.garlic.com/~lynn/2006b.html#15 {SPAM?} Expanded Storage

minor addeda to what went wrong in the initial morph from cp67 to
vm370.

as i mentioned, LRU is based on the reference history being
an correct predictor of future references.

a one-bit clock basically orders all pages in real memory and then
cycles around them testing and resetting the reference bit. the
cycle interval looking at all other pages in storage establishes
a uniform interval between resetting the bit and testing it
again.

the initial vm370 implementation went wrong by both reordering
all pages at queue interval as well as resetting the reference
bits.

for small storage sizes ... the time it took to cycle thru all pages
in memory was less than the nominal queue stay ... so we are looking
at a reference bit that represents an elapsed period less than a queue
stay. as real storage sizes got larger ... the time to cycle all pages
became longer than the avg. queue stay. that required that the
reference period represented by the reference bit be a period longer
than the queue stay. however, at queue transition ... the pages were
both being reordered and the reference bit being reset. as a result it
only had memory about the most recent queue stay ... even tho pages
had real storage lifetimes that were becoming much longer than the
most recent queue stay. as a result of both the queue transition reset
and the constant reordering ... the testing and resetting
implementation bore little actual resemblance to any algorithm based
on theoritical foundation (even tho the testing and resetting code
looked the same).

on the other hand, the same could be said of my slight-of-hand change
to the testing and resetting code. however, I could demonstrate that
my change actually corresponded to well provable theoritical
principles and had well describable and predictable behavior under all
workloads and configurations.

again, collect postings related to wsclock, global lru, local lru,
etc.
http://www.garlic.com/~lynn/subtopic.html#wsclock

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/

Anne & Lynn Wheeler

2006-02-07 18:27:07 UTC

Permalink

Post by Anne & Lynn Wheeler
on the other hand, the same could be said of my slight-of-hand change
to the testing and resetting code. however, I could demonstrate that
my change actually corresponded to well provable theoritical
principles and had well describable and predictable behavior under all
workloads and configurations.

re:
http://www.garlic.com/~lynn/2006b.html#14 Expanded Storage
http://www.garlic.com/~lynn/2006b.html#15 {SPAM?} Expanded Storage
http://www.garlic.com/~lynn/2006b.html#16 {SPAM?} Expanded Storage

ok, and for even more drift.

one of the things done for the resource manager was a automated
benchmarking process was developed.
http://www.garlic.com/~lynn/subtopic.html#bench

over the years, there had been lots of work done on workload and
configuration profiling (leading into the evoluation of things like
capacity planning). one of these that saw a lot of exposure was the
performance predictor analytical model available to SEs and salesmen
on HONE
http://www.garlic.com/~lynn/subtopic.html#hone

based on lots of customer and internal datacenter activity, in some
cases spanning nearly a decade ... an initial set of 1000 benchmarks
were defined for calibrating the resource manager ... selecting a wide
variety of workload profiles and configuration profiles. these were
specified and run by the automated benchmarking process.

in paralle a highly modified version of the performance predictor was
developed that would, the modified performance predictor would take
all workload, configuration and benchmark results done to date. the
model then would select a new workload/configuraiton combination,
predict the benchmark results and dynamically specify the
workload/configuraiton profile to the automated benchmark process.
after the benchmark was run, the results would be feed back to the
model and checked against the predictions. then it would select
another workload/configuration and repeat the process. this was done
for an additional 1000 benchmarks ... each time validating that the
actual operation (cpu useage, paging rate, distribution of cpu across
different tasks, etc) corresponded to the predicted.

the full 2000 automated benchmarks took three months elapsed time to
run. however, at the end, we were relatively confident that the
resource manager (cpu, dispatching, scheduling, paging, i/o, etc)
operating consistently and preditably with respect to theory as well
as developed analytical models across an extremely wide range of
workloads and configurations.

as a side issue. one of the things that we started with before with
the automated benchmarks ... before the actual 2000 final were run
... were some extremely pathelogical and extreme benchmarks
(i.e. number of users, total virtual pages, etc that were ten to
twenty times more than anybody had ever run before). this put extreme
stress on the operating system and initially resulted in lots of
system failures. as a result, before starting the final resource
manager phase ... i redesigned and rewrote the internal serialization
mechanism ... and then went thru the whole kernel fixing up all sorts
of things to use the new sycronization and serialization process. when
i was done all cases of zombie/hung users had been eliminated as well
as all cases of system failures because of
syncronization/serialization bugs. this code then was incorporated
into (and whipped as part of) the resource manager.

unfortunately, over the years, various rewrites and fixes corrupted
the purity of this serialization/syncronization rework ... and you
started to again see hung/zombie users as well as some
serialization/syncronization failures.

misc. collected past posts on debugging, zombine/hung users, etc
http://www.garlic.com/~lynn/subtopic.html#dumprx

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/

Anne & Lynn Wheeler

2006-02-07 20:03:26 UTC

Permalink

Post by Anne & Lynn Wheeler
based on lots of customer and internal datacenter activity, in some
cases spanning nearly a decade ... an initial set of 1000 benchmarks
were defined for calibrating the resource manager ... selecting a wide
variety of workload profiles and configuration profiles. these were
specified and run by the automated benchmarking process.

re:
http://www.garlic.com/~lynn/2006b.html#14 Expanded Storage
http://www.garlic.com/~lynn/2006b.html#15 {SPAM?} Expanded Storage
http://www.garlic.com/~lynn/2006b.html#16 {SPAM?} Expanded Storage
http://www.garlic.com/~lynn/2006b.html#17 {SPAM?} Expanded Storage

and minor addenda to actual (implementation) benchmarking results
corresponding to theory/model/prediction ...

the modified predictor not only specified the workload profile (things
like batch, interactive, mixed-mode, etc) and configuration ... but
also scheduling priority. so not only did the actual (implementation)
overall benchmarking results had to correspond to
theory/model/predictation ... but each individual virtual machine
measured benchmark resource use (cpu, paging, i/o, etc) also had to
correspond to the theory/model/prediction for that virtual machine
... including any variations introduced by changing the individual
virtual machine scheduling priority.

a side issue was when i released the resource manager ... they wanted
me to do an updated release on the same schedule as the monthly PLC
releases for the base product. my problem was that i was responsible
for doing all the documentation, classes, support, changes,
maintenance, and (initially) answering all trouble calls ... basically
as a sideline hobby ... indepedent of other stuff I was supposed to be
doing at the science center (aka i was part of the development
organization ... at the time, occupying the old SBC building in
burlington mall). I argued for and won ... only having to put out a
new release every three months instead of along with every monthly
PLC.

part of this was that it was just a sideline hobby ... the other was
that i insisted that I repeat at least 100-200 benchmarks before each
new minor (3 month) release to validate that nothing had affected the
overall infrastructure (and major changes to the underlying system
might require several hundred or thousands of benchmarks to be
repeated).

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/