w.f.j.mueller@gsi.de [H390-MVS]
2018-06-16 15:25:29 UTC
Hi,
the s370_perf instruction time benchmark (see project https://github.com/wfjm/s370-perf and posting https://groups.yahoo.com/neo/groups/hercules-390/conversations/topics/83415) uses currently the elapsed time retrieved with STCK for all timing measurements. Mark L. Gaubatz commented rightfully that CPU time should be used as the basis of all performance measurements (see message https://groups.yahoo.com/neo/groups/hercules-390/conversations/messages/83140, see topic https://groups.yahoo.com/neo/groups/hercules-390/conversations/topics/83139). To implement this one needs a low overhead method to access the task CPU time with high resolution, best with microsecond precision.
s370_perf is designed to run on MVS 3.8J, as freely available with the Tur(n)key 4- System http://wotho.ethz.ch/tk4-/ distribution. The methods used for CPU time retrieval on modern MVS systems are not available
TIMEUSED macro not available (was introduced later) TCBTTIME field in TCB not available (was introduced later)
Tom Armstrong pointed me to a function CPUTIM which is included in the Algol F IVP code IEXSAMP4 and used for algorithm timing. It essentially returns the sum of the ACB fields ASCBEJST and ASCBSRBT. The essential drawback is that these fields are only updated after each dispatch, and a test quickly showed that the CPU time retrieved this way has errors of up to 170 msec (seen on a 2 CPU systems with only one user job active).
When searching for other 'cpu time from ASCBEJST' methods I stumbled across the TIMED function included in the 1998 edition of the KERNLIB section of the CERN PACKLIB (see source code https://accserv.lepp.cornell.edu/svn/packages/cern/98/src/packlib/kernlib/kerngen/obsolete/kernibm/symvs/timexxa.s). This function calls the SVC CALLDISP to force an update of ASCBEJST. Turns out that this works under MVS/SE, but not under MVS 3.8J. A WAIT on a fall-through ECB (see posting https://groups.google.com/forum/#!topic/bit.listserv.ibm-main/FRuYz33wUGs) also doesn't work under MVS 3.8J.
A comment in the TIMED function finally pointed me to the proper solution. The comment states effectively:
for on non-SE systems use ASCBEJST + (TOD - LCCADTOD).
LCCADTOD is a LCCA field which holds the TOD of the last dispatch. So I searched the whole MVS 3.8J code base for code using LCCADTOD. Easy to do after a full export with the hercexport tool (see postings on hercexport https://groups.yahoo.com/neo/groups/hercules-390/conversations/topics/83578 and tk4- takeout https://groups.yahoo.com/neo/groups/turnkey-mvs/conversations/topics/10884) and a simple
find -name "*.mac" -type f -print0 | xargs -0 grep -l ",LCCADTOD"
lists all PDS members with code accessing directly LCCADTOD. The essential hits are
./cbt001.341/cbtcov.file185/cputime.mac
./cbt001.341/cbtcov.file185/cputim.mac
which contain two versions of a CPUTIM function from a CBT volume, one for FORTRAN and one for PL/I. The essential core of both implementations is
STCK TOD
LM R4,R5,TOD
$SLD R4,LCCADTOD
$ALD R4,ASCBEJST
where $ALD and $SLD are macros for double word integer add/sub (available on the same CBT volume in PDS cbtcov.file188). A quick test immediately shows that this works fine and returns a high resolution CPU time.
However, the key trick of this method, that the time since last dispatch is added, causes also a vulnerability. If a dispatch occurs between the subtract of LCCADTOD and the add of ASCBEJST the CPU time spend during the previous dispatch is double counted. Not very likely to happen, but possible, especially on a single CPU system with some I/O load. To protect against this vulnerability, I store the LCCADTOD value at the beginning of the function, and check after all arithmetic whether the LCCADTOD value changed. If yes, a dispatch happened, and the whole procedure is simply re-tried. Taken all together the core of my CPUTIM method looks like
L R6,PSALCCAV get LCCA ptr
L R7,PSAAOLD get ASCB ptr
LA R10,9 init retry loop count
*
CPUTIMR LM R8,R9,LCCADTOD get initial LCCADTOD
STM R8,R9,SAVDTOD and save it
*
STCK CKBUF store TOD
LM R0,R1,CKBUF
SLR R1,R9 low order: sum=TOD-LCCADTOD
BC 3,*+4+4 check for borrow
SL R0,=F'1' and correct if needed
SLR R0,R8 high order: sum=TOD-LCCADTOD
*
LM R8,R9,ASCBEJST load ASCBEJST
ALR R1,R9 low order: sum+=ASCBEJST
BC 12,*+4+4 check for carry
AL R0,=F'1' and correct if needed
ALR R0,R8 high order: sum+=ASCBEJST
*
LM R8,R9,ASCBSRBT load ASCBSRBT
ALR R1,R9 low order: sum+=ASCBSRBT
BC 12,*+4+4 check for carry
AL R0,=F'1' and correct if needed
ALR R0,R8 high order: sum+=ASCBSRBT
*
LM R8,R9,LCCADTOD get final LCCADTOD
C R9,SAVDTOD+4 check low order
BNE CPUTIMN if ne, dispatch detected
C R8,SAVDTOD check high order
BE CPUTIME if eq, all fine
*
CPUTIMN BCT R10,CPUTIMR retry in case dispatch detected
*
CPUTIME <return handling> CPUTIM in register pair R0,R1
...
CKBUF DS 1D
SAVDTOD DS 1D
Any comments on the function logic are very much appreciated, especially on
is this solution portable also to current MVS versions ? is there still any hidden vulnerability which affects timing precision ? should the SRB time be included (like in ALGOL example), or not (like in CERNLIB and CBT) ? With best regards, Walter
P.S.: a test code with the logic described above as available as part of s370_perf project https://github.com/wfjm/s370-perf, see source file https://github.com/wfjm/s370-perf/blob/master/codes/test_cputime.asm and JCL template https://github.com/wfjm/s370-perf/blob/master/codes/test_cputime.JES, to be used with the hercjis https://github.com/wfjm/herc-tools/blob/master/doc/hercjis.md preprocessor.
the s370_perf instruction time benchmark (see project https://github.com/wfjm/s370-perf and posting https://groups.yahoo.com/neo/groups/hercules-390/conversations/topics/83415) uses currently the elapsed time retrieved with STCK for all timing measurements. Mark L. Gaubatz commented rightfully that CPU time should be used as the basis of all performance measurements (see message https://groups.yahoo.com/neo/groups/hercules-390/conversations/messages/83140, see topic https://groups.yahoo.com/neo/groups/hercules-390/conversations/topics/83139). To implement this one needs a low overhead method to access the task CPU time with high resolution, best with microsecond precision.
s370_perf is designed to run on MVS 3.8J, as freely available with the Tur(n)key 4- System http://wotho.ethz.ch/tk4-/ distribution. The methods used for CPU time retrieval on modern MVS systems are not available
TIMEUSED macro not available (was introduced later) TCBTTIME field in TCB not available (was introduced later)
Tom Armstrong pointed me to a function CPUTIM which is included in the Algol F IVP code IEXSAMP4 and used for algorithm timing. It essentially returns the sum of the ACB fields ASCBEJST and ASCBSRBT. The essential drawback is that these fields are only updated after each dispatch, and a test quickly showed that the CPU time retrieved this way has errors of up to 170 msec (seen on a 2 CPU systems with only one user job active).
When searching for other 'cpu time from ASCBEJST' methods I stumbled across the TIMED function included in the 1998 edition of the KERNLIB section of the CERN PACKLIB (see source code https://accserv.lepp.cornell.edu/svn/packages/cern/98/src/packlib/kernlib/kerngen/obsolete/kernibm/symvs/timexxa.s). This function calls the SVC CALLDISP to force an update of ASCBEJST. Turns out that this works under MVS/SE, but not under MVS 3.8J. A WAIT on a fall-through ECB (see posting https://groups.google.com/forum/#!topic/bit.listserv.ibm-main/FRuYz33wUGs) also doesn't work under MVS 3.8J.
A comment in the TIMED function finally pointed me to the proper solution. The comment states effectively:
for on non-SE systems use ASCBEJST + (TOD - LCCADTOD).
LCCADTOD is a LCCA field which holds the TOD of the last dispatch. So I searched the whole MVS 3.8J code base for code using LCCADTOD. Easy to do after a full export with the hercexport tool (see postings on hercexport https://groups.yahoo.com/neo/groups/hercules-390/conversations/topics/83578 and tk4- takeout https://groups.yahoo.com/neo/groups/turnkey-mvs/conversations/topics/10884) and a simple
find -name "*.mac" -type f -print0 | xargs -0 grep -l ",LCCADTOD"
lists all PDS members with code accessing directly LCCADTOD. The essential hits are
./cbt001.341/cbtcov.file185/cputime.mac
./cbt001.341/cbtcov.file185/cputim.mac
which contain two versions of a CPUTIM function from a CBT volume, one for FORTRAN and one for PL/I. The essential core of both implementations is
STCK TOD
LM R4,R5,TOD
$SLD R4,LCCADTOD
$ALD R4,ASCBEJST
where $ALD and $SLD are macros for double word integer add/sub (available on the same CBT volume in PDS cbtcov.file188). A quick test immediately shows that this works fine and returns a high resolution CPU time.
However, the key trick of this method, that the time since last dispatch is added, causes also a vulnerability. If a dispatch occurs between the subtract of LCCADTOD and the add of ASCBEJST the CPU time spend during the previous dispatch is double counted. Not very likely to happen, but possible, especially on a single CPU system with some I/O load. To protect against this vulnerability, I store the LCCADTOD value at the beginning of the function, and check after all arithmetic whether the LCCADTOD value changed. If yes, a dispatch happened, and the whole procedure is simply re-tried. Taken all together the core of my CPUTIM method looks like
L R6,PSALCCAV get LCCA ptr
L R7,PSAAOLD get ASCB ptr
LA R10,9 init retry loop count
*
CPUTIMR LM R8,R9,LCCADTOD get initial LCCADTOD
STM R8,R9,SAVDTOD and save it
*
STCK CKBUF store TOD
LM R0,R1,CKBUF
SLR R1,R9 low order: sum=TOD-LCCADTOD
BC 3,*+4+4 check for borrow
SL R0,=F'1' and correct if needed
SLR R0,R8 high order: sum=TOD-LCCADTOD
*
LM R8,R9,ASCBEJST load ASCBEJST
ALR R1,R9 low order: sum+=ASCBEJST
BC 12,*+4+4 check for carry
AL R0,=F'1' and correct if needed
ALR R0,R8 high order: sum+=ASCBEJST
*
LM R8,R9,ASCBSRBT load ASCBSRBT
ALR R1,R9 low order: sum+=ASCBSRBT
BC 12,*+4+4 check for carry
AL R0,=F'1' and correct if needed
ALR R0,R8 high order: sum+=ASCBSRBT
*
LM R8,R9,LCCADTOD get final LCCADTOD
C R9,SAVDTOD+4 check low order
BNE CPUTIMN if ne, dispatch detected
C R8,SAVDTOD check high order
BE CPUTIME if eq, all fine
*
CPUTIMN BCT R10,CPUTIMR retry in case dispatch detected
*
CPUTIME <return handling> CPUTIM in register pair R0,R1
...
CKBUF DS 1D
SAVDTOD DS 1D
Any comments on the function logic are very much appreciated, especially on
is this solution portable also to current MVS versions ? is there still any hidden vulnerability which affects timing precision ? should the SRB time be included (like in ALGOL example), or not (like in CERNLIB and CBT) ? With best regards, Walter
P.S.: a test code with the logic described above as available as part of s370_perf project https://github.com/wfjm/s370-perf, see source file https://github.com/wfjm/s370-perf/blob/master/codes/test_cputime.asm and JCL template https://github.com/wfjm/s370-perf/blob/master/codes/test_cputime.JES, to be used with the hercjis https://github.com/wfjm/herc-tools/blob/master/doc/hercjis.md preprocessor.