SUMMARY: CPU performance: 10/30 faster than 10/41

From: p.ross (ross@bio-medical-physics.aberdeen.ac.uk)
Date: Fri Oct 08 1993 - 18:54:39 CDT


Yesterday I asked why a 10/41 runs some programs more slowly than a 10/30.

....stuff deleted...
> with an FFT application the 10/41 is 12% SLOWER!!
> ^^^^^^^^^^^^^^^^^
>
> 1) Can anyone tell me if this is the normal behaviour of a 10/41? I could
> possibly imagine there might be little improvement compared to the 10/30
> but why should the 10/41 ever be worse than the 10/30?
>
> 2) What is the effect of processor cache on a 10/41 uniprocessor
> (theoretically) and is there any easy way to switch it off to see its
> effect? (ie. is the greater mismatch in speed between processor and
> main memory causing extra wait states to be added during cache reads?)
>
> 3) How would the performance of a single threaded task be affected by
> adding a second processor? (the reason we bought a 10/41 was because
> we were told that was the minimum configuration which could have a
> second processor addded).
....stuff deleted...

As I suspected, hence question 2, it appears that the external cache on the
10/41 can give rise to performance anomalies on the 10/41 relative to the
10/30.

Several replies essentially stated that programs with large working sets and
that step through arrays with a non-unit stride give rise to frequent cache
misses. When this is the case, the 'prefetch' by the cache will waste
memory bandwidth and can often be enough to lower the performance of a 10/41
to lower than a 10/30.

I was also pointed in the direction of several Sun white papers about the
SuperSPARC processor and performance tuning. Specifically:

 SuperSPARCWhitePaper.ps
 sun_performance_tuning_overview.ps
 sun_performance_tuning_overview_contents.ps

which are all available by anonymous ftp (see 'archie' for suitable sites).

I was also sent a copy of a report by Sun Microsystems titled
'SPARCstation 10/30 and 10/41 Relative Performance', by David Hough (93/01/29).
As I was not sent this directly by Sun I don't think I can send a copy to this
list. I have, however, included a summary of the salient points and some excerpts.

"...
Below are some suggestions for programmers that may provide better
performance from existing 10/41's with existing SC2.0.1 compilers.
Some suggestions that help the 10/41+SC2.0.1 combination may be
counter-productive with other hardware or other software releases. All
suggestions should be tried by comparing performance before and after,
on each specific application, to insure that they actually confer a
benefit:

* Divide large computations into 1 MB chunks.
* Avoid combining simple unit-stride array operations.
* Avoid operations on structs or unions in inner loops.
* Use tmpfs for large sequential files.
* Transpose matrices to maximize the number of unit-stride inner loops.
* Vary the leftmost subscript in Fortran inner loops.
* Explicitly unroll rolled inner loops in source.
* Explicitly roll unrolled inner loops in source.
* Avoid traversing huge data structures.
* Use Winograd-Strassen matrix multiplication methods for sufficiently
  large matrices.

SPARCstation 10's contain Viking (SuperSPARC) CPU chips with internal
physical-address 16KB 4-way associative data cache and 20KB 5-way
associative instruction cache, with cache line sizes of 32 bytes.
10/30 operates at 36 MHz and 10/41 at 40.3 MHz. Thus the expected
performance ratio of 10/41 execution times divided by 10/30 execution
times is 36/40.3 = 89%, based on clock rate alone.

However 10/41 also contains an external physical-address 1MB MXCC
(SuperCache) direct-mapped combined I&D cache, with a cache line size
of 128 bytes. This cache imposes a higher miss penalty - a minimum of
24 cycles, a maximum of over 100 cycles - than the miss penalty for the
internal cache.

Thus programs that have small instruction and working data sets will reside
in the internal cache in both systems and should display a relative performance
of 89% due solely to clock rate.

Programs with somewhat larger requirements, but fitting in the external cache,
may display significantly better relative performance - lower than 89% - by
eliminating many miss penalty cycles because data is supplied from the external
cache rather than main memory.

But programs with very large working sets may display significantly
worse relative performance - greater than 89% - as data is missed in both
the internal and external caches and incurs the greater external miss penalties.

Poor relative performance might also arise due to other factors:

"MXCC prefetch"
The MXCC has a data prefetch feature. It may be disadvantageous if it
consistently prefetches the wrong data, as is possible when following
linked lists or accessing arrays with non-unit stride.

"write cycles"
If write cycles are bunched together or the same data is alternately
read and written.

"used once"
If a large array is traversed just once and the data used once then
miss penalties will apply to almost all accesses.

"direct-map conflicts" If a code loop references data whose address
differs by almost exactly a multiple of the external cache size of
2**20, or calls a subroutine whose address is similarly offset, then
conflicts will occur because the direct-mapped external cache will map
both code and data, or caller and callee, to the same cache line. The
solution is to move code around by a different linking order, or move
data by inserting some in an appropriate place.
...."

Programmer suggestions from the report:

1) Try dividing up large computations in order to limit working set size
   to 1 MB
2) Try doing simple unit-stride array operations one at a time rather
   than combining them in one loop.
3) Avoid operations on structs or unions in inner loops. Try dividing
   operations on structs into their components.
4) Try using tmpfs for large sequential temporary files to be accessed
   more than once.
5) Try explicitly transposing matrices to maximize the number of
   unit-stride inner loops. (The overhead of the transpose may or may
   not be worthwhile: explicit transposition didn't help on cfft2d).
6) Try varying the leftmost subscript in the Fortran inner loop. For
   C, vary the rightmost subscript!
7) Try explicitly unrolling rolled inner loops in source. Try
   explicitly rolling unrolled inner loops in source.
8) Allocate data to bss instead of stack, when possible. Avoid 2KB
   strides.
9) Use Winograd-Strassen methods when multiplying sufficiently large
   matrices.

Final suggestion to programmers:
"It's often not worthwhile to port code to a new supercomputer unless
you're willing to tune the code for the new supercomputer. It will be
increasingly necessary to plan to tune when porting to high-performance
RISC workstations."

Thanks to all those people who have replied (so far)
doug@edu.berkeley.perry
dan@com.bellcore
echard@de.dlr.go.ts
B.McCrone@uk.ac.daresbury
bernards@nl.ECN

Philip Ross.

/----------------------------------------------------------------------------\
| Philip Ross, Computer Officer | email: ross@biomed.abdn.ac.uk |
| Dept. Bio-Medical Physics | |
| University of Aberdeen | |
| Foresterhill | [+44]-224-685645 (Fax) |
| Aberdeen. AB9 2ZD | [+44]-224-681818 x53210 (Voice) |
|----------------------------------------------------------------------------|
| 3 is not equal to 2, not even for large values of 2. |
\----------------------------------------------------------------------------/



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:23 CDT