Документ взят из кэша поисковой машины. Адрес оригинального документа : http://star.arm.ac.uk/f77to90/apr.txt
Дата изменения: Tue Jan 9 16:59:34 1996
Дата индексирования: Mon Oct 1 20:59:54 2012
Кодировка:
Поисковые слова: arp 220

From help@cs.rice.edu Wed Jul 5 19:58:49 1995
Date: Wed, 5 Jul 95 09:28:40 PDT
From: marc@efn.org (Marc Baber)
To: hpff@cs.rice.edu
Subject: APR Releases xHPF 2.1 and NAS Benchmark Results
Content-Length: 14861

APR RELEASES xHPF 2.1, WORLD'S FIRST HPF TO TURN IN NAS BENCHMARK RESULTS
July 5, 1995
=========================================================================

Sacramento, CA -- Applied Parallel Research (APR) announced it will
begin shipping xHPF 2.1, the latest version of its industry-leading
High Performance Fortran (HPF) compilation system. xHPF 2.1 is the
first HPF implementation to compile NAS Parallel Benchmark (NPB)
programs and has thus set a new standard for end-user achievable
performance on a wide range of parallel platforms.

The NPB suite is used to measure sustainable performance of computer
systems when running five computational kernels and three simulated CFD
programs. The programs represent typical applications used in NASA's
NAS project. The benchmarks are considered a "pathfinder" in searching
out the best parallel systems for grand challenge problems such as
modeling whole aircraft.

APR's president, John Levesque said, "xHPF may be the only HPF
implementation capable of successfully parallelizing NAS parallel
benchmarks today. To date, no other HPF vendor has published even a
single result for any of the of these eight benchmarks. The speed-ups
achieved by xHPF on the [Cray] T3D, the [IBM] SP-2, and the [Intel]
Paragon are impressive enough that I believe it will be months or even
years before other HPF vendors can offer comparable performance."

"We expect 1995 will be the watershed year for parallel programming
of distributed memory systems and clusters. Before 1995, hand-tuned
message-passing programming was the norm. Beginning this year,
automatic parallelization by sophisticated, production-quality HPF
compilers will be the norm and APR's xHPF is well-positioned to become
the de facto industry standard for HPF compilation. This is a wake-up
call for application programmers who've been waiting since the early
days of the hypercubes for good parallel Fortran compilers."

From UNI-C, The Danish Computing Center for Research and Education,
Jorgen Moth commented, "Parallelization of standard Fortran programs is
made practical for our busy scientists by FORGE Explorer and xHPF. We
have found these tools to be a bridge between Fortran 77, Fortran 90,
and HPF, thus removing many obstacles from the exploitation of parallel
machines."

At the Cornell Theory Center, where the largest IBM SP-2 (512 nodes)
is installed, Donna Bergmark summarized over two years of experience
with xHPF, saying, "It [APR's xHPF] has proven to be a 'quick and easy'
way to get a program to run in parallel, without having to learn a
message passing protocol." She also noted, "At the present time, there
are on the average 500-800 invocations of xHPF per month [at the CTC]."

With xHPF, automatic parallelization has now reached the point where
gains achievable by hand-parallelization are often not cost-effective
when the expense of re-programming is factored into the
price-performance equation. Nonetheless, for users who demand the very
highest performance, APR provides ForgeX, an interactive Motif GUI
Fortran code browser and interactive parallelization system which is
fully compatible with xHPF. With ForgeX, users can interactively fine
tune their parallelized codes, using their knowledge of the underlying
algorithms as well as execution timings that can be obtained with the
code instrumentation features of ForgeX.

To underscore the importance of the latest release, during the month
of July, APR is offering free ForgeX licenses, including interactive
parallelization for distributed memory systems, for sites purchasing
xHPF licenses. The number of concurrent interactive users is related
to the number of processors the xHPF-parallelized codes will be run
on. Contact APR for details.

The NAS Benchmark results include the EP, SP, BT, FT and MG
programs. These are slightly modified versions of the standard
Fortran-77 programs from NASA supplemented with HPF directives. While
many MPP vendors worked months on optimizing the sequential versions of
these programs to utilize cache more effectively, or to perform table
lookups for some operations, no similar restructurings were performed
with APR's versions. Therefore, the APR versions of the NAS benchmarks
tend to be closer to end-user programs and the results obtained should
be more representative of what might be expected by the general user
community.

The timings in the following tables were obtained using xHPF and
APR's shared memory parallelization system -- spf. With these results
APR is demonstrating the ability to maintain portable code across
varied MPP and SMP parallel systems. All of the benchmarks also run
sequentially on a uni-processor.

The results in the tables following this article are for xHPF 2.1
(APR development version 2029) and, for shared-memory systems, spf. As
development versions and new releases of xHPF achieve even better
results, APR will update the timings available on its web pages at
http://www.infomall.org/apri. APR encourages other HPF vendors to
respond in kind by making their HPF benchmark results available in
their web pages accessible from the HPFF (High Performance Fortran
Forum) web page at http://www.erc.msstate.edu/hpff/home.html.

The speedups obtained for the Fortran-77 versions of the benchmarks
highlight xHPF's superior capabilities in the area of parallelizing
DO-loops in addition to Fortran-90 array syntax and HPF FORALL
statements. Some other HPF implementations either do not attempt to
parallelize DO-loops or do not have the robust dependence analysis
capabilities of xHPF and fail to parallelize some DO loops that are no
problem for xHPF.

These Fortran programs were processed without modification by APR's
xHPF and code generated for the Cray T3D, IBM SP2, Digital ALPHA
Cluster, SGI Power Challenge Cluster and Intel PARAGON. Then some of
the benchmarks were processed without modification by APR's spf and
code generated for the Sun SPARCcenter 2000.

One will notice that the IBM-SP2 does very well compared to other MPP
systems. It is true that APR's timings get closer to the timings
supplied by IBM, but no special "tuning" was done for the SP-2. The
superior performance can be attributed to IBM's Fortran-77 compiler
(xlf) which compiles the parallelized SPMD Fortran-77 code output by
xHPF. Xlf is more successful at achieving maximum single processor
performance than the other vendors' Fortran-77 compilers. The scaling
of the timings as the number of processors is increased is good on all
the platforms.

APR is a leading supplier of software tools for Fortran program
analysis, performance measurement, parallelization, restructuring,
and dialect translation.

======== NAS Benchmark Results ========

NOTE: The following times are between 2-10 times slower than the
timings reported by the various vendors. The major difference is due
to the vendors' extensive rewriting of the benchmarks to obtain the
best possible single node performance. APR has asked and will continue
to ask the vendors to supply their optimized single node versions of
the benchmarks so everyone can start with the same sequential
programs. To date, however, all vendors have refused saying their
versions of the benchmarks are proprietary.

Benchmark SP: Simulated CFD Application
---------------------------------------------------------------------
Platform Processors Time(Sec.)
---------------------------------------------------------------------
Cray C90 1 7634. **
---------------------------------------------------------------------
Cray T3D 16 2368.
32 1353.
64 728.
---------------------------------------------------------------------
IBM SP2-WIDE 16 576.
32 320.
64 192.
---------------------------------------------------------------------
Intel Paragon 16 3435.
32 2202.
64 1257.
---------------------------------------------------------------------
Sun SPARCcenter 8 2382.
2000 (40 MHz) 16 1617.
---------------------------------------------------------------------

This benchmark was not run on the SGI PowerChallenge or the DEC
ALPHA cluster. The SP benchmark contains several transposes
that require a faster communication fabric.

** C90 timings were obtained by taking the same Fortran 77 code
that was input to xHPF and spf for the other timings and
compiling it with cf77, with no special optimization switches.
Though these results are below the speeds a C90 is capable of,
the purpose here is to show simple compile-and-run performance
for a portable code on parallel systems versus the C90, the
widely recognized single-processor supercomputing standard.

The C90 results are indicative of sequential scalar performance
on the C90, since Cray's cf77 did not vectorize some of the
major loops in the benchmarks for one reason or another. For
example, the EP benchmark calls a subroutine from within its
major loop and, because cf77 doesn't provide global
interprocedural analysis (as xHPF does), it was unable to
vectorize the main loop. It is interesting that xHPF can
parallelize many loops that the best vectorizing compilers in
the industry cannot vectorize.

As with the parallel machines, no attempt was made to make the C90
timings the best that could be achieved. The objective was not to
hit the "Macho-flop" performance ratings, but rather to indicate
what performance the normal compile-and-run user might expect.

Benchmark EP: Embarrassingly Parallel Benchmark
---------------------------------------------------------------------
Platform Processors Time(Sec.)
---------------------------------------------------------------------
Cray C90 1 694.
---------------------------------------------------------------------
Cray T3D 16 100.
32 50.
64 25.
---------------------------------------------------------------------
IBM SP2-WIDE 16 79.
32 40.
64 23.
---------------------------------------------------------------------
Intel Paragon 16 251.
32 126.
64 64.
---------------------------------------------------------------------
DEC ALPHA 4 261.
3000/900 (275Mhz) 8 131.
---------------------------------------------------------------------
SGI PowerChallenge 4 459.
MIPS R8000 8 233.
16 116.
---------------------------------------------------------------------

Benchmark BT: Simulated CFD Application
---------------------------------------------------------------------
Platform Processors Time(Sec.)
---------------------------------------------------------------------
Cray C90 1 10615.
---------------------------------------------------------------------
Cray T3D 16 1958.
32 1044.
64 551.
---------------------------------------------------------------------
IBM SP2-WIDE 16 446.
32 245.
64 164.
---------------------------------------------------------------------
Intel Paragon 16 5741.
32 3091.
64 1809.
---------------------------------------------------------------------
Sun SPARCcenter 8 3393
2000 (40MHz) 16 1759
---------------------------------------------------------------------

This benchmark was not run on the SGI PowerChallenge or the DEC ALPHA
cluster. The BT benchmark contains several transposes that require a
faster communication fabric.

Benchmark FT
---------------------------------------------------------------------
Platform Processors Time(Sec.)
---------------------------------------------------------------------
Cray C90 Code Requires more memory than available
---------------------------------------------------------------------
Intel Paragon 16 1165.
32 249.7
64 247.7
---------------------------------------------------------------------
Cray T3D 16 279.4
32 192.41
---------------------------------------------------------------------
IBM SP2-WIDE 16 104.8
32 67.52

This benchmark was not run on the SGI PowerChallenge or the DEC ALPHA
cluster. The FT benchmark contains several transposes that require a
faster communication fabric.

Benchmark MG
---------------------------------------------------------------------
Platform Processors Time(Sec.)
---------------------------------------------------------------------
Cray C90 Code Requires more memory than available
---------------------------------------------------------------------
IBM SP2-WIDE 16 17.48
32 12.25
64 12.07

Source code for these benchmarks can be found via anonymous FTP to
ftp.infomall.org in subdirectory /tenants/apri/Bench or via WWW to the
URL http://www.infomall.org/apri/

Printed hardcopy of this report can be obtained by contacting:

___ _____ _____
__________/==|__|==__=\__|==__=\ Applied Parallel Research, Inc.
_________/===|__|=|__\=\_|=|__\=\ 1723 Professional Drive
________/=/|=|__|=|__/=/_|=|__/=/ Sacramento, CA 95825
_______/=/_|=|__|==___/__|====_/_______________________________________________
______/=___==|__|=|______|=|\=\________________________________________________
_____/=/___|=|__|=|______|=|_\=\_______________________________________________
/_/ |_| |_| |_| \_\
Voice: (916)481-9891 E-mail: support@apri.com
FAX: (916)481-7924 APR Web Page: http://www.infomall.org/apri
-------------------------------------------------------------------------------