![](/style/images/good.png)
![](/style/images/bad.png)
Cray-1 vs Raspberry Pi
source link: http://www.roylongbottom.org.uk/Cray%201%20Supercomputer%20Performance%20Comparisons%20With%20Home%20Computers%20Phones%20and%20Tablets.htm
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Contents
Summary and Background Activities
My PC Benchmarks
Detailed Results and Comparisons
Also celebrating the 50th anniversary of the Whetstone Benchmark - 1972 to 2022.
Summary
This report is mainly based on the comprehensive benchmark used to verify performance of the first Cray 1. This comprises the Lawrence Livermore Laboratory program kernels (aka Livermore Loops), that provides a range of Millions of Floating Point Operations Per Second (MFLOPS) measurements. In this case, results from my 1990s conversion to all C code are used.
To support these performance ratings, results are also considered from two similar vintage benchmarks. These are the Linpack and Whetstone benchmarks. The first is linpack-pc.c, my accepted conversion for PCs, available at Netlib. For the second, I took over design responsibility from Harold Curnow, the original author, and developed enhanced variations, including one with 100% vectorisation, the initial target being the first Cray 1 system delivered to the UK.
A selection of available results is provided to demonstrate performance variations and comparisons over the years. Other important issues can be considered, based on the information provided in my first Raspberry Pi report.
"In 1978, the Cray 1 supercomputer cost $7 Million, weighed 10,500 pounds and had a 115 kilowatt power supply. It was, by far, the fastest computer in the world. The Raspberry Pi costs around $70 (CPU board, case, power supply, SD card), weighs a few ounces, uses a 5 watt power supply and is more than 4.5 times faster than the Cray 1"
Background Activities - This provides details of my involvement in evaluating, acceptance testing and benchmarking mainframe and supercomputer systems for UK Government and University projects, including hands-on Cray 1 program development of benchmarks and stress tests..
Results Provided - Livermore Loops MFLOPS minimum, geometric mean (official average) and maximum, Linpack MFLOPS, Whetstone overall MWIPS and average MFLOPS of appropriate tests. These are all single core benchmarks.
Raspberry Pi ARM CPUs - The comment above was for the 2012 Pi 1. In 2020, the Pi 400 average Livermore Loops, Linpack and Whetstone MFLOPS reached 78.8, 49.5 and 95.5 times faster than the Cray 1.
Android ARM CPUs - 2012 Android tablet results identified Cray 1 gains with a range from barely there up to 10 times. My 2021 mid priced phone produced MFLOPS gains of 123, 74 and 151 times.
Windows and Linux PCs Intel CPUs - The first PC to reach the average Cray 1 Livermore Loops score is indicated as a 1994 100 MHz Pentium. Best results for the original benchmarks are for a medium range laptop with a 2021 11th generation 4150 MHz Core i5 CPU. The the three MFLOPS gains were 117, 131 and 134 times.
Advanced SIMD compilations lead to i5 gains of 359, 337 and 226 times.
Multiprogramming Livermore Loops - Four copies of the Advanced SIMD Livermore Loops Benchmark were run at the same time. This resulted in a MFLOPS throughput gain of 1134 times.
Vector Whetstones - This single core benchmark uses large data arrays that produce 100% vectorisation for all test functions and was produced to benchmark the first UK Cray 1. Results are included for thirteen 1978 to 1991 supercomputers. For this benchmark, Single and Double Precision (SP and DP) versions are available, the latter could be appropriate for comparison with supercomputer longer words. Top SP and DP MFLOPS measurements for the Core i5 were 602 and 433 times faster than Cray 1.
Multithreading MP Whetstones - Results are provided essentially from running multiple copies of the mainly scalar version of the Whetstone benchmark, using 1, 2, 4, and 8 threads, via a single program. It highlights complications due to varying CPU MHz, according to the number of threads, and benefits of PC Hyperthreading. Single and double precision versions were run, in this case obtaining similar performance. Eight thread throughput gains over the Cray 1 were Raspberry Pi 400 times, Android phone 757 times and Core i5 laptop 1521 times.
MP MFLOPS - This executes combinations of floating point multiplications and additions handling SP or DP variables, intended to demonstrate near maximum performance, again from a single program. For Intel, assembly code listings are provided for the normally fastest test. Based on the mix of floating point operations, an estimate of Cray 1 maximum speed, running these, is reduced from 160 MFLOPS to 122. The Core i5 laptop gains, over the revised Cray 1 maximum rating, were SP 2671 and DP 1317 times via 326 and 161 GFLOPS.
Gains on the other devices were Android phone SP 293 times and Raspberry Pi SP 247 times, both at greater than 30 GFLOPS. This benchmark has a run time parameter to use up to 64 threads that should demonstrate far superior performance of more advanced CPUs.
Background Activities Next or Go To Start
Background Activities
Reliability Studies - I worked for the UK Government Central Computer Agency from 1960, initially analysing fault returns that were contractually required for all new systems. These provided the first detailed statistics included in my book “Computer System Reliability” published in 1980. I also provided assistance in running acceptance tests, gathering similar information, over the years, for inclusion in my book.
Acceptance Trials and First Supercomputer Involvement - During the late 1960s, with 20 staff, I became in charge of all acceptance trials, taking personal responsibility for top of the range computers. This included organising and supervising trials the UK Atlas 2 for Cambridge University Mathematical Laboratory, the 1962 earlier version said to be the most powerful supercomputer in the world.
Scientific Systems Evaluation and Acceptance Trials - In the 1970s and early 1980s, with up to 15 staff, I covered evaluating and acceptance testing of scientific systems, with continuing responsibility for design and acceptance trial supervision of the larger systems. Between 1972 and 1973 these included an IBM 360/195 for UK Met Office and a CDC 7600 for ULCC (University of London), again said to be the current fastest supercomputers.
Stress Testing Programs - In order to stress test all computers, during acceptance tests and under Operating Systems, I produced a range of Fortran programs, a few for testing CPUs, with others covering everything from paper tape punches to disk drives. These had parameters to run for extended periods and were used during hundreds of acceptance tests from 1972 up until the 1990s. The tests included the Whetstone benchmark, produced by my CCTA colleague Harold Curnow and the first accepted general purpose computer benchmark. I collected running times of most programs for use in performance evaluation.
Met Seymour Cray - It must have been 1969, when I was visiting Control Data manufacturing facility in Minnesota, that I was asked to visit Chippewa Falls in Wisconsin to witness a UK Met Office benchmark run on the CDC 7600. Then, I had a brief encounter with Seymour Cray, who appeared to run the benchmark. After setting it up, it was all over in a flash, with Seymour reported that it took not a lot of milliseconds. This influenced my later development of general purpose benchmarks to have have noticeable running times with ongoing displays of progress.
External Consultancy - CCTA had contractual responsibility for handling procurement of centrally funded university computers, leading to me becoming an advisor to the Computer Board for Universities and Research Councils, and later a member of the Technical Sub-Group for Advanced Research Computers. In 1976, I was appointed, as an expert from a Member State, to join a European Centre for Medium-Range Weather Forecasts Committee, involving procurement of a new supercomputer, where a Cray 1 became the obvious choice.
Hands-on Cray 1 - My detailed involvement in real supercomputers started in in 1978, including a second visit to Chippewa Falls to evaluate the Cray 1. This was followed by a pre-delivery factory trial, in 1979, for the new AWRE Aldermaston system.. Meanwhile, Cray 1 serial 1 was at the UK Rutherford Laboratory, where I converted all my appropriate test programs, and Whetstone benchmark, to use the new vector instructions. The on-site acceptance trials were carried out later in 1979, where the Cray 1 passed with flying colours. This was followed by the same factory and on-site testing procedures on serial 1 CDC Cyber 205 for UK Met Office, in 1981. That one failed its first factory trial due to my I/O stress testing program identifying a design fault.
Collecting Performance Data - Next, until my early retirement in 1993, I was mainly involved in performance consultancy of data processing systems, personally covering more than 60 projects. I also took over design responsibility for the Whetstone benchmark and continued consultancy on university procurements. Starting during this period, I collected published details of computers, amounting to more than 2000 mainframes, minicomputers, supercomputers and workstations, from around 120 suppliers. I also continued collecting Whetstone results, ending with more than 700 from 53 computer manufacturers, covering minicomputers, through mainframes, to supercomputers.
These provided the beginning for my ResearchGate reports, starting with
Whetstone Benchmark History and Results.pdf and
Computer_Speed_Claims_1980_to_1996.pdf,
also
Computer Speeds From Instruction Mixes pre-1960 to 1971.pdf.
Met Key Supercomputer Benchmark Authors - As part of a university benchmark investigation team, I visited the USA in 1987, including to the creator of the Linpack range of benchmarks, Jack Dongarra in Tennessee, and to the Lawrence Livermore Research Laboratory, who produced the Livermore Loops (Livermore Fortran Kernels) benchmark. This became the key supercomputer benchmark for a number of years.
Influencing Supercomputer Choice - In 1988, the Director of University of Manchester Regional Computer Centre requested my performance analysis of the two competing supercomputers, after being quoted by part of the evaluation team that I would support one of the proposals. I demonstrated the opposite choice, that was accepted, using results of my scalar and vector Whetstone benchmark results. This is based on a large number of users, where 90% of programs can be vectorised. Then, the one with the fastest maximum vector speed, but the lowest for scalar code, lost the battle.
Benchmarking Supercomputers in Japan - My last involvement in supercomputers was for a new one for University of London Computer Centre, over 1991 and 1992, when I became the independent observer of a benchmark, based on numerous real applications, at Fujitsu and NEC in Japan. My colleague dealt with Cray, in the USA, that won the contract with a Y-MP configuration. As confirmed with my simple scalar and vector Whetstone, that I ran then, it was really comparing multiple pipelines against multiple CPUs, each of the latter with scalar and vector processing capabilities.
My PC Benchmarks Next or Go To Start
My PC Benchmarks
Classic Benchmarks - Following retirement came part time consultancy and eventually creating my website (roylongbottom.org.uk) to house copies of performance data, collected during my CCTA days (with approval) and a range of benchmarks, initially concentrating on those for PCs, all for free with no adverts. The first, being detailed in my
Classic Benchmarks
report, covering Whetstone, Dhrystone, Linpack 100 and Livermore Loops, using C/C++ compiled programs. Early PCs had poor timer resolution, with benchmarks or functions requiring running times of 5 seconds for consistent performance. Other requirements included logging results in text files, checking output for consistent numeric results and, where possible, performance and results displayed as the tests progressed.
Besides benchmark reports, identified below, I developed programs in assemly code, with reports now at ResearchGate in
PC CPUID 1994 to 2013, plus Measured Maximum Speeds via Assembler Code.pdf and
PC CPU Specifications 1994 to 2014 plus Measured MIPS and MFLOPS per MHz.pdf.
Netlib Involvement - Other than for my Whetstone programs, the initial source code was obtained from Netlib, where my linpack-pc.c code, was later accepted and included, for use on PCs. Livermore Loops conversion was time consuming, where C code was available for the calculations, but data generation, checking and other activities were in Fortran, that I converted to C. All these followed conversion routes from running under DOS, though variations of Windows, Linux, including Raspberry Pi, and Android, most at 32 bits and 64 bits. The majority of early results were collected through involvement in Compuserve Benchmark and Standards Forum.
Livermore Loops - Results shown here are for the second version of this benchmark, comprising 24 loops. Besides MFLOPS measurements for each of these, summary minimum, maximum and various averages are produced, with geometric mean being the official average. During my visit to LLL, I was given a copy of the 1986 report “The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range”, with over 200 pages of results for minicomputers to supercomputers. The document appears to be available via Internet, but I never managed to obtain a free download. My ResearchGate report
Livermore Loops Benchmark Results On PCs and Later Devices.pdf,
contains results up to 2017.
As for the other Classic Benchmarks, later details are included in reports covering the different platforms.
Linpack 100 - Performance of this original version of the benchmark is dependent on a function that simply calculates dy[i] = dy[i] + da*dx[i]; but with too many overheads that prevent performance from approaching maximum possible. However, this can be compiled to use linked (fused) multiply and add instructions, such as available on the Cray 1 and later computers, particularly with SIMD vector capabilities. As for Livermore Loops, run on systems with smaller word sizes than Cray 1, it is compiled to use double precision arithmetic. Detailed results of this original version, and that for the later HPL variety for modern supercomputers, are available from Netlib in document
Performance of Various Computers Using Standard Linear Equations Software.
My
Linpack Benchmark results On PCs and Later Devices
is available from ResearchGate.
Whetstone - The current versions include the changes I made to produce performance ratings of each of the eight test functions, particularly to identify cases where some of the code was not being executed (deliberately by some, in the days of minicomputers). Here, the original overall MWIPS ratings are quoted, along with average MFLOPS for the three tests carrying out straightforward calculations. Two other tests include functions such as SIN and LOG, where changes in maths libraries can significantly affect overall MWIPS. This benchmark is based on simple code sequences used in the 1960s, where performance is more inclined to be proportional to CPU MHz until dramatic changed are made in hardware, like the introduction of additional instructions.
My ResearchGate report for this benchmark is
Whetstone Benchmark Detailed Later Results.
Vector Whetstone - The vector version, converted for the Cray 1, executes the same functions as the scalar version, but covering a number of sequential memory locations, defined by a vector length parameter. Cray 1 performance achieved maximum performance at vector lengths of 64 words and above, with a sawtooth pattern . I have a previous unpublished C/C++ version that I ran on Raspberry Pi 400 and Windows and Linux based PCs, with results, at vector lengths of 256, being provided below.
MP Whetstone - None of the other benchmarks, covered here so far, exercise more than one CPU core. This MP benchmark currently executes 1, 2, 4 and 8 copies of the standard code via multithreading in a single program. Some results are included to highlight performance gains over the single CPU in the Cray 1.
MP MFLOPS - Results for this benchmark have been included for comparison with the maximum MFLOPS possible on the Cray 1. The benchmark executes tests with 2, 8 and 32 floating point operations per data word, covering larger caches and RAM. Default operation uses 8 threads but can be changed, up to 64 threads, with a run time parameter. Description and earlier results are available in
MultiThreading Benchmarks.
Detailed Results Next or Go To Start
Detailed Results and Comparisons
Variations - Results provided below, cover Livermore Loops, Linpack and Whetstone benchmarks. Note that the main Livermore report contains numerous results, covering different compilers and possibly hardware changes. As for the Whetstone benchmark, slow performance of single test functions can severely impact overall ratings. The chosen results represent my choice of typical performance.
Cray 1 - The maximum possible hardware performance was said to be 160 MFLOPS for the 80 MHz Cray 1, comprising linked multiply and add, for two results per clock cycle. The LLL benchmark maximum shown is 82.1 MFLOPS with average 11.9 and minimum 1.2. Then Linpack and Whetstone are 27 and 6 MFLOPS, Linpack benefiting from linked multiply and add.
Raspberry Pi - In 2013 I ran my benchmarks on the first version of Raspberry Pi. These were essentially the same as those used on PCs, under Linux. The programs included the Livermore Loops benchmark and that lead me including the following in a report.
“In 1978, the Cray 1 supercomputer cost $7 Million, weighed 10,500 pounds and had a 115 kilowatt power supply. It was, by far, the fastest computer in the world. The Raspberry Pi costs around $70 (CPU board, case, power supply, SD card), weighs a few ounces, uses a 5 watt power supply and is more than 4.5 times faster than the Cray 1.” This refers to official average geometric mean results.
In 2019 (aged 84), I was invited to become a voluntary member of Raspberry Pi Foundation’s Alpha Testing Team, which I accepted. This lead to me running my benchmarks and stress tests on new top of the range systems before announcement. The supply of new hardware and software, so far, lead to me producing eight additional PDF reports that are available from ResearchGate. See Project Log for Performance of Raspberry Pi and Android Devices. Early results are included in the benchmark specific reports, identified above. Later reports that include links to download the benchmarks are Raspberry Pi 400 PC 32 Bit and 64 Bit Benchmarks and Stress Tests.pdf and Raspberry Pi 32 Bit and 64 Bit Benchmarks and Stress Tests.pdf.
The following MFLOPS comparisons are in the order of Livermore Loops average.
Comparison - The three 700 MHz Pi 1 main measurements (Loops, Linpack and Whetstone) were 55, 42 and 94 MFLOPS, with the four gains over Cray 1 being 8.8 times for MHz and 4.6, 1.6, 15.7 times for MFLOPS.
The 2020 1800 MHz Pi 400 provided 819, 1147 and 498 MFLOPS, with MHz speed gains of 23 times and 69, 42 and 83 times for MFLOPS. With more advanced SIMD options, the 64 bit compilation produced Cray 1 MFLOPS gains of 78.8, 49.5 and 95.5 times.
LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS CPU Device CPU MHz Max Gmean Min Linpack Whets Whets Year Year Main Columns V V V V Cray Cray 1 80 82.1 11.9 1.2 27 16.2 est 6 1978 XMP1 118 162.2 17.3 2.1 121 30.3 11 1985 Cray 1 Whets MFLOPS estimated based on XMP results Raspberry Pi 32 bit Pi CPU 1 1176JZF 700 148 55 17 42 271 94 2001 2012 2 A7 900 248 115 42 120 525 244 2011 2014 3 A53 1200 436 184 56 176 725 324 2012 2016 4 A72 1500 1861 679 180 764 1883 415 2015 2019 400 A72 1800 2262 819 217 1147 2258 498 2015 2020 Raspberry Pi 64 bit 400 A72 1800 3353 938 242 1337 2505 573 2015 2020 Rpi 1/Cray 1 8.8 1.8 4.6 13.8 1.6 16.7 15.7 64 bit/32 bit 1.0 1.5 1.1 1.1 1.2 1.1 1.1 64 bit/Cray 1 22.5 40.8 78.8 201.7 49.5 154.6 95.5 Main Columns # # # # |
Android
In 2012, I converted my benchmarks to run via Android, in native ARM code, requiring Java front end programs. The latest versions identify the hardware, automatically running code for ARM or Intel CPUs, under 32 bit or 64 bit Operating Systems. In the early days I obtained lots of results with similar performance, a sample of these being provided below.
Comparison - The first results were for tablets that did not have hardware or software to support fast floating point calculations. The earliest with appropriate facilities, from 2012, used the ARM Cortex-A9 processors, starting with 800 Mhz versions. This is indicated as having the three MFLOPS speeds of 20, 11 and 22, or at 10 times Cray 1 CPU MHz, with gains of 1.7, 0.4 and 3.7 in MFLOPS.
A later 800 MHz V7-A9 obtained 115, 101 and 155 MFLOPS, or Cray 1 gains of 9.7, 3.7 and 25.8 times.
Fastest results provided are for a 2021 mid priced phone with a Kryo 570 CPU, said to be based on ARM Cortex-A77. At 2000 MHz, this obtained an average LLL speed of 1468 MFLOPS, with Linpack at 1986 and 905 for Whetstone and Cray 1 performance gains of 123, 74 and 151 times, at 25 times CPU MHz.
The latest versions of the benchmarks can be downloaded and installed from the following (see security warning). Android 9 Benchmarks and Stress Tests On 32 Bit and 64 Bit CPUs. Then Android 10 and 11 Benchmarks and ARM big.LITTLE Architecture Issues might be of interest, with Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM and Intel providing more information, results and access to older (out of date) apps.
LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS CPU Device CPU MHz Max Gmean Min Linpack Whets Whets Year Year Main Columns V V V V Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978 Android 32 bits V7-A9 a 800 36 20 11 11 171 22 2012 2012 V7-A9 a later 800 253 115 47 101 687 155 2012 2012 v7-A9 1200 208 176 27 159 731 259 2012 2012 v8-A53 1300 397 164 28 348 868 332 2012 2015 v7-A15 1700 471 342 34 826 907 329 2012 2013 QU-800 2150 447 356 112 630 1974 610 2013 2013 V8-A72 1800 674 584 136 1023 2053 465 2015 2015 Android 64 bits v8-A53 1300 805 238 101 338 1494 319 2012 2015 Exynos 8890 2300 188 158 27 999 3342 760 2016 2017 v8-A57 2000 724 641 245 1163 1988 390 2013 2015 v8-A73 2000 877 786 269 1122 2927 497 2016 2019 Kryo 570 2000 1620 1468 514 1986 4650 905 2020 2021 A53 64/32bit 1.0 2.0 1.5 3.6 1.0 1.7 1.0 V7-A9 a/Cray 1 10.0 0.4 1.7 9.2 0.4 10.6 3.7 v7-A9 later 10.0 3.1 9.7 39.2 3.7 42.4 25.8 32b A72/Cray 1 22.5 8.2 49.0 113.5 37.9 126.7 77.5 64b 570/Cray 1 25.0 19.7 123.3 428.0 73.6 287.1 150.8 Main Columns # # # # |
Windows Intel and AMD CPUs
I developed my benchmarks for Intel CPUs in the 1990s, starting with DOS and OS/2. through varieties of Windows and Linux. The compiled benchmarks and source codes are all available for download via my PDF reports at ResearchGate. I received numerous results up to 2005. By 2013, my interests became ARM CPUs, with MHz of those from Intel not increasing sufficiently to show real improvements in performance of my single core benchmarks. Then, my 3.9 GHz CPU was close to maximum speed and, in 2021, this appears to have only reached 5.5 GHz, but now with 16 CPU cores. In order to obtain some up to date performance data, I bought a new laptop with a 11th generation Core i5 CPU that runs at a maximum speed of 4.15 GHz.
Comparison - Below are samples of results where details for the three benchmarks were available. The first PC to reach the average Cray 1 Livermore Loops score is indicated as a 1994 100 MHz Pentium, shown as 12 MFLOPS, with Linpack and Whetstone at 12 and 16. This gives approximate Cray 1 comparisons of MHz and the three MFLOPS measurements of 1.3, 1.0. 0.44 and 2.6 times.
PCs with faster Pentium processors continued to produce performance proportional to CPU MHz, with improvements appearing with the 1995 Pentium Pro. At 200 MHz the three MFLOPS measurements were 34, 49 and 41 and four comparisons 2.5, 2.9, 1.8 and 6.8 times.
Next came various Pentium II and III models with improvements to these benchmarks mainly proportional to CPU MHz. Then the 2002 Pentium 4 is show to achieve 187 , 382 and 146 MFLOPS, but at 1700 MHz, producing the four Cray ` comparisons of 21, 16, 14 and 24 times, with decreases in MFLOPS per MHz, compared with earlier Pentiums.
With alternative CPU technology, the per MHz ratio improved with a single core of a 1820 MHz 2007 Core 2 processor obtaining 413, 998 and 374 MFLOPS or Cray 1 improvements of 23, 35, 37 and 62 times.
The 2010 Core i7 range produced an improvement in MFLOPS per MHz, with the 3900 MHz 2013 model obtaining 1108, 2684 and 716 MFLOPS and comparisons 49, 93, 99 and 119 times.
The 2021 laptop with a Core i5 1135G7 CPU provided further gains with a higher MFLOPS per MHz rating for Livermore Loops and Linpack but not much with Whetstone. MFLOPS identified were 1387, 3541 and 802, and Cray 1 comparisons of 117, 131 and 134 times.
These results are from running optimised versions of the original Windows Classic Benchmarks livecont.exe, linpcont.exe and whetcont.exe, available in downloadable benchnt.zip.
LLLoops Gmean LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS Device MFLOPS CPU MHz Max Gmean Min Linpack Whets Whets Year per MHz Main Columns V V V V Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978 0.15 Windows PCs AMD 80386 40 1.2 0.6 0.2 0.5 5.7 0.8 1991 0.02 80486 DX2 66 4.9 2.7 0.7 2.6 15 3.3 1992 0.04 Pentium 75 24 7.7 1.3 7.6 48 11 1994 0.10 Pentium 100 34 12 2.1 12 66 16 1994 0.12 Pentium 200 66 22 3.8 132 31 1996 0.11 AMD K6 200 68 22 2.7 23 124 26 1997 0.11 Pentium Pro 200 121 34 3.6 49 161 41 1995 0.17 Pentium II 300 177 51 5.5 48 245 61 1997 0.17 AMD K62 500 172 55 6.0 46 309 67 1999 0.11 Pentium III 450 267 77 8.3 62 368 92 1999 0.17 Pentium 4 1700 1043 187 19 382 603 146 2002 0.11 Athlon Tbird 1000 1124 201 23 373 769 161 2000 0.20 Core 2 1830 1650 413 40 998 1557 374 2007 0.23 Core i5 2300 2326 438 35 1065 1813 428 2009 0.19 Athlon 64 2150 2484 447 48 812 1720 355 2005 0.21 Phenom II 3000 3894 644 64 1413 2145 424 2009 0.21 Core i7 930 3066 2751 732 68 1765 2496 576 2010 0.24 Core i7 4820K 3900 5508 1108 88 2680 3114 716 2013 0.28 Core i5 1135G7 4150 7505 1387 92 3541 3293 802 2021 0.33 Pentium/Cray 1 1.3 0.4 1.0 1.8 0.4 4.1 2.6 i5/Cray 1 52 91 117 77 131 203 134 i5/i7 1.1 1.4 1.3 1.1 1.3 1.1 1.1 Main Columns # # # # |
SIMD Windows and Linux Next or Go To Start
SIMD Windows and Linux PCs
Following are results from running the benchmarks compiled with SSE, AVX and AVX-512 SIMD options. These employ 128, 256 or 512 bit vector registers simultaneous operating on 4, 8 or 16 single precision (SP) and 2, 4 or 8 double precision (DP) numbers, historically stated as producing maximum SP performance of 4, 8 or 16 MFLOPS per MHz and half those for DP. With 100% fused multiply and add (FMA) type operations these maximum expectations would be doubled. FMA was only available on the Core i5 laptop tested here. It should be noted that using fused operation can result in slightly different accuracy in computed results. The benchmarks report these as errors. See Error Reports. Similar variations were encountered, in the past, using different versions of the compilers.
Windows benchmarks, used in this area, were lloops64.exe, linpack64.exe and whetsSSE.exe. These and source code files are included in Windows-Benchmarks.zip . Compared with the earlier results, performance increased to achieve Cray 1 MFLOPS gains of 238, 190 and 182 times. For this area, double precision Whetstone results are also shown to run at the same speed as the single precision version.
I have had difficulties in using the latest C compilers for Windows, but a new bootable flash drive for Ubuntu 20.04 provided the compiler, enabling more advanced options to be used under Linux. The new benchmarks were initially compiled on an older PC as it did not seem possible to boot the latest flash drive on my new Core i5 based laptop. For the latter, I installed WSL (Windows Subsystem for Linux) in order to compile and run the programs.
Linux - The first compilations under Linux were slightly faster than those from Windows. Those used here were compiled on the i5 laptop using the latest gcc 9.3.0 compiler, under Ubuntu. Disassembly code was examined to show that SSE, AVX and AVX-512 instructions were being used, as appropriate. This cannot be guaranteed by relying on compile options. These benchmarks can be downloaded in Linux-Benchmarks.tar.xz. The first Linux results, using the AVX SIMD instructions, increased the three i5 Cray 1 gains to 300, 259 and 179 times. AVX-512 hardware was only available on the Core i5 CPU, providing the three MFLOPS gains of 359, 337 and 226 times.
The table provides MFLOPS per MHz calculations for Livermore Loops average and maximum results. A major surprise is that the latter for SSE and AVX, of 3.56 and 4.77 were higher than recognised maximum double precision ratios, without FMA, of 2.0 and 4.0. This also applied for SSE for the Core i7 at 3.05. The AVX-512 FMA 47692 MFLOPS ratio of 11.49 suggests significant FMA was being used. See also Faster Than Expected below.
DP LLLoops Gmean Max LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS Device MFLOPS MFLOPS CPU MHz Max Gmean Min Linpack Whets Whets Year per MHz per MHz Main Columns V V V V Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978 0.15 1.03 Windows PCs Earlier SSE Compiler Core i7 4820K 3900 6145 2037 327 3601 6385 1081 2013 0.52 1.58 Core i5 1135G7 4150 8313 2828 386 5132 7466 1094 2021 0.68 2.00 Core i5 DP 4150 7256 1098 i5/Cray 1 52 101 238 321 190 461 182 i5/i7 1.1 1.4 1.4 1.2 1.4 1.2 1.0 Linux PCs SSE New Compiler Core i7 4820K 3900 11881 2578 569 5306 6007 1182 2013 0.66 3.05 Core i5 1135G7 4150 14786 3364 575 7322 6586 1052 2021 0.81 3.56 i5/Cray 1 52 180 283 479 271 407 175 i5/i7 1.1 1.2 1.3 1.0 1.4 1.1 0.9 Linux PCs AVX New Compiler Core i7 4820K 3900 12878 2615 597 5098 5887 1174 2013 0.67 3.30 Core i5 1135G7 4150 19794 3568 943 6998 6477 1077 2021 0.86 4.77 Core i5 DP 6861 1076 i5/Cray 1 52 241 300 786 259 400 179 i5/i7 1.1 1.5 1.4 1.6 1.4 1.1 0.9 SP/DP 0.9 1.0 i7 AVX/SSE 1.1 1.0 1.0 1.0 1.0 1.0 i5 AVX/SSE 1.3 1.1 1.6 1.0 1.0 1.0 Linux AVX 512 FMA New Cpmpiler Core i5 1135G7 4150 47692 4273 965 9088 8193 1353 2021 1.03 11.49 i5/Cray 1 52 581 359 805 337 506 226 Main Columns # # # # |
The following is a summary of the range of results on the 2021 Core i5, to show the impact of compilers that support newer technology. Note that trying to run the AVX-512 variety on earlier CPUs, without this option, results in an program failure report. The No SSE results are from the earlier table.
The i5 CPU MHz is 52 times than that for the Cray 1, compared with over 300 times for Livermore Loops and Linpack benchmarks using AVX-512 functions and more than 200 times for Whetstone. Later are multithreading results for the latter, and for a vector version, to highlight the benefits of using more advanced facilities
MFLOPS i5/Cray 1 LLOOPS Linpack Whets LLOOPS Linpack Whets No SSE 1387 3541 802 117 131 134 SSE 3364 7322 1052 283 271 175 AVX 3568 6998 1077 300 259 179 AVX512 4273 9088 1353 359 337 226
Vector Whetstone Benchmark Next or Go To Start
Vector Whetstone Benchmark
Below are details of supercomputer Whetstone Scalar and Vector benchmark results included in my ResearchGate Whetstone Benchmark History and Results report. Details of the vector program version are included above. As far as I remember, all these results are from systems using a single scalar CPU, possibly with more than one vector pipeline. Cray was the first manufacturer to produce systems with multiple scalar CPUs but it is not clear if any of the others followed this line in the timescale considered. From the details here, both Cray Y-MP MHz clock speed and scalar MFLOPS are indicated as around twice as fast as Cray 1, with vector MFLOPS four times faster, the system having two vector units. This benchmark is included in Linux-Benchmarks.tar.xz and Windows-Benchmarks.zip .
Best results, from the next table, for Core i5 and Raspberry Pi 400 are provided, to demonstrate their superiority over 1991 supercomputers. On top of this, the former have multiple cores, with the potential of four time higher throughput or raw performance. See MP Whetstone results and those for MP MFLOPS.
Vector/ Scalar Vector Scalar MHz MWIPS MFLOPS MWIPS MFLOPS MFLOPS DATE Cray 1 80 16.2 5.9 98 47 8.0 1978 CDC Cyber 205 50 11.9 4.9 161 57 11.7 1981 Cray XMP1 118 30.3 11.0 313 151 13.7 1982 Cray 2/1 244 25.8 N/A 425 N/A 1984 Amdahl VP 500 # 143 21.7 7.5 250 103 13.8 1984 Amdahl VP 1100 # 143 21.7 7.5 374 146 19.5 1984 Amdahl VP 1200 # 143 21.7 7.5 581 264 35.3 1984 IBM 3090-150 VP 54 12.1 4.9 60 17 3.6 1986 (CDC) ETA 10E 95 15.7 6.5 335 124 19.2 1987 Cray YMP1 154 31.0 12.0 449 195 16.3 1987 Fujitsu VP-2400/4 312 71.7 25.4 1828 794 31.3 1991 NEC SX-3/11 345 42.9 17.0 1106 441 25.9 1991 NEC SX-3/12 345 42.9 17.0 1667 753 44.3 1991 # Fujitsu Systems Core i5 AVX512 SP 4150 7780 1353 21039 28303 20.9 2021 Core i5 AVX512 DP 4150 8193 1353 21464 20346 15.0 2021 Pi 400 SP 1800 2505 573 3755 2131 3.7 2020 Pi 400 DP 1800 2684 575 3407 1184 2.1 2020 |
The following include all three MFLOPS measurements to identify maximum, as the second test sometimes falls behind. Single and double precision results are provided, where either could be valid, depending on numeric precision requirements.
The fastest Whetstone floating point code is not suitable to benefit much from fused multiply and add operation, with one multiply associated with four additions or subtractions. The maximum Core i5 speed of 75.1 GFLOPS is quite impressive. Average i5 Cray 1 MFLOPS gains were 602 and 433 times, for single then double precision calculations. Note that some SP SSE MFLOPS per MHz were again greater than 4.0 and AVX above 8.0 and half these with DP. The Raspberry Pi 400 vector performance was not that good but, as shown above, somewhat faster than the scalar speed.
Average Maximum Average Average MFLOPS MFLOPS MFLOPS Mode MHz MWIPS MFLOPS MFLOPS MFLOPS MFLOPS Per MHz Per MHz xCray 1 Windows SSE Phenom II 64b SP 3000 4869 4429 3067 751 1593 0.5 1.5 34 Phenom II 64b DP 3000 4897 2418 1722 751 1290 0.4 0.8 27 Phenom II 32b SP 3000 4624 1798 1584 701 1148 0.4 0.6 24 Core i7 4820K 64b SP 3900 7256 14233 12655 958 2513 0.6 3.6 53 Core i7 4820K 64b DP 3900 7299 7416 7019 953 2261 0.6 1.9 48 Core i7 4820K 32b SP 3900 10494 10362 9748 9468 9846 2.5 2.7 209 Core i5 1135G7 64b SP 4150 8435 23709 21246 1043 2862 0.7 5.7 61 Core i5 1135G7 64b DP 4150 8621 12375 11475 1041 2659 0.6 3.0 57 Core i5 1135G7 32b SP 4150 13387 18221 17254 13739 16162 3.9 4.4 344 Linux Core i7 4820K Op3 SP 3900 12012 12896 6248 17131 10136 2.6 4.4 216 Core i7 4820K AVX SP 3900 11924 20394 7124 23551 12938 3.3 6.0 275 Core i7 4820K Op3 DP 3900 11383 6259 4601 8711 6099 1.6 2.2 130 Core i7 4820K AVX DP 3900 11526 10509 5789 11950 8533 2.2 3.1 182 Core i5 1135G7 Op3 SP 4150 20870 21024 10721 28800 17088 4.1 6.9 364 Core i5 1135G7 AVX SP 4150 20294 37170 12353 39126 22487 5.4 9.4 478 Core i5 1135G7 A512 SP 4150 21039 62592 13037 75094 28303 6.8 18.1 602 Core i5 1135G7 Op3 DP 4150 20045 10884 8035 14575 10528 2.5 3.5 224 Core i5 1135G7 AVX DP 4150 20526 19270 10311 20360 15152 3.7 4.9 322 Core i5 1135G7 A512 DP 4150 21464 33188 11504 32907 20346 4.9 8.0 433 Raspberry Pi 400 SP 1800 3755 2413 1683 2506 2131 1.2 1.4 45 Raspberry Pi 400 DP 1800 3407 1216 1151 1186 1184 0.7 0.7 25 |
MP Whetstone Benchmarks Next or Go To Start
MP Whetstone Benchmark
Previous results compared Cray 1 performance with single CPU cores on the later systems. Here we consider possible implication of using multiple cores, using this benchmark that effectively represents 1, 2, 4 and 8 users concurrently executing the same application, but using different data.
Details shown are overall MWIPS ratings, the three MFLOPS measurements, overall harmonic mean MFLOPS, recorded running times, MFLOPS performance gains over the Cray 1 and MFLOPS per MHz ratios for single core activity. Note nominal running time varies due to rough calibration of the number of passes to use. The benchmark is also included in Linux-Benchmarks.tar.xz and Windows-Benchmarks.zip .
Phenom, Windows 7 - This demonstrates almost perfect speed gains using 1 to 2 and 2 to 4 cores, with no further increase using 8 threads.
Core i7 Desktop - This can use 4 cores or 8 independent threads at the same time. This application appeared to demonstrate near best case performance gains using 8 threads.
Core i5 Laptop - Performance Monitor indicated that this ran at around 4150 MHz using 1 and 2 threads, but reduced to about 3800 MHz for 4 and 8 threads.
Windows vs Linux - Average MFLOPS performance was quite similar, on both the i7 and i5 PCs, at the lower level of optimisation shown here.
Single vs Double Precision - Results indicated similar performance, as expected from scalar operation.
PC Performance Gains - some of the Core i7 speeds were faster than on the i5. For the latter, eight thread Cray 1 MFLOPS gains were 1521 times.
Android Phone - The Kryo 570 CPU has out-of-order execution, maybe responsible for the highest MFLOPS per MHz ratio of 0.42. But maximum performance of the big/LITTLE CPU arrangement, of 2 fast and 6 slow cores, lead to 8 core performance being only 5 times faster than than for 1 core. Still, the Cray 1 gain was 757 times.
Raspberry Pi 400 - As might be expected, performance of this quad core system produced the same elapsed time using 1, 2 and 4 threads, and a little bit extra with 8 threads. Maximum Cray 1 gain was 400 times.
Average --- Average MFLOPS --- System Threads MWIPS MFLOPS MFLOPS MFLOPS MFLOPS Secs xCray 1 Gain Per MHz Desktop Win 7 1 4086 817 817 752 794 5.0 132 1.0 0.26 Phenom II 2 8149 1635 1616 1501 1582 5.0 264 2.0 4 core 4 16199 3261 3234 2968 3149 5.1 525 4.0 3000 MHz 8 16602 3428 3461 3056 3304 10.1 551 4.2 Desktop Win 10 1 6169 1236 1236 856 1077 4.5 179 1.0 0.28 Core i7 4820K 2 13106 2601 2604 1910 2322 4.2 387 2.2 4 Core 8 Thread 4 25343 5181 5197 3723 4587 4.5 764 4.3 3900 MHz 8 46579 10310 10263 7403 9104 5.0 1517 8.5 Laptop Win 10 1 7555 1195 1216 1046 1147 4.9 191 1.0 0.28 Core i5 1135G7 2 15048 2385 2424 2083 2287 5.0 381 2.0 4 Core 8 Thread 4 27290 4339 4407 3787 4158 5.6 693 3.6 4150 MHz or less 8 53037 8619 8773 7538 8272 5.9 1379 7.2 Linux Desktop SP 1 6157 1189 1146 931 1076 4.7 179 1.0 0.28 Core i7 4820K 2 12641 2529 2608 1931 2314 4.6 386 2.1 4 Core 8 Thread 4 25490 5204 5213 3900 4685 4.6 781 4.4 3900 MHz 8 43907 10217 10440 7714 9279 5.7 1547 8.6 Desktop DP 1 6500 1235 1252 972 1138 3.9 190 1.0 0.29 Core i7 4820K 2 13098 2542 2636 1938 2328 3.9 388 2.0 4 Core 8 Thread 4 26298 5105 5273 3906 4676 3.9 779 4.1 3900 MHz 8 44758 10268 10435 7755 9312 5.2 1552 8.2 Laptop SP 1 7640 1140 1199 1015 1113 5.0 185 1.0 0.27 Core i5 1135G7 2 14662 2347 2262 1997 2192 5.4 365 2.0 4 Core 8 Thread 4 26754 4320 4387 3752 4133 6.1 689 3.7 4150 MHz or less 8 46016 7885 8264 6701 7556 7.5 1259 6.8 Laptop SP AVX512 1 8432 1281 1280 1248 1269 5.0 212 1.0 0.31 Core i5 1135G7 2 16728 2542 2548 2471 2520 5.0 420 2.0 4 Core 8 Thread 4 29816 4625 4617 4523 4588 6.0 765 3.6 4150 MHz or less 8 54985 9203 9188 8994 9127 6.6 1521 7.2 Laptop DP AVX512 1 8748 1278 1278 1248 1268 4.9 211 1.0 0.31 Core i5 1135G7 2 17372 2542 2542 2481 2521 5.0 420 2.0 4 Core 8 Thread 4 31459 4622 4622 4514 4585 5.5 764 3.6 4150 MHz or less 8 57024 9187 9210 8985 9126 6.0 1521 7.2 Android Phone 1 4327 1010 984 782 913 4.6 152 1.0 0.42 Kryo 570 2 8782 1850 2126 1604 1836 4.5 306 2.0 2 x 2200 MHz + 4 13969 3189 3373 2641 3034 6.9 506 3.3 6 x 1800 MHz 8 21039 4535 4985 4171 4540 7.9 757 5.0 Raspberry Pi 400 1 2266 644 645 376 520 5.0 87 1.0 0.29 4 x Cortex A72 2 4533 1285 1284 751 1038 5.0 173 2.0 1800 MHz 4 9065 2562 2498 1505 2062 5.0 344 4.0 8 9611 3284 3375 1543 2402 10.1 400 4.6 |
MP MFLOPS Linux - Intel Single Precision Results
The benchmark aims at producing maximum measured performance of floating point operation for comparison with the theoretically possible 160 MFLOPS on Cray 1. Here, a Linux benchmark is used, running SSE and AVX Intel SIMD instructions (in Linux-Benchmarks.tar.xz).
Calculations are carried out of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. In each case, accessing 102400, 1024000 and 10240000 data words, covering caches and RAM. Up to 64 threads can be used, each using a dedicated segment of the data, default being 8 threads. Data is checked for consistent values at the end.
Below are measured MFLOPS using 1, 2, 4 and 8 threads for the Core i7 and i5 computers, executing SSE and AVX instructions, plus AVX-512 on the i5. As for MP Whetstones, performance improvements, from doubling the number of threads (MP Gains), are shown to be non-linear for the Core i5 laptop.
Single core MFLOPS per MHz ratios are also shown. Maximum single precision expectations, without FMA instructions, are 4 for SSE and 8 for AVX and 16 for AVX-512, then 32 for the latter, where FMA is used. Then double precision operation expectations are half these values.
It can be seen that, for both i7 and i5, SSE and AVX MFLOPS/MHz ratios were higher than these. I have been unable to identify the reason for these levels of performance, without FMA type instructions being used. For further information see Faster Than Expected below.
AVX-512 MFLOPS per MHz was less than 32, one reason being that all instructions were not of the FMA variety, as shown in code disassemblies, shown below. These indicate that the highest expected speed achievable by the FMA code is just over 76% of maximum with complete FMA instructions, or 24.4 MFLOPS/MHz, close to that obtained.
The performance adjustment is also shown to produce a likely reduction in Cray 1 maximum speed to 122 MFLOPS, executing these functions. The maximum Core i5 single precision speed of 325915 MFLOPS indicates a Cray 1 gain of 2671 times. Maximum double precision result, from the next page, was 160641 MFLOPS with a gain of 1317 times.
Also single precision results on the next page indicate maximum 8 thread speed on Raspberry Pi of 30150 MFLOPS or Cray 1 gain of 247 times and Android phone at 35686 MFLOPS or gain of 293 times, both at Intel SSE SIMD level.
Threads 1 2 4 8 1 2 4 8 1 2 4 8 KWDs Ops SSE SSE SSE SSE AVX AVX AVX AVX AVX512 AVX512 AVX512 AVX512 Core i7 3900 MHz MFLOPS 102 2 10106 22704 47224 54668 11379 27114 56982 63095 N/A N/A N/A N/A 1024 2 9801 19227 36849 42389 10542 20127 39567 45256 10240 2 5856 9342 10120 9951 6004 9400 10165 9936 102 8 24258 48818 91871 97077 36354 82307 169881 184765 1024 8 24356 49258 91911 96902 34820 67057 130960 161412 10240 8 19421 34454 39855 39777 22340 36088 40372 39578 102 32 23355 46711 88383 93448 45374 88045 171961 177649 1024 32 23284 46883 88776 93381 45459 91277 172443 178895 10240 32 23107 46102 85346 92767 43834 86697 152019 157381 Maximum 24356 49258 91911 97077 45459 91277 172443 184765 MP Gains 1.0 2.0 3.8 4.0 1.0 2.0 3.8 4.1 AVX/SSE 1.9 1.9 1.9 1.9 Max/MHz 6.2 11.7 Core i5 MFLOPS 102 2 24612 48845 46738 80544 29021 30791 86020 93812 37656 74288 72164 121973 1024 2 21362 42345 43579 79180 21656 44753 44415 93920 23333 46844 58968 122122 10240 2 7495 12295 13298 14067 7620 11160 13454 14020 9274 13455 13337 13995 102 8 33271 65364 71105 119460 64946 128515 153955 210177 71895 142743 142554 241880 1024 8 32614 65504 63763 118933 62120 127095 121959 210157 66304 134081 144756 239841 10240 8 22467 38871 50079 56166 24963 42384 53438 56122 30345 49693 54170 56226 102 32 33273 58673 69365 119426 64941 124972 133637 225265 94417 170909 324843 325915 1024 32 32997 39974 86194 119313 64304 125772 125365 224014 91558 185785 324870 324936 10240 32 32777 64727 82112 116115 61061 114491 127026 200120 77458 140903 182219 222231 Maximum 33273 65504 86194 119460 64946 128515 153955 225265 94417 185785 324870 325915 MP Gains 1.0 2.0 2.6 3.6 1.0 2.0 2.4 3.5 1.0 2.0 3.4 3.5 AVX/SSE 2.0 2.0 1.8 1.9 a512/AVX 1.5 1.4 2.1 1.4 i5/i7 1.4 1.3 0.9 1.2 1.4 1.4 0.9 1.2 MHz 4150 4150 3600 3600 4150 4150 3600 3600 4150 4150 3600 3600 Max/MHz 8.0 15.6 22.8 |
continued below or Go To Start
MP MFLOPS 2 - Intel DP, Android and Raspberry Pi SP
The first columns in this table, provide Core i7 and Core i5 double precision MP MFLOPS results, using. 1 and 8 threads. Calculations below show that performance at 32 operations per word, and other high performing areas, was effectively at half single precision speed, as expected with SIMD. The lower ratios probably reflect half speed double precision calculations and overheads dealing with 64 bit numbers.
Single precision results are also included for the 2000 MHz Kryo 570 Android phone and 1800 MHz Raspberry Pi 400.
For these, SIMD level used is equivalent to Intel SSE.
Worse than MP Whetstone, this time the Kryo 570, using 8 threads, was only three times faster than during the single thread test. Then, a CPU monitoring app indicated that six cores were running at 1804 MHz, with two at 768 MHz.
Core i7 Core i5 Phone RPi Threads 1 8 1 8 1 8 1 8 1 8 1 8 1 8 Ops/word SSE SSE AVX AVX SSE SSE AVX AVX AVX512 AVX512 SP SP SP SP 102 2 4921 28537 5290 32337 12437 38391 14606 43320 18872 60955 6977 15998 4015 10169 1024 2 4820 21214 4772 19551 4978 29821 6351 32157 8120 35674 8034 14536 3865 9622 10240 2 2949 4923 2946 4950 3604 6562 3683 6728 4442 6514 2984 2442 447 585 102 8 12233 48924 17683 95178 16500 59285 32504 104046 35958 120212 1024 8 12074 48679 16145 78149 12762 54904 19300 92706 22226 105465 10240 8 9929 19774 10969 19845 10941 26897 12157 27045 14806 26544 102 32 11742 46894 22880 89459 16602 58258 32420 111461 47200 160641 12178 34803 7902 28978 1024 32 11697 46848 22667 88958 16314 59325 31215 107323 42515 151251 12139 35686 7860 30150 10240 32 11615 46395 21983 78687 16315 57399 30488 99303 38532 105812 12137 34050 7326 8537 Maximum 12233 48924 22880 95178 16602 59325 32504 111461 47200 160641 12178 35686 7902 30150 MP Gain 1.0 4.0 1.0 4.2 1.0 3.6 1.0 3.4 1.0 3.4 1.0 2.9 1.0 3.8 MF/MHZ 3.14 5.87 4.0 7.8 11.4 6.1 4.4 Double/Single Precision 102 2 0.49 0.52 0.46 0.51 0.51 0.48 0.50 0.46 0.50 0.50 1024 2 0.49 0.50 0.45 0.43 0.23 0.38 0.29 0.34 0.35 0.29 10240 2 0.50 0.49 0.49 0.50 0.48 0.47 0.48 0.48 0.48 0.47 102 8 0.50 0.50 0.49 0.52 0.50 0.50 0.50 0.50 0.50 0.50 1024 8 0.50 0.50 0.46 0.48 0.39 0.46 0.31 0.44 0.34 0.44 10240 8 0.51 0.50 0.49 0.50 0.49 0.48 0.49 0.48 0.49 0.47 102 32 0.50 0.50 0.50 0.50 0.50 0.49 0.50 0.49 0.50 0.49 1024 32 0.50 0.50 0.50 0.50 0.49 0.50 0.49 0.48 0.46 0.47 10240 32 0.50 0.50 0.50 0.50 0.50 0.49 0.50 0.50 0.50 0.48 |
MP Livermore Loops
Four copies of the Livermore Loops Benchmark were run at the same time on the i5 laptop, with a longer parameter for seconds per loop, each program running for around 15 minutes. Using all cores lead to the usual reduction in CPU MHz, but there may have been more throttling to counteract heating effects. Single thread (Geomean) official average speed was 4273 MFLOPS, compared with a per thread average of 3375 here. However, the total throughput of 13500 MFLOPS indicates an increase over Cray 1 of 1134 times.
---------- AVX-512 DP MFLOPS ---------- Thread Maximum Average Geomean Harmean Minimum 1 33413.3 5809.5 3430.8 2293.0 493.7 2 35648.5 5576.5 3275.7 2223.1 552.1 3 35422.7 5953.9 3449.2 2300.6 505.1 4 36895.5 5746.0 3344.4 2190.7 459.4 |
MP MFLOPS 4 to 64 Threads
As indicated earlier, MP MFLOPS benchmark can handle up to 64 threads, with an execution command under LINUX (such as ./MPmflops64AVX512 Threads 64). For correct operation, the specified number must be 1, 2, 4, 8, 16, 32 or 64. Following are results on the i5 between 4 and 64 threds. These show that performance can be significantly improved using additional threads, the reason is due to data being allocated to a faster lower level cache.
4 Byte Ops/ Repeat MFLOPS Using Number Of Threads Words Word Passes 4 8 16 32 64 102400 2 75000 72164 112210 155132 158133 153968 1024000 2 7500 58968 108429 119118 117709 122011 10240000 2 750 13337 13824 17251 60342 116964 102400 8 75000 142554 210116 253359 270576 275220 1024000 8 7500 144756 212406 233939 236110 242271 10240000 8 750 54170 54988 64520 174245 235583 102400 32 75000 324843 312508 316881 318233 327762 1024000 32 7500 324870 308995 310405 325996 327897 10240000 32 750 182219 204563 243408 301543 322605 |
Performance Summary below or Go To Start
Performance Summary
Following is a summary of most results, intended to show best case performance gains, over the Cray 1, for different classes of work. Considering the Core i5 details, the first four are for programs that only use (or are intended to use) a single CPU core. The main one for Cray 1 comparison being Livermore Loops average. Linpack is the only one that provides a single measurement. Whetstones identify relative performance of scalar and vector processing. As for MP benchmarks, vector single and double precision results are provided. The former can be used for comparison with those produced via the long word used by Cray, if the numeric accuracy is acceptable.
The MP benchmark results can be used to represent multiple users running the same program or a single program executing multiple threads, each handling a dedicated segment of shared data. Again for the Core i5, MP Whetstone MFLOPS were similar for double and single precision versions, with little opportunity for vectorisation. The simpler Whetstone calculations demonstrate the benefit of hyperthreading with the 4 core, 8 thread throughput being nearly seven times faster than the standalone run. On the other hand, MP MFLOPS suffered from the i5 running at a lower MHz when four cores were being used, leading to 8 thread performance being less than four times faster than via 1 thread. This benchmark identified the highest Cray 1 performance gains of over 2600 times for single precision calculations, but half of this at double precision.
On cost/performance grounds, the Raspberry Pi 400 was better than the Core i5 laptop, in some of the early cases, but worse on others, then fell far behind on benchmarks that could benefit from compilation using Intel Advanced Vector instructions. Compared with the Cray 1, MP performance gains of up to 400 times were recorded.
Just considering performance of the Android phone, the more advanced ARM CPU used provided some significant gains over the Raspberry Pi, but lost the advantage, due to the big/LITTLE architecture, on running the MP MFLOPS 8 thread test. Still, best Cray 1 performance gain was 757 times through using multiple cores.
Core i5 AVX-512 Android Phone Raspberry Pi 400 Cray 1 X Cray X Cray X Cray CPU MHz 1 Thread 80 4150 52 2000 25 1800 23 CPU MHz 8 Thread N/A 3600 <1800 1800 1. Livermore Loops MFLOPS Max 82.1 47692 581 1620 20 3353 41 MFLOPS Average 11.9 4273 359 1468 123 938 79 2. Linpack MFLOPS 27 9088 337 1986 74 1337 50 3. Whetstone MFLOPS 6 1353 226 905 151 573 96 4. Vector Whetstone MFLOPS DP Average 47 20346 433 1184 25 MFLOPS DP Maximum 32907 1216 MFLOPS SP Average 28303 602 2131 45 MFLOPS SP Maximum 75094 2506 5. MP Whetstone MFLOPS DP Average 6 9126 1521 MFLOPS DP Maximum 9210 MFLOPS SP Average 9127 1521 4540 757 2402 400 MFLOPS SP Maximum 9203 4985 3284 6. MP MFLOPS MFLOPS DP 1 Thread 122 47200 387 MFLOPS DP 8 Thread 160641 1317 MFLOPS SP 1 Thread 94417 774 12178 100 7902 65 MFLOPS SP 8 Thread 325915 2671 35686 293 30150 247 |
More Advanced Hardware
Here, relative Cray 1 performance calculations, for Android devices and PCs, have been for mid range hardware. It is useful to consider apparent more powerful processors.
CPU MHz - In a given processing architecture, performance is usually proportional to CPU MHz. This was clear in earlier times, when Pentium, Celeron and Xeon processors had the same core processor. The above benchmarks were run on a Core i5 with maximum turbo speed of 4150 MHz and an ARM CPU at 2000 MHz. The latest 2022 processors appear to be rated at up to 5500 MHz for PCs and 3000 MHz for ARM based phones. These would affect the single core benchmarks but not excessively.
Multiple Cores - At least for the laptop and phone used here, full benefits of multiple cores were not apparent. The laptop switched to a lower MHz and the phone’s 8 core big/LITTLE processor maximum performance became not much better than the 4 core Raspberry Pi. Performance appears to be becoming even more unpredictable. The latest (that I have seen) - Intel 24 threads over 16 cores, 8 at up to 5.1 GHz and 8 to 3.8 GHz.Then ARM that has cores 1 at 3200 MHz, 3 at 2420 MHz and 4 at 1800 MHz.
More Advanced CPU Options - Some CPUs in the Core range have two 512-bit fused-multiply add (FMA) units that can, potentially, double SIMD performance of the right sort of application. Judging by the improvement in adopting a higher level of SIMD here and consideration of heating effects, I would not bet on it.
Disassembled Code Next or Go To Start
Disassembled Code
Disassembled code compiled to use AVX-512 and AVX instructions are listed below. The former includes vector fused multiply and add or subtract instructions. With AVX-512 there are 21 arithmetic vector instructions and the expected 32 with AVX, the latter also applying for SSE code. Minimum instructions for full fused multiply and add type is 16, leading to a (16/21) 76.19% reduction in achievable speed. This would lead the maximum Cray 1 MFLOPS for this code becoming 122, instead of 160.
AVX-512 AVX L22: L60: vmovupd (%rax), %zmm0 vmovups (%rax), %xmm1 addq $64, %rax vinsertf128 $0x1, 16(%rax), %ymm1, %ymm1 vaddpd %zmm0, %zmm28, %zmm31 addq $32, %rax vaddpd %zmm0, %zmm30, %zmm1 vaddps -24(%rsp), %ymm1, %ymm0 vmulpd %zmm27, %zmm31, %zmm31 vmulps 8(%rsp), %ymm0, %ymm15 vfmsub132pd %zmm29, %zmm31, %zmm vaddps 40(%rsp), %ymm1, %ymm0 vaddpd %zmm0, %zmm26, %zmm31 vmulps 72(%rsp), %ymm0, %ymm0 vfmadd231pd %zmm31, %zmm25, %zmm vsubps %ymm0, %ymm15, %ymm0 vaddpd %zmm24, %zmm0, %zmm31 vaddps 104(%rsp), %ymm1, %ymm15 vfnmadd132pd %zmm23, %zmm1, %zmm3 vmulps 136(%rsp), %ymm15, %ymm15 vaddpd %zmm22, %zmm0, %zmm1 vaddps %ymm15, %ymm0, %ymm0 vfmadd231pd %zmm21, %zmm1, %zmm3 vaddps 168(%rsp), %ymm1, %ymm15 vaddpd %zmm20, %zmm0, %zmm1 vmulps -56(%rsp), %ymm15, %ymm15 vfnmadd132pd %zmm19, %zmm31, %zmm vsubps %ymm15, %ymm0, %ymm0 vaddpd %zmm18, %zmm0, %zmm31 vaddps %ymm14, %ymm1, %ymm15 vfmadd231pd %zmm17, %zmm31, %zmm vmulps -88(%rsp), %ymm15, %ymm15 vaddpd %zmm16, %zmm0, %zmm31 vaddps %ymm15, %ymm0, %ymm0 vfnmadd132pd %zmm15, %zmm1, %zmm3 vaddps %ymm13, %ymm1, %ymm15 vaddpd %zmm14, %zmm0, %zmm1 vmulps %ymm12, %ymm15, %ymm15 vfmadd231pd %zmm13, %zmm1, %zmm3 vsubps %ymm15, %ymm0, %ymm0 vaddpd %zmm12, %zmm0, %zmm1 vaddps %ymm11, %ymm1, %ymm15 vaddpd %zmm10, %zmm0, %zmm0 vmulps %ymm10, %ymm15, %ymm15 vfnmadd132pd %zmm11, %zmm31, %zmm vaddps %ymm15, %ymm0, %ymm0 vfmadd132pd %zmm9, %zmm1, %zmm0 vaddps %ymm9, %ymm1, %ymm15 vmovupd %zmm0, -64(%rax) vmulps %ymm8, %ymm15, %ymm15 cmpq %rax, %rcx vsubps %ymm15, %ymm0, %ymm0 jne .L22 vaddps %ymm7, %ymm1, %ymm15 vmulps %ymm6, %ymm15, %ymm15 vaddps %ymm15, %ymm0, %ymm0 vaddps %ymm5, %ymm1, %ymm15 vaddps %ymm3, %ymm1, %ymm1 vmulps %ymm4, %ymm15, %ymm15 vsubps %ymm15, %ymm0, %ymm0 vmulps %ymm2, %ymm1, %ymm15 vaddps %ymm15, %ymm0, %ymm0 vmovups %xmm0, -32(%rax) vextractf128 $0x1, %ymm0, -16(%rax) cmpq %rdx, %rax jne .L60 |
Benchmark Error Reports Next or Go To Start
Benchmark Error Reports
Livermore Loops benchmark displays details of results for each of the three times 24 sets of calculations. These include the final numeric results, whose values are included in the program and can vary slightly, depending on the hardware and compiler options. The values under OK indicate accuracy in terms of the number of decimal places, double precision numbers being said to be accurate up to 16 decimal places, but possibly subject to rounding errors.
As indicated, there were differences in numeric results from the Core i5 laptop, with accuracy reducing from 15 or 16 decimal places to 12 or 13, using the AVX512 compile option. Apparently, there is only one rounding for fused operations, as opposed to one for each separate instruction.
Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK Earlier Compilation 6 3 x 658 2 1.566566e+09 0.89 1751.62 64 4.375116344729986e+03 16 7 4 x 529 16 6.737344e+09 0.89 7529.02 995 6.104251075174761e+04 16 18 2 x 703 44 6.124536e+09 0.89 6867.09 100 1.015727037502299e+05 15 Log Program report - Numeric results were as expected AVX Compilation 6 3 x 814 2 1.937971e+09 1.00 1929.85 64 4.375116344729986e+03 16 7 4 x 616 16 7.845376e+09 1.00 7835.67 995 6.104251075174761e+04 16 18 2 x1711 44 1.490623e+10 1.0014869.06 100 1.015727037502299e+05 15 Log Program report - Numeric results were as expected AVX512 Compilation 6 3 x 757 2 1.802266e+09 1.00 1802.82 64 4.375116344743195e+03 12 7 4 x3738 16 4.760717e+10 1.0047691.47 995 6.104251075174966e+04 13 18 2 x2393 44 2.084782e+10 1.0020781.51 100 1.015727037502806e+05 12 Log Program report - Examples of different numeric results Test 6 result was 4.375116344743195e+03 expected 4.375116344729986e+03 Test 7 result was 6.104251075174966e+04 expected 6.104251075174761e+04 Test 18 result was 1.015727037502806e+05 expected 1.015727037502299e+05 |
MP Linpack and Whetstone Benchmarks Error Reports
Similar sumcheck variations were recorded on running the Linpack and Whetstone benchmarks on the Core i5 based laptop. In both cases, as for the Livermore Loops example, the errors were not reported running on older hardware or from alternative compilations.
Linpack AVX-512 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation AVX512 64 Bit, Tue Dec 7 11:38:24 2021 Speed 5151.83 MFLOPS Variable norm. resid Non-standard result was 1.9 instead of 1.7 Variable resid Non-standard result was 8.46778499e-14 instead of 7.41628980e-14 Variable x[0]-1 Non-standard result was -1.11799459e-13 instead of -1.49880108e-14 Variable x[n-1]-1 Non-standard result was -9.60342916e-14 instead of -1.89848137e-14 Whetstone SSE Whetstone Double Precision SSE2 Benchmark Tue Jan 11 19:34:50 2022 Test 5 Non-standard result was 0.49902937281518372 instead of 0.49902937281518167 Log file result Loop content Result MFLOPS MOPS Seconds N5 sin,cos etc. 0.49902937281518372 281.276 2.089 |
Run Time Displays Next or Go To Start
Run Time Displays
The running times of benchmarks, used here, are calibrated to run for a given noticeable time, with Windows and Linux versions displaying details on completion of individual test functions. This automatic adjustment has currently survived more than 100 times increase in CPU MHz. The aim for Livermore Loops is currently 1 second for each of the 72 tests, with Linpack 1 second for each of the 10 calculations and Whetstone 10 seconds overall.
L.L.N.L. 'C' KERNELS: MFLOPS P.C. VERSION 4.0 Calculating outer loop overhead 1000 times 0.00 seconds 10000 times 0.00 seconds 100000 times 0.00 seconds 1000000 times 0.01 seconds 10000000 times 0.04 seconds 20000000 times 0.08 seconds 40000000 times 0.16 seconds 80000000 times 0.31 seconds Overhead for each loop 3.9288e-09 seconds Calibrating part 1 of 3 Loop count 4 0.00 seconds Loop count 16 0.00 seconds Loop count 64 0.00 seconds Loop count 256 0.01 seconds Loops 200 x 1 x Passes Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK ------------ -- ------------- ----- ------- ---- ---------------------- -- 1 7 x1566 5 1.097296e+10 0.9811171.25 1001 5.114652693224671e+04 16 2 67 x 595 4 3.093524e+09 0.98 3164.40 101 1.539721811668385e+03 15 3 9 x 657 2 2.367565e+09 0.95 2494.82 1001 1.000742883066363e+01 15 4 14 x 728 2 2.446080e+09 0.96 2555.68 1001 5.999250595473891e-01 16 5 10 x 234 2 9.360000e+08 0.95 980.20 1001 4.548871642387267e+03 16 6 3 x 904 2 2.152243e+09 0.95 2276.20 64 4.375116344729986e+03 16 7 4 x 975 16 1.241760e+10 1.0312101.10 995 6.104251075174761e+04 16 8 10 x 385 36 5.488560e+09 0.95 5788.45 100 1.501268005625795e+05 15 9 36 x 536 17 6.626246e+09 0.96 6926.99 101 1.189443609974981e+05 16 10 34 x 456 9 2.818627e+09 0.95 2973.11 101 7.310369784325296e+04 16 11 11 x 565 1 1.243000e+09 0.95 1309.65 1001 3.342910972650109e+07 16 12 12 x1201 1 2.882400e+09 0.95 3030.87 1000 2.907141294167248e-05 16 13 36 x 177 7 5.709312e+08 0.95 600.71 64 1.202533961842805e+11 15 14 2 x 290 11 1.277276e+09 0.95 1347.14 1001 3.165553044000335e+09 15 15 1 x 660 33 2.178000e+09 0.96 2268.96 101 3.943816690352044e+04 15 16 25 x 768 10 2.035200e+09 0.94 2153.77 75 5.650760000000000e+05 16 17 35 x 368 9 2.341584e+09 0.96 2447.92 101 1.114641772902486e+03 16 18 2 x 733 44 6.385896e+09 0.97 6567.18 100 1.015727037502299e+05 15 19 39 x 215 6 1.016262e+09 0.95 1070.62 101 5.421816960147207e+02 16 20 1 x 187 26 9.724000e+08 0.95 1021.36 1000 3.040644339351239e+07 16 21 1 x 302 2 7.625500e+09 0.95 8021.31 101 1.597308280710199e+08 15 22 11 x 356 17 1.344754e+09 0.95 1416.60 101 2.938604376566697e+02 16 23 8 x 223 11 1.942776e+09 0.95 2045.20 100 3.549900501563623e+04 16 24 5 x1553 1 1.553000e+09 0.95 1637.44 1001 5.000000000000000e+02 16 Maximum Rate12101.10 Average Rate 3557.12 Geometric Mean 2580.73 Harmonic Mean 1966.74 Minimum Rate 600.71 Do Span 471 Calibrating part 2 of 3 Loop count 8 0.00 seconds Loop count 32 0.00 seconds Loop count 128 0.00 seconds Loops 200 x 2 x Passes Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK ------------ -- ------------- ----- ------- ---- ---------------------- -- 1 40 x1061 5 8.572880e+09 0.98 8769.29 101 5.253344778937972e+02 16 2 40 x 495 4 3.072960e+09 1.01 3046.39 101 1.539721811668385e+03 15 3 53 x 595 2 2.548028e+09 1.00 2536.39 101 1.009741436578952e+00 16 4 70 x 949 2 3.188640e+09 1.00 3194.69 101 5.999250595473891e-01 16 5 55 x 247 2 1.086800e+09 1.00 1082.99 101 4.589031939600982e+01 16 6 7 x 760 2 2.042880e+09 0.98 2081.44 32 8.631675645333210e+01 16 7 22 x 858 16 1.220145e+10 0.9912378.97 101 6.345586315784055e+02 16 More Below or Go To Start 8 6 x 338 36 5.782234e+09 1.00 5784.83 100 1.501268005625795e+05 15 9 21 x 483 17 6.966212e+09 1.00 6934.93 101 1.189443609974981e+05 16 10 19 x 431 9 2.977520e+09 1.01 2952.22 101 7.310369784325296e+04 16 11 64 x 536 1 1.372160e+09 1.00 1366.60 101 3.433560407475758e+04 16 12 68 x 931 1 2.532320e+09 1.04 2443.43 100 7.127569130821465e-06 16 13 41 x 165 7 6.061440e+08 1.00 603.71 32 9.816387810944356e+10 15 14 10 x 373 11 1.657612e+09 1.01 1640.50 101 3.039983465145392e+07 15 15 1 x 348 33 2.296800e+09 1.00 2295.02 101 3.943816690352044e+04 15 16 27 x 748 10 2.261952e+09 1.01 2241.47 40 6.480410000000000e+05 16 17 20 x 340 9 2.472480e+09 1.01 2441.10 101 1.114641772902486e+03 16 18 1 x 753 44 6.560136e+09 0.99 6608.43 100 1.015727037502299e+05 15 19 23 x 192 6 1.070438e+09 1.02 1053.29 101 5.421816960147207e+02 16 20 8 x 125 26 1.040000e+09 1.01 1031.93 100 3.126205178815431e+04 16 21 1 x 324 2 8.100000e+09 1.00 8099.88 50 7.824524877232093e+07 16 22 7 x 295 17 1.418242e+09 1.00 1415.93 101 2.938604376566697e+02 16 23 5 x 188 11 2.047320e+09 1.00 2044.93 100 3.549900501563623e+04 16 24 31 x 881 1 1.092440e+09 1.00 1087.94 101 5.000000000000000e+01 16 Maximum Rate12378.97 Average Rate 3464.01 Geometric Mean 2544.88 Harmonic Mean 1951.83 Minimum Rate 603.71 Do Span 90 Calibrating part 3 of 3 Loop count 32 0.00 seconds Loop count 128 0.00 seconds Loop count 512 0.00 seconds Loop count 2048 0.01 seconds Loops 200 x 8 x Passes Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK ------------ -- ------------- ----- ------- ---- ---------------------- -- 1 28 x1795 5 1.085616e+10 1.0010866.23 27 3.855104502494961e+01 16 2 46 x 748 4 2.422323e+09 1.00 2415.89 15 3.953296986903059e+01 16 3 37 x1126 2 3.599597e+09 1.01 3575.44 27 2.699309089320672e-01 16 4 38 x1471 2 2.683104e+09 1.00 2685.18 27 5.999250595473891e-01 16 5 40 x 473 2 1.574144e+09 1.02 1546.96 27 3.182615248447483e+00 16 6 21 x1047 2 1.688602e+09 1.01 1665.66 8 1.120309393467088e+00 15 7 20 x1082 16 1.163366e+10 0.9412311.54 21 2.845720217644024e+01 16 8 9 x 427 36 5.755277e+09 1.00 5741.74 14 2.960543667875005e+03 15 9 26 x 664 17 7.043712e+09 1.00 7015.77 15 2.623968460874250e+03 16 10 25 x 557 9 3.007800e+09 1.02 2959.56 15 1.651291227698265e+03 16 11 46 x1015 1 1.942304e+09 1.02 1901.09 27 6.551161335845770e+02 16 12 48 x1714 1 3.422515e+09 1.02 3359.57 26 1.943435981130448e-06 16 13 31 x 226 7 6.277376e+08 1.01 621.29 8 3.847124199949431e+10 15 14 8 x 490 11 1.862784e+09 1.01 1853.19 27 2.923540598672009e+06 15 15 1 x 639 33 2.361744e+09 1.00 2361.30 15 1.108997288134785e+03 16 16 14 x 974 10 2.399936e+09 1.00 2394.06 15 5.152160000000000e+05 16 17 26 x 513 9 2.881008e+09 1.00 2875.92 15 2.947368618589361e+01 16 18 2 x 647 44 5.921344e+09 1.00 5907.72 14 9.700646212337041e+02 16 19 28 x 273 6 1.100736e+09 1.02 1074.17 15 1.268230698051003e+01 15 20 7 x 145 26 1.097824e+09 1.02 1077.78 26 5.987713249475302e+02 16 21 1 x 212 2 8.480000e+09 1.01 8435.26 20 5.009945671204667e+07 16 22 8 x 421 17 1.374144e+09 1.00 1373.68 15 6.109968728263972e+00 16 23 7 x 356 11 2.850848e+09 1.00 2862.02 14 4.850340602749970e+02 16 24 23 x 952 1 9.108736e+08 1.00 910.11 27 1.300000000000000e+01 16 Maximum Rate12311.54 Average Rate 3657.96 Geometric Mean 2704.87 Harmonic Mean 2064.49 Minimum Rate 621.29 Do Span 19 Overall Part 1 weight 1 Part 2 weight 2 Part 3 weight 1 Maximum Rate12378.97 Average Rate 3535.78 Geometric Mean 2593.02 Harmonic Mean 1982.64 Minimum Rate 600.71 Do Span 167 More Below or Go To Start Unrolled Double Precision Linpack Benchmark - PC Version in 'C/C++' Optimisation AVX 64 Bit norm resid resid machep x[0]-1 x[n-1]-1 1.7 7.41628980e-14 2.22044605e-16 -1.49880108e-14 -1.89848137e-14 Times are reported for matrices of order 100 1 pass times for array with leading dimension of 201 dgefa dgesl total Mflops unit ratio 0.00016 0.00001 0.00017 4091.04 0.0005 0.0030 Calculating matgen overhead 10 times 0.00 seconds 100 times 0.00 seconds 1000 times 0.03 seconds 10000 times 0.28 seconds 20000 times 0.54 seconds 40000 times 1.02 seconds Overhead for 1 matgen 0.00003 seconds Calculating matgen/dgefa passes for 1 seconds 10 times 0.00 seconds 100 times 0.02 seconds 1000 times 0.17 seconds 2000 times 0.32 seconds 4000 times 0.64 seconds 8000 times 1.27 seconds Passes used 6311 Times for array with leading dimension of 201 dgefa dgesl total Mflops unit ratio 0.00013 0.00000 0.00014 5049.22 0.0004 0.0024 0.00013 0.00000 0.00014 4949.89 0.0004 0.0025 0.00013 0.00000 0.00014 4956.75 0.0004 0.0025 0.00013 0.00000 0.00014 5048.17 0.0004 0.0024 0.00013 0.00000 0.00014 5049.18 0.0004 0.0024 Average 5010.64 Calculating matgen2 overhead Overhead for 1 matgen 0.00003 seconds Times for array with leading dimension of 200 dgefa dgesl total Mflops unit ratio 0.00012 0.00000 0.00013 5372.95 0.0004 0.0023 0.00012 0.00000 0.00013 5374.23 0.0004 0.0023 0.00012 0.00000 0.00013 5370.76 0.0004 0.0023 0.00012 0.00000 0.00013 5462.18 0.0004 0.0022 0.00012 0.00000 0.00013 5463.74 0.0004 0.0022 Average 5408.77 Unrolled Double Precision 5010.64 Mflops Single Precision C Whetstone Benchmark AVX 64 Bit, Tue Jan 18 23:29:03 2022 Calibrate 0.01 Seconds 1 Passes (x 100) 0.01 Seconds 5 Passes (x 100) 0.05 Seconds 25 Passes (x 100) 0.23 Seconds 125 Passes (x 100) 1.12 Seconds 625 Passes (x 100) 5.49 Seconds 3125 Passes (x 100) Use 5695 passes (x 100) Single Precision C/C++ Whetstone Benchmark Loop content Result MFLOPS MOPS Seconds N1 floating point -1.12475013732910156 1324.013 0.083 N2 floating point -1.12274742126464844 1319.724 0.580 N3 if then else 1.00000000000000000 0.000 0.000 N4 fixed point 12.00000000000000000 6512.599 0.275 N5 sin,cos etc. 0.49911010265350342 134.391 3.526 N6 floating point 0.99999982118606567 977.216 3.144 N7 assignments 3.00000000000000000 3908.815 0.269 N8 exp,sqrt etc. 0.75110864639282227 99.207 2.135 MWIPS 5688.198 10.012 |
Faster Than Expected Next or Go To Start
Faster Than Expected
The following is intended to show that it is not just my MP MFLOPS benchmark that provides performance levels higher than expected via compilations using SSE and AVX instructions, in this case dealing with double precision variables. The other benchmark considered is the naturally double precision Livermore Loops. In both cases, disassembled code was checked to ensure that there were no Fused Multiply and Add (FMA) type instructions. Then, according to viewed documentation, maximum performance using SSE functions is 2.0 MFLOPS per MHz, using 128-bit xmm registers, and 4.0 using 256-bit AVX with ymm registers.
Below, the Livermore Loops example shows the full displayed output for the kernel producing maximum MFLOPS, the source code with 16 floating point operations and compile commands used. The SSE example indicates 3.56 MFLOPS per MHz, thought to be impossible without FMA. The AVX results provide 4.86 MFLOPS per MHz 21.5% higher than expected maximum.
The same range of results, source code and compile options are provided for MP MFLOPS benchmark these combinations of instructions., running via a single thread. Looking at the first word size details, least likely to involve RAM data transfers, SSE 12437 to 16602 MFLOPS equates to 3.00 to 4.00 per MHz and AVX 14606 to 32420 MFLOPS at 3.53 to 7.81 per MHz. These ranges include Livermore Loops ratios, but the larger ones are higher than might be expected using FMA, with the particular combination of instructions shown.
Unexpected high levels of performance were also produced on running the benchmarks on the much older Core i7 PC. Livermore Loops SSE maximum was 3.05 MFLOPS per MHz and 1 thread DP MP-MFLOPS with SSE 3.14 and AVX 5.87 MFLOPS per MHz.
4150 MHz Core i5 Livermore Loops Benchmark Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK ------------ -- ------------- ----- ------- ---- ---------------------- -- SSE2 7 4 x1037 16 1.320723e+10 0.89 14782.74 995 6.104251075174761e+04 16 AVX 7 4 x1423 16 1.812333e+10 0.90 20184.30 995 6.104251075174761e+04 16 Kernel 7 C Code for ( k=0 ; k < n ; k++ ) { x[k] = u[k] + r*( z[k] + r*y[k] ) + t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) + t*( u[k+6] + q*( u[k+5] + q*u[k+4] ) ) ); } Compiled With gcc lloops.c -O3 -msse2 -m64 -lrt -lc -lm -o lloopssse2 and gcc lloops.c -O3 -mavx -m64 -lrt -lc -lm -o lloopsavx ##################################################### 4150 MHz Core i5 MP DP MFLOPS Benchmark 1 Thread Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same SSE2 Data in & out 102400 2 75000 1.234995 12437 0.414016 Yes Data in & out 1024000 2 7500 3.085865 4978 0.812316 Yes Data in & out 10240000 2 750 4.262126 3604 0.977908 Yes Data in & out 102400 8 75000 3.723678 16500 0.563491 Yes Data in & out 1024000 8 7500 4.814260 12762 0.883058 Yes Data in & out 10240000 8 750 5.615416 10941 0.986707 Yes Data in & out 102400 32 75000 14.803324 16602 0.353716 Yes Data in & out 1024000 32 7500 15.063927 16314 0.723569 Yes Data in & out 10240000 32 750 15.063069 16315 0.964957 Yes AVX Data in & out 102400 2 75000 1.051636 14606 0.414016 Yes Data in & out 1024000 2 7500 2.418388 6351 0.812316 Yes Data in & out 10240000 2 750 4.170949 3683 0.977908 Yes Data in & out 102400 8 75000 1.890234 32504 0.563491 Yes Data in & out 1024000 8 7500 3.183412 19300 0.883058 Yes Data in & out 10240000 8 750 5.054079 12157 0.986707 Yes Data in & out 102400 32 75000 7.580423 32420 0.353716 Yes Data in & out 1024000 32 7500 7.873082 31215 0.723569 Yes Data in & out 10240000 32 750 8.061002 30488 0.964957 Yes C Function Code 8 Operationss per Word for(i=0; i < n; i++) x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f; Compiled With gcc mpmflops2dp.c -lpthread -msse2 -lrt -lc -lm -O3 -o MPmflops64SSE2DP and gcc mpmflops2dp.c -lpthread -mavx -lrt -lc -lm -O3 -o MPmflops64AVXDP |
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK