White Paper

Page 5 of 7

w w w. m r c y. c o m WHITE PAPER 6 As per Table 7 the measured time to run the STAP benchmark on four cores is 28.29 ms which is below the CPI time of 32.5 ms. Utilizing four cores provides an overall 45.4 Gflop/s which is also above the required 39.82 Gflop/s. Measured scalability Using MPI the STAP benchmark has been scaled to run on varying number of cores. The measured aggregate operation rate is shown in Figure 6. Figure 6. Measured scalability The measured operation rate per core is more clearly illustrated in Figure 7. Figure 7. Measured scalability Gflops/s per core Most parts of the application have a reasonable scaling, some scale fac- tors are more efficient than others which are application and data size specific. Due to latency induced by data movement, using also the 2nd processor (cores 13-24) reduces the per-core performance. The dual QPI interlinking the processors helps reducing such effect. Since the problem size used herein can be solved using four cores, scaling this to 24 cores (as shown in this section) is more academic than of practical use. A more demanding application (e.g. with larger data sizes) would scale more evenly over multiple processors. Instead of splitting a single QRD over many cores it would be more efficient to run multiple QRDs in parallel on groups of cores. A STAP application might be partitioned differently from this benchmark resulting in improved scalability. The level of scalability depends on the type of computation and parti- tion. As shown below, for compute bound problems (small FFT sizes) the scalability of algorithms is straight-forward but for larger FFT sizes the problem becomes memory bound (data does not fit in cache). This is shown in Figure 8. Figure 8. Measured Gflop/s 12 Vs 24 cores Figure 8 shows both Xeon D and E performance. The difference between these is discussed in the next section. Functional block Function Nr float operations per CPI Nr float operations of total [%] Operation rate required [Gflop/s] Four cores measured time [s] % of total time Four cores operation rate [Gflop/s] Pre-processing Short to float cast 0.00085 3.0% Demodulation to baseband 5,406,720 0.4% 0.17 0.00055 1.9% 9.84 Low-pass filter (FIR) and decimation 72,990,720 5.7% 2.26 0.00105 3.7% 69.52 Pulse compression array calibration 92,995,584 7.2% 2.88 0.00085 3.0% 109.41 Misc. pre-processing 0.00015 0.5% 0 Adaptive processing Doppler processing 21,626,880 1.7% 0.67 0.0002 0.7% 108.14 QRD 1,070,530,560 83.4% 33.19 0.01731 61.2% 61.85 Solve for adaptive weights 4,460,544 0.3% 0.14 0.00097 3.4% 4.6 Weights application 16,220,160 1.3% 0.50 0.00041 1.4% 39.57 Other Corner-turn 2x 0.00314 11.1% Misc. processing 0.0028 9.9% Total run time 1,284,231,168 100% 39.82 0.02829 100.0% 45.4 Table 7. Third-order factored-Doppler STAP - four cores

Articles in this issue

Cover

view archives of White Paper - Xeon-D Vs Xeon-E for Embedded Radar Applications

Xeon-D Vs Xeon-E for Embedded Radar Applications

Contents of this Issue

Navigation

Page 5 of 7

Articles in this issue