Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.mrao.cam.ac.uk/projects/aavp/presentations/AhmedSaid_Software_beamformer.pdf
Дата изменения: Fri Dec 10 12:32:48 2010
Дата индексирования: Tue Oct 2 17:44:58 2012
Кодировка:
Поисковые слова: arp 220

Evaluation Of The 2-PAD Aperture Array Software Beamformer System
Dr Aziz Ahmedsaid
University of Manchester December 2010

Outline
· Objectives · Hardware platform · Beamforming architecture · Software developments · Implementation · Performance evaluation · Conclusions

Objectives/Beamformer requirements
· The ability to create multiple simultaneous beams (1 to 8 beams) · Two polarisations · A relatively large bandwidth (several MHz). · Ability to trade bandwidth for beams. · Ability to perform accurate calibration. · Ability to mitigate RFI.

Hardware Platform

Hardware Platform
IBM Cyclops System:
· Very Dense Multi-Node HPC system · 3D grid configuration · Max size 24X24X24 nodes · Each node consisting of one multi-processor Cyclops chip · Each Cyclops Device contains 80 Dual Thread Unit Processing Elements, Dual ALU single FPMAC unit

The Cyclops Processor Device

Simplified Cyclops Diagram

Cyclops Capabilities
· 500 MHz system clock frequency · 80 GFLOPS peak processing power · 9 GBytes/s data throughput per port

Beamforming architecture

Frequency domain beamforming

Frequency domain beamforming

Logical Mapping

Physical Mapping on a 3X3X3 system

Physical Mapping on a 3X3X3 system

Software development

Software development

Single port IO Data Rates utilising ETI Libraries*
Data rate 0.04 0.035 0.03 Bytes per clock cycle 0.025 0.02 0.015 0.01 0.005 0 64 128 256 512 1024 Number of bytes 2048 4096 8192 16384

*

Simulated using Cyclops Simulator

Single port IO Data Rates utilising Custom Developed Libraries
Data rate 2.5

2 Bytes per clock cycle

1.5

1

0.5

0 64 128 256 512 1024 Number of bytes 2048 4096 8192 16384

*

Simulated using Cyclops Simulator

Data rates relative comparison
Custom lib Vs ET I lib 70 60 50 Speed up 40 30 20 Speed up (Custom/ET I) 10 0 64 128 256 512 1024 Number of bytes 2048 4096 8192 16384

Implementation

Beamformer implementation
· 2 Polarisations · 1,2,4 or 8 beams · 1 byte input data and 2 bytes complex output data · Mixed arithmetic precision:Coefficients are applied in double floating-point precision and partial-beams are accumulated in 8 bits integer format · On-the-fly updatable beamforming coefficients · Can trade-off between the bandwidth and the number of beams

Program Execution
·At the node level, operations are synchronised using barriers ·Between nodes, operations are data driven

Activity diagram for Node 0 (Outer node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam

Activity diagram for Node 1 (Inner node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam

Activity diagram for Node 10 (Inner node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam

Activity diagram for Node 13 (Centre node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam

Performance evaluation

Com putation tim e V Num be r of proce ssing thre ad units 80000 70000 60000 Clock cycles 50000 40000 30000 20000 10000 0 8 16 32 Num be r of proce ssing thre ad units 64 128

(Packet size 512, 16 packets per buffer)

Activity diagram for Node 0 (Outer node)

Packet size 256, 64 packets, 32 processing thread units, 1 beam

Activity diagram for Node 0 (Outer node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam

Data rate V Packe t size 1 0.9 0.8 Bytes/Clock cycle 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 32 64 128 Packe t size 256 512

(32 processing thread units, 64 packets)

Data Rate V Num be r of Packe ts 1.4 1.2 1 0.8 0.6 0.4 0.2 0 4 8 16 Num be r of Packe ts 32 64

Bytes/Clock cycle

Packet size 256, 64 processing thread units

Performance Summary
Highest performance obtained for: · Largest packet size · Largest number of packets · Largest number of processing thread units

Bandwidth performance
Bandwidth = DateRate / (2 X NumberOfPolarisation X NumberOfBeams) Bandwidth per beam, per polarisation (MHz) 193 98 57 24

Number of beams 1 2 4 8

Computation time (Clock cycles) 42510 41826 35725 42168

Data rate (MBytes/s)
771 783 917 777

Packet size 512, 64 packets, 64 processing thread units, buffer size 64K

Hardware utilisation efficiency
· 15 out of 27 nodes
~ 70X15 out of 160X15 thread units (44%) 64X8 out of 80X15 FPMAC units (43%) 617KBytes out of 5120KBytes internal memory (12%) 3 out of 6 ports (50%) · Overall utilisation ~ 46 % (for the 15 nodes)

Conclusions

Conclusions
· · · Heavily optimised low level software development is required in order to achieve high performance Software productivity is low for this type of software development. Raw number crunching power does not guarantee high performance, fixed point arithmetic is likely to be adequate within the signal processing functions so floating point capability is a luxury item which needs to be targeted appropriately High speed interconnects do not guarantee high data rates, without carefully matched localised dedicated buffer memories Memory structure and bandwidth is likely to be a key parameter Achieving an optimal level of resource utilisation is difficult, resulting in hardware inefficiency. Beamforming process "shape" is not a natural fit onto typical "cube" style general purpose multi core processing systems Even the most highly capable hardware platforms such as Cyclops are extremely difficult to exploit to full potential

· · · · ·

Continuation Of Work
· Further analysis of relationship between hardware utilisation, data movement, processing activity and power consumption. · Further analysis of numerical precision, data growth and utilisation efficiency. · Further Research into Programming Methodology and Software implementation productivity issues. · Comparative Analysis of other hardware architectures.

Acknowledgments
Bruce Elmegreen, Maria Butrico Ben Maron, Brett Patane and Steven Janssen (IBM Watson Research Center) John Tully (ETI)
Andrew Faulkner (University Of Cambridge) Prof Mike Jones, Kris Zarb Adami (University of Oxford) Steph Salvini, Ben Mort, Fred Dulwich (Oxford ERC) Chris Shenton, Tim Ikin, Georgina Harris, Prof Tony Brown (University Of Manchester)