Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.mrao.cam.ac.uk/projects/aavp/presentations/AhmedSaid_Software_beamformer.pdf
Äàòà èçìåíåíèÿ: Fri Dec 10 12:32:48 2010
Äàòà èíäåêñèðîâàíèÿ: Tue Oct 2 17:44:58 2012
Êîäèðîâêà:

Ïîèñêîâûå ñëîâà: comet
Evaluation Of The 2-PAD Aperture Array Software Beamformer System
Dr Aziz Ahmedsaid
University of Manchester December 2010


Outline
· Objectives · Hardware platform · Beamforming architecture · Software developments · Implementation · Performance evaluation · Conclusions


Objectives/Beamformer requirements
· The ability to create multiple simultaneous beams (1 to 8 beams) · Two polarisations · A relatively large bandwidth (several MHz). · Ability to trade bandwidth for beams. · Ability to perform accurate calibration. · Ability to mitigate RFI.


Hardware Platform


Hardware Platform
IBM Cyclops System:
· Very Dense Multi-Node HPC system · 3D grid configuration · Max size 24X24X24 nodes · Each node consisting of one multi-processor Cyclops chip · Each Cyclops Device contains 80 Dual Thread Unit Processing Elements, Dual ALU single FPMAC unit


The Cyclops Processor Device


Simplified Cyclops Diagram


Cyclops Capabilities
· 500 MHz system clock frequency · 80 GFLOPS peak processing power · 9 GBytes/s data throughput per port


Beamforming architecture


Frequency domain beamforming


Frequency domain beamforming


Logical Mapping


Physical Mapping on a 3X3X3 system


Physical Mapping on a 3X3X3 system


Software development


Software development


Single port IO Data Rates utilising ETI Libraries*
Data rate 0.04 0.035 0.03 Bytes per clock cycle 0.025 0.02 0.015 0.01 0.005 0 64 128 256 512 1024 Number of bytes 2048 4096 8192 16384

*

Simulated using Cyclops Simulator


Single port IO Data Rates utilising Custom Developed Libraries
Data rate 2.5

2 Bytes per clock cycle

1.5

1

0.5

0 64 128 256 512 1024 Number of bytes 2048 4096 8192 16384

*

Simulated using Cyclops Simulator


Data rates relative comparison
Custom lib Vs ET I lib 70 60 50 Speed up 40 30 20 Speed up (Custom/ET I) 10 0 64 128 256 512 1024 Number of bytes 2048 4096 8192 16384


Implementation


Beamformer implementation
· 2 Polarisations · 1,2,4 or 8 beams · 1 byte input data and 2 bytes complex output data · Mixed arithmetic precision:Coefficients are applied in double floating-point precision and partial-beams are accumulated in 8 bits integer format · On-the-fly updatable beamforming coefficients · Can trade-off between the bandwidth and the number of beams


Program Execution
·At the node level, operations are synchronised using barriers ·Between nodes, operations are data driven


Activity diagram for Node 0 (Outer node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam


Activity diagram for Node 1 (Inner node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam


Activity diagram for Node 10 (Inner node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam


Activity diagram for Node 13 (Centre node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam


Performance evaluation


Com putation tim e V Num be r of proce ssing thre ad units 80000 70000 60000 Clock cycles 50000 40000 30000 20000 10000 0 8 16 32 Num be r of proce ssing thre ad units 64 128

(Packet size 512, 16 packets per buffer)


Activity diagram for Node 0 (Outer node)

Packet size 256, 64 packets, 32 processing thread units, 1 beam


Activity diagram for Node 0 (Outer node)

Packet size 256, 64 packets, 64 processing thread units, 1 beam


Data rate V Packe t size 1 0.9 0.8 Bytes/Clock cycle 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 32 64 128 Packe t size 256 512

(32 processing thread units, 64 packets)


Data Rate V Num be r of Packe ts 1.4 1.2 1 0.8 0.6 0.4 0.2 0 4 8 16 Num be r of Packe ts 32 64

Bytes/Clock cycle

Packet size 256, 64 processing thread units


Performance Summary
Highest performance obtained for: · Largest packet size · Largest number of packets · Largest number of processing thread units


Bandwidth performance
Bandwidth = DateRate / (2 X NumberOfPolarisation X NumberOfBeams) Bandwidth per beam, per polarisation (MHz) 193 98 57 24

Number of beams 1 2 4 8

Computation time (Clock cycles) 42510 41826 35725 42168

Data rate (MBytes/s)
771 783 917 777

Packet size 512, 64 packets, 64 processing thread units, buffer size 64K


Hardware utilisation efficiency
· 15 out of 27 nodes
­ ~ 70X15 out of 160X15 thread units (44%) ­ 64X8 out of 80X15 FPMAC units (43%) ­ 617KBytes out of 5120KBytes internal memory (12%) ­ 3 out of 6 ports (50%) · Overall utilisation ~ 46 % (for the 15 nodes)


Conclusions


Conclusions
· · · Heavily optimised low level software development is required in order to achieve high performance Software productivity is low for this type of software development. Raw number crunching power does not guarantee high performance, fixed point arithmetic is likely to be adequate within the signal processing functions so floating point capability is a luxury item which needs to be targeted appropriately High speed interconnects do not guarantee high data rates, without carefully matched localised dedicated buffer memories Memory structure and bandwidth is likely to be a key parameter Achieving an optimal level of resource utilisation is difficult, resulting in hardware inefficiency. Beamforming process "shape" is not a natural fit onto typical "cube" style general purpose multi core processing systems Even the most highly capable hardware platforms such as Cyclops are extremely difficult to exploit to full potential

· · · · ·


Continuation Of Work
· Further analysis of relationship between hardware utilisation, data movement, processing activity and power consumption. · Further analysis of numerical precision, data growth and utilisation efficiency. · Further Research into Programming Methodology and Software implementation productivity issues. · Comparative Analysis of other hardware architectures.


Acknowledgments
Bruce Elmegreen, Maria Butrico Ben Maron, Brett Patane and Steven Janssen (IBM Watson Research Center) John Tully (ETI)
Andrew Faulkner (University Of Cambridge) Prof Mike Jones, Kris Zarb Adami (University of Oxford) Steph Salvini, Ben Mort, Fred Dulwich (Oxford ERC) Chris Shenton, Tim Ikin, Georgina Harris, Prof Tony Brown (University Of Manchester)