MOTIVATE (Many-cOre Technology Investigating Value, Application, deploymenT and Efficiency)

Many-core technologies providing energy and cost efficiency for radio astronomy HPC

MOTIVATE is a pathfinder project and aims to investigate the latest many-core technologies with the aim of delivering energy and cost efficiency in the area of radio astronomy HPC.

MOTIVATE stands for Many-cOre Technology Investigating Value, Application, deploymenT and Efficiency.

The MOTIVATE project is funded by the Oxford-Martin School through the Institute for the Future of Computing

MOTIVATE stands for Many-cOre Technology Investigating Value, Application, deploymenT and Efficiency.

Martin School Logo

The MOTIVATE project is funded by the Oxford-Martin School through the Institute for the Future of Computing   

Astro - Accelerate

The Initial pathfinder project for MOTIVATE was to optimize signal processing and detection software used in the ARTEMIS project for the detection of transient radio events in data recorded by radio telescopes. These signals originate from compact sources (such as pulsars) within our galaxy and beyond.

Dispersion

The ionized part of the interstellar medium (ISM) causes dispersion of radio signals traveling through it. To recover any detected but dispersed radio signal, data must be integrated over frequency, increasing the signal to a level that is detectable above the noise of the instrument.

De-dispersion

Since we have no idea of how dispersed a detected signal might be we must integrate over many different trial dispersion measures (depicted below).

dedis

 

Acceleration via GPU computing

In order to produce a GPU kernel that can achieve a significant proportion of the peak performance of the GPU we need to ensure three things. The first is that the accumulator that stores the integrated value of the intensity (along the trial dispersion curve) sits in the fastest area of memory. The second is that the correct data from the ( f, t) domain is always available to the streaming multiprocessors. The third is that the shifting value is calculated using as few operations as possible. The GPU algorithm presented is designed to exploit the new fast L1 cache present on the NVidia Fermi hardware. The algorithm is designed to reuse cache-lines that are present in the fast L1 cache, vastly reducing the need to transfer the same data from main memory multiple times. This is achieved by each thread processing several time elements for its given value of dispersion, holding these values in local registers (below).

cache

This gives rise to each thread-block processing a rectangular area of the dispersion-time, (dm, t), space ensuring cache-lines of ( f, t) data are reused multiple times (below).

threads

Comparisons of GPU and CPU algorithms

The following plot contains results from our GPU kernel and compares these results to a vector-parallel CPU code that exploits the SSE registers on a multiprocessor Intel Xeon machine or the AVX registers on a new Intel i7 sandy bridge based machine (Overclocked from 3.2GHz to 4.2GHz, employing 1600MHz DDR3 SD-RAM). Both CPU codes have been designed with maximum cache-line usage in mind and use the Intel intrinsics in the vector parts of the code. Results from a vectorized code using the Intel auto-vectorizer have not been presented because they are consistently slower (approximately 3x, in our region of interest) compared with our vectorized code.

 

results

 

The plot shows the proportion of real-time taken by the CPU/GPU codes (including different platforms) against a varying number of frequency channels. Importantly we hold the maximum dispersion measure at 200 (this relates to the lowest gradient of the maximum dispersion search). However to ensure that we do not sub-sample the data we set the total number of dispersion measures equal to the total number of channels.

 

Recent Talks:

GPU accelerated de-dispersion for LOFAR transient searches

LOFAR TKP Meeting, Oxford, 12th-15th June 2012

A GPU based brute force de-dispersion algorithm for LOFAR

LOFAR Single Station Meeting, Nancay, 9th May 2012

Searches for millisecond radio bursts with GPUs on LOFAR

ASPERA, Computing and Astroparticle Physics Workshop, Hannover, 3rd May 2012

Real-time de-dispersion in astrophysics

Edinburgh University, Jan 9-10, 2012

ARTEMIS, MDSM and the search for radio transients.

The Oxford e-Research Centre GPU Seminar Series Michaelmas 2011, OeRC Oxford, 18th November 2011

A GPU-based survey for millisecond radio transients using ARTEMIS.

ADASS XXI, Paris, 7th November 2011

De-dispersion for LOFAR using GPUs.

PrepSKA, OeRC Oxford, 26th October 2011

GPU computing for real-time de-dispersion in astrophysics

Oxford-Man Institute, Sept 29, 2011

Signal detection and data processing for LOFAR using GPUs.

Oxford e-Research Centre Seminar Series, OeRC Oxford, 8th July 2011

Signal detection and data processing for LOFAR and the SKA using GPUs.

Institute for the Future of Computing Seminars, OeRC Oxford, 20th May 2011

 

Recent Publications:

M. Serylak, A. Karastergiou, C. Williams, W. Armour and LOFAR Pulsar Working Group. Observations of transients and pulsars with LOFAR international stations.

To appear in the proceedings of the Electromagnetic Radiation from Pulsars and Magnetars conference, Zielona Gora, 2012.

W. Armour, A. Karastergiou, M. Giles, C. Williams, A. Magro, K. Zagkouris, S. Roberts, S. Salvini, F. Dulwich and B. Mort. A GPU-based survey for millisecond radio transients using ARTEMIS.

To appear in the proceedings of ADASS XXI, ed. P.Ballester and D.Egret, ASP Conf. Ser.