IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Propagation pattern for moment representation of the lattice Boltzmann method
Gounley J, Vardhan M, Draeger EW, Valero-Lara P, Moore SV and Randles A
A propagation pattern for the moment representation of the regularized lattice Boltzmann method (LBM) in three dimensions is presented. Using effectively lossless compression, the simulation state is stored as a set of moments of the lattice Boltzmann distribution function, instead of the distribution function itself. An efficient cache-aware propagation pattern for this moment representation has the effect of substantially reducing both the storage and memory bandwidth required for LBM simulations. This paper extends recent work with the moment representation by expanding the performance analysis on central processing unit (CPU) architectures, considering how boundary conditions are implemented, and demonstrating the effectiveness of the moment representation on a graphics processing unit (GPU) architecture.
Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications
Cabezas J, Gelado I, Stone JE, Navarro N, Kirk DB and Hwu WM
Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.
Cluster-based Epidemic Control Through Smartphone-based Body Area Networks
Zhang Z, Wang H, Wang C and Fang H
Increasing population density, closer social contact and interactions make epidemic control difficult. Traditional offline epidemic control methods (e.g., using medical survey or medical records) or model-based approach are not effective due to its inability to gather health data and social contact information simultaneously or impractical statistical assumption about the dynamics of social contact networks, respectively. In addition, it is challenging to find optimal sets of people to be quarantined to contain the spread of epidemics for large populations due to high computational complexity. Unlike these approaches, in this paper, a novel cluster-based epidemic control scheme is proposed based on Smartphone-based body area networks. The proposed scheme divides the populations into multiple clusters based on their physical location and social contact information. The proposed control schemes are applied within the cluster or between clusters. Further, we develop a computational efficient approach called UGP to enable an effective cluster-based quarantine strategy using graph theory for large scale networks (i.e., populations). The effectiveness of the proposed methods is demonstrated through both simulations and experiments on real social contact networks.
Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU: A Case Study from Microscopy Image Analysis
Teodoro G, Kurc T, Kong J, Cooper L and Saltz J
We study and characterize the performance of operations in an important class of applications on GPUs and Many Integrated Core (MIC) architectures. Our work is motivated by applications that analyze low-dimensional spatial datasets captured by high resolution sensors, such as image datasets obtained from whole slide tissue specimens using microscopy scanners. Common operations in these applications involve the detection and extraction of objects (object segmentation), the computation of features of each extracted object (feature computation), and characterization of objects based on these features (object classification). In this work, we have identify the data access and computation patterns of operations in the object segmentation and feature computation categories. We systematically implement and evaluate the performance of these operations on modern CPUs, GPUs, and MIC systems for a microscopy image analysis application. Our results show that the performance on a MIC of operations that perform regular data access is comparable or sometimes better than that on a GPU. On the other hand, GPUs are significantly more efficient than MICs for operations that access data irregularly. This is a result of the low performance of MICs when it comes to random data access. We also have examined the coordinated use of MICs and CPUs. Our experiments show that using a performance aware task strategy for scheduling application operations improves performance about 1.29× over a first-come-first-served strategy. This allows applications to obtain high performance efficiency on CPU-MIC systems - the example application attained an efficiency of 84% on 192 nodes (3072 CPU cores and 192 MICs).