PARALLEL COMPUTING

Benchmarking the Performance of Irregular Computations in AutoDock-GPU Molecular Docking
Solis-Vasquez L, Tillack AF, Santos-Martins D, Koch A, LeGrand S and Forli S
Irregular applications can be found in different scientific fields. In computer-aided drug design, molecular docking simulations play an important role in finding promising drug candidates. AutoDock is a software application widely used for predicting molecular interactions at close distances. It is characterized by irregular computations and long execution runtimes. In recent years, a hardware-accelerated version of AutoDock, called AutoDock-GPU, has been under active development. This work benchmarks the recent code and algorithmic enhancements incorporated into AutoDock-GPU. Particularly, we analyze the impact on execution runtime of techniques based on early termination. These enable AutoDock-GPU to explore the molecular space as necessary, while safely avoiding redundant computations. Our results indicate that it is possible to achieve average runtime reductions of 50% by using these techniques. Furthermore, a comprehensive literature review is also provided, where our work is compared to relevant approaches leveraging hardware acceleration for molecular docking.
Asynchronous Parallel Stochastic Quasi-Newton Methods
Tong Q, Liang G, Cai X, Zhu C and Bi J
Although first-order stochastic algorithms, such as stochastic gradient descent, have been the main force to scale up machine learning models, such as deep neural nets, the second-order quasi-Newton methods start to draw attention due to their effectiveness in dealing with ill-conditioned optimization problems. The L-BFGS method is one of the most widely used quasi-Newton methods. We propose an asynchronous parallel algorithm for stochastic quasi-Newton (AsySQN) method. Unlike prior attempts, which parallelize only the calculation for gradient or the two-loop recursion of L-BFGS, our algorithm is the first one that truly parallelizes L-BFGS with a convergence guarantee. Adopting the variance reduction technique, a prior stochastic L-BFGS, which has not been designed for parallel computing, reaches a linear convergence rate. We prove that our asynchronous parallel scheme maintains the same linear convergence rate but achieves significant speedup. Empirical evaluations in both simulations and benchmark datasets demonstrate the speedup in comparison with the non-parallel stochastic L-BFGS, as well as the better performance than first-order methods in solving ill-conditioned problems.
Multiscale modeling and cinematic visualization of photosynthetic energy conversion processes from electronic to cell scales
Sener M, Levy S, Stone JE, Christensen AJ, Isralewitz B, Patterson R, Borkiewicz K, Carpenter J, Hunter CN, Luthey-Schulten Z and Cox D
Conversion of sunlight into chemical energy, namely photosynthesis, is the primary energy source of life on Earth. A visualization depicting this process, based on multiscale computational models from electronic to cell scales, is presented in the form of an excerpt from the fulldome show . This accessible visual narrative shows a lay audience, including children, how the energy of sunlight is captured, converted, and stored through a chain of proteins to power living cells. The visualization is the result of a multi-year collaboration among biophysicists, visualization scientists, and artists, which, in turn, is based on a decade-long experimental-computational collaboration on structural and functional modeling that produced an atomic detail description of a bacterial bioenergetic organelle, the chromatophore. Software advancements necessitated by this project have led to significant performance and feature advances, including hardware-accelerated cinematic ray tracing and instanced visualizations for efficient cell-scale modeling. The energy conversion steps depicted feature an integration of function from electronic to cell levels, spanning nearly 12 orders of magnitude in time scales. This atomic detail description uniquely enables a modern retelling of one of humanity's earliest stories-the interplay between light and life.
Atomic Detail Visualization of Photosynthetic Membranes with GPU-Accelerated Ray Tracing
Stone JE, Sener M, Vandivort KL, Barragan A, Singharoy A, Teo I, Ribeiro JV, Isralewitz B, Liu B, Goh BC, Phillips JC, MacGregor-Chatwin C, Johnson MP, Kourkoutis LF, Hunter CN and Schulten K
The cellular process responsible for providing energy for most life on Earth, namely photosynthetic light-harvesting, requires the cooperation of hundreds of proteins across an organelle, involving length and time scales spanning several orders of magnitude over quantum and classical regimes. Simulation and visualization of this fundamental energy conversion process pose many unique methodological and computational challenges. We present, in two accompanying movies, light-harvesting in the photosynthetic apparatus found in purple bacteria, the so-called chromatophore. The movies are the culmination of three decades of modeling efforts, featuring the collaboration of theoretical, experimental, and computational scientists. We describe the techniques that were used to build, simulate, analyze, and visualize the structures shown in the movies, and we highlight cases where scientific needs spurred the development of new parallel algorithms that efficiently harness GPU accelerators and petascale computers.
Visualizing multiphysics, fluid-structure interaction phenomena in intracranial aneurysms
Perdikaris P, Insley JA, Grinberg L, Yu Y, Papka ME and Karniadakis GE
This work presents recent advances in visualizing multi-physics, fluid-structure interaction (FSI) phenomena in cerebral aneurysms. Realistic FSI simulations produce very large and complex data sets, yielding the need for parallel data processing and visualization. Here we present our efforts to develop an interactive visualization tool which enables the visualization of such FSI simulation data. Specifically, we present a ParaView-NekTar interface that couples the ParaView visualization engine with NekTar's parallel libraries, which are employed for the calculation of derived fields in both the fluid and solid domains with spectral accuracy. This interface allows the flexibility of independently choosing the resolution for visualizing both the volume data and the surface data from each of the solid and fluid domains, which significantly facilitates the visualization of complex structures under large deformations. The animation of the fluid and structure data is synchronized in time, while the ParaView-NekTar interface enables the visualization of different fields to be superimposed, e.g. fluid jet and structural stress, to better understand the interactions in this multi-physics environment. Such visualizations are key towards elucidating important biophysical interactions in health and disease, as well as disseminating the insight gained from our simulations and further engaging the medical community in this effort of bringing computational science to the bedside.
A global perspective of atmospheric carbon dioxide concentrations
Putman WM, Ott L, Darmenov A and daSilva A
A high-resolution (7 km) non-hydrostatic global mesoscale simulation using the Goddard Earth Observing System (GEOS-5) model is used to visualize the flow and fluxes of carbon dioxide throughout the year. Carbon dioxide (CO) is the most important greenhouse gas affected by human activity. About half of the CO emitted from fossil fuel combustion remains in the atmosphere, contributing to rising temperatures, while the other half is absorbed by natural land and ocean carbon reservoirs. Despite the importance of CO, many questions remain regarding the processes that control these fluxes and how they may change in response to a changing climate. This visualization shows how column CO mixing ratios are strongly affected by local emissions and large-scale weather systems. In order to fully understand carbon flux processes, observations and atmospheric models must work closely together to determine when and where observed CO came from. Together, the combination of high-resolution data and models will guide climate models towards more reliable predictions of future conditions.
Parallel Simulated Annealing Using an Adaptive Resampling Interval
Lou Z and Reinitz J
This paper presents a parallel simulated annealing algorithm that is able to achieve 90% parallel efficiency in iteration on up to 192 processors and up to 40% parallel efficiency in time when applied to a 5000-dimension Rastrigin function. Our algorithm breaks scalability barriers in the method of Chu (1999) by abandoning adaptive cooling based on variance. The resulting gains in parallel efficiency are much larger than the loss of serial efficiency from lack of adaptive cooling. Our algorithm resamples the states across processors periodically. The resampling interval is tuned according to the success rate for each specific number of processors. We further present an adaptive method to determine the resampling interval based on the adoption rate. This adaptive method is able to achieve nearly identical parallel efficiency but higher success rates compared to the fixed interval one using the best interval found.
Region Templates: Data Representation and Management for High-Throughput Image Analysis
Teodoro G, Pan T, Kurc T, Kong J, Cooper L, Klasky S and Saltz J
We introduce a region template abstraction and framework for the efficient storage, management and processing of common data types in analysis of large datasets of high resolution images on clusters of hybrid computing nodes. The region template abstraction provides a generic container template for common data structures, such as points, arrays, regions, and object sets, within a spatial and temporal bounding box. It allows for different data management strategies and I/O implementations, while providing a homogeneous, unified interface to applications for data storage and retrieval. A region template application is represented as a hierarchical dataflow in which each computing stage may be represented as another dataflow of finer-grain tasks. The execution of the application is coordinated by a runtime system that implements optimizations for hybrid machines, including performance-aware scheduling for maximizing the utilization of computing devices and techniques to reduce the impact of data transfers between CPUs and GPUs. An experimental evaluation on a state-of-the-art hybrid cluster using a microscopy imaging application shows that the abstraction adds negligible overhead (about 3%) and achieves good scalability and high data transfer rates. Optimizations in a high speed disk based storage implementation of the abstraction to support asynchronous data transfers and computation result in an application performance gain of about 1.13×. Finally, a processing rate of 11,730 4K×4K tiles per minute was achieved for the microscopy imaging application on a cluster with 100 nodes (300 GPUs and 1,200 CPU cores). This computation rate enables studies with very large datasets.
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations
Hallock MJ, Stone JE, Roberts E, Fry C and Luthey-Schulten Z
Simulation of cellular processes with the reaction-diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel e ciency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in . Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines
Teodoro G, Pan T, Kurc T, Kong J, Cooper L and Saltz J
We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.
Parallelization of Nullspace Algorithm for the computation of metabolic pathways
Jevremović D, Trinh CT, Srienc F, Sosa CP and Boley D
Elementary mode analysis is a useful metabolic pathway analysis tool in understanding and analyzing cellular metabolism, since elementary modes can represent metabolic pathways with unique and minimal sets of enzyme-catalyzed reactions of a metabolic network under steady state conditions. However, computation of the elementary modes of a genome- scale metabolic network with 100-1000 reactions is very expensive and sometimes not feasible with the commonly used serial Nullspace Algorithm. In this work, we develop a distributed memory parallelization of the Nullspace Algorithm to handle efficiently the computation of the elementary modes of a large metabolic network. We give an implementation in C++ language with the support of MPI library functions for the parallel communication. Our proposed algorithm is accompanied with an analysis of the complexity and identification of major bottlenecks during computation of all possible pathways of a large metabolic network. The algorithm includes methods to achieve load balancing among the compute-nodes and specific communication patterns to reduce the communication overhead and improve efficiency.
GPU computing with Kaczmarz's and other iterative algorithms for linear systems
Elble JM, Sahinidis NV and Vouzis P
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarz's, Cimmino's, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradient-accelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method.
Optimizing Data Intensive GPGPU Computations for DNA Sequence Alignment
Trapnell C and Schatz MC
MUMmerGPU uses highly-parallel commodity graphics processing units (GPU) to accelerate the data-intensive computation of aligning next generation DNA sequence data to a reference sequence for use in diverse applications such as disease genotyping and personal genomics. MUMmerGPU 2.0 features a new stackless depth-first-search print kernel and is 13× faster than the serial CPU version of the alignment code and nearly 4× faster in total computation time than MUMmerGPU 1.0. We exhaustively examined 128 GPU data layout configurations to improve register footprint and running time and conclude higher occupancy has greater impact than reduced latency. MUMmerGPU is available open-source at http://mummergpu.sourceforge.net.
Multilevel Summation of Electrostatic Potentials Using Graphics Processing Units
Hardy DJ, Stone JE and Schulten K
Physical and engineering practicalities involved in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. The urgent need for continuing increases in the performance of scientific applications requires the use of many-core processors and accelerators such as graphics processing units (GPUs). This paper discusses GPU acceleration of the multilevel summation method for computing electrostatic potentials and forces for a system of charged atoms, which is a problem of paramount importance in biomolecular modeling applications. We present and test a new GPU algorithm for the long-range part of the potentials that computes a cutoff pair potential between lattice points, essentially convolving a fixed 3-D lattice of "weights" over all sub-cubes of a much larger lattice. The implementation exploits the different memory subsystems provided on the GPU to stream optimally sized data sets through the multiprocessors. We demonstrate for the full multilevel summation calculation speedups of up to 26 using a single GPU and 46 using multiple GPUs, enabling the computation of a high-resolution map of the electrostatic potential for a system of 1.5 million atoms in under 12 seconds.
Guest Editors' Introduction, Special Issue on High-Performance Computational Biology
Bader DA and Aluru S
RNAVLab: A virtual laboratory for studying RNA secondary structures based on grid computing technology
Taufer M, Leung MY, Solorio T, Licon A, Mireles D, Araiza R and Johnson KL
As ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation, their secondary structures have been the focus of many recent studies. Despite the computing power of supercomputers, computationally predicting secondary structures with thermodynamic methods is still not feasible when the RNA molecules have long nucleotide sequences and include complex motifs such as pseudoknots. This paper presents RNAVLab (RNA Virtual Laboratory), a virtual laboratory for studying RNA secondary structures including pseudoknots that allows scientists to address this challenge. Two important case studies show the versatility and functionalities of RNAVLab. The first study quantifies its capability to rebuild longer secondary structures from motifs found in systematically sampled nucleotide segments. The extensive sampling and predictions are made feasible in a short turnaround time because of the grid technology used. The second study shows how RNAVLab allows scientists to study the viral RNA genome replication mechanisms used by members of the virus family Nodaviridae.
Explicit Design of FPGA-Based Coprocessors for Short-Range Force Computations in Molecular Dynamics Simulations
Gu Y, Vancourt T and Herbordt MC
FPGA-based acceleration of molecular dynamics simulations (MD) has been the subject of several recent studies. The short-range force computation, which dominates the execution time, is the primary focus. Here we combine: a high level of FPGA-specific design including cell lists, systematically determined interpolation and precision, handling of exclusion, and support for MD simulations of up to 256K particles. The target system consists of a standard PC with a 2004-era COTS FPGA board. There are several innovations: new microarchitectures for several major components, including the cell list processor and the off-chip memory controller; and a novel arithmetic mode. Extensive experimentation was required to optimize precision, interpolation order, interpolation mode, table sizes, and simulation quality. We obtain a substantial speed-up over a highly tuned production MD code.
Single Pass Streaming BLAST on FPGAs
Herbordt MC, Model J, Sukhwani B, Gu Y and Vancourt T
Approximate string matching is fundamental to bioinformatics and has been the subject of numerous FPGA acceleration studies. We address issues with respect to FPGA implementations of both BLAST- and dynamic-programming- (DP) based methods. Our primary contribution is a new algorithm for emulating the seeding and extension phases of BLAST. This operates in a single pass through a database at streaming rate, and with no preprocessing other than loading the query string. Moreover, it emulates parameters turned to maximum possible sensitivity with no slowdown. While current DP-based methods also operate at streaming rate, generating results can be cumbersome. We address this with a new structure for data extraction. We present results from several implementations showing order of magnitude acceleration over serial reference code. A simple extension assures compatibility with NCBI BLAST.