Relative Suffix Trees
Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.
Developing a smartphone software package for predicting atmospheric pollutant concentrations at mobile locations
There is considerable evidence that exposure to air pollution is harmful to health. In the U.S., ambient air quality is monitored by Federal and State agencies for regulatory purposes. There are limited options, however, for people to access this data in real-time which hinders an individual's ability to manage their own risks. This paper describes a new software package that models environmental concentrations of fine particulate matter (PM), coarse particulate matter (PM), and ozone concentrations for the state of Oregon and calculates personal health risks at the smartphone's current location. Predicted air pollution risk levels can be displayed on mobile devices as interactive maps and graphs color-coded to coincide with EPA air quality index (AQI) categories. Users have the option of setting air quality warning levels via color-coded bars and were notified whenever warning levels were exceeded by predicted levels within 10 km. We validated the software using data from participants as well as from simulations which showed that the application was capable of identifying spatial and temporal air quality trends. This unique application provides a potential low-cost technology for reducing personal exposure to air pollution which can improve quality of life particularly for people with health conditions, such as asthma, that make them more susceptible to these hazards.
NVR-BIP: Nuclear Vector Replacement using Binary Integer Programming for NMR Structure-Based Assignments
Nuclear magnetic resonance (NMR) spectroscopy is an important experimental technique that allows one to study protein structure and dynamics in solution. An important bottleneck in NMR protein structure determination is the assignment of NMR peaks to the corresponding nuclei. Structure-based assignment (SBA) aims to solve this problem with the help of a template protein which is homologous to the target and has applications in the study of structure-activity relationship, protein-protein and protein-ligand interactions. We formulate SBA as a linear assignment problem with additional nuclear overhauser effect constraints, which can be solved within nuclear vector replacement's (NVR) framework (Langmead, C., Yan, A., Lilien, R., Wang, L. and Donald, B. (2003) A Polynomial-Time Nuclear Vector Replacement Algorithm for Automated NMR Resonance Assignments. , Berlin, Germany, April 10-13, pp. 176-187. ACM Press, New York, NY. , (2004), 11, pp. 277-298; Langmead, C. and Donald, B. (2004) An expectation/maximization nuclear vector replacement algorithm for automated NMR resonance assignments. , 29, 111-138). Our approach uses NVR's scoring function and data types and also gives the option of using CH and NH residual dipolar coupling (RDCs), instead of NH RDCs which NVR requires. We test our technique on NVR's data set as well as on four new proteins. Our results are comparable to NVR's assignment accuracy on NVR's test set, but higher on novel proteins. Our approach allows partial assignments. It is also complete and can return the optimum as well as near-optimum assignments. Furthermore, it allows us to analyze the information content of each data type and is easily extendable to accept new forms of input data, such as additional RDCs.