JOURNAL OF CHEMOMETRICS

Alternative weighting schemes for fine-tuned extended similarity indices
López Pérez K, Rácz A, Bajusz D, Gonzalez C, Héberger K and Miranda-Quintana RA
Extended similarity indices (i.e., generalization of pairwise similarity) have recently gained importance because of their simplicity, fast computation and superiority in tasks like diversity picking. However, they operate with several meta parameters that should be optimized. Earlier, we extended the binary similarity indices to 'discrete non-binary' and 'continuous' data; now we continue with introducing and comparing multiple weighting functions. As a case study, the similarity of CYP enzyme inhibitors (4016 molecules after curation) was characterized by their extended similarities, based on 2D descriptors, MACCS and Morgan fingerprints. A statistical workflow based on sum of ranking differences (SRD) and analysis of variance (ANOVA) was used for finding the optimal weight function(s). Overall, the best weighting function is the fraction ("frac"), which corresponds to the principle of parsimony. Optimal extended similarity indices were also found, and their differences are revealed across different data sets. We intend this work to be a guideline for users of extended similarity indices regarding the various weighting options available. Source code for the calculations is available at https://github.com/mqcomplab/MultipleComparisons.
Implications of confounding from unmodeled interactions between explanatory variables when using latent variable regression models to make inferences
Kvalheim OM, Vidar WS, Baumeister TUH, Linington RG and Cech NB
With linear dependency between the explanatory variables, partial least squares (PLS) regression is commonly used for regression analysis. If the response variable correlates to a high degree with the explanatory variables, a model with excellent predictive ability can usually be obtained. Ranking of variable importance is commonly used to interpret the model and sometimes this interpretation guides further experimentation. For instance, when analyzing natural product extracts for bioactivity, an underlying assumption is that the highest ranked compounds represent the best candidates for isolation and further testing. A problem with this approach is that in most cases the number of compounds is larger than the number of samples (and usually much larger) and that the concentrations of the compounds correlate. Furthermore, compounds may interact as synergists or as antagonists. If the modelling process does not account for this possibility, the interpretation can be thoroughly wrong since unmodelled variables that strongly influence the response will give rise to confounding of a first order PLS model and send the experimenter on a wrong track. We show the consequences of this by a practical example from natural product research. Furthermore, we show that by including the possibility of interactions between explanatory variables, visualization using a selectivity ratio plot may provide model interpretation that can be used to make inferences.
Coherent Point Drift Peak Alignment Algorithms Using Distance and Similarity Measures for Two-Dimensional Gas Chromatography Mass Spectrometry Data
Li Z, Kim S, Zhong S, Zhong Z, Kato I and Zhang X
The peak alignment is a vital preprocessing step before downstream analysis, such as biomarker discovery and pathway analysis, for two-dimensional gas chromatography mass spectrometry (2DGCMS)-based metabolomics data. Due to uncontrollable experimental conditions, e.g., the differences in temperature or pressure, matrix effects on samples, and stationary phase degradation, a shift of retention times among samples inevitably occurs during 2DGCMS experiments, making it difficult to align peaks. Various peak alignment algorithms have been developed to correct retention time shifts for homogeneous, heterogeneous or both type of mass spectrometry data. However, almost all existing algorithms have been focused on a local alignment and are suffering from low accuracy especially when aligning dense biological data with many peaks. We have developed four global peak alignment (GPA) algorithms using coherent point drift (CPD) point matching algorithms: retention time-based CPD-GPA (RT), CPD-GPA (P), mixture CPD-GPA (M), and mixture CPD-GPA (P+M). The method RT performs the peak alignment based only on the retention time distance, while the methods P, M, and P+M carry out the peak alignment using both the retention time distance and mass spectral similarity. The method P incorporates the mass spectral similarity through information and the methods M and P+M use the mixture distance measure. Four developed algorithms are applied to homogeneous and heterogeneous spiked-in data as well as two real biological data and compared with three existing algorithms, mSPA, SWPA, and BiPACE-2D. The results show that our CPD-GPA algorithms perform better than all existing algorithms in terms of F1 score.
Composition of cometary particles collected during two periods of the Rosetta mission: multivariate evaluation of mass spectral data
Varmuza K, Filzmoser P, Fray N, Cottin H, Merouane S, Stenzel O, Paquette J, Kissel J, Briois C, Baklouti D, Bardyn A, Siljeström S, Silén J and Hilchenbach M
The instrument COSIMA (COmetary Secondary Ion Mass Analyzer) onboard of the European Space Agency mission Rosetta collected and analyzed dust particles in the neighborhood of comet 67P/Churyumov-Gerasimenko. The chemical composition of the particle surfaces was characterized by time-of-flight secondary ion mass spectrometry. A set of 2213 spectra has been selected, and relative abundances for CH-containing positive ions as well as positive elemental ions define a set of multivariate data with nine variables. Evaluation by complementary chemometric techniques shows different compositions of sample groups collected during two periods of the mission. The first period was August to November 2014 (far from the Sun); the second period was January 2015 to February 2016 (nearer to the Sun). The applied data evaluation methods consider the compositional nature of the mass spectral data and comprise robust principal component analysis as well as classification with discriminant partial least squares regression, -nearest neighbor search, and random forest decision trees. The results indicate a high importance of the relative abundances of the secondary ions C and Fe for the group separation and demonstrate an enhanced content of carbon-containing substances in samples collected in the period with smaller distances to the Sun.
Cellwise outlier detection and biomarker identification in metabolomics based on pairwise log ratios
Walach J, Filzmoser P, Kouřil Š, Friedecký D and Adam T
Data outliers can carry very valuable information and might be most informative for the interpretation. Nevertheless, they are often neglected. An algorithm called cellwise outlier diagnostics using robust pairwise log ratios (cell-rPLR) for the identification of outliers in single cell of a data matrix is proposed. The algorithm is designed for metabolomic data, where due to the size effect, the measured values are not directly comparable. Pairwise log ratios between the variable values form the elemental information for the algorithm, and the aggregation of appropriate outlyingness values results in outlyingness information. A further feature of cell-rPLR is that it is useful for biomarker identification, particularly in the presence of cellwise outliers. Real data examples and simulation studies underline the good performance of this algorithm in comparison with alternative methods.
Comparative Chemometric Analysis for Classification of Acids and Bases via a Colorimetric Sensor Array
Kangas MJ, Burks RM, Atwater J, Lukowicz RM, Garver B and Holmes AE
With the increasing availability of digital imaging devices, colorimetric sensor arrays are rapidly becoming a simple, yet effective tool for the identification and quantification of various analytes. Colorimetric arrays utilize colorimetric data from many colorimetric sensors, with the multidimensional nature of the resulting data necessitating the use of chemometric analysis. Herein, an 8 sensor colorimetric array was used to analyze select acid and basic samples (0.5 - 10 M) to determine which chemometric methods are best suited for classification quantification of analytes within clusters. PCA, HCA, and LDA were used to visualize the data set. All three methods showed well-separated clusters for each of the acid or base analytes and moderate separation between analyte concentrations, indicating that the sensor array can be used to identify and quantify samples. Furthermore, PCA could be used to determine which sensors showed the most effective analyte identification. LDA, KNN, and HQI were used for identification of analyte and concentration. HQI and KNN could be used to correctly identify the analytes in all cases, while LDA correctly identified 95 of 96 analytes correctly. Additional studies demonstrated that controlling for solvent and image effects was unnecessary for all chemometric methods utilized in this study.
Quionolone carboxylic acid derivatives as HIV-1 integrase inhibitors: Docking-based HQSAR and topomer CoMFA analyses
Tong J, Zhan P, Wang XS and Wu Y
Quionolone carboxylic acid derivatives as inhibitors of HIV-1 integrase were investigated as a potential class of drugs for the treatment of acquired immunodeficiency syndrome (AIDS). Hologram quantitative structure-activity relationships (HQSAR) and translocation comparative molecular field vector analysis (topomer CoMFA) were applied to a series of 48 quionolone carboxylic acid derivatives. The most effective HQSAR model was obtained using atoms and bonds as fragment distinctions: cross-validation = 0.796, standard error of prediction = 0.36, the non-cross-validated = 0.967, non-cross validated standard error = 0.17, the correlation coefficient of external validation = 0.955, and the best hologram length = 180. topomer CoMFA models were built based on different fragment cutting models, with the most effective model of = 0.775, = 0.37, = 0.967, = 0.15, = 0.915, and = 163.255. These results show that the models generated form HQSAR and topomer CoMFA were able to effectively predict the inhibitory potency of this class of compounds. The molecular docking method was also used to study the interactions of these drugs by docking the ligands into the HIV-1 integrase active site, which revealed the likely bioactive conformations. This study showed that there are extensive interactions between the quionolone carboxylic acid derivatives and THR80, VAL82, GLY27, ASP29, and ARG8 residues in the active site of HIV-1 integrase. These results provide useful insights for the design of potent new inhibitors of HIV-1 integrase.
Adaptive penalties for generalized Tikhonov regularization in statistical regression models with application to spectroscopy data
Randolph TW, Ding J, Kundu MG and Harezlak J
Tikhonov regularization was proposed for multivariate calibration by Andries and Kalivas [1]. We use this framework for modeling the statistical association between spectroscopy data and a scalar outcome. In both the calibration and regression settings this regularization process has advantages over methods of spectral pre-processing and dimension-reduction approaches such as feature extraction or principal component regression. We propose an extension of this penalized regression framework by adaptively refining the penalty term to optimally focus the regularization process. We illustrate the approach using simulated spectra and compare it with other penalized regression models and with a two-step method that first pre-processes the spectra then fits a dimension-reduced model using the processed data. The methods are also applied to magnetic resonance spectroscopy data to identify brain metabolites that are associated with cognitive function.
Discovery of False Identification Using Similarity Difference in GC-MS based Metabolomics
Kim S and Zhang X
Compound identification is a critical process in metabolomics. The widely used approach for compound identification in gas chromatography-mass spectrometry (GC-MS) based metabolomics is the spectrum matching, in which the mass spectral similarity between an experimental mass spectrum and each mass spectrum in a reference library is calculated. While various similarity measures have been developed to improve the overall accuracy of compound identification, little attention has been paid to reducing the false discovery rate. We, therefore, develop an approach for controlling false identification rate using the distribution of the difference between the first and the second highest spectral similarity scores. We further propose a model-based approach to achieving a desired true positive rate. The developed method is applied to the NIST mass spectral library and its performance is compared with the conventional approach that uses only the maximum spectral similarity score. The results show that the developed method achieves a significantly higher 1 score and positive predictive value than those of the conventional approach.
Artificial neural networks as supervised techniques for FT-IR microspectroscopic imaging
Lasch P, Diem M, Hänsch W and Naumann D
In this report the applicability of an improved method of image segmentation of infrared microspectroscopic data from histological specimens is demonstrated. Fourier transform infrared (FT-IR) microspectroscopy was used to record hyperspectral data sets from human colorectal adenocarcinomas and to build up a database of spatially resolved tissue spectra. This database of colon microspectra comprised 4120 high-quality FT-IR point spectra from 28 patient samples and 12 different histological structures. The spectral information contained in the database was employed to teach and validate multilayer perceptron artificial neural network (MLP-ANN) models. These classification models were then employed for database analysis and utilised to produce false colour images from complete tissue maps of FT-IR microspectra. An important aspect of this study was also to demonstrate how the diagnostic sensitivity and specificity can be specifically optimised. An example is given which shows that changes of the number of teaching patterns per class can be used to modify these two interrelated test parameters. The definition of ANN topology turned out to be crucial to achieve a high degree of correspondence between the gold standard of histopathology and IR spectroscopy. Particularly, a hierarchical scheme of ANN classification proved to be superior for the reliable classification of tissue spectra. It was found that unsupervised methods of clustering, specifically agglomerative hierarchical clustering (AHC), were helpful in the initial phases of model generation. Optimal classification results could be achieved if the class definitions for the ANNs were carried out by considering the classification information provided by cluster analysis.