Least-Squares Support Vector Machine Approach to Viral Replication Origin Prediction
Replication of their DNA genomes is a central step in the reproduction of many viruses. Procedures to find replication origins, which are initiation sites of the DNA replication process, are therefore of great importance for controlling the growth and spread of such viruses. Existing computational methods for viral replication origin prediction have mostly been tested within the family of herpesviruses. This paper proposes a new approach by least-squares support vector machines (LS-SVMs) and tests its performance not only on the herpes family but also on a collection of caudoviruses coming from three viral families under the order of caudovirales. The LS-SVM approach provides sensitivities and positive predictive values superior or comparable to those given by the previous methods. When suitably combined with previous methods, the LS-SVM approach further improves the prediction accuracy for the herpesvirus replication origins. Furthermore, by recursive feature elimination, the LS-SVM has also helped find the most significant features of the data sets. The results suggest that the LS-SVMs will be a highly useful addition to the set of computational tools for viral replication origin prediction and illustrate the value of optimization-based computing techniques in biomedical applications.
Palindromes in SARS and Other Coronaviruses
With the identification of a novel coronavirus associated with the (SARS), computational analysis of its RNA genome sequence is expected to give useful clues to help elucidate the origin, evolution, and pathogenicity of the virus. In this paper, we study the collective counts of palindromes in the SARS genome along with all the completely sequenced coronaviruses. Based on a Markov-chain model for the genome sequence, the mean and standard deviation for the number of palindromes at or above a given length are derived. These theoretical results are complemented by extensive simulations to provide empirical estimates. Using a score obtained from these mathematical and empirical means and standard deviations, we have observed that palindromes of length four are significantly underrepresented in all the coronaviruses in our data set. In contrast, length-six palindromes are significantly underrepresented only in the SARS coronavirus. Two other features are unique to the SARS sequence. First, there is a length-22 palindrome TCTTTAACAAGCTTGTTAAAGA spanning positions 25962-25983. Second, there are two repeating length-12 palindromes TTATAATTATAA spanning positions 22712-22723 and 22796-22807. Some further investigations into possible biological implications of these palindrome features are proposed.
Supervised t-distributed stochastic neighbor embedding for data visualization and classification
We propose a novel supervised dimension reduction method, called supervised t-distributed stochastic neighbor embedding (St-SNE), which achieves dimension reduction by preserving the similarities of data points in both feature and outcome spaces. The proposed method can be used for both prediction and visualization tasks, with the ability to handle high-dimensional data. We show through a variety of datasets that when compared with a comprehensive list of existing methods, St-SNE has superior prediction performance in the ultra-high dimensional setting where the number of features exceeds the sample size , and has competitive performance in the ≤ setting. We also show that St-SNE is a competitive visualization tool that is capable of capturing within cluster variations. In addition, we propose a penalized Kullback-Leibler divergence criterion to automatically select the reduced dimension size for St-SNE.
Predictive Analytics with Strategically Missing Data
We study strategically missing data problems in predictive analytics with regression. In many real-world situations, such as financial reporting, college admission, job application, and marketing advertisement, data providers often conceal certain information on purpose in order to gain a favorable outcome. It is important for the decision-maker to have a mechanism to deal with such strategic behaviors. We propose a novel approach to handle strategically missing data in regression prediction. The proposed method derives imputation values of strategically missing data based on the Support Vector Regression models. It provides incentives for the data providers to disclose their true information. We show that with the proposed method imputation errors for the missing values are minimized under some reasonable conditions. An experimental study on real-world data demonstrates the effectiveness of the proposed approach.
A High-Fidelity Model to Predict Length-of-Stay in the Neonatal Intensive Care Unit (NICU)
Having an interpretable dynamic length-of-stay (LOS) model can help hospital administrators and clinicians make better decisions and improve the quality of care. The widespread implementation of electronic medical record (EMR) systems has enabled hospitals to collect massive amounts of health data. However, how to integrate this deluge of data into healthcare operations remains unclear. We propose a framework grounded in established clinical knowledge to model patients' lengths-of-stay. In particular, we impose expert knowledge when grouping raw clinical data into medically meaningful variables, which summarize patients' health trajectories. We use dynamic predictive models to output patients' remaining lengths-of-stay (RLOS), future discharges, and census probability distributions based on their health trajectories up to the current stay. Evaluated with large-scale EMR data, the dynamic model significantly improves predictive power over the performance of any model in previous literature and remains medically interpretable.
Combination Chemotherapy Optimization with Discrete Dosing
Chemotherapy drug administration is a complex problem that often requires expensive clinical trials to evaluate potential regimens; one way to alleviate this burden and better inform future trials is to build reliable models for drug administration. This paper presents a mixed-integer program for combination chemotherapy (utilization of multiple drugs) optimization that incorporates various important operational constraints and, besides dose and concentration limits, controls treatment toxicity based on its effect on the count of white blood cells. To address the uncertainty of tumor heterogeneity, we also propose chance constraints that guarantee reaching an operable tumor size with a high probability in a neoadjuvant setting. We present analytical results pertinent to the accuracy of the model in representing biological processes of chemotherapy and establish its potential for clinical applications through a numerical study of breast cancer.