Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

The use of classification trees for bioinformatics
Chen X, Wang M and Zhang H
Classification trees are non-parametric statistical learning methods that incorporate feature selection and interactions, possess intuitive interpretability, are efficient, and have high prediction accuracy when used in ensembles. This paper provides a brief introduction to the classification tree-based methods, a review of the recent developments, and a survey of the applications in bioinformatics and statistical genetics.
A comprehensive survey of error measures for evaluating binary decision making in data science
Emmert-Streib F, Moutari S and Dehmer M
Binary decision making is a topic of great interest for many fields, including biomedical science, economics, management, politics, medicine, natural science and social science, and much effort has been spent for developing novel computational methods to address problems arising in the aforementioned fields. However, in order to evaluate the effectiveness of any prediction method for binary decision making, the choice of the most appropriate error measures is of paramount importance. Due to the variety of error measures available, the evaluation process of binary decision making can be a complex task. The main objective of this study is to provide a comprehensive survey of error measures for evaluating the outcome of binary decision making applicable to many data-driven fields. This article is categorized under: Fundamental Concepts of Data and Knowledge > Key Design Issues in Data MiningTechnologies > PredictionAlgorithmic Development > Statistics.
Data mining of functional RNA structures in genomic sequences
Le SY and Shapiro BA
The normal functions of genomes depend on the precise expression of messenger RNAs and noncoding RNAs (ncRNAs) such as transfer RNAs and microRNAs in eukaryotes. These ncRNAs and functional RNA structures (FRSs) act as regulators or response elements for cellular factors and participate in transcription, posttranscriptional processing, and translation. Knowledge discovery of these FRSs in huge DNA/RNA sequence databases is a very important step to reach our goal of going from genomic sequence data to biological knowledge for understanding RNA-based regulation. Analyses of a large number of FRSs have indicated that the FRS can be well characterized by some quantitative measures such as significance and well-ordered scores of the local segment. Various data mining tools have been developed and successfully applied to FRS discovery in genomic sequence databases. Here, we summarize our efforts in the computational discovery of structured features of ncRNAs and FRSs within complex genomes by EDscan and SigED.
Blockchain networks: Data structures of Bitcoin, Monero, Zcash, Ethereum, Ripple, and Iota
Akcora CG, Gel YR and Kantarcioglu M
Blockchain is an emerging technology that has enabled many applications, from cryptocurrencies to digital asset management and supply chains. Due to this surge of popularity, analyzing the data stored on blockchains poses a new critical challenge in data science. To assist data scientists in various analytic tasks for a blockchain, in this tutorial, we provide a systematic and comprehensive overview of the fundamental elements of blockchain network models. We discuss how we can abstract blockchain data as various types of networks and further use such associated network abstractions to reap important insights on blockchains' structure, organization, and functionality. This article is categorized under:Technologies > Data PreprocessingApplication Areas > Business and IndustryFundamental Concepts of Data and Knowledge > Data ConceptsFundamental Concepts of Data and Knowledge > Knowledge Representation.
Causability and explainability of artificial intelligence in medicine
Holzinger A, Langs G, Denk H, Zatloukal K and Müller H
Explainable artificial intelligence (AI) is attracting much interest in medicine. Technically, the problem of explainability is as old as AI itself and classic AI represented comprehensible retraceable approaches. However, their weakness was in dealing with uncertainties of the real world. Through the introduction of probabilistic learning, applications became increasingly successful, but increasingly opaque. Explainable AI deals with the implementation of transparency and traceability of statistical black-box machine learning methods, particularly deep learning (DL). We argue that there is a need to go beyond explainable AI. To reach a level of we need causability. In the same way that usability encompasses measurements for the quality of use, causability encompasses measurements for the quality of explanations. In this article, we provide some necessary definitions to discriminate between explainability and causability as well as a use-case of DL interpretation and of human explanation in histopathology. The main contribution of this article is the notion of causability, which is differentiated from explainability in that causability is a property of a person, while explainability is a property of a system This article is categorized under: Fundamental Concepts of Data and Knowledge > Human Centricity and User Interaction.
Epidemiological challenges in pandemic coronavirus disease (COVID-19): Role of artificial intelligence
Dasgupta A, Bakshi A, Mukherjee S, Das K, Talukdar S, Chatterjee P, Mondal S, Das P, Ghosh S, Som A, Roy P, Kundu R, Sarkar A, Biswas A, Paul K, Basak S, Manna K, Saha C, Mukhopadhyay S, Bhattacharyya NP and De RK
World is now experiencing a major health calamity due to the coronavirus disease (COVID-19) pandemic, caused by the severe acute respiratory syndrome coronavirus clade 2. The foremost challenge facing the scientific community is to explore the growth and transmission capability of the virus. Use of artificial intelligence (AI), such as deep learning, in (i) rapid disease detection from x-ray or computed tomography (CT) or high-resolution CT (HRCT) images, (ii) accurate prediction of the epidemic patterns and their saturation throughout the globe, (iii) forecasting the disease and psychological impact on the population from social networking data, and (iv) prediction of drug-protein interactions for repurposing the drugs, has attracted much attention. In the present study, we describe the role of various AI-based technologies for rapid and efficient detection from CT images complementing quantitative real-time polymerase chain reaction and immunodiagnostic assays. AI-based technologies to anticipate the current pandemic pattern, prevent the spread of disease, and face mask detection are also discussed. We inspect how the virus transmits depending on different factors. We investigate the deep learning technique to assess the affinity of the most probable drugs to treat COVID-19. This article is categorized under:Application Areas > Health CareAlgorithmic Development > Biological Data MiningTechnologies > Machine Learning.
Machine learning in postgenomic biology and personalized medicine
Ray A
In recent years Artificial Intelligence in the form of machine learning has been revolutionizing biology, biomedical sciences, and gene-based agricultural technology capabilities. Massive data generated in biological sciences by rapid and deep gene sequencing and protein or other molecular structure determination, on the one hand, requires data analysis capabilities using machine learning that are distinctly different from classical statistical methods; on the other, these large datasets are enabling the adoption of novel data-intensive machine learning algorithms for the solution of biological problems that until recently had relied on mechanistic model-based approaches that are computationally expensive. This review provides a bird's eye view of the applications of machine learning in post-genomic biology. Attempt is also made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence (XAI) in human health. Further contributions of machine learning are expected to transform medicine, public health, agricultural technology, as well as to provide invaluable gene-based guidance for the management of complex environments in this age of global warming.
Distributional regression modeling via generalized additive models for location, scale, and shape: An overview through a data set from learning analytics
Marmolejo-Ramos F, Tejo M, Brabec M, Kuzilek J, Joksimovic S, Kovanovic V, González J, Kneib T, Bühlmann P, Kook L, Briseño-Sánchez G and Ospina R
The advent of technological developments is allowing to gather large amounts of data in several research fields. Learning analytics (LA)/educational data mining has access to big observational unstructured data captured from educational settings and relies mostly on unsupervised machine learning (ML) algorithms to make sense of such type of data. Generalized additive models for location, scale, and shape (GAMLSS) are a supervised statistical learning framework that allows modeling all the parameters of the distribution of the response variable with respect to the explanatory variables. This article overviews the power and flexibility of GAMLSS in relation to some ML techniques. Also, GAMLSS' capability to be tailored toward causality via causal regularization is briefly commented. This overview is illustrated via a data set from the field of LA. This article is categorized under:Application Areas > Education and LearningAlgorithmic Development > StatisticsTechnologies > Machine Learning.
A Survey on Artificial Intelligence in Pulmonary Imaging
Saha PK, Nadeem SA and Comellas AP
Over the last decade, deep learning (DL) has contributed a paradigm shift in computer vision and image recognition creating widespread opportunities of using artificial intelligence in research as well as industrial applications. DL has been extensively studied in medical imaging applications, including those related to pulmonary diseases. Chronic obstructive pulmonary disease, asthma, lung cancer, pneumonia, and, more recently, COVID-19 are common lung diseases affecting nearly 7.4% of world population. Pulmonary imaging has been widely investigated toward improving our understanding of disease etiologies and early diagnosis and assessment of disease progression and clinical outcomes. DL has been broadly applied to solve various pulmonary image processing challenges including classification, recognition, registration, and segmentation. This paper presents a survey of pulmonary diseases, roles of imaging in translational and clinical pulmonary research, and applications of different DL architectures and methods in pulmonary imaging with emphasis on DL-based segmentation of major pulmonary anatomies such as lung volumes, lung lobes, pulmonary vessels, and airways as well as thoracic musculoskeletal anatomies related to pulmonary diseases.
Ethical issues when using digital biomarkers and artificial intelligence for the early detection of dementia
Ford E, Milne R and Curlewis K
Dementia poses a growing challenge for health services but remains stigmatized and under-recognized. Digital technologies to aid the earlier detection of dementia are approaching market. These include traditional cognitive screening tools presented on mobile devices, smartphone native applications, passive data collection from wearable, in-home and in-car sensors, as well as machine learning techniques applied to clinic and imaging data. It has been suggested that earlier detection and diagnosis may help patients plan for their future, achieve a better quality of life, and access clinical trials and possible future disease modifying treatments. In this review, we explore whether digital tools for the early detection of dementia can or should be deployed, by assessing them against the principles of ethical screening programs. We conclude that while the importance of dementia as a health problem is unquestionable, significant challenges remain. There is no available treatment which improves the prognosis of diagnosed disease. Progression from early-stage disease to dementia is neither given nor currently predictable. Available technologies are generally not both minimally invasive and highly accurate. Digital deployment risks exacerbating health inequalities due to biased training data and inequity in digital access. Finally, the acceptability of early dementia detection is not established, and resources would be needed to ensure follow-up and support for those flagged by any new system. We conclude that early dementia detection deployed at scale via digital technologies does not meet standards for a screening program and we offer recommendations for moving toward an ethical mode of implementation. This article is categorized under:Application Areas > Health CareCommercial, Legal, and Ethical Issues > Ethical ConsiderationsTechnologies > Artificial Intelligence.