STIF: Intuitionistic fuzzy Gaussian membership function with statistical transformation weight of evidence and information value for private information preservation
Data sharing to the multiple organizations are essential for analysis in many situations. The shared data contains the individual's private and sensitive information and results in privacy breach. To overcome the privacy challenges, privacy preserving data mining (PPDM) has progressed as a solution. This work addresses the problem of PPDM by proposing statistical transformation with intuitionistic fuzzy (STIF) algorithm for data perturbation. The STIF algorithm contains statistical methods weight of evidence, information value and intuitionistic fuzzy Gaussian membership function. The STIF algorithm is applied on three benchmark datasets adult income, bank marketing and lung cancer. The classifier models decision tree, random forest, extreme gradient boost and support vector machines are used for accuracy and performance analysis. The results show that the STIF algorithm achieves 99% of accuracy for adult income dataset and 100% accuracy for both bank marketing and lung cancer datasets. Further, the results highlights that the STIF algorithm outperforms in data perturbation capacity and privacy preserving capacity than the state-of-art algorithms without any information loss on both numerical and categorical data.
MICAR: multi-inhabitant context-aware activity recognition in home environments
The sensor-based recognition of Activities of Daily Living (ADLs) in smart-home environments enables several important applications, including the continuous monitoring of fragile subjects in their homes for healthcare systems. The majority of the approaches in the literature assume that only one resident is living in the home. Multi-inhabitant ADLs recognition is significantly more challenging, and only a limited effort has been devoted to address this setting by the research community. One of the major open problems is called , which is correctly associating each environmental sensor event (e.g., the opening of a fridge door) with the inhabitant that actually triggered it. Moreover, existing multi-inhabitant approaches rely on supervised learning, assuming a high availability of labeled data. However, collecting a comprehensive training set of ADLs (especially in multiple-residents settings) is prohibitive. In this work, we propose MICAR: a novel multi-inhabitant ADLs recognition approach that combines semi-supervised learning and knowledge-based reasoning. Data association is performed by semantic reasoning, combining high-level context information (e.g., residents' postures and semantic locations) with triggered sensor events. The personalized stream of sensor events is processed by an incremental classifier, that is initialized with a limited amount of labeled ADLs. A novel cache-based active learning strategy is adopted to continuously improve the classifier. Our results on a dataset where up to 4 subjects perform ADLs at the same time show that MICAR reliably recognizes individual and joint activities while triggering a significantly low number of active learning queries.
Abstract Cost Models for Distributed Data-Intensive Computations
We consider data analytics workloads on distributed architectures, in particular clusters of commodity machines. To find a job partitioning that minimizes running time, a cost model, which we more accurately refer to as makespan model, is needed. In attempting to find the simplest possible, but sufficiently accurate, such model, we explore piecewise linear functions of input, output, and computational complexity. They are abstract in the sense that they capture fundamental algorithm properties, but do not require explicit modeling of system and implementation details such as the number of disk accesses. We show how the simplified functional structure can be exploited to reduce optimization cost. In the general case, we identify a lower bound that can be used for search-space pruning. For applications with homogeneous tasks, we further demonstrate how to directly integrate the model into the makespan optimization process, reducing search-space dimensionality and thus complexity by orders of magnitude. Experimental results provide evidence of good prediction quality and successful makespan optimization across a variety of operators and cluster architectures.
: A Cloud MapReduce Based High Performance Whole Slide Image Analysis Framework
Recent advancements in systematic analysis of high resolution whole slide images have increase efficiency of diagnosis, prognosis and prediction of cancer and important diseases. Due to the enormous sizes and dimensions of whole slide images, the analysis requires extensive computing resources which are not commonly available. Images have to be tiled for processing due to computer memory limitations, which lead to inaccurate results due to the ignorance of boundary crossing objects. Thus, we propose a generic and highly scalable cloud-based image analysis framework for whole slide images. The framework enables parallelized integration of image analysis steps, such as segmentation and aggregation of micro-structures in a single pipeline, and generation of final objects manageable by databases. The core concept relies on the abstraction of objects in whole slide images as different classes of spatial geometries, which in turn can be handled as text based records in MapReduce. The framework applies an overlapping partitioning scheme on images, and provides parallelization of tiling and image segmentation based on MapReduce architecture. It further provides robust object normalization, graceful handling of boundary objects with an efficient spatial indexing based matching method to generate accurate results. Our experiments on Amazon EMR show that is highly scalable, generic and extremely cost effective by benchmark tests.
Scalable and flexible management of medical image big data
Digital imaging plays a critical role for image guided diagnosis and clinical trials, and the amount of image data is fast growing. There are two major requirements for image data management: scalability for massive scales and support of comprehensive queries. Traditional Picture Archiving and Communication Systems (PACS for short) are based on relational data management systems and suffer from limited scalability and query support. Therefore, new systems that support fast, scalable and comprehensive queries on image data are highly demanded. In this paper, we introduce two alternative approaches: DCMRL/XMLStore (RL/XML for short)-a parallel, hybrid relational and XML data management approach, and DCMDocStore (DOC for short)-a NoSQL document store approach. DCMRL/XMLStore manages DICOM images as binary large objects and metadata as relational tables and XML documents based on IBM DB2, which is parallelized through data partitioning. DCMDocStore manages DICOM metadata as JSON objects, and DICOM images as encoded attachments in MongoDB running on multiple nodes. We have delivered two open source systems DCMRL/XMLStore and DCMDocStore. Both systems support scalable data management and comprehensive queries. We also evaluated them with nearly one million DICOM images from National Biomedical Imaging Archive. The results show that, DCMDocStore demonstrates high data loading speed, high scalability and fault tolerance. DCMRL/XMLStore provides efficient queries, but comes with slower data loading. Traditional PACS systems have inherent limitations on flexible queries and scalability for massive amount of images.
High-dimensional similarity searches using query driven dynamic quantization and distributed indexing
The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distancemetric,but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.
Sensitive attribute privacy preservation of trajectory data publishing based on l-diversity
The widely application of positioning technology has made collecting the movement of people feasible for knowledge-based decision. Data in its original form often contain sensitive attributes and publishing such data will leak individuals' privacy. Especially, a privacy threat occurs when an attacker can link a record to a specific individual based on some known partial information. Therefore, maintaining privacy in the published data is a critical problem. To prevent record linkage, attribute linkage, and similarity attacks based on the background knowledge of trajectory data, we propose a data privacy preservation with enhanced -diversity. First, we determine those critical spatial-temporal sequences which are more likely to cause privacy leakage. Then, we perturb these sequences by adding or deleting some spatial-temporal points while ensuring the published data satisfy our ( )-privacy, an enhanced privacy model from -diversity. Our experiments on both synthetic and real-life datasets suggest that our proposed scheme can achieve better privacy while still ensuring high utility, compared with existing privacy preservation schemes on trajectory.
Sentimental analysis from imbalanced code-mixed data using machine learning approaches
Knowledge discovery from various perspectives has become a crucial asset in almost all fields. Sentimental analysis is a classification task used to classify the sentence based on the meaning of their context. This paper addresses class imbalance problem which is one of the important issues in sentimental analysis. Not much works focused on sentimental analysis with imbalanced class label distribution. The paper also focusses on another aspect of the problem which involves a concept called "Code Mixing". Code mixed data consists of text alternating between two or more languages. Class imbalance distribution is a commonly noted phenomenon in a code-mixed data. The existing works have focused more on analyzing the sentiments in a monolingual data but not in a code-mixed data. This paper addresses all these issues and comes up with a solution to analyze sentiments for a class imbalanced code-mixed data using sampling technique combined with levenshtein distance metrics. Furthermore, this paper compares the performances of various machine learning approaches namely, Random Forest Classifier, Logistic Regression, XGBoost classifier, Support Vector Machine and Naïve Bayes Classifier using F1- Score.
RETRACTED ARTICLE: Application of machine learning (ML) and internet of things (IoT) in healthcare to predict and tackle pandemic situation
Subscribing to big data at scale
Today, data is being actively generated by a variety of devices, services, and applications. Such data is important not only for the information that it contains, but also for its relationships to other data and to interested users. Most existing Big Data systems focus on answering queries from users, rather than collecting data, processing it, and serving it to users. To satisfy both passive and active requests at scale, application developers need either to heavily customize an existing passive Big Data system or to glue one together with systems like and . Either choice requires significant effort and incurs additional overhead. In this paper, we present the BAD (Big Active Data) system as an end-to-end, out-of-the-box solution for this challenge. It is designed to preserve the merits of passive Big Data systems and introduces new features for actively serving Big Data to users at scale. We show the design and implementation of the BAD system, demonstrate how BAD facilitates providing both passive and active data services, investigate the BAD system's performance at scale, and illustrate the complexities that would result from instead providing BAD-like services with a "glued" system.
Scalable probabilistic truss decomposition using central limit theorem and H-index
Truss decomposition is a popular notion of hierarchical dense substructures in graphs. In a nutshell, -truss is the largest subgraph in which every edge is contained in at least triangles. Truss decomposition aims to compute -trusses for each possible value of . There are many works that study truss decomposition in deterministic graphs. However, in probabilistic graphs, truss decomposition is significantly more challenging and has received much less attention; state-of-the-art approaches do not scale well to large probabilistic graphs. Finding the tail probabilities of the number of triangles that contain each edge is a critical challenge of those approaches. This is achieved using dynamic programming which has quadratic run-time and thus not scalable to real large networks which, quite commonly, can have edges contained in many triangles (in the millions). To address this challenge, we employ a special version of the Central Limit Theorem (CLT) to obtain the tail probabilities efficiently. Based on our CLT approach we propose a peeling algorithm for truss decomposition that scales to large probabilistic graphs and offers significant improvement over state-of-the-art. We also design a second method which progressively tightens the estimate of the truss value of each edge and is based on -index computation. In contrast to our CLT-based approach, our -index algorithm (1) is progressive by allowing the user to see near-results along the way, (2) does not sacrifice the exactness of final result, and (3) achieves all these while processing only one edge and its immediate neighbors at a time, thus resulting in smaller memory footprint. We perform extensive experiments to show the scalability of both of our proposed algorithms.
Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation
The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at question answering using DBpedia, or require to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets. Finally, we introduce Bio-SODA UX, a graphical user interface designed to assist users in the exploration of large knowledge graphs and in dynamically disambiguating natural language questions that target the data available in these graphs.