Big Data

DMHANT: DropMessage Hypergraph Attention Network for Information Propagation Prediction
Ouyang Q, Chen H, Liu S, Pu L, Ge D and Fan K
Predicting propagation cascades is crucial for understanding information propagation in social networks. Existing methods always focus on structure or order of infected users in a single cascade sequence, ignoring the global dependencies of cascades and users, which is insufficient to characterize their dynamic interaction preferences. Moreover, existing methods are poor at addressing the problem of model robustness. To address these issues, we propose a predication model named DropMessage Hypergraph Attention Networks, which constructs a hypergraph based on the cascade sequence. Specifically, to dynamically obtain user preferences, we divide the diffusion hypergraph into multiple subgraphs according to the time stamps, develop hypergraph attention networks to explicitly learn complete interactions, and adopt a gated fusion strategy to connect them for user cascade prediction. In addition, a new drop immediately method DropMessage is added to increase the robustness of the model. Experimental results on three real-world datasets indicate that proposed model significantly outperforms the most advanced information propagation prediction model in both MAP@k and Hits@K metrics, and the experiment also proves that the model achieves more significant prediction performance than the existing model under data perturbation.
Maximizing Influence in Social Networks Using Combined Local Features and Deep Learning-Based Node Embedding
Bouyer A, Beni HA, Oskouei AG, Rouhi A, Arasteh B and Liu X
The influence maximization problem has several issues, including low infection rates and high time complexity. Many proposed methods are not suitable for large-scale networks due to their time complexity or free parameter usage. To address these challenges, this article proposes a local heuristic called Embedding Technique for Influence Maximization (ETIM) that uses shell decomposition, graph embedding, and reduction, as well as combined local structural features. The algorithm selects candidate nodes based on their connections among network shells and topological features, reducing the search space and computational overhead. It uses a deep learning-based node embedding technique to create a multidimensional vector of candidate nodes and calculates the dependency on spreading for each node based on local topological features. Finally, influential nodes are identified using the results of the previous phases and newly defined local features. The proposed algorithm is evaluated using the independent cascade model, showing its competitiveness and ability to achieve the best performance in terms of solution quality. Compared with the collective influence global algorithm, ETIM is significantly faster and improves the infection rate by an average of 12%.
Special Issue: Big Scientific Data and Machine Learning in Science and Engineering
Pourkamali-Anaraki F
Content-Aware Human Mobility Pattern Extraction
Li S, Fan C, Li T, Chen R, Liu Q and Gong J
Extracting meaningful patterns of human mobility from accumulating trajectories is essential for understanding human behavior. However, previous works identify human mobility patterns based on the spatial co-occurrence of trajectories, which ignores the effect of activity content, leaving challenges in effectively extracting and understanding patterns. To bridge this gap, this study incorporates the activity content of trajectories to extract human mobility patterns, and proposes acontent-aware mobility pattern model. The model first embeds the activity content in distributed continuous vector space by taking point-of-interest as an agent and then extracts representative and interpretable mobility patterns from human trajectory sets using a derived topic model. To investigate the performance of the proposed model, several evaluation metrics are developed, including pattern coherence, pattern similarity, and manual scoring. A real-world case study is conducted, and its experimental results show that the proposed model improves interpretability and helps to understand mobility patterns. This study provides not only a novel solution and several evaluation metrics for human mobility patterns but also a method reference for fusing content semantics of human activities for trajectory analysis and mining.
A Fast Survival Support Vector Regression Approach to Large Scale Credit Scoring via Safe Screening
Wang H and Hong L
Survival models have found wider and wider applications in credit scoring recently due to their ability to estimate the dynamics of risk over time. In this research, we propose a Buckley-James safe sample screening support vector regression (BJS4VR) algorithm to model large-scale survival data by combing the Buckley-James transformation and support vector regression. Different from previous support vector regression survival models, censored samples here are imputed using a censoring unbiased Buckley-James estimator. Safe sample screening is then applied to discard samples that guaranteed to be non-active at the final optimal solution from the original data to improve efficiency. Experimental results on the large-scale real lending club loan data have shown that the proposed BJS4VR model outperforms existing popular survival models such as RSFM, CoxRidge and CoxBoost in terms of both prediction accuracy and time efficiency. Important variables highly correlated with credit risk are also identified with the proposed method.
Research on the Influence of Information Iterative Propagation on Complex Network Structure
Qian Y, Nian F, Wang Z and Yao Y
Dynamic propagation will affect the change of network structure. Different networks are affected by the iterative propagation of information to different degrees. The iterative propagation of information in the network changes the connection strength of the chain edge between nodes. Most studies on temporal networks build networks based on time characteristics, and the iterative propagation of information in the network can also reflect the time characteristics of network evolution. The change of network structure is a macromanifestation of time characteristics, whereas the dynamics in the network is a micromanifestation of time characteristics. How to concretely visualize the change of network structure influenced by the characteristics of propagation dynamics has become the focus of this article. The appearance of chain edge is the micro change of network structure, and the division of community is the macro change of network structure. Based on this, the node participation is proposed to quantify the influence of different users on the information propagation in the network, and it is simulated in different types of networks. By analyzing the iterative propagation of information, the weighted network of different networks based on the iterative propagation of information is constructed. Finally, the chain edge and community division in the network are analyzed to achieve the purpose of quantifying the influence of network propagation on complex network structure.
Balancing Protection and Quality in Big Data Analytics Pipelines
Polimeno A, Mignone P, Braghin C, Anisetti M, Ceci M, Malerba D and Ardagna CA
Existing data engine implementations do not properly manage the conflict between the need of protecting and sharing data, which is hampering the spread of big data applications and limiting their impact. These two requirements have often been studied and defined independently, leading to a conceptual and technological misalignment. This article presents the architecture and technical implementation of a data engine addressing this conflict by integrating a new governance solution based on access control within a big data analytics pipeline. Our data engine enriches traditional components for data governance with an access control system that enforces access to data in a big data environment based on data transformations. Data are then used along the pipeline only after sanitization, protecting sensitive attributes before their usage, in an effort to facilitate the balance between protection and quality. The solution was tested in a real-world smart city scenario using the data of the Oslo city transportation system. Specifically, we compared the different predictive models trained with the data views obtained by applying the secure transformations required by different user roles to the same data set. The results show that the predictive models, built on data manipulated according to access control policies, are still effective.
A Basketball Big Data Platform for Box Score and Play-by-Play Data
Vinué G
This is the second part of a research diptych devoted to improving basketball data management in Spain. The Spanish ACB (Association of Basketball Clubs, acronym in Spanish) is the top European national competition. It attracts most of the best foreign players outside the NBA (National Basketball Association, in North America) and also accelerates the development of Spanish players who ultimately contribute to the success of the Spanish national team. However, this sporting excellence is not reciprocated by an advanced treatment of the data generated by teams and players, the so-called statistics. On the contrary, their use is still very rudimentary. An earlier article published in this journal in 2020 introduced the first open web application for interactive visualization of the box score data from three European competitions, including the ACB. Box score data refer to the data provided once the game is finished. Following the same inspiration, this new research aims to present the work carried out with more advanced data, namely, play-by-play data, which are provided as the game runs. This type of data allow us to gain greater insight into basketball performance, providing information that cannot be revealed with box score data. A new dashboard is developed to analyze play-by-play data from a number of different and novel perspectives. Furthermore, a comprehensive data platform encompassing the visualization of the ACB box score and play-by-play data is presented.
Dual-Path Graph Neural Network with Adaptive Auxiliary Module for Link Prediction
Yang Z, Lin Z, Yang Y and Li J
Link prediction, which has important applications in many fields, predicts the possibility of the link between two nodes in a graph. Link prediction based on Graph Neural Network (GNN) obtains node representation and graph structure through GNN, which has attracted a growing amount of attention recently. However, the existing GNN-based link prediction approaches possess some shortcomings. On the one hand, because a graph contains different types of nodes, it leads to a great challenge for aggregating information and learning node representation from its neighbor nodes. On the other hand, the attention mechanism has been an effect instrument for enhancing the link prediction performance. However, the traditional attention mechanism is always monotonic for query nodes, which limits its influence on link prediction. To address these two problems, a Dual-Path Graph Neural Network (DPGNN) for link prediction is proposed in this study. First, we propose a novel Local Random Features Augmentation for Graph Convolution Network as a baseline of one path. Meanwhile, Graph Attention Network version 2 based on dynamic attention mechanism is adopted as a baseline of the other path. And then, we capture more meaningful node representation and more accurate link features by concatenating the information of these two paths. In addition, we propose an adaptive auxiliary module for better balancing the weight of auxiliary tasks, which brings more benefit to link prediction. Finally, extensive experiments verify the effectiveness and superiority of our proposed DPGNN for link prediction.
Gaussian Adapted Markov Model with Overhauled Fluctuation Analysis-Based Big Data Streaming Model in Cloud
Ananthi M, Gopal A, Ramalakshmi K and Mohan Kumar P
An accurate resource usage prediction in the big data streaming applications still remains as one of the complex processes. In the existing works, various resource scaling techniques are developed for forecasting the resource usage in the big data streaming systems. However, the baseline streaming mechanisms limit with the issues of inefficient resource scaling, inaccurate forecasting, high latency, and running time. Therefore, the proposed work motivates to develop a new framework, named as Gaussian adapted Markov model (GAMM)-overhauled fluctuation analysis (OFA), for an efficient big data streaming in the cloud systems. The purpose of this work is to efficiently manage the time-bounded big data streaming applications with reduced error rate. In this study, the gating strategy is also used to extract the set of features for obtaining nonlinear distribution of data and fat convergence solution, used to perform the fluctuation analysis. Moreover, the layered architecture is developed for simplifying the process of resource forecasting in the streaming applications. During experimentation, the results of the proposed stream model GAMM-OFA are validated and compared by using different measures.
Investigating the Co-Movement and Asymmetric Relationships of Oil Prices on the Shipping Stock Returns: Evidence from Three Shipping-Flagged Companies from Germany, South Korea, and Taiwan
Saputra J, Mokhtar K, Abu Bakar A and Ruslan SMM
In the last 2 years, there has been a significant upswing in oil prices, leading to a decline in economic activity and demand. This trend holds substantial implications for the global economy, particularly within the emerging business landscape. Among the influential risk factors impacting the returns of shipping stocks, none looms larger than the volatility in oil prices. Yet, only a limited number of studies have explored the complex relationship between oil price shocks and the dynamics of the liner shipping industry, with specific focus on uncertainty linkages and potential diversification strategies. This study aims to investigate the co-movements and asymmetric associations between oil prices (specifically, West Texas Intermediate and Brent) and the stock returns of three prominent shipping companies from Germany, South Korea, and Taiwan. The results unequivocally highlight the indispensable role of oil prices in shaping both short-term and long-term shipping stock returns. In addition, the research underscores the statistical significance of exchange rates and interest rates in influencing these returns, with their effects varying across different time horizons. Notably, shipping stock prices exhibit heightened sensitivity to positive movements in oil prices, while exchange rates and interest rates exert contrasting impacts, one being positive and the other negative. These findings collectively illuminate the profound influence of market sentiment regarding crucial economic indicators within the global shipping sector.
Long- and Short-Term Memory Model of Cotton Price Index Volatility Risk Based on Explainable Artificial Intelligence
Xia H, Hou X and Zhang JZ
Market uncertainty greatly interferes with the decisions and plans of market participants, thus increasing the risk of decision-making, leading to compromised interests of decision-makers. Cotton price index (hereinafter referred to as cotton price) volatility is highly noisy, nonlinear, and stochastic and is susceptible to supply and demand, climate, substitutes, and other policy factors, which are subject to large uncertainties. To reduce decision risk and provide decision support for policymakers, this article integrates 13 factors affecting cotton price index volatility based on existing research and further divides them into transaction data and interaction data. A long- and short-term memory (LSTM) model is constructed, and a comparison experiment is implemented to analyze the cotton price index volatility. To make the constructed model explainable, we use explainable artificial intelligence (XAI) techniques to perform statistical analysis of the input features. The experimental results show that the LSTM model can accurately analyze the cotton price index fluctuation trend but cannot accurately predict the actual price of cotton; the transaction data plus interaction data are more sensitive than the transaction data in analyzing the cotton price fluctuation trend and can have a positive effect on the cotton price fluctuation analysis. This study can accurately reflect the fluctuation trend of the cotton market, provide reference to the state, enterprises, and cotton farmers for decision-making, and reduce the risk caused by frequent fluctuation of cotton prices. The analysis of the model using XAI techniques builds the confidence of decision-makers in the model.
The Impact of Big Data Analytics on Decision-Making Within the Government Sector
Faridoon L, Liu W and Spence C
The government sector has started adopting big data analytics capability (BDAC) to enhance its service delivery. This study examines the relationship between BDAC and decision-making capability (DMC) in the government sector. It investigates the mediation role of the cognitive style of decision makers and organizational culture in the relationship between BDAC and DMC utilizing the resource-based view of the firm theory. It further investigates the impact of BDAC on organizational performance (OP). This study attempts to extend existing research through significant findings and recommendations to enhance decision-making processes for a successful utilization of BDAC in the government sector. A survey method was adopted to collect data from government organizations in the United Arab Emirates, and partial least-squares structural equation modeling was deployed to analyze the collected data. The results empirically validate the proposed theoretical framework and confirm that BDAC positively impacts DMC via cognitive style and organizational culture, and in turn further positively impacting OP overall.
Modeling of Machine Learning-Based Extreme Value Theory in Stock Investment Risk Prediction: A Systematic Literature Review
Melina M, Sukono , Napitupulu H and Mohamed N
The stock market is heavily influenced by global sentiment, which is full of uncertainty and is characterized by extreme values and linear and nonlinear variables. High-frequency data generally refer to data that are collected at a very fast rate based on days, hours, minutes, and even seconds. Stock prices fluctuate rapidly and even at extremes along with changes in the variables that affect stock fluctuations. Research on investment risk estimation in the stock market that can identify extreme values is nonlinear, reliable in multivariate cases, and uses high-frequency data that are very important. The extreme value theory (EVT) approach can detect extreme values. This method is reliable in univariate cases and very complicated in multivariate cases. The purpose of this research was to collect, characterize, and analyze the investment risk estimation literature to identify research gaps. The literature used was selected by applying the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) and sourced from Sciencedirect.com and Scopus databases. A total of 1107 articles were produced from the search at the identification stage, reduced to 236 in the eligibility stage, and 90 articles in the included studies set. The bibliometric networks were visualized using the VOSviewer software, and the main keyword used as the search criteria is "VaR." The visualization showed that EVT, the Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models, and historical simulation are models often used to estimate the investment risk; the application of the machine learning (ML)-based investment risk estimation model is low. There has been no research using a combination of EVT and ML to estimate the investment risk. The results showed that the hybrid model produced better Value-at-Risk (VaR) accuracy under uncertainty and nonlinear conditions. Generally, models only use daily return data as model input. Based on research gaps, a hybrid model framework for estimating risk measures is proposed using a combination of EVT and ML, using multivariable and high-frequency data to identify extreme values in the distribution of data. The goal is to produce an accurate and flexible estimated risk value against extreme changes and shocks in the stock market. Mathematics Subject Classification: 60G25; 62M20; 6245; 62P05; 91G70.
A MapReduce-Based Approach for Fast Connected Components Detection from Large-Scale Networks
Bhat SY and Abulaish M
Owing to increasing size of the real-world networks, their processing using classical techniques has become infeasible. The amount of storage and central processing unit time required for processing large networks is far beyond the capabilities of a high-end computing machine. Moreover, real-world network data are generally distributed in nature because they are collected and stored on distributed platforms. This has popularized the use of the MapReduce, a distributed data processing framework, for analyzing real-world network data. Existing MapReduce-based methods for connected components detection mainly struggle to minimize the number of MapReduce rounds and the amount of data generated and forwarded to the subsequent rounds. This article presents an efficient MapReduce-based approach for finding connected components, which does not forward the complete set of connected components to the subsequent rounds; instead, it writes them to the Hadoop Distributed File System as soon as they are found to reduce the amount of data forwarded to the subsequent rounds. It also presents an application of the proposed method in contact tracing. The proposed method is evaluated on several network data sets and compared with two state-of-the-art methods. The empirical results reveal that the proposed method performs significantly better and is scalable to find connected components in large-scale networks.
Sharing Medical Big Data While Preserving Patient Confidentiality in Innovative Medicines Initiative: A Summary and Case Report from BigData@Heart
Schröder M, Muller SHA, Vradi E, Mielke J, Lim YMF, Couvelard F, Mostert M, Koudstaal S, Eijkemans MJC and Gerlinger C
Sharing individual patient data (IPD) is a simple concept but complex to achieve due to data privacy and data security concerns, underdeveloped guidelines, and legal barriers. Sharing IPD is additionally difficult in big data-driven collaborations such as Bigdata@Heart in the Innovative Medicines Initiative, due to competing interests between diverse consortium members. One project within BigData@Heart, case study 1, needed to pool data from seven heterogeneous data sets: five randomized controlled trials from three different industry partners, and two disease registries. Sharing IPD was not considered feasible due to legal requirements and the sensitive medical nature of these data. In addition, harmonizing the data sets for a federated data analysis was difficult due to capacity constraints and the heterogeneity of the data sets. An alternative option was to share summary statistics through contingency tables. Here it is demonstrated that this method along with anonymization methods to ensure patient anonymity had minimal loss of information. Although sharing IPD should continue to be encouraged and strived for, our approach achieved a good balance between data transparency while protecting patient privacy. It also allowed a successful collaboration between industry and academia.
The Impact of the COVID-19 Pandemic on Stock Market Performance in G20 Countries: Evidence from Long Short-Term Memory with a Recurrent Neural Network Approach
Fitriana PM, Saputra J and Halim ZA
In light of developing and industrialized nations, the G20 economies account for a whopping two-thirds of the world's population and are the largest economies globally. Public emergencies have occasionally arisen due to the rapid spread of COVID-19 globally, impacting many people's lives, especially in G20 countries. Thus, this study is written to investigate the impact of the COVID-19 pandemic on stock market performance in G20 countries. This study uses daily stock market data of G20 countries from January 1, 2019 to June 30, 2020. The stock market data were divided into G7 countries and non-G7 countries. The data were analyzed using Long Short-Term Memory with a Recurrent Neural Network (LSTM-RNN) approach. The result indicated a gap between the actual stock market index and a forecasted time series that would have happened without COVID-19. Owing to movement restrictions, this study found that stock markets in six countries, including Argentina, China, South Africa, Turkey, Saudi Arabia, and the United States, are affected negatively. Besides that, movement restrictions in the G7 countries, excluding the United States, and the non-G20 countries, excluding Argentina, China, South Africa, Turkey, and Saudi, significantly impact the stock market performance. Generally, LSTM prediction estimates relative terms, except for stock market performance in the United Kingdom, the Republic of Korea, South Africa, and Spain. The stock market performance in the United Kingdom and Spain countries has significantly reduced during and after the occurrence of COVID-19. It indicates that the COVID-19 pandemic considerably influenced the stock markets of 14 G20 countries, whereas less severely impacting 6 remaining countries. In conclusion, our empirical evidence showed that the pandemic had restricted effects on the stock market performance in G20 countries.
Big Data-Driven Futuristic Fabric System in Societal Digital Transformation
Chakraborty C and Khan MK
Consumer Segmentation Based on Location and Timing Dimensions Using Big Data from Business-to-Customer Retailing Marketplaces
Ehsani F and Hosseini M
Consumer segmentation is an electronic marketing practice that involves dividing consumers into groups with similar features to discover their preferences. In the business-to-customer (B2C) retailing industry, marketers explore big data to segment consumers based on various dimensions. However, among these dimensions, the motives of location and time of shopping have received relatively less attention. In this study, we use the recency, frequency, monetary, and tenure (RFMT) method to segment consumers into 10 groups based on their time and geographical features. To explore location, we investigate market distribution, revenue distribution, and consumer distribution. Geographical coordinates and peculiarities are estimated based on consumer density. Regarding time exploration, we evaluate the accuracy of product delivery and the timing of promotions. To pinpoint the target consumers, we display the main hotspots on the distribution heatmap. Furthermore, we identify the optimal time for purchase and the most densely populated locations of beneficial consumers. In addition, we evaluate product distribution to determine the most popular product categories. Based on the RFMT segmentation and product popularity, we have developed a product recommender system to assist marketers in attracting and engaging potential consumers. Through a case study using data from massive B2C retailing, we conclude that the proposed segmentation provides superior insights into consumer behavior and improves product recommendation performance.
Big Data Confidentiality: An Approach Toward Corporate Compliance Using a Rule-Based System
Vranopoulos G, Clarke N and Atkinson S
Organizations have been investing in analytics relying on internal and external data to gain a competitive advantage. However, the legal and regulatory acts imposed nationally and internationally have become a challenge, especially for highly regulated sectors such as health or finance/banking. Data handlers such as Facebook and Amazon have already sustained considerable fines or are under investigation due to violations of data governance. The era of big data has further intensified the challenges of minimizing the risk of data loss by introducing the dimensions of Volume, Velocity, and Variety into confidentiality. Although Volume and Velocity have been extensively researched, Variety, "the ugly duckling" of big data, is often neglected and difficult to solve, thus increasing the risk of data exposure and data loss. In mitigating the risk of data exposure and data loss in this article, a framework is proposed to utilize algorithmic classification and workflow capabilities to provide a consistent approach toward data evaluations across the organizations. A rule-based system, implementing the corporate data classification policy, will minimize the risk of exposure by facilitating users to identify the approved guidelines and enforce them quickly. The framework includes an exception handling process with appropriate approval for extenuating circumstances. The system was implemented in a proof of concept working prototype to showcase the capabilities and provide a hands-on experience. The information system was evaluated and accredited by a diverse audience of academics and senior business executives in the fields of security and data management. The audience had an average experience of ∼25 years and amasses a total experience of almost three centuries (294 years). The results confirmed that the 3Vs are of concern and that Variety, with a majority of 90% of the commentators, is the most troubling. In addition to that, with an approximate average of 60%, it was confirmed that appropriate policies, procedure, and prerequisites for classification are in place while implementation tools are lagging.
An Expert Panel Discussion Embedding Ethics and Equity in Artificial Intelligence and Machine Learning Infrastructure
Simmons M, Hendricks-Sturrup R, Waters G, Novak L, Were M and Hussain S