Earth Science Informatics

A Linked Science Investigation: Enhancing Climate Change Data Discovery with Semantic Technologies
Pouchard LC, Branstetter ML, Cook RB, Devarakonda R, Green J, Palanisamy G, Alexander P and Noy NF
is the practice of inter-connecting scientific assets by publishing, sharing and linking scientific data and processes in end-to-end loosely coupled workflows that allow the sharing and re-use of scientific data. Much of this data does not live in the cloud or on the Web, but rather in multi-institutional data centers that provide tools and add value through quality assurance, validation, curation, dissemination, and analysis of the data. In this paper, we make the case for the use of scientific scenarios in Linked Science. We propose a scenario in river-channel transport that requires biogeochemical experimental data and global climate-simulation model data from many sources. We focus on the use of ontologies-formal machine-readable descriptions of the domain-to facilitate search and discovery of this data. Mercury, developed at Oak Ridge National Laboratory, is a tool for distributed metadata harvesting, search and retrieval. Mercury currently provides uniform access to more than 100,000 metadata records; 30,000 scientists use it each month. We augmented search in Mercury with ontologies, such as the ontologies in the Semantic Web for Earth and Environmental Terminology (SWEET) collection by prototyping a component that provides access to the ontology terms from Mercury. We evaluate the coverage of SWEET for the ORNL Distributed Active Archive Center (ORNL DAAC).
A comprehensive social media data processing and analytics architecture by using big data platforms: a case study of twitter flood-risk messages
Podhoranyi M
The main objective of the article is to propose an advanced architecture and workflow based on Apache Hadoop and Apache Spark big data platforms. The primary purpose of the presented architecture is collecting, storing, processing, and analysing intensive data from social media streams. This paper presents how the proposed architecture and data workflow can be applied to analyse Tweets with a specific flood topic. The secondary objective, trying to describe the flood alert situation by using only Tweet messages and exploring the informative potential of such data is demonstrated as well. The predictive machine learning approach based on Bayes Theorem was utilized to classify flood and no flood messages. For this study, approximately 100,000 Twitter messages were processed and analysed. Messages were related to the flooding domain and collected over a period of 5 days (14 May - 18 May 2018). Spark application was developed to run data processing commands automatically and to generate the appropriate output data. Results confirmed the advantages of many well-known features of Spark and Hadoop in social media data processing. It was noted that such technologies are prepared to deal with social media data streams, but there are still challenges that one has to take into account. Based on the flood tweet analysis, it was observed that Twitter messages with some considerations are informative enough to be used to estimate general flood alert situations in particular regions. Text analysis techniques proved that Twitter messages contain valuable flood-spatial information.
Context aware benchmarking and tuning of a TByte-scale air quality database and web service
Betancourt C, Hagemeier B, Schröder S and Schultz MG
We present context-aware benchmarking and performance engineering of a mature TByte-scale air quality database system which was created by the Tropospheric Ozone Assessment Report (TOAR) and contains one of the world's largest collections of near-surface air quality measurements. A special feature of our data service https://join.fz-juelich.de is on-demand processing of several air quality metrics directly from the TOAR database. As a service that is used by more than 350 users of the international air quality research community, our web service must be easily accessible and functionally flexible, while delivering good performance. The current on-demand calculations of air quality metrics outside the database together with the necessary transfer of large volume raw data are identified as the major performance bottleneck. In this study, we therefore explore and benchmark in-database approaches for the statistical processing, which results in performance enhancements of up to 32%.
Distribution and origin of anomalously high permeability zones in Weizhou formation, Weizhou 12-X oilfield, Weixinan Sag, China
Chen G, Meng Y, Huan J, Wang Y, Zhang L and Xiao L
In order to study the dominant seepage channel of the third member of Weizhou formation (E ) in Weizhou 12-X oilfield, Weixinan Sag, Beibu Gulf Basin, and tap the potential of remaining oil. The distribution and causes of the anomalously high permeability zones in Weizhou Formation were studied by using conventional core physical property analysis, scanning electron microscope, laser particle size analysis, X-ray diffraction and thin section microscopic identification. As the results show, vertically,there are three anomalously high permeability zones in the A , A and A micro-stage of the middle diagenetic stage, with the depth range of 2300 m ~ 2400 m, 2400 m ~ 2600 m, 2600 m ~ 2900 m respectively. Grain size, sorting, dissolution and early emplacement of hydrocarbons are the main causes of anomalously high permeability zones. Although both grain size and sorting affect porosity and permeability, the effect of grain size on permeability is stronger than sorting, and sorting has a stronger effect on porosity than grain size. Magmatic hydrothermal and organic acid promote dissolution and concomitant porosity and permeability increase by dissolving unstable minerals. The early emplacement of hydrocarbons retard the cementation and accompanying porosity and permeability reduction by reducing the water-rock ratio. Finally, sandstone reservoirs in the E are characterized by anomalously high permeability zones.
STARE INTO THE FUTURE OF GEODATA INTEGRATIVE ANALYSIS
Rilee ML, Kuo KS, Frew J, Gallagher J, Griessbaum N, Neumiller K and Wolfe RE
Different kinds of observations feature different strengths, e.g. visible-infrared imagery for clouds and radar for precipitation, and when integrated better constrain scientific models and hypotheses. Even critical, fundamental operations such as cross-calibrations of related sensors operating on different platforms or orbits, e.g. spacecraft and aircraft, are integrative analyses. The great variety of Earth Science data types and the spatiotemporal irregularity of important low-level (ungridded) data has so far made their integration a customized, tedious process which scales in neither variety nor volume. Generic, higher-level (gridded) data products are easier to use, at the cost of being farther from the original observations and having to settle with grids, interpolation assumptions, and uncertainties that limit their applicability. The root cause of the difficulty in scalably bringing together diverse data is the current rectilinear geo-partitioning of Earth Science data into conventional arrays indexed using consecutive integer indices and then packaged into files. Such indices suffice for archival, search, and retrieval, but lack a common geospatial semantics, which is mitigated by adding on floating-point encoded longitude-latitude (lon-lat) information for registration. An alternative to floating-point lon-lat, the SpatioTemporal Adaptive Resolution Encoding (STARE) provides an encoding for geo-spatiotemporal location and neighborhood that transcends the use of files and native array indexing allowing diverse data to be organized on scalable, distributed computing and storage platforms.
An ensemble method to forecast 24-h ahead solar irradiance using wavelet decomposition and BiLSTM deep learning network
Singla P, Duhan M and Saroha S
In recent years, the penetration of solar power at residential and utility levels has progressed exponentially. However, due to its stochastic nature, the prediction of solar global horizontal irradiance (GHI) with higher accuracy is a challenging task; but, vital for grid management: planning, scheduling & balancing. Therefore, this paper proposes an ensemble model using the extended scope of wavelet transform (WT) and bidirectional long short term memory (BiLSTM) deep learning network to forecast 24-h ahead solar GHI. The WT decomposes the input time series data into different finite intrinsic model functions (IMF) to extract the statistical features of input time series. Further, the study reduces the number of IMF series by combining the wavelet decomposed components (D1-D6) series on the basis of comprehensive experimental analysis with an aim to improve the forecasting accuracy. Next, the trained standalone BiLSTM networks are allocated to each IMF sub-series to execute the forecasting. Finally, the forecasted values of each sub-series from BiLSTM networks are reconstructed to deliver the final solar GHI forecast. The study performed monthly solar GHI forecasting for one year dataset using one month moving window mechanism for the location of Ahmedabad, Gujarat, India. For the performance comparison, the naïve predictor as a benchmark model, standalone long short term memory (LSTM), gated recurrent unit (GRU), BiLSTM and two other wavelet-based BiLSTM models are also simulated. From the results, it is observed that the proposed model outperforms other models in terms of root mean square error (RMSE) & mean absolute percentage error (MAPE), coefficient of determination (R) and forecast skill (FS). The proposed model reduces the monthly average RMSE by range from 26.04-58.89%, 5.17-31.35%, 23.26-56.06% & 21.08-57% in comparison with benchmark, standalone BiLSTM, GRU & LSTM networks respectively. On the other hand, the monthly average MAPE is reduced by range from 9 to 51.18%, 12.59-28.14%, 30.43-59.19% & 26.54-58.92% in comparison to benchmark, standalone BiLSTM, GRU & LSTM respectively. Further, the proposed model obtained the value of R equal to 0.94 and forecast skill (%) of 47% with reference to the benchmark model.
Real time deep learning framework to monitor social distancing using improved single shot detector based on overhead position
Gopal B and Ganesan A
The current COVID 19 halo infection has caused a severe catastrophe with its deadly spread. Despite the implementation of the vaccine, the severity of the infection has not diminished, and it has become stronger and more destructive. So, the only solution to protect ourselves from infection is social-distancing. Although social-distancing has been in practice for a long time, in most places it is not effectively followed, and it is very difficult to find out manually at all times whether people are following it or not. Therefore, we introduced a newly developed framework of deep-learning technique to automatically identify whether people maintain social-distancing or not using remote sensing top view images. Initially, we are detecting the context of image which includes information about the environment. Our detection model recognizes individuals using the boundary box. Then centroid is determined over every detected boundary box. By means of applying Euclidean distance, the pair range distances of the detected boundary box centroid are determined. To evaluate whether the distance measurement exceeds the minimum social distance limit, the violation threshold is established. We used Improved Single Shot Detector model for detecting a person over an image. Experiments are carried out on widely collected remote sensing images from various environments. Based on the object detection algorithm of deep learning, a variety of performance metrics are compared to evaluate the efficiency of the proposed model. Research outcome shows that, our proposed model outperforms well while recognize and detect a person in a well excellent way.
EZ-InSAR: An easy-to-use open-source toolbox for mapping ground surface deformation using satellite interferometric synthetic aperture radar
Hrysiewicz A, Wang X and Holohan EP
Satellite Interferometric Synthetic Aperture Radar (InSAR) is a space-borne geodetic technique that can map ground displacement at millimetre accuracy. Via the new era for InSAR applications provided by the Copernicus Sentinel-1 SAR satellites, several open-source software packages exist for processing SAR data. These packages enable one to obtain high-quality ground deformation maps, but still require a deep understanding of InSAR theory and the related computational tools, especially when dealing with a large stack of images. Here we present an open-source toolbox, EZ-InSAR (, for a user-friendly implementation of InSAR displacement time series analysis with multi-temporal SAR images. EZ-InSAR integrates the three most popular and renowned open-source tools (i.e., ISCE, StaMPS, and MintPy), to generate interferograms and displacement time series by using these state-of-art algorithms within a seamless Graphical User Interface. EZ-InSAR reduces the user's workload by automatically downloading the Sentinel-1 SAR imagery and the digital elevation model data for the user's area of interest, and by streamlining preparation of input data stacks for the time series InSAR analysis. We illustrate the EZ-InSAR processing capabilities by mapping recent ground deformation at Campi Flegrei (> 100 mm·yr) and Long Valley (~ 10 mm·yr) calderas with both Persistent Scatterer InSAR and Small-Baseline Subset approaches. We also validate the test results by comparing the InSAR displacements with Global Navigation Satellite System measurements at those volcanoes. Our tests indicate that the EZ-InSAR toolbox provided here can serve as a valuable contribution to the community for ground deformation monitoring and geohazard evaluation, as well as for disseminating bespoke InSAR observations for all.
Modeling the stubble burning generated airborne contamination with air pollution components through MATLAB
Mathur S
Air pollution is one of the significant environmental issues. It can cause adverse health effects such as cancer, cardiovascular diseases, and high mortality rates. High population density is a huge contributory factor of air pollution in cities and urbanized areas. Other sources of air pollution are transport, local heating, and possibly a pollution transfer from neighboring industrial regions. Information about the opening and closing of industrial plants, stubble burning, and fireworks can be considered an added value to this work. In recent past years, among the Delhi/National Capital Region. The pollution level has been raised multiple times during November and March months. The most significant root cause of this trouble is stubble-burning air pollution in the neighboring cities like Punjab and Haryana. Another concern is the burning of fireworks in these months due to Diwali and other festivals in the cities. This research paper aims to perform a data-based analysis on the PM2.5 and PM10 concentration levels of the past six years month-wise from the authenticated sources to find out the causes of extreme air pollution levels. This research paper deals mainly with long-term time series, including air pollutants PM2.5, PM10, and other meteorological variables. It models and analyzes the PM2.5 and PM10 values for 2014-2018. The present research work also deals with the facts and methods to handle the high air pollution rate in Delhi/NCR caused by stubble burning and analysis the effect on the environment by the call of Honorable Prime Minister of India Shri Narender Modi Ji for the illusion of candles as a symbolic fight against COVID 19.
Improving access to geodetic imaging crustal deformation data using GeoGateway
Donnellan A, Parker J, Heflin M, Glasscoe M, Lyzenga G, Pierce M, Wang J, Rundle J, Ludwig LG, Granat R, Mirkhanian M and Pulver N
GeoGateway (http://geo-gateway.org) is a web-based interface for analysis and modeling of geodetic imaging data and to support response to related disasters. Geodetic imaging data product currently supported by GeoGateway include Global Navigation Satellite System (GNSS) daily position time series and derived velocities and displacements and airborne Interferometric Synthetic Aperture Radar (InSAR) from NASA's UAVSAR platform. GeoGateway allows users to layer data products in a web map interface and extract information from various tools. Extracted products can be downloaded for further analysis. GeoGateway includes overlays of California fault traces, seismicity from user selected search parameters, and user supplied map files. GeoGateway also provides earthquake nowcasts and hazard maps as well as products created for related response to natural disasters. A user guide is present in the GeoGateway interface. The GeoGateway development team is also growing the user base through workshops, webinars, and video tutorials. GeoGateway is used in the classroom and for research by experts and non-experts including by students.
OpenAltimetry - rapid analysis and visualization of Spaceborne altimeter data
Khalsa SJS, Borsa A, Nandigam V, Phan M, Lin K, Crosby C, Fricker H, Baru C and Lopez L
NASA's Ice, Cloud, and land Elevation Satellite-2 (ICESat-2) carries a laser altimeter that fires 10,000 pulses per second towards Earth and records the travel time of individual photons to measure the elevation of the surface below. The volume of data produced by ICESat-2, nearly a TB per day, presents significant challenges for users wishing to efficiently explore the dataset. NASA's National Snow and Ice Data Center (NSIDC) Distributed Active Archive Center (DAAC), which is responsible for archiving and distributing ICESat-2 data, provides search and subsetting services on mission data products, but providing interactive data discovery and visualization tools needed to assess data coverage and quality in a given area of interest is outside of NSIDC's mandate. The OpenAltimetry project, a NASA-funded collaboration between NSIDC, UNAVCO and the University of California San Diego, has developed a web-based cyberinfrastructure platform that allows users to locate, visualize, and download ICESat-2 surface elevation data and photon clouds for any location on Earth, on demand. OpenAltimetry also provides access to elevations and waveforms for ICESat (the predecessor mission to ICESat-2). In addition, OpenAltimetry enables data access via APIs, opening opportunities for rapid access, experimentation, and computation via third party applications like Jupyter notebooks. OpenAltimetry emphasizes ease-of-use for new users and rapid access to entire altimetry datasets for experts and has been successful in meeting the needs of different user groups. In this paper we describe the principles that guided the design and development of the OpenAltimetry platform and provide a high-level overview of the cyberinfrastructure components of the system.
PYTAF: A Python Tool for Spatially Resampling Earth Observation Data
Zhao G, Yang M, Gao Y, Zhan Y, Lee HJ and Di Girolamo L
Earth observation data have revolutionized Earth science and significantly enhanced the ability to forecast weather, climate and natural hazards. The storage format of the majority of Earth observation data can be classified into swath, grid or point structures. Earth science studies frequently involve resampling between swath, grid and point data when combining measurements from multiple instruments, which can provide more insights into geophysical processes than using any single instrument alone. As the amount of Earth observation data increases each day, the demand for a high computational efficient tool to resample and fuse Earth observation data has never been greater. We present a software tool, called pytaf, that resamples Earth observation data stored in swath, grid or point structures using a novel block indexing algorithm. This tool is specially designed to process large scale datasets. The core functions of pytaf were implemented in C with OpenMP to enable parallel computations in a shared memory environment. A user-friendly python interface was also built. The tool has been extensively tested on supercomputers and successfully used to resample the data from five instruments on the EOS-Terra platform at a mission-wide scale.
Enhancing the classification metrics of spectroscopy spectrums using neural network based low dimensional space
Yousuff M and Babu R
Spectroscopy is a methodology for gaining knowledge of particles, especially biomolecules, by quantifying the interactions between matter and light. By examining the level of light absorbed, reflected or released by a specimen, its constituents, properties, and volume can be determined. Spectra obtained through spectroscopy procedures are quick, harmless and contactless; hence nowadays preferred in chemometrics. Due to the high dimensional nature of the spectra, it is challenging to build a robust classifier with good performance metrics. Many linear and nonlinear dimensionality reduction-based classification models have been previously implemented to overcome this issue. However, they lack in capturing the subtle details of the spectra into the low dimension space or cannot efficiently handle the nonlinearity present in the spectral data. We propose a graph-based neural network embedding approach to extract appropriate features into latent space and circumvent the spectrums' nonlinearity problem. Our approach performs dimensionality reduction into two phases: constructing a nearest neighbor graph and producing almost linear embedding using a fully connected neural network. Further, the low dimensional embedding is subjected to classification using the Random Forest algorithm. In this paper, we have implemented and compared our technique with four nonlinear dimensionality techniques widely used for spectral data analysis. In this study, we have considered five different spectral datasets belonging to specific applications. The various classification performance metrics of all the techniques are evaluated. The proposed approach is able to perform competitively well on six different low-dimensional spaces for each dataset with an accuracy score above 95% and Matthew's correlation coefficient value close to 1. The trustworthiness score of almost 1 show that the presented dimensionality reduction approach preserves the closest neighbor structure of high dimensional spectral inputs into latent space.
Mapping burn severity and monitoring CO content in Türkiye's 2021 Wildfires, using Sentinel-2 and Sentinel-5P satellite data on the GEE platform
Yilmaz OS, Acar U, Sanli FB, Gulgen F and Ates AM
This study investigated forest fires in the Mediterranean of Türkiye between July 28, 2021, and August 11, 2021. Burn severity maps were produced with the difference normalised burned ratio index (dNBR) and difference normalised difference vegetation index (dNDVI) using Sentinel-2 images on the Google Earth Engine (GEE) cloud platform. The burned areas were estimated based on the determined burning severity degrees. Vegetation density losses in burned areas were analysed using the normalised difference vegetation index (NDVI) time series. At the same time, the post-fire Carbon Monoxide (CO) column number densities were determined using the Sentinel-5P satellite data. According to the burn severity maps obtained with dNBR, the sum of high and moderate severity areas constitutes 34.64%, 20.57%, 46.43%, 51.50% and 18.88% of the entire area in Manavgat, Gündoğmuş, Marmaris, Bodrum and Köyceğiz districts, respectively. Likewise, according to the burn severity maps obtained with dNDVI, the sum of the areas of very high severity and high severity constitutes 41.17%, 30.16%, 30.50%, 42.35%, and 10.40% of the entire region, respectively. In post-fire NDVI time series analyses, sharp decreases were observed in NDVI values from 0.8 to 0.1 in all burned areas. While the Tropospheric CO column number density was 0.03 mol/m in all regions burned before the fire, it was observed that this value increased to 0.14 mol/m after the fire. Moreover, when the area was examined more broadly with Sentinel 5P data, it was observed that the amount of CO increased up to a maximum value of 0.333 mol/m. The results of this study present significant information in terms of determining the severity of forest fires in the Mediterranean region in 2021 and the determination of the CO column number density after the fire. In addition, monitoring polluting gases with RS techniques after forest fires is essential in understanding the extent of the damage they can cause to the environment.
A spectral optical flow method for determining velocities from digital imagery
Hurlburt N and Jaffey S
We present a method for determining surface flows from solar images based upon optical flow techniques. We apply the method to sets of images obtained by a variety of solar imagers to assess its performance. The opflow3d procedure is shown to extract accurate velocity estimates when provided perfect test data and quickly generates results consistent with completely distinct methods when applied on global scales. We also validate it in detail by comparing it to an established method when applied to high-resolution datasets and find that it provides comparable results without the need to tune, filter or otherwise preprocess the images before its application.
Spatial and temporal changes analysis of air quality before and after the COVID-19 in Shandong Province, China
Xing H, Zhu L, Chen B, Niu J, Li X, Feng Y and Fang W
Due to the COVID-19 pandemic outbreak, the home quarantine policy was implemented to control the spread of the pandemic, which may have a positive impact on the improvement of air quality in China. In this study, Google Earth Engine (GEE) cloud computing platform was used to obtain CO, NO, SO and aerosol optical depth (AOD) data from December 2018-March 2019, December 2019-March 2020, and December 2020-March 2021 in Shandong Province. These data were used to study the spatial and temporal distribution of air quality changes in Shandong Province before and after the pandemic and to analyze the reasons for the changes. The results show that: (1) Compared with the same period, CO and NO showed a decreasing trend from December 2019 to March 2020, with an average total change of 4082.36 mol/m and 167.25 mol/m, and an average total change rate of 4.80% and 38.11%, respectively. SO did not have a significant decrease. This is inextricably linked to the reduction of human travel production activities with the implementation of the home quarantine policy. (2) The spatial and temporal variation of AOD was similar to that of pollutants, but showed a significant increase in January 2020, with an average total amount increase of 1.69 × 10 up about 2.54% from December 2019 to March 2020. This is attributed to urban heating and the reduction of pollutants such as NO. (3) Pollutants and AOD were significantly correlated with meteorological data (e.g., average temperature, average humidity, average wind speed, average precipitation, etc.). This study provides data support for atmospheric protection and air quality monitoring in Shandong Province, as well as theoretical basis and technical guidance for policy formulation and urban planning.
Utility of the Python package Geoweaver_cwl for improving workflow reusability: an illustration with multidisciplinary use cases
Kale A, Sun Z and Ma X
Computational workflows are widely used in data analysis, enabling automated tracking of steps and storage of provenance information, leading to innovation and decision-making in the scientific community. However, the growing popularity of workflows has raised concerns about reproducibility and reusability which can hinder collaboration between institutions and users. In order to address these concerns, it is important to standardize workflows or provide tools that offer a framework for describing workflows and enabling computational reusability. One such set of standards that has recently emerged is the Common Workflow Language (CWL), which offers a robust and flexible framework for data analysis tools and workflows. To promote portability, reproducibility, and interoperability of AI/ML workflows, we developed , a Python package that automatically describes AI/ML workflows from a workflow management system (WfMS) named Geoweaver into CWL. In this paper, we test our Python package on multiple use cases from different domains. Our objective is to demonstrate and verify the utility of this package. We make all the code and dataset open online and briefly describe the experimental implementation of the package in this paper, confirming that can lead to a well-versed AI process while disclosing opportunities for further extensions. The package is publicly released online at https://pypi.org/project/geoweaver-cwl/0.0.1/ and exemplar results are accessible at: https://github.com/amrutakale08/geoweaver_cwl-usecases.