Expanding the concept of ID conversion in TogoID by introducing multi-semantic and label features
TogoID ( https://togoid.dbcls.jp/ ) is an identifier (ID) conversion service designed to link IDs across diverse categories of life science databases. With its ability to obtain IDs related in different semantic relationships, a user-friendly web interface, and a regular automatic data update system, TogoID has been a valuable tool for bioinformatics.
FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis
We are witnessing an enormous growth in the amount of molecular profiling (-omics) data. The integration of multi-omics data is challenging. Moreover, human multi-omics data may be privacy-sensitive and can be misused to de-anonymize and (re-)identify individuals. Hence, most biomedical data is kept in secure and protected silos. Therefore, it remains a challenge to re-use these data without infringing the privacy of the individuals from which the data were derived. Federated analysis of Findable, Accessible, Interoperable, and Reusable (FAIR) data is a privacy-preserving solution to make optimal use of these multi-omics data and transform them into actionable knowledge.
MeSH2Matrix: combining MeSH keywords and machine learning for biomedical relation classification based on PubMed
Biomedical relation classification has been significantly improved by the application of advanced machine learning techniques on the raw texts of scholarly publications. Despite this improvement, the reliance on large chunks of raw text makes these algorithms suffer in terms of generalization, precision, and reliability. The use of the distinctive characteristics of bibliographic metadata can prove effective in achieving better performance for this challenging task. In this research paper, we introduce an approach for biomedical relation classification using the qualifiers of co-occurring Medical Subject Headings (MeSH). First of all, we introduce MeSH2Matrix, our dataset consisting of 46,469 biomedical relations curated from PubMed publications using our approach. Our dataset includes a matrix that maps associations between the qualifiers of subject MeSH keywords and those of object MeSH keywords. It also specifies the corresponding Wikidata relation type and the superclass of semantic relations for each relation. Using MeSH2Matrix, we build and train three machine learning models (Support Vector Machine [SVM], a dense model [D-Model], and a convolutional neural network [C-Net]) to evaluate the efficiency of our approach for biomedical relation classification. Our best model achieves an accuracy of 70.78% for 195 classes and 83.09% for five superclasses. Finally, we provide confusion matrix and extensive feature analyses to better examine the relationship between the MeSH qualifiers and the biomedical relations being classified. Our results will hopefully shed light on developing better algorithms for biomedical ontology classification based on the MeSH keywords of PubMed publications. For reproducibility purposes, MeSH2Matrix, as well as all our source codes, are made publicly accessible at https://github.com/SisonkeBiotik-Africa/MeSH2Matrix .
Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI)
Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources.
Annotation of epilepsy clinic letters for natural language processing
Natural language processing (NLP) is increasingly being used to extract structured information from unstructured text to assist clinical decision-making and aid healthcare research. The availability of expert-annotated documents for the development and validation of NLP applications is limited. We created synthetic clinical documents to address this, and to validate the Extraction of Epilepsy Clinical Text version 2 (ExECTv2) NLP pipeline.
Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models
Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects.
Concretizing plan specifications as realizables within the OBO foundry
Within the Open Biological and Biomedical Ontology (OBO) Foundry, many ontologies represent the execution of a plan specification as a process in which a realizable entity that concretizes the plan specification, a "realizable concretization" (RC), is realized. This representation, which we call the "RC-account", provides a straightforward way to relate a plan specification to the entity that bears the realizable concretization and the process that realizes the realizable concretization. However, the adequacy of the RC-account has not been evaluated in the scientific literature. In this manuscript, we provide this evaluation and, thereby, give ontology developers sound reasons to use or not use the RC-account pattern.
An extensible and unifying approach to retrospective clinical data modeling: the BrainTeaser Ontology
Automatic disease progression prediction models require large amounts of training data, which are seldom available, especially when it comes to rare diseases. A possible solution is to integrate data from different medical centres. Nevertheless, various centres often follow diverse data collection procedures and assign different semantics to collected data. Ontologies, used as schemas for interoperable knowledge bases, represent a state-of-the-art solution to homologate the semantics and foster data integration from various sources. This work presents the BrainTeaser Ontology (BTO), an ontology that models the clinical data associated with two brain-related rare diseases (ALS and MS) in a comprehensive and modular manner. BTO assists in organizing and standardizing the data collected during patient follow-up. It was created by harmonizing schemas currently used by multiple medical centers into a common ontology, following a bottom-up approach. As a result, BTO effectively addresses the practical data collection needs of various real-world situations and promotes data portability and interoperability. BTO captures various clinical occurrences, such as disease onset, symptoms, diagnostic and therapeutic procedures, and relapses, using an event-based approach. Developed in collaboration with medical partners and domain experts, BTO offers a holistic view of ALS and MS for supporting the representation of retrospective and prospective data. Furthermore, BTO adheres to Open Science and FAIR (Findable, Accessible, Interoperable, and Reusable) principles, making it a reliable framework for developing predictive tools to aid in medical decision-making and patient care. Although BTO is designed for ALS and MS, its modular structure makes it easily extendable to other brain-related diseases, showcasing its potential for broader applicability.Database URL https://zenodo.org/records/7886998 .
Chemical entity normalization for successful translational development of Alzheimer's disease and dementia therapeutics
Identifying chemical mentions within the Alzheimer's and dementia literature can provide a powerful tool to further therapeutic research. Leveraging the Chemical Entities of Biological Interest (ChEBI) ontology, which is rich in hierarchical and other relationship types, for entity normalization can provide an advantage for future downstream applications. We provide a reproducible hybrid approach that combines an ontology-enhanced PubMedBERT model for disambiguation with a dictionary-based method for candidate selection.
Optimized continuous homecare provisioning through distributed data-driven semantic services and cross-organizational workflows
In healthcare, an increasing collaboration can be noticed between different caregivers, especially considering the shift to homecare. To provide optimal patient care, efficient coordination of data and workflows between these different stakeholders is required. To achieve this, data should be exposed in a machine-interpretable, reusable manner. In addition, there is a need for smart, dynamic, personalized and performant services provided on top of this data. Flexible workflows should be defined that realize their desired functionality, adhere to use case specific quality constraints and improve coordination across stakeholders. User interfaces should allow configuring all of this in an easy, user-friendly way.
Empowering standardization of cancer vaccines through ontology: enhanced modeling and data analysis
The exploration of cancer vaccines has yielded a multitude of studies, resulting in a diverse collection of information. The heterogeneity of cancer vaccine data significantly impedes effective integration and analysis. While CanVaxKB serves as a pioneering database for over 670 manually annotated cancer vaccines, it is important to distinguish that a database, on its own, does not offer the structured relationships and standardized definitions found in an ontology. Recognizing this, we expanded the Vaccine Ontology (VO) to include those cancer vaccines present in CanVaxKB that were not initially covered, enhancing VO's capacity to systematically define and interrelate cancer vaccines.
Correction to: Semantic units: organizing knowledge graphs into semantically meaningful units of representation
Multi-task transfer learning for the prediction of entity modifiers in clinical text: application to opioid use disorder case detection
The semantics of entities extracted from a clinical text can be dramatically altered by modifiers, including entity negation, uncertainty, conditionality, severity, and subject. Existing models for determining modifiers of clinical entities involve regular expression or features weights that are trained independently for each modifier.
Elucidating the semantics-topology trade-off for knowledge inference-based pharmacological discovery
Leveraging AI for synthesizing the deluge of biomedical knowledge has great potential for pharmacological discovery with applications including developing new therapeutics for untreated diseases and repurposing drugs as emergent pandemic treatments. Creating knowledge graph representations of interacting drugs, diseases, genes, and proteins enables discovery via embedding-based ML approaches and link prediction. Previously, it has been shown that these predictive methods are susceptible to biases from network structure, namely that they are driven not by discovering nuanced biological understanding of mechanisms, but based on high-degree hub nodes. In this work, we study the confounding effect of network topology on biological relation semantics by creating an experimental pipeline of knowledge graph semantic and topological perturbations. We show that the drop in drug repurposing performance from ablating meaningful semantics increases by 21% and 38% when mitigating topological bias in two networks. We demonstrate that new methods for representing knowledge and inferring new knowledge must be developed for making use of biomedical semantics for pharmacological innovation, and we suggest fruitful avenues for their development.
Leveraging logical definitions and lexical features to detect missing IS-A relations in biomedical terminologies
Biomedical terminologies play a vital role in managing biomedical data. Missing IS-A relations in a biomedical terminology could be detrimental to its downstream usages. In this paper, we investigate an approach combining logical definitions and lexical features to discover missing IS-A relations in two biomedical terminologies: SNOMED CT and the National Cancer Institute (NCI) thesaurus. The method is applied to unrelated concept-pairs within non-lattice subgraphs: graph fragments within a terminology likely to contain various inconsistencies. Our approach first compares whether the logical definition of a concept is more general than that of the other concept. Then, we check whether the lexical features of the concept are contained in those of the other concept. If both constraints are satisfied, we suggest a potentially missing IS-A relation between the two concepts. The method identified 982 potential missing IS-A relations for SNOMED CT and 100 for NCI thesaurus. In order to assess the efficacy of our approach, a random sample of results belonging to the "Clinical Findings" and "Procedure" subhierarchies of SNOMED CT and results belonging to the "Drug, Food, Chemical or Biomedical Material" subhierarchy of the NCI thesaurus were evaluated by domain experts. The evaluation results revealed that 118 out of 150 suggestions are valid for SNOMED CT and 17 out of 20 are valid for NCI thesaurus.
Semantic units: organizing knowledge graphs into semantically meaningful units of representation
In today's landscape of data management, the importance of knowledge graphs and ontologies is escalating as critical mechanisms aligned with the FAIR Guiding Principles-ensuring data and metadata are Findable, Accessible, Interoperable, and Reusable. We discuss three challenges that may hinder the effective exploitation of the full potential of FAIR knowledge graphs.
Explanatory argumentation in natural language for correct and incorrect medical diagnoses
A huge amount of research is carried out nowadays in Artificial Intelligence to propose automated ways to analyse medical data with the aim to support doctors in delivering medical diagnoses. However, a main issue of these approaches is the lack of transparency and interpretability of the achieved results, making it hard to employ such methods for educational purposes. It is therefore necessary to develop new frameworks to enhance explainability in these solutions.
RecSOI: recommending research directions using statements of ignorance
The more science advances, the more questions are asked. This compounding growth can make it difficult to keep up with current research directions. Furthermore, this difficulty is exacerbated for junior researchers who enter fields with already large bases of potentially fruitful research avenues. In this paper, we propose a novel task and a recommender system for research directions, RecSOI, that draws from statements of ignorance (SOIs) found in the research literature. By building researchers' profiles based on textual elements, RecSOI generates personalized recommendations of potential research directions tailored to their interests. In addition, RecSOI provides context for the recommended SOIs, so that users can quickly evaluate how relevant the research direction is for them. In this paper, we provide an overview of RecSOI's functioning, implementation, and evaluation, demonstrating its effectiveness in guiding researchers through the vast landscape of potential research directions.
Comparing generative and extractive approaches to information extraction from abstracts describing randomized clinical trials
Systematic reviews of Randomized Controlled Trials (RCTs) are an important part of the evidence-based medicine paradigm. However, the creation of such systematic reviews by clinical experts is costly as well as time-consuming, and results can get quickly outdated after publication. Most RCTs are structured based on the Patient, Intervention, Comparison, Outcomes (PICO) framework and there exist many approaches which aim to extract PICO elements automatically. The automatic extraction of PICO information from RCTs has the potential to significantly speed up the creation process of systematic reviews and this way also benefit the field of evidence-based medicine.
Ontological representation, modeling, and analysis of parasite vaccines
Pathogenic parasites are responsible for multiple diseases, such as malaria and Chagas disease, in humans and livestock. Traditionally, pathogenic parasites have been largely an evasive topic for vaccine design, with most successful vaccines only emerging recently. To aid vaccine design, the VIOLIN vaccine knowledgebase has collected vaccines from all sources to serve as a comprehensive vaccine knowledgebase. VIOLIN utilizes the Vaccine Ontology (VO) to standardize the modeling of vaccine data. VO did not model complex life cycles as seen in parasites. With the inclusion of successful parasite vaccines, an update in parasite vaccine modeling was needed.
Enriching the FIDEO ontology with food-drug interactions from online knowledge sources
The increasing number of articles on adverse interactions that may occur when specific foods are consumed with certain drugs makes it difficult to keep up with the latest findings. Conflicting information is available in the scientific literature and specialized knowledge bases because interactions are described in an unstructured or semi-structured format. The FIDEO ontology aims to integrate and represent information about food-drug interactions in a structured way. This article reports on the new version of this ontology in which more than 1700 interactions are integrated from two online resources: DrugBank and Hedrine. These food-drug interactions have been represented in FIDEO in the form of precompiled concepts, each of which specifies both the food and the drug involved. Additionally, competency questions that can be answered are reviewed, and avenues for further enrichment are discussed.