Data Science Curriculum in the iField
Many disciplines, including the broad Field of Information (iField), have been offering Data Science (DS) programs. There have been significant efforts exploring an individual discipline's identity and unique contributions to the broader DS education landscape. To advance DS education in the iField, the iSchool Data Science Curriculum Committee (iDSCC) was formed and charged with building and recommending a DS education framework for iSchools. This paper reports on the research process and findings of a series of studies to address important questions: What is the iField identity in the multidisciplinary DS education landscape? What is the status of DS education in iField schools? What knowledge and skills should be included in the core curriculum for iField DS education? What are the jobs available for DS graduates from the iField? What are the differences between graduate-level and undergraduate-level DS education? Answers to these questions will not only distinguish an iField approach to DS education but also define critical components of DS curriculum. The results will inform individual DS programs in the iField to develop curriculum to support undergraduate and graduate DS education in their local context.
The NLM indexer assignment dataset: a new large-scale dataset for reviewer assignment research
MEDLINE is the National Library of Medicine's (NLM) journal citation database. It contains over 28 million references to biomedical and life science journal articles, and a key feature of the database is that all articles are indexed with NLM Medical Subject Headings (MeSH). The library employs a team of MeSH indexers, and in recent years they have been asked to index close to 1 million articles per year in order to keep MEDLINE up to date. An important part of the MEDLINE indexing process is the assignment of articles to indexers. High quality and timely indexing is only possible when articles are assigned to indexers with suitable expertise. This paper introduces the NLM indexer assignment dataset: a large dataset of 4.2 million indexer article assignments for articles indexed between 2011 and 2019. The dataset is shown to be a valuable testbed for expert matching and assignment algorithms, and indexer article assignment is also found to be useful domain-adaptive pre-training for the closely related task of reviewer assignment.
Do funding sources complement or substitute? Examining the impact of cancer research publications
Academic research often draws on multiple funding sources. This paper investigates whether complementarity or substitutability emerges when different types of funding are used. Scholars have examined this phenomenon at the university and scientist levels, but not at the publication level. This gap is significant since acknowledgement sections in scientific papers indicate publications are often supported by multiple funding sources. To address this gap, we examine the extent to which different funding types are jointly used in publications, and to what extent certain combinations of funding are associated with higher academic impact (citation count). We focus on three types of funding accessed by UK-based researchers: national, international, and industry. The analysis builds on data extracted from all UK cancer-related publications in 2011, thus providing a 10-year citation window. Findings indicate that, although there is complementarity between national and international funding in terms of their co-occurrence (where these are acknowledged in the same publication), when we evaluate funding complementarity in relation to academic impact (we employ the supermodularity framework), we found no evidence of such a relationship. Rather, our results suggest substitutability between national and international funding. We also observe substitutability between international and industry funding.
Feedback beyond accuracy: Using eye-tracking to detect comprehensibility and interest during reading
Knowing what information a user wants is a paramount challenge to information science and technology. Implicit feedback is key to solving this challenge, as it allows information systems to learn about a user's needs and preferences. The available feedback, however, tends to be limited and its interpretation shows to be difficult. To tackle this challenge, we present a user study that explores whether tracking the eyes can unpack part of the complexity inherent to relevance and relevance decisions. The eye behavior of 30 participants reading 18 news articles was compared with their subjectively appraised comprehensibility and interest at a discourse level. Using linear regression models, the eye-tracking signal explained 49.93% (comprehensibility) and 30.41% (interest) of variance ( < .001). We conclude that eye behavior provides implicit feedback beyond accuracy that enables new forms of adaptation and interaction support for personalized information systems.
Analysis of noise and bias errors in intelligence information systems
An intelligence information system (IIS) is a particular kind of information systems (IS) devoted to the analysis of intelligence relevant to national security. Professional and military intelligence analysts play a key role in this, but their judgments can be inconsistent, mainly due to noise and bias. The team-oriented aspects of the intelligence analysis process complicates the situation further. To enable analysts to achieve better judgments, the authors designed, implemented, and validated an innovative IIS for analyzing UK Military Signals Intelligence (SIGINT) data. The developed tool, the Team Information Decision Engine (TIDE), relies on an innovative preference learning method along with an aggregation procedure that permits combining scores by individual analysts into aggregated scores. This paper reports on a series of validation trials in which the performance of individual and team-oriented analysts was accessed with respect to their effectiveness and efficiency. Results show that the use of the developed tool enhanced the effectiveness and efficiency of intelligence analysis process at both individual and team levels.
"Death of social encounters": Investigating COVID-19's initial impact on virtual reference services in academic libraries
This investigation explores the initial impact of the COVID-19 pandemic on live chat virtual reference services (VRS) in academic libraries and on user behaviors from March to December 2020 using Goffman's theoretical framework (1956, 1967, 1971). Data from 300 responses by academic librarians to two longitudinal online surveys and 28 semi-structured interviews were quantitatively and qualitatively analyzed. Results revealed that academic librarians were well-positioned to provide VRS as university information hubs during pandemic shutdowns. Qualitative analysis revealed that participants received gratitude for VRS help, but also experienced frustrations and angst with limited accessibility during COVID-19. Participants reported changes including VRS volume, level of complexity, and question topics. Results reveal the range and frequency of new services with librarians striving to make personal connections with users through VRS, video consultations, video chat, and other strategies. Participants found it difficult to maintain these connections, coping through grit and mutual support when remote work became necessary. They adapted to challenges, including isolation, technology learning curves, and disrupted work routines. Librarians' responses chronicle their innovative approaches, fierce determination, emotional labor, and dedication to helping users and colleagues through this unprecedented time. Results have vital implications for the future of VRS.
How do properties of data, their curation, and their funding relate to reuse?
Despite large public investments in facilitating the secondary use of data, there is little information about the specific factors that predict data's reuse. Using data download logs from the Inter-university Consortium for Political and Social Research (ICPSR), this study examines how data properties, curation decisions, and repository funding models relate to data reuse. We find that datasets deposited by institutions, subject to many curatorial tasks, and whose access and preservation is funded externally, are used more often. Our findings confirm that investments in data collection, curation, and preservation are associated with more data reuse.
Analysis of shared research data in Spanish scientific papers about COVID-19: A first approach
During the coronavirus pandemic, changes in the way science is done and shared occurred, which motivates meta-research to help understand science communication in crises and improve its effectiveness. The objective is to study how many Spanish scientific papers on COVID-19 published during 2020 share their research data. Qualitative and descriptive study applying nine attributes: (a) availability, (b) accessibility, (c) format, (d) licensing, (e) linkage, (f) funding, (g) editorial policy, (h) content, and (i) statistics. We analyzed 1,340 papers, 1,173 (87.5%) did not have research data. A total of 12.5% share their research data of which 2.1% share their data in repositories, 5% share their data through a simple request, 0.2% do not have permission to share their data, and 5.2% share their data as supplementary material. There is a small percentage that shares their research data; however, it demonstrates the researchers' poor knowledge on how to properly share their research data and their lack of knowledge on what is research data.
Trust in COVID-19 public health information
Understanding the factors that influence trust in public health information is critical for designing successful public health campaigns during pandemics such as COVID-19. We present findings from a cross-sectional survey of 454 US adults-243 older (65+) and 211 younger (18-64) adults-who responded to questionnaires on human values, trust in COVID-19 information sources, attention to information quality, self-efficacy, and factual knowledge about COVID-19. Path analysis showed that trust in direct personal contacts ( = 0.071, .04) and attention to information quality ( = 0.251, < .001) were positively related to self-efficacy for coping with COVID-19. The human value of self-transcendence, which emphasizes valuing others as equals and being concerned with their welfare, had significant positive indirect effects on self-efficacy in coping with COVID-19 (mediated by attention to information quality; effect = 0.049, 95% CI 0.001-0.104) and factual knowledge about COVID-19 (also mediated by attention to information quality; effect = 0.037, 95% CI 0.003-0.089). Our path model offers guidance for fine-tuning strategies for effective public health messaging and serves as a basis for further research to better understand the societal impact of COVID-19 and other public health crises.
Pandemics are catalysts of scientific novelty: Evidence from COVID-19
Scientific novelty drives the efforts to invent new vaccines and solutions during the pandemic. First-time collaboration and international collaboration are two pivotal channels to expand teams' search activities for a broader scope of resources required to address the global challenge, which might facilitate the generation of novel ideas. Our analysis of 98,981 coronavirus papers suggests that scientific novelty measured by the BioBERT model that is pretrained on 29 million PubMed articles, and first-time collaboration increased after the outbreak of COVID-19, and international collaboration witnessed a sudden decrease. During COVID-19, papers with more first-time collaboration were found to be more novel and international collaboration did not hamper novelty as it had done in the normal periods. The findings suggest the necessity of reaching out for distant resources and the importance of maintaining a collaborative scientific community beyond nationalism during a pandemic.
Domain-topic models with chained dimensions: Charting an emergent domain of a major oncology conference
This paper presents a contribution to the study of bibliographic corpora through science mapping. From a graph representation of documents and their textual dimension, stochastic block models can provide a simultaneous clustering of documents and words that we call a domain-topic model. Previous work investigated the resulting topics, or word clusters, while ours focuses on the study of the document clusters we call domains. To enable the description and interactive navigation of domains, we introduce measures and interfaces that consider the structure of the model to relate both types of clusters. We then present a procedure that extends the block model to cluster metadata attributes of documents, which we call a domain-chained model, noting that our measures and interfaces transpose to metadata clusters. We provide an example application to a corpus relevant to current science, technology and society (STS) research and an interesting case for our approach: the abstracts presented between 1995 and 2017 at the American Society of Clinical Oncology Annual Meeting, the major oncology research conference. Through a sequence of domain-topic and domain-chained models, we identify and describe a group of domains that have notably grown through the last decades and which we relate to the establishment of "oncopolicy" as a major concern in oncology.
Understanding the effects of message cues on COVID-19 information sharing on Twitter
Analyzing and documenting human information behaviors in the context of global public health crises such as the COVID-19 pandemic are critical to informing crisis management. Drawing on the Elaboration Likelihood Model, this study investigates how three types of peripheral cues-content richness, emotional valence, and communication topic-are associated with COVID-19 information sharing on Twitter. We used computational methods, combining Latent Dirichlet Allocation topic modeling with psycholinguistic indicators obtained from the Linguistic Inquiry and Word Count dictionary to measure these concepts and built a research model to assess their effects on information sharing. Results showed that content richness was negatively associated with information sharing. Tweets with negative emotions received more user engagement, whereas tweets with positive emotions were less likely to be disseminated. Further, tweets mentioning advisories tended to receive more retweets than those mentioning support and news updates. More importantly, emotional valence moderated the relationship between communication topics and information sharing-tweets discussing news updates and support conveying positive sentiments led to more information sharing; tweets mentioning the impact of COVID-19 with negative emotions triggered more sharing. Finally, theoretical and practical implications of this study are discussed in the context of global public health communication.
Understanding the spread of COVID-19 misinformation on social media: The effects of topics and a political leader's nudge
The spread of misinformation on social media has become a major societal issue during recent years. In this work, we used the ongoing COVID-19 pandemic as a case study to systematically investigate factors associated with the spread of multi-topic misinformation related to one event on social media based on the heuristic-systematic model. Among factors related to systematic processing of information, we discovered that the topics of a misinformation story matter, with conspiracy theories being the most likely to be retweeted. As for factors related to heuristic processing of information, such as when citizens look up to their leaders during such a crisis, our results demonstrated that behaviors of a political leader, former US President Donald J. Trump, may have nudged people's sharing of COVID-19 misinformation. Outcomes of this study help social media platform and users better understand and prevent the spread of misinformation on social media.
A bridge too far for artificial intelligence?: Automatic classification of stanzas in Spanish poetry
The rise in artificial intelligence and natural language processing techniques has increased considerably in the last few decades. Historically, the focus has been primarily on texts expressed in prose form, leaving mostly aside figurative or poetic expressions of language due to their rich semantics and syntactic complexity. The creation and analysis of poetry have been commonly carried out by hand, with a few computer-assisted approaches. In the Spanish context, the promise of machine learning is starting to pan out in specific tasks such as metrical annotation and syllabification. However, there is a task that remains unexplored and underdeveloped: stanza classification. This classification of the inner structures of verses in which a poem is built upon is an especially relevant task for poetry studies since it complements the structural information of a poem. In this work, we analyzed different computational approaches to stanza classification in the Spanish poetic tradition. These approaches show that this task continues to be hard for computers systems, both based on classical machine learning approaches as well as statistical language models and cannot compete with traditional computational paradigms based on the knowledge of experts.
Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians
This article considers the interdisciplinary opportunities and challenges of working with digital cultural heritage, such as digitized historical newspapers, and proposes an integrated digital hermeneutics workflow to combine purely disciplinary research approaches from computer science, humanities, and library work. Common interests and motivations of the above-mentioned disciplines have resulted in interdisciplinary projects and collaborations such as the NewsEye project, which is working on novel solutions on how digital heritage data is (re)searched, accessed, used, and analyzed. We argue that collaborations of different disciplines can benefit from a good understanding of the workflows and traditions of each of the disciplines involved but must find integrated approaches to successfully exploit the full potential of digitized sources. The paper is furthermore providing an insight into digital tools, methods, and hermeneutics in action, showing that integrated interdisciplinary research needs to build something in between the disciplines while respecting and understanding each other's expertise and expectations.
Forensically reconstructing biomedical maintenance labor: PDF metadata under the epistemic conditions of COVID-19
This study examines the documents circulated among biomedical equipment repair technicians in order to build a conceptual model that accounts for multilayered temporality in technical healthcare professional communities. A metadata analysis informed by digital forensics and trace ethnography is employed to model the overlapping temporal, format-related, and annotation characteristics present in a corpus of repair manual files crowdsourced during collaborations between volunteer archivists and professional technicians. The corpus originates within iFixit.com's Medical Device Repair collection, a trove of more than 10,000 manuals contributed by working technicians in response to the strain placed on their colleagues and institutions due to the COVID-19 pandemic. The study focuses in particular on the Respiratory Analyzer subcategory of documents, which aid in the maintenance of equipment central to the care of COVID-19 patients experiencing respiratory symptoms. The 40 Respiratory Analyzer manuals in iFixit's collection are examined in terms of their original publication date, the apparent status of their original paper copies, the version of PDF used to encode them, and any additional metadata that is present. Based on these characteristics, the study advances a conceptual model accounting for circulation among multiple technicians, as well as alteration of documents during the course of their lifespans.
Knowledge creation through collaboration: The role of shared institutional affiliations and physical proximity
This paper examines how shared affiliations within an institution (e.g., same primary appointment, same secondary appointment, same research center, same laboratory/facility) and physical proximity (e.g., walking distance between collaborator offices) shape knowledge creation through biomedical science collaboration in general, and interdisciplinary collaboration in particular. Using archival and publication data, we examine pairwise research collaborations among 1,138 faculty members over a 12-year period at a medical school in the United States. Modeling at the dyadic level, we find that faculty members with more shared institutional affiliations are positively associated with knowledge creation and knowledge impact, and that this association is moderated by the physical proximity of collaborators. We further find that the positive influence of disciplinary diversity (e.g., collaborators from different fields) on knowledge impact is stronger among pairs that share more affiliations and is significantly reduced as the physical distance among collaborators increases. These results support the idea that shared institutional affiliations and physical proximity can increase interpersonal contact, providing more opportunities to develop trust and mutual understanding, and thus alleviating some of the coordination issues that can arise with higher disciplinary diversity. We discuss the implications for future research on scientific collaborations, managerial practice regarding office space allocation, and strategic planning of initiatives aimed at promoting interdisciplinary collaboration.
The impact of emotional signals on credibility assessment
Fake news is considered one of the main threats of our society. The aim of fake news is usually to confuse readers and trigger intense emotions to them in an attempt to be spread through social networks. Even though recent studies have explored the effectiveness of different linguistic patterns for fake news detection, the role of emotional signals has not yet been explored. In this paper, we focus on extracting emotional signals from claims and evaluating their effectiveness on credibility assessment. First, we explore different methodologies for extracting the emotional signals that can be triggered to the users when they read a claim. Then, we present emoCred, a model that is based on a long-short term memory model that incorporates emotional signals extracted from the text of the claims to differentiate between credible and non-credible ones. In addition, we perform an analysis to understand which emotional signals and which terms are the most useful for the different credibility classes. We conduct extensive experiments and a thorough analysis on real-world datasets. Our results indicate the importance of incorporating emotional signals in the credibility assessment problem.
Ethnicity-based name partitioning for author name disambiguation using supervised machine learning
In several author name disambiguation studies, some ethnic name groups such as East Asian names are reported to be more difficult to disambiguate than others. This implies that disambiguation approaches might be improved if ethnic name groups are distinguished before disambiguation. We explore the potential of ethnic name partitioning by comparing performance of four machine learning algorithms trained and tested on the entire data or specifically on individual name groups. Results show that ethnicity-based name partitioning can substantially improve disambiguation performance because the individual models are better suited for their respective name group. The improvements occur across all ethnic name groups with different magnitudes. Performance gains in predicting matched name pairs outweigh losses in predicting nonmatched pairs. Feature (e.g., coauthor name) similarities of name pairs vary across ethnic name groups. Such differences may enable the development of ethnicity-specific feature weights to improve prediction for specific ethic name categories. These findings are observed for three labeled data with a natural distribution of problem sizes as well as one in which all ethnic name groups are controlled for the same sizes of ambiguous names. This study is expected to motive scholars to group author names based on ethnicity prior to disambiguation.
Medieval Spanish (12th-15th centuries) named entity recognition and attribute annotation system based on contextual information
The recognition of named entities in Spanish medieval texts presents great complexity, involving specific challenges: First, the complex morphosyntactic characteristics in proper-noun use in medieval texts. Second, the lack of strict orthographic standards. Finally, diachronic and geographical variations in Spanish from the 12th to 15th century. In this period, named entities usually appear as complex text structure. For example, it was frequent to add nicknames and information about the persons role in society and geographic origin. To tackle this complexity, named entity recognition and classification system has been implemented. The system uses contextual cues based on semantics to detect entities and assign a type. Given the occurrence of entities with attached attributes, entity contexts are also parsed to determine entity-type-specific dependencies for these attributes. Moreover, it uses a variant generator to handle the diachronic evolution of Spanish medieval terms from a phonetic and morphosyntactic viewpoint. The tool iteratively enriches its proper lexica, dictionaries, and gazetteers. The system was evaluated on a corpus of over 3,000 manually annotated entities of different types and periods, obtaining F1 scores between 0.74 and 0.87. Attribute annotation was evaluated for a person and role name attributes with an overall F1 of 0.75.
Loosen control without losing control: Formalization and decentralization within commons-based peer production
This study considers commons-based peer production (CBPP) by examining the organizational processes of the free/libre open-source software community, Drupal. It does so by exploring the sociotechnical systems that have emerged around both Drupal's development and its face-to-face communitarian events. There has been criticism of the simplistic nature of previous research into free software; this study addresses this by linking studies of CBPP with a qualitative study of Drupal's organizational processes. It focuses on the evolution of organizational structures, identifying the intertwined dynamics of formalization and decentralization, resulting in coexisting sociotechnical systems that vary in their degrees of organicity.