International Journal of Multimedia Information Retrieval

Instance Search Retrospective with Focus on TRECVID
Awad G, Kraaij W, Over P and Satoh S
This paper presents an overview of the Video Instance Search benchmark which was run over a period of 6 years (2010-2015) as part of the TREC Video Retrieval (TRECVID) workshop series. The main contributions of the paper include i) an examination of the evolving design of the evaluation framework and its components (system tasks, data, measures); ii) an analysis of the influence of topic characteristics (such as rigid/non rigid, planar/non-planar, stationary/mobile on performance; iii) a high-level overview of results and best-performing approaches. The Instance Search (INS) benchmark worked with a variety of large collections of data including Sound & Vision, Flickr, BBC (British Broadcasting Corporation) Rushes for the first 3 pilot years and with the small world of the BBC Eastenders series for the last 3 years.
3D object retrieval using salient views
Atmosukarto I and Shapiro LG
This paper presents a method for selecting salient 2D views to describe 3D objects for the purpose of retrieval. The views are obtained by first identifying salient points via a learning approach that uses shape characteristics of the 3D points (Atmosukarto and Shapiro in International workshop on structural, syntactic, and statistical pattern recognition, 2008; Atmosukarto and Shapiro in ACM multimedia information retrieval, 2008). The salient views are selected by choosing views with multiple salient points on the silhouette of the object. Silhouette-based similarity measures from Chen et al. (Comput Graph Forum 22(3):223-232, 2003) are then used to calculate the similarity between two 3D objects. Retrieval experiments were performed on three datasets: the Heads dataset, the SHREC2008 dataset, and the Princeton dataset. Experimental results show that the retrieval results using the salient views are comparable to the existing light field descriptor method (Chen et al. in Comput Graph Forum 22(3):223-232, 2003), and our method achieves a 15-fold speedup in the feature extraction computation time.
On-the-fly learning for visual search of large-scale image and video datasets
Chatfield K, Arandjelović R, Parkhi O and Zisserman A
The objective of this work is to visually search large-scale video datasets for semantic entities specified by a text query. The paradigm we explore is constructing visual models for such semantic entities , i.e. at run time, by using an image search engine to source visual training data for the text query. The approach combines fast and accurate learning and retrieval, and enables videos to be returned within seconds of specifying a query. We describe three classes of queries, each with its associated visual search method: object instances (using a bag of visual words approach for matching); object categories (using a discriminative classifier for ranking key frames); and faces (using a discriminative classifier for ranking face tracks). We discuss the features suitable for each class of query, for example Fisher vectors or features derived from convolutional neural networks (CNNs), and how these choices impact on the trade-off between three important performance measures for a real-time system of this kind, namely: (1) accuracy, (2) memory footprint, and (3) speed. We also discuss and compare a number of important implementation issues, such as how to remove 'outliers' in the downloaded images efficiently, and how to best obtain a single descriptor for a face track. We also sketch the architecture of the real-time on-the-fly system. Quantitative results are given on a number of large-scale image and video benchmarks (e.g.  TRECVID INS, MIRFLICKR-1M), and we further demonstrate the performance and real-world applicability of our methods over a dataset sourced from 10,000 h of unedited footage from BBC News, comprising 5M+ key frames.
Investigating country-specific music preferences and music recommendation algorithms with the LFM-1b dataset
Schedl M
Recently, the LFM-1b dataset has been proposed to foster research and evaluation in music retrieval and music recommender systems, Schedl (Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR). New York, 2016). It contains more than one billion music listening events created by more than 120,000 users of Last.fm. Each listening event is characterized by artist, album, and track name, and further includes a timestamp. Basic demographic information and a selection of more elaborate listener-specific descriptors are included as well, for anonymized users. In this article, we reveal information about LFM-1b's acquisition and content and we compare it to existing datasets. We furthermore provide an extensive statistical analysis of the dataset, including basic properties of the item sets, demographic coverage, distribution of listening events (e.g., over artists and users), and aspects related to music preference and consumption behavior (e.g., temporal features and mainstreaminess of listeners). Exploiting country information of users and genre tags of artists, we also create taste profiles for populations and determine similar and dissimilar countries in terms of their populations' music preferences. Finally, we illustrate the dataset's usage in a simple artist recommendation task, whose results are intended to serve as baseline against which more elaborate techniques can be assessed.
Design ensemble deep learning model for pneumonia disease classification
El Asnaoui K
With the recent spread of the SARS-CoV-2 virus, computer-aided diagnosis (CAD) has received more attention. The most important CAD application is to detect and classify pneumonia diseases using X-ray images, especially, in a critical period as pandemic of covid-19 that is kind of pneumonia. In this work, we aim to evaluate the performance of single and ensemble learning models for the pneumonia disease classification. The ensembles used are mainly based on fined-tuned versions of (InceptionResNet_V2, ResNet50 and MobileNet_V2). We collected a new dataset containing 6087 chest X-ray images in which we conduct comprehensive experiments. As a result, for a single model, we found out that InceptionResNet_V2 gives 93.52% of F1 score. In addition, ensemble of 3 models (ResNet50 with MobileNet_V2 with InceptionResNet_V2) shows more accurate than other ensembles constructed (94.84% of F1 score).
A faceted approach to reachability analysis of graph modelled collections
Sabetghadam S, Lupu M, Bierig R and Rauber A
Nowadays, there is a proliferation of available information sources from different modalities-text, images, audio, video and more. Information objects are not isolated anymore. They are frequently connected via metadata, semantic links, etc. This leads to various challenges in graph-based information retrieval. This paper is concerned with the reachability analysis of multimodal graph modelled collections. We use our framework to leverage the combination of features of different modalities through our formulation of faceted search. This study highlights the effect of different facets and link types in improving reachability of relevant information objects. The experiments are performed on the Image CLEF 2011 Wikipedia collection with about 400,000 documents and images. The results demonstrate that the combination of different facets is conductive to obtain higher reachability. We obtain 373% recall gain for very hard topics by using our graph model of the collection. Further, by adding semantic links to the collection, we gain a 10% increase in the overall recall.
Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown
Heller S, Gsteiger V, Bailer W, Gurrin C, Jónsson BÞ, Lokoč J, Leibetseder A, Mejzlík F, Peška L, Rossetto L, Schall K, Schoeffmann K, Schuldt H, Spiess F, Tran LD, Vadicamo L, Veselý P, Vrochidis S and Wu J
The Video Browser Showdown addresses difficult video search challenges through an annual interactive evaluation campaign attracting research teams focusing on interactive video retrieval. The campaign aims to provide insights into the performance of participating interactive video retrieval systems, tested by selected search tasks on large video collections. For the first time in its ten year history, the Video Browser Showdown 2021 was organized in a fully remote setting and hosted a record number of sixteen scoring systems. In this paper, we describe the competition setting, tasks and results and give an overview of state-of-the-art methods used by the competing systems. By looking at query result logs provided by ten systems, we analyze differences in retrieval model performances and browsing times before a correct submission. Through advances in data gathering methodology and tools, we provide a comprehensive analysis of ad-hoc video search tasks, discuss results, task design and methodological challenges. We highlight that almost all top performing systems utilize some sort of joint embedding for text-image retrieval and enable specification of temporal context in queries for known-item search. Whereas a combination of these techniques drive the currently top performing systems, we identify several future challenges for interactive video search engines and the Video Browser Showdown competition itself.
A review on deep learning in medical image analysis
Suganyadevi S, Seethalakshmi V and Balasamy K
Ongoing improvements in AI, particularly concerning deep learning techniques, are assisting to identify, classify, and quantify patterns in clinical images. Deep learning is the quickest developing field in artificial intelligence and is effectively utilized lately in numerous areas, including medication. A brief outline is given on studies carried out on the region of application: neuro, brain, retinal, pneumonic, computerized pathology, bosom, heart, breast, bone, stomach, and musculoskeletal. For information exploration, knowledge deployment, and knowledge-based prediction, deep learning networks can be successfully applied to big data. In the field of medical image processing methods and analysis, fundamental information and state-of-the-art approaches with deep learning are presented in this paper. The primary goals of this paper are to present research on medical image processing as well as to define and implement the key guidelines that are identified and addressed.
Anomaly detection using edge computing in video surveillance system: review
Patrikar DR and Parate MR
The current concept of smart cities influences urban planners and researchers to provide modern, secured and sustainable infrastructure and gives a decent quality of life to its residents. To fulfill this need, video surveillance cameras have been deployed to enhance the safety and well-being of the citizens. Despite technical developments in modern science, abnormal event detection in surveillance video systems is challenging and requires exhaustive human efforts. In this paper, we focus on evolution of anomaly detection followed by survey of various methodologies developed to detect anomalies in intelligent video surveillance. Further, we revisit the surveys on anomaly detection in the last decade. We then present a systematic categorization of methodologies for anomaly detection. As the notion of anomaly depends on context, we identify different objects-of-interest and publicly available datasets in anomaly detection. Since anomaly detection is a time-critical application of computer vision, we explore the anomaly detection using edge devices and approaches explicitly designed for them. The confluence of edge computing and anomaly detection for real-time and intelligent surveillance applications is also explored. Further, we discuss the challenges and opportunities involved in anomaly detection using the edge devices.
How can users' comments posted on social media videos be a source of effective tags?
Ellouze M
This paper proposed a new approach for the extraction of tags from users' comments made about videos. In fact, videos on the social media, like Facebook and YouTube, are usually accompanied by comments where users may give opinions about things evoked in the video. The main challenge is how to extract relevant tags from them. To the best of the authors' knowledge, this is the first research work to present an approach to extract tags from comments posted about videos on the social media. We do not pretend that comments can be a perfect solution for tagging videos since we rather tried to investigate the reliability of comments to tag videos and we studied how they can serve as a source of tags. The proposed approach is based on filtering the comments to retain only the words that could be possible tags. We relied on the self-organizing map clustering considering that tags of a given video are semantically and contextually close. We tested our approach on the Google YouTube 8M dataset, and the achieved results show that we can rely on comments to extract tags. They could be also used to enrich and refine the existing uploaders' tags as a second area of application. This can mitigate the bias effect of the uploader's tags which are generally subjective.
Generative adversarial networks and its applications in the biomedical image segmentation: a comprehensive survey
Iqbal A, Sharif M, Yasmin M, Raza M and Aftab S
Recent advancements with deep generative models have proven significant potential in the task of image synthesis, detection, segmentation, and classification. Segmenting the medical images is considered a primary challenge in the biomedical imaging field. There have been various GANs-based models proposed in the literature to resolve medical segmentation challenges. Our research outcome has identified 151 papers; after the twofold screening, 138 papers are selected for the final survey. A comprehensive survey is conducted on GANs network application to medical image segmentation, primarily focused on various GANs-based models, performance metrics, loss function, datasets, augmentation methods, paper implementation, and source codes. Secondly, this paper provides a detailed overview of GANs network application in different human diseases segmentation. We conclude our research with critical discussion, limitations of GANs, and suggestions for future directions. We hope this survey is beneficial and increases awareness of GANs network implementations for biomedical image segmentation tasks.
A unified approach of detecting misleading images via tracing its instances on web and analyzing its past context for the verification of multimedia content
Varshney D and Vishwakarma DK
The verification of multimedia content over social media is one of the challenging and crucial issues in the current scenario and gaining prominence in an age where user-generated content and online social web-platforms are the leading sources in shaping and propagating news stories. As these sources allow users to share their opinions without restriction, opportunistic users often post misleading/unreliable content on social media such as Twitter, Facebook, etc. At present, to lure users toward the news story, the text is often attached with some multimedia content (images/videos/audios). Verifying these contents to maintain the credibility and reliability of social media information is of paramount importance. Motivated by this, we proposed a generalized system that supports the automatic classification of images into credible or misleading. In this paper, we investigated machine learning-based as well as deep learning-based approaches utilized to verify misleading multimedia content, where the available image traces are used to identify the credibility of the content. The experiment is performed on the real-world dataset (Media-eval-2015 dataset) collected from Twitter. It also demonstrates the efficiency of our proposed approach and features using both Machine and Deep Learning Model (Bi-directional LSTM). The experiment result reveals that the Microsoft BING image search engine is quite effective in retrieving titles and performs better than our study's Google image search engine. It also shows that gathering clues from attached multimedia content (image) is more effective than detecting only posted content-based features.
Emotion-aware music tower blocks (EmoMTB ): an intelligent audiovisual interface for music discovery and recommendation
Melchiorre AB, Penz D, Ganhör C, Lesota O, Fragoso V, Fritzl F, Parada-Cabaleiro E, Schubert F and Schedl M
Music listening has experienced a sharp increase during the last decade thanks to music streaming and recommendation services. While they offer text-based search functionality and provide recommendation lists of remarkable utility, their typical mode of interaction is unidimensional, i.e., they provide lists of consecutive tracks, which are commonly inspected in sequential order by the user. The user experience with such systems is heavily affected by cognition biases (e.g., position bias, human tendency to pay more attention to first positions of ordered lists) as well as algorithmic biases (e.g., popularity bias, the tendency of recommender systems to overrepresent popular items). This may cause dissatisfaction among the users by disabling them to find novel music to enjoy. In light of such systems and biases, we propose an intelligent audiovisual music exploration system named EmoMTB . It allows the user to browse the entirety of a given collection in a free nonlinear fashion. The navigation is assisted by a set of personalized emotion-aware recommendations, which serve as starting points for the exploration experience. EmoMTB  adopts the metaphor of a city, in which each track (visualized as a colored cube) represents one floor of a building. Highly similar tracks are located in the same building; moderately similar ones form neighborhoods that mostly correspond to genres. Tracks situated between distinct neighborhoods create a gradual transition between genres. Users can navigate this music city using their smartphones as control devices. They can explore districts of well-known music or decide to leave their comfort zone. In addition, EmoMTB   integrates an emotion-aware music recommendation system that re-ranks the list of suggested starting points for exploration according to the user's self-identified emotion or the collective emotion expressed in EmoMTB 's Twitter channel. Evaluation of EmoMTB   has been carried out in a threefold way: by quantifying the homogeneity of the clustering underlying the construction of the city, by measuring the accuracy of the emotion predictor, and by carrying out a web-based survey composed of open questions to obtain qualitative feedback from users.