VIsual Seek for Interactive Image Retrieval (VISIIR) is a project aiming at exploring new methods for semantic image annotation. This refers to the ability of predicting a semantic concept based on the visual content of the image. This topic is extensively studied for more than a decade now due to its large number of applications in areas as diverse as Information Retrieval, Computer Vision, Image Processing, and Artificial Intelligence.
Scientific websiteThis demo uses a deep convolutional neural network (CNN) to recognize food images across 101 categories. This CNN follows the architecture of the OverFeat CNN (Sermanet et al. 2013). It has been initialized with the weights of the network trained on the ImageNet dataset and was then re-trained (fine-tuned) on our own dataset for food recognition.
For this project, we created the UPMC Food-101 dataset. This dataset contains 101 food categories. For each of them constituted, we gathered around 800 to 950 images from a Google Image seach of the title of the category. Because of this, this dataset may contain some noise. Below are 6 randomly chosen images from the dataset. Fell free to explore this dataset more and give us some feedback about the images.
UPMC-G20 is a dataset based on the UPMC Food-101 with gaze annotation. We selected 20 food categories and 100 images per category from the UPMC Food-101. For each image, we collected about 15 fixations across 3 subjects with a total duration of 2.5 seconds. The categories selected are apple-pie, bread-pudding, beef-carpaccio, beet-salad, chocolate-cake, chocolate-mousse, donuts, beignets, eggs-benedict, croque-madame, gnocchi, shrimp-and-grits, grilled-salmon, pork-chop, lasagna, ravioli, pancakes, french-toast, spaghetti-bolognese, pad-thai.
VIsual Seek for Interactive Image Retrieval (VISIIR) is a project aiming at exploring new methods for semantic image annotation. This topic is extensively studied for more than a decade now due to its large number of applications in areas as diverse as Information Retrieval, Computer Vision, Image Processing, and Artificial Intelligence. Semantic annotation refers to the ability of predicting a semantic concept based on the visual content of the image. Filling the semantic gap between visual data and concepts is the main goal followed by researchers in the field. In supervised learning, a large amount of labeled data is mandatory to provide effective semantic annotation tools. In interactive Image Retrieval Systems (CBIR), the annotation requires to formulate the user query with an example, i.e. an image. User feedback interaction is commonly used to interactively refine a query concept by asking the user whether some selected images are relevant or not. To be effective, one major challenge in interactive CBIR is to minimize the required number of feedback loops to grasp the semantic query of the user.
The VISIIR project proposes new interactive methods for providing powerful semantic annotation systems. The originality of the proposal is three-fold:
In terms of methodology, the first lock for semantic annotation relies on the representation of visual content. In order to make one step further compared to current state-of-the-art methods, we want to develop new bio-inspired representations. One key idea is to provide a hybrid representation, combining visual saliency models and unsupervised deep networks. In the second part of VISIIR, we design new interactive learning schemes. We exploit the additional source of information provided by the eye-tracker to boost the learning quality (i.e. the active learning convergence), at two different levels. Firstly, eye-tracker features are used in conjunction to user’s annotation to jointly optimize the classification function and the visual representations learned off-line in task 1. In addition to this gaze analysis purpose, we propose to use the eye-tracker to control the learning process and develop new Human-Computer Interactions (HCI). Typically, eye-tracking statistics will act as user feedback. Finally, one strong axis of VISIIR is the rigorous evaluation of the proposed semantic annotation methods in a specific web filtering application dedicated to food retrieval. A complete database will be provided through the project with the goal of finding images of recipes. This fine-grained classification task will be used as a use case to validate the visual representations and interactive learning methods of task 1-2. A methodological aspect addressed in this task is the scalability of the interactive search when applied to the huge amount of images harvested from the web. We want to tackle this scalability lock by marrying efficient hashing structures for indexing and search with exploration techniques.
To carry out VISIIR, the various required skills will be provided by the consortium partners. UPMC will bring skills in image classification and statistical learning, I3S on CBIR and scalability, L3I on visual saliency and attentional models, and Tobii on Eye-tracker technology.
Université Pierre et Marie CURIE
Université Nice Sophia Antipolis
Université de La Rochelle
Industrial Partner
Financer