Media Art Net | Mapping and Text

The volume of electronic documents generated is growing consistently and exponentially every day. It was reported recently that in 2002, 5 billions of GigaBytes of original data were created worldwide (thus filling 50 millions of current generic harddisk drives). Part of this massive volume corresponds to textual exchanges (such as emails) and thus may be managed at a semantic level by techniques parallel to text information retrieval, applied successfully in Google, for example. However, a significant part of this data corresponds to visual multimedia information such as images or videos. In the case of these visual documents, the management cannot be performed automatically with high accuracy. This is due to the well-known semantic gap, defined as the discrepancy between the capabilities of a machine and that of a human to perceive visual content. Despite several decades of research, automated image (and video) content analysis is still too poor to reliably replace humans in management tasks. Locally, in the Viper group [VIPER], we are following several research directions that should lead to complementary solutions to the problem of inferring semantically meaningfulinterpretations of visual content. Our initial research on Contentbased Image Retrieval «GIFT» [GIFT] has led us to considering annotated image collections. The problem of annotation in itself is far from trivial and we look at how to assign textual labels to still pictures [ANNOTATION]. Note that this is very much in line with the Semantic Web initiative [SEMWEB]. We also look at how to extract automatically (using learning machines) text from visual content. [1] and describe our advances on multimedia data visualization.

We face a context where we need to automate the management of document collections as much as possible but where the presence of a human operator is made necessary to reach a sufficient level of efficiency and accuracy. The simplest example of a private user managing his/her own digital photo and video collection already calls for the use of a number of tools to efficiently keep track of all the content. Content-based tools come as solutions to such problems. They aim at facilitating document search and retrieval, solely based on the automated analysis and characterization of visual content. Whereas they do succeed in performing search tasks, this only offers a partial solution to the management problem. It may well be the case that the problem is not so much of finding something in a collection but that the collection itself needs to be investigated.

Here, we look at visual multimedia information management in a ‹queryless› context. The user is faced with a (large) number of multimedia data. The tools should help the user in getting a comprehensive view of the content of these collections effectively. The base system is a simple view system that shows items in a random order. Whereas this may look as a purely technical challenge, its development involves the comprehension of human perception of visual content and leads to problems that do not just find analytical solutions.

From the context in which these tools should be developed, we look at what tools we have at hand for achieving our goal and assemble them in a unified context. This directs us towards the concept of «Collection Guiding» [CGUIDE] where the user embarks for a visit through the multimedia space automatically created using state-of-the-art techniques for automated visual document analysis. Our approachre-locates the user at the center of the system and sets back the emphasis on Human-Computer Interaction.

Human perception of visual multimedia documents

Studies show that a person may handle simultaneously less than a thousand photos (clearly, this depends on the diversity and task). Nowadays photo cameras may store some hundreds of photos. The well-known GIMP-Savvy free photo archive contains about 27,000 images and Google indexes 425 million images acquired from the WWW (as of February 2004). A commercial image provider like Corbis should manage a catalog of more than a million items to be competitive.

In the latter case, it is important that no ‹dark zone› is created. In other words, the manager should keep a fluent access to any item within the database. Hence, the problem is twofold. The manager should first know that an image of a given type exists and then know how to formulate a demand to the system to actually retrieve it.

While Content-based Visual Indexing tools that we discuss in the next paragraph may solve the second part of the problem, keeping an accurate overview of one’s visual asset is far from trivial. However, as discussed towards the end of this text, on the way to solving this part, one finds interesting visual properties and features that go beyond the technicalities of developing such a system.

Searching for visual multimedia documents

Most current Multimedia Information Management frameworks now are query-based. That is, they mostly rely on the assumption that the user is looking for something and has a good idea of what (s)he is looking for. This can be mapped onto the concept of Query-by-Example, where the user is able to produce an example of what (s)he is looking for. Browsing is another concept for searching information. It also assumes that the users hold the definition of a specific target. In both concepts, the user should be able to produce a query for the information needed.

Following the classical phrase that «an image is worth a thousand words,» the query-by-visual-example (QBE) paradigm simply wishes to avoid the tedious and imprecise textual description of the wanted item.When looking for an image, the user of such a QBE system is asked to produce one or more positive or negative image examples to the system to express features that are desired. Clearly, the underlying goal is to capture content semantics to reach an accurate level of retrieval. However, recent studies have converged towards the use of relatively basic visual content characteristics. Most used features are

• Color, characterized by numerical values or an index in a palette;

• Texture, capturing the regular pattern of the visual content at hand. Numerical values associated to it may be the dominant orientation or some measure of coarseness;

• Shape, encoding the composition of recognizable objects. Geometrical values such as area and perimeter may be used here. Clearly these aspects characterize visual content but no or little semantics. To reach more visual properties, system designers have moved towards specific classification such as

• Visual properties, capturing the setting of an image such as landscape, cityscape, seascape;

• Text, looking for any textual cue within the visual content for better identification;

• Human faces whose detection may be performed automatically in a reliable fashion and which form a efficient cue for classification; • object detector (such as «car detector») may be finally designed and finetuned in a very generic setting.

However, the more one gets specific, the less robust to errors the characterization will be. Despite this criticism, automated image analysis has led to unquestionable success. The performance of image compression systems such as the JPEG standard allowing efficient Web image transfers relies on a shallow understanding of the content. Further, as mentioned, faces and text are items that may be handled well by automated systems. In our GIFT system [GIFT], the image may interactively be searched byvisual content based on color and texture. By successively marking negative and positive examples, the search is refined and characteristics are filtered so as to match an underlying semantic concept.

While the system is achieving its aim of retrieving images of a given class, a careful study of the results shows that, even towards the end of the search, our GIFT system is not actually able to express the underlying semantic concepts. That is, it is still confused with unrelated visual examples. This would clearly not be the case using a text-based system but then would require a complete and exhaustive annotation of the visual content that is know to be unpractical and that we try to avoid here. Summarizing, if we take the case of art, we may easily characterize e.g. a given painting by its color and associated layout so that we would be able to distinguish copies of that specific painting with a collection of images. This would be useful to track unlawful appearances of images of that painting on the Web for example. [2] At the other end of the scale, we also may characterize a school of painters by the color and strokes they use. Impressionism is easily characterized in this way. This may help to preclassify paintings.

However, automated systems have real difficulties coping with the middle range of the problem, which is the automated characterization of painting form a given painter. Evidently, this tells how much such distinction relies on semantic and cultural background. A first derivative from the QBE paradigm is the concept of target search proposing to the user samples of the collection from which (s)he will chose directions to move to (as opposed to valid or invalid items). Here, decisions are made relative to each other and not as absolute relevance judgments. One may therefore expect that for example, at some point of the process, opponent color images will be proposed so as to disambiguate the question of the dominant image color. In a way, such as system successively poses to the user a number of questions whose answer help resolving the search problem. The underlying aim of the above target search is to locate a target image as quickly as possible, ideally at first glance. There is therefore no intention to purposely propose a visit of the complete collection to the user.

Collection Guiding

Yet, we advocate this may indeed be a useful tool for collection management. We oppose the above search context where the user acts as a customer to the system (i.e. requiring a search service) to a new context where the user acts as a manager of the system. We define the aim as collection-level operations (as opposed to the above document-level operations) such as collection sorting, filtering and summarization. The basic aim of the «Collection Guiding» [CGUIDE] development is to provide a tool that would allow a (naïve) user to grasp the content of an image collection as quickly as possible. Our developments are to be compared to straightforward approaches such as:

• Linear visit of the collection, the images are just shown one after the other in any order (as in the Visual Summary of the texts in «Media Art Net»;

• Random sampling, samples of the image collection of manageable size are extracted and successively shown to the user (as in the Start Page of «Media Art Net.» We propose to direct our tool towards the ability of performing:

• Intelligent sampling, subsets of the image collection are extracted that represent well the diversity of the original collection; • Organized visit, a coherent path is defined for the visit of the collection. The real analogy with a Museum Guide applies fully here;

• Hierarchical visit, the collection is organized in a hierarchy so as to explore interesting parts only or all parts in an interactive manner. Here again the analogy with museum rooms and wings applies; One interesting fact is that, in the above context, the basic visual characteristics show enough performance to achieve a sufficient organizational level that allows the user to keep track of the collection content. The solution here is not so much that of understanding the image (as before) but rather to capture the diversity of the collection. The above highlighted features of diversity, coherence and interest are reachable using our low-level feature set including color texture and shape. The key part of our system is the user. Withinthis context, emphasis is placed back on human-machine communication, which is accepted to be one important key for development of semantic-based systems. It is now the role of designers (e.g. of interfaces) to create appropriate transfer modes of collection information onto maps such as that shown above.

Conclusion

Nowadays volumes of multimedia data force the use of automated tools for management. While feasibility has been proven for textual data with the development of search engines such as Google, the problem remains for visual data.

Recent research and developments have focused on a query-based context placing the user as a customer to the system. We propose the «Collection Guiding» context that shows to be more suited for several current needs but also more flexible in accommodating user requirements and interaction modalities. Further research will concentrate on extending the data visualization techniques so as to allow for the discovery of useful structures in visual data sets. We envisage this will truly be relevant in the case of video documents where these techniques should create new facets of video documents by breaking their rigid linear temporal structure. This will permit alternative navigation modes that we think should reveal interesting features of video documents.

Clearly, this will be done hand in hand with the development of Human-Computer Interaction (HCI) techniques to validate these advances. The goal is and remains that of having feed back between any development in Content-based Visual Data interpretation technique and relevant advances made in parallel fields. We do not wish to abandon the concept of fully automated visual content interpretation systems but rather to make them practical by including human interaction where (temporarily) needed.