INFRASTRUCTURE AND SYSTEMS

Exposing the Hidden Web for Chemical Digital Libraries (Full Paper)
Sascha Tonnies, Benjamin Kohncke, Oliver Koepler and Wolf-Tilo Balke
Abstract. In recent years the vast amount of digitally available content has lead to the creation of many topic-centered digital libraries. Also in the domain of chemistry more and more digital collections are available, but the complex query formulation still hampers their intuitive adoption. This is because information seeking in chemi-cal documents is focused on chemical entities, for which current standard search relies on complex structures which are hard to extract from documents. Moreover, although simple keyword searches would often be sufficient, current collections simply cannot be indexed by Web search providers due to the ambiguity of chemical substance names. In this paper we present a frame-work for automatically generating metadata-enriched index pages for all documents in a given chemical collection. All information is then linked to the respective documents and thus provides an easy to crawl metadata repository promising to open up digital chemical libraries. Our experiments indexing an open access journal show that not only can the documents be found using a simple Google search via the automatically created index pages, but also that the quality of the search is much more efficient than fulltext indexing in terms of both precision/recall and perfor-mance. Finally, we compare our indexing against a classical struc-ture search and find that keyword-based search can indeed solve at least some of the daily tasks in chemical workflows. To use our framework thus promises to expose a large part of the currently still hidden chemical Web, making the techniques employed interesting for chemical information providers like digital libraries and open access journals.

oreChem ChemxSeer: A Semantic Digital Library (Full Paper)
Na Li, Leilei Zhu, Prasenjit Mitra and C. Lee Giles
Abstract. Bringing semantics to unstructured scientific publications is vital as the amount of scientific literature increases explosively. However, current digital libraries are limited by classic flat structured metadata to model scientific publications that contain rich semantic metadata and semantic relations. Furthermore, how to search scientific literature using those linked semantic metadata and relations stay unsolved. We have developed a semantic digital library oreChem ChemxSeer that models chemistry papers with semantic metadata and semantic relations, and stores and indexes extracted metadata from a chemistry paper repository ChemxSeer in a form called “compound object”. A compound object is defined using Object Reuse and Exchange (ORE), which is a new data model that can be used to represent linked resources as an object using Resource Description Framework (RDF) graphs. Creating aggregates of metadata related to a particular object can allow us to manage and retrieve the linked metadata easily as one unit. ORE objects are created on demand; thus, we are able to search for a set of linked metadata with one query. We were also able to model new types of metadata easily. For example, chemists are especially interested in finding information related to experiments in documents. We show how paragraphs containing experiment information in papers can be extracted, tagged based on a chemistry ontology with 470 classes and represented in ORE. Our algorithm uses a classifier with features that are words that are typically only used to describe experiments, like, “apparatus., “prepare., etc. Using a dataset comprising of documents downloaded from the Royal Society of Chemistry digital library, we show that the our proposed method performs well in extracting experiment-related paragraphs from the Chemistry documents.

BinarizationShop: A User-Assisted Software Suite for Converting Old Documents to Black-and-White (Short Paper)
Fanbo Deng, Zheng Wu, Zheng Lu and Michael S. Brown
Abstract. Converting a scanned document to a binary format (black and white) is a key step in the digitization process. While many existing binarization algorithms operate robustly for well-kept documents, these algorithms often produce less than satisfactory results when applied to old documents, especially documents degraded with stains and other discolorations. For these challenging documents, user assistance can be advantageous in directing the binarization procedure. Many existing algorithms, however, are poorly designed to incorporate user assistance. In this paper, we discuss a software framework, BinarizationShop, that combines a series of binarization approaches that have been tailored to exploit user assistance. This framework provides a practical approach for converting difficult documents to black and white.

Using an Ontology and a Multilingual Glossary for Enhancing the Nautical Archaeology Digital Library (Short Paper)
Carlos Monroy, Richard Furuta and Filipe Castro
Abstract. Access to materials in digital collections has been extensively studied within digital libraries. Exploring a collection requires customized indexes and novel interfaces to allow users new exploration mechanisms. Materials or objects can then be found by way of full-text, faceted, or thematic indexes. There has been a marked interest not only in finding objects in a collection, but in discovering relationships and properties. For example, multiple representations of the same object enable the use of visual aids to augment collection exploration. Depending on the domain and characteristics of the objects in a collection, relationships among components can be used to enrich the process of understanding their contents. In this context, the Nautical Archaeology Digital Library (NADL) includes multilingual textual- and visual-rich objects (shipbuilding treatises, illustrations, photographs, and drawings). In this paper we describe an approach for enhancing access to a collection of ancient technical documents, illustrations, and photographs documenting archaeological excavations. Because of the nature of our collection, we exploit a multilingual glossary along with an ontology. Preliminary tests of our prototype suggest the feasibility of our method for enhancing access to the collection.