JCDL International Digital Libraries Conference

Papers

Annotations and Markup

Making Web Annotations Persistent over Time (Full Paper)

Robert Sanderson and Herbert Van de Sompel

Abstract. As Digital Libraries (DL) become more aligned with the web architecture, their functional components need to be fundamentally rethought in terms of URIs and HTTP. Annotation, a core scholarly activity enabled by many DL solutions, exhibits a clearly unacceptable characteristic when existing models are applied to the web: due to the representations of web resources changing over time, an annotation made about a web resource today may no longer be relevant to the representation that is served from that same resource tomorrow.

We assume the existence of archived versions of resources, and combine the temporal features of the emerging Open Annotation data model with the capability offered by the Memento framework that allows seamless navigation from the URI of a resource to archived versions of that resource, and arrive at a solution that provides guarantees regarding the persistence of web annotations over time. More specifically, we provide theoretical solutions and proof-of-concept experimental evaluations for two problems: reconstructing an existing annotation so that the correct archived version is displayed for all resources involved in the annotation, and retrieving all annotations that involve a given archived version of a web resource.

Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection (Full Paper)

David Bamman, Alison Babeu and Gregory Crane

Abstract. We present here a method for automatically projecting structural information across translations, including canonical citation structure (such as chapters and sections), speaker information, quotations, markup for people and places, and any other element in TEI-compliant XML that delimits spans of text that are linguistically symmetrical in two languages. We evaluate this technique on two datasets, one containing perfectly transcribed texts and one containing errorful OCR, and achieve an accuracy rate of 88.2% projecting 13,023 XML tags from source documents to their transcribed translations, with an 83.6% accuracy rate when projecting to texts containing uncorrected OCR. This approach has the potential to allow a highly granular multilingual digital library to be bootstrapped by applying the knowledge contained in a small, heavily curated collection to a much larger but unstructured one.

ProcessTron: Efficient Semi-Automated Markup Generation for Scientific Documents (Full Paper)

Guido Sautter, Klemens Böhm and Conny Kühne

Abstract. Digitizing legacy documents and marking them up with XML is important for many scientific domains. However, creating comprehensive semantic markup of high quality is challenging. Respective processes consist of many steps, with automated markup generation and intermediate manual correction. These corrections are extremely laborious. To reduce this effort, this paper makes two contributions: First, it proposes ProcessTron, a lightweight markup-process-control mechanism. ProcessTron assists users in two ways: It ensures that the steps are executed in the appropriate order, and it points the user to possible errors during manual correction. Second, ProcessTron has been deployed in real-world projects, and this paper reports on our experiences. A core observation is that ProcessTron more than halves the time users need to mark up a document. Results from laboratory experiments, which we have conducted as well, confirm this finding.

Scholarly Publications

Scholarly Paper Recommendation via User's Recent Research Interests (Full Paper)

Kazunari Sugiyama and Min-Yen Kan

Abstract. We examine the effect of modeling a researcher's past works in recommending scholarly papers to the researcher. Our hypothesis is that an author's published works constitute a clean signal of the latent interests of a researcher. A key part of model is to enhance the profile derived directly from past works with information coming from the past works' referenced papers as well as papers that cite the work. In our experiments, we differentiate between junior researchers that have only published one paper and senior researchers that have multiple publications. We show that filtering these sources of information is advantageous — when we additionally prune noisy citations, referenced papers and publication history, we achieve statistically significant higher levels of recommendation accuracy.

Effective Self-Training Author Name Disambiguation in Scholarly Digital Libraries (Full Paper)

Anderson Ferreira, Adriano Veloso, Marcos Goncalves and Alberto Laender

Abstract. Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples, and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. The second step uses a supervised disambiguation method that is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and has effectiveness close, and in some cases superior, to supervised ones, without manually labeling any training example.

Citing for High Impact (Full Paper)

Xiaolin Shi, Jure Leskovec and Daniel McFarland

Abstract. The question of citation behavior has always intrigued scientists from various disciplines. While general citation patterns have been widely studied we develop the notion of citation projection graphs by investigating the references between the publications that a given paper cites. We investigate how patterns of citations vary between scientific disciplines and how such patterns reflect the impact of the paper. We find that idiosyncratic citation patterns are used by low impact papers; while narrow, discipline-focused citation patterns are used by medium impact papers. Crossing-community, or bridging citation patters are high risk and high reward since these result in either low or high impact papers. Last, we observe a trend in paper citation networks over time toward more bridging and interdisciplinary forms.

Search

Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure (Full Paper)

Martin Klein and Michael L. Nelson

Abstract. Missing web pages (pages that return the 404 "Page Not Found" error) are part of the browsing experience. The manual use of search engines to rediscover missing pages can be frustrating and unsuccessful. We compare four automated methods for rediscovering web pages. We extract the page's title, generate the page's lexical signature (LS), query the bookmarking website delicious.com for the page's tags and generate a LS from the page's link neighborhood. We use all methods to query Internet search engines and analyze their retrieval performance. Our results show that both LSs and titles perform fairly well with over 60% URIs returned top ranked from Yahoo. However, the combination of methods improves the retrieval performance. Considering the complexity of the LS generation, querying the title first and in case of insufficient results querying the LSs second is the preferable setup. This combination accounts for more than 75% top ranked URIs.

Search Behaviors in Different Task Types (Full Paper)

Jingjing Liu, Michael Cole, Chang Liu, Ralf Bierig, Jacek Gwizdka, Nick Belkin, Jun Zhang and Xiangmin Zhang

Abstract. Personalization of information retrieval tailors search towards individual users to meet their particular information needs by taking into account information about users and their contexts, often through implicit sources of evidence such as user behaviors. Task types have been shown to influence search behaviors including usefulness judgments. This paper reports on an investigation of user behaviors associated with different task types. Twenty-two undergraduate journalism students participated in a controlled lab experiment, each searching on four tasks which varied on four dimensions: complexity, task product, task goal and task level. Results indicate regular differences associated with different task characteristics in several search behaviors, including decision time (the time taken to decide whether a document is useful or not) and eye fixations. We suggest these behaviors can be used as implicit indicators of the user.s task type.

Exploiting Time-based Synonyms in Searching Document Archives (Full Paper)

Nattiya Kanhabua and Kjetil Norvag

Abstract. Recently a large number of easily accessible information resources have become available. In order to increase search quality, document creation time can be taken into account in order to increase precision, and query expansion of named entities can be employed in order to increase recall. A peculiarity of named entities compared to other vocabulary terms is that they are very dynamic in appearance. In this paper, we present an approach to extract synonyms of named entities over time from the whole history of Wikipedia. In addition, we will use their temporal patterns as a feature in ranking and classifying them into two types, i.e., time-independent or time-dependent. Time-independent synonyms are invariant to time, while time-dependent synonyms are relevant to a particular time period, i.e., the synonym relation changes over time. Further, we describe how to make use of both types of synonyms in order to increase the retrieval effectiveness (precision and recall), i.e., query expansion with time-independent synonyms for an ordinary search, and query expansion with time-dependent synonyms for a search wrt.\ temporal criteria. Finally, through an evaluation based on TREC collections we demonstrate how retrieval performance of queries consisting of named entity can be improved using our approach.

Historial Text and Documents

Using Word Sense Discrimination on Historic Document Collection (Full Paper)

Nina Tahmasebi, Kai Niklas, Thomas Theuerkauf and Thomas Risse

Abstract. Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing found word senses over time, important information can be revealed and used to improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today's language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations in this paper are performed on The Times Archive, a collection of newspaper articles from 1785—1985.

Chinese Calligraphy Specific Style Rendering System (Full Paper)

Zhenting Zhang, Jiangqin Wu and Kai Yu

Abstract. Manifesting the handwriting characters with the specific style of a famous artwork is fascinating. In this paper, a system is built to render the user's handwriting characters with a specific style. A stroke database is established firstly. When rendering a character, the strokes are extracted and recognized, then proper radicals and strokes are filtered, finally these strokes are deformed and the result is generated. The Special Nine Grid (SNG) is presented to help recognize radicals and strokes. The Rule-base Stroke Deformation Algorithm (RSDA) is proposed to deform the original strokes according to the handwriting strokes. The rendering result manifests the specific style with high quality. It is feasible for people to generate the tablet or other artworks with the proposed system.

Translating Handwritten Bushman Texts (Full Paper)

Kyle Williams and Hussein Suleman

Abstract. The Lloyd and Bleek Collection is a collection of artefacts documenting the life and language of the Bushman people of southern Africa in the 19th century. Included in this collection is a handwritten dictionary that contains English words and their corresponding |xam Bushman language translations. This dictionary allows for the manual translation of |xam words that appear in the notebooks of the Lloyd and Bleek collection. This, however, is not practical due to the size of the dictionary, which contains over 14000 entries. To solve this problem a content-based image retrieval system was built that allows for the selection of a |xam word from a notebook and returns matching words from the dictionary. The system shows promise with some search keys returning relevant results.

Collaborative Information Environments

Do Wikipedians Follow Domain Experts? : A Domain-specific Study on Wikipedia Knowledge Building (Full Paper)

Yi Zhang, Aixin Sun, Anwitaman Datta, Kuiyu Chang and Ee-Peng Lim

Abstract. Wikipedia is one of the most successful online knowledge bases attracting millions of visits daily. The huge success no surprisingly has gained much research interests for the better understanding of the collaborative knowledge building process. In this paper, we perform a domain-specific analysis, compare and contrast the knowledge building in Wikipedia with a knowledge base created by domain experts. In particular, we compared Wikipedia knowledge building in terrorism domain with reference to Terrorism Knowledge Base (TKB) developed by experts at MIPT. In total, revision history of 409 articles in Wikipedia each matches a TKB record have been studied from three aspects: creation, revision and link evolution. We found that the knowledge building in Wikipedia in terrorism domain had been unlikely to follow TKB despite the online availability of the latter. For an attempt to find out possible reasons, we conducted a detailed analysis on contribution behavior of Wikipedians. It is found that most Wikipedians each contributes to a relatively small set of articles with biased contribution focus on one particular article. At the same time, for a given article, its contributions are often championed by very few active contributors including the article's creator. Our interpretation is that the contributions in Wikipedia are more for knowledge coverage at article level rather than domain level.

Spatiotemporal Mapping of Wikipedia Concepts (Full Paper)

Adrian Popescu and Gregory Grefenstette

Abstract. Space and time are important dimensions in the representation of a large number of concepts. However there exists no available resource that provides spatiotemporal mappings of concepts. Here we present a link-analysis based method for extracting the main locations and periods associated to all Wikipedia concepts. Relevant locations are selected from a set of geotagged articles, while relevant periods are discovered using a list of people with associated life periods. We analyze article versions over multiple languages and consider the strength of a spatial/temporal reference to be proportional to the number of languages in which it appears. To illustrate the utility of the spatiotemporal mapping of Wikipedia concepts, we present an analysis of cultural interactions and a temporal analysis of two domains. The Wikipedia mapping can also be used to perform rich spatiotemporal document indexing by extracting implicit spatial and temporal references from texts.

Crowdsourcing the Assembly of Concept Hierarchies (Full Paper)

Kai Eckert, Mathias Niepert, Christof Niemann, Cameron Buckner, Colin Allen and Heiner Stuckenschmidt

Abstract. The "wisdom of crowds" is accomplishing tasks that are cumbersome for single human beings but can not yet be fully automated by means of specialized computer algorithms. One such tasks is the construction of thesauri and other types of concept hierarchies. Human expert feedback on the relatedness and relative generality of terms, however, can be aggregated to construct dynamically changing concept hierarchies. The InPhO (Indiana Philosophy Ontology) project bootstraps feedback from volunteer users unskilled in ontology design into a precise representation of a specific domain. The approach combines statistical text processing methods with expert feedback and logic programming to create a dynamic semantic representation of the discipline of philosophy.

In this paper, we show that results of comparable quality can be achieved by leveraging the workforce of crowdsourcing services such as Amazon's Mechanical Turk (AMT). In an extensive empirical study, we compare the feedback obtained from AMT's workers with that from the InPhO volunteer users providing an insight into qualitative differences of the two groups. Furthermore, we present a set of strategies for assessing the quality of different users when gold standards are missing. We finally use these methods to construct a concept hierarchy based on the feedback acquired from AMT workers.

Personal Collections

A User-Centered Design of a Personal Digital Library for Music Exploration (Full Paper)

David Bainbridge, Brook Novak and Sally Jo Cunningham

Abstract. We describe the evaluation of a system to help musicians capture, enrich and archive their ideas using a spatial hypermedia paradigm. The target user group is musicians who primarily use audio and text for composition and arrangement, rather than with formal music notation. Using the principle of user centered design, the software implementation was guided by a diary study involving nine musicians which suggested five requirements for the software to support: capturing, overdubbing, developing, archiving, and organizing. Moreover, the underlying spatial data-model was exploited to give raw audio compositions a hierarchical structure, and — to aid musicians in retrieving previous ideas — a search facility is available to support both query by humming and text-based queries. A user evaluation of the completed design with eleven subjects indicated that musicians, in general, would find the hypermedia environment useful for capturing and managing their moments of musical creativity and exploration. More specifically they would make use of the query by humming facility and the hierarchical track organization, but not the overdubbing facility as implemented.

Improving Mood Classification in Music Digital Libraries by Combining Lyrics and Audio (Full Paper)

Xiao Hu and J. Stephen Downie

Abstract. Mood is an emerging metadata type and access point in music digital libraries (MDL) and online music repositories. In this study, we present a comprehensive investigation of the usefulness of lyrics in music mood classification by evaluating and comparing a wide range of lyric text features including linguistic and text stylistic features. We then combine the best lyric features with features extracted from music audio using two fusion methods. The results show that combining lyrics and audio significantly outperformed systems using audio-only features. In addition, the examination of learning curves shows that the hybrid lyric + audio system needed fewer training samples to achieve the same or better classification accuracies than systems using lyrics or audio singularly. These experiments were conducted on a unique large-scale dataset of 5,296 songs (with both audio and lyrics for each) representing 18 mood categories derived from social tags. The findings push forward the state-of-the-art on lyric sentiment analysis and automatic music mood classification and will help make mood a practical access point in music digital libraries.

Visualizing Personal Digital Collections (Short Paper)

Maria Esteva, Weijia Xu and Suyog Dott Jain

Abstract. This paper describes the use of RDBMS and treemap visualization to represent and analyze a group of personal digital collections created in the context of work and with no external metadata. We evaluated the visualization vis a vis the results of previous personal information management (PIM) studies. We suggest that this visualization affords analysis and understanding of how people organize and maintain their personal information overtime.

Interpretation of Web Page Layouts by Blind Users (Short Paper)

Luis Francisco-Revilla and Jeff Crow

bstract. Digital libraries must support assistive technologies that allow people with disabilities such as blindness to use, navigate and understand their documents. Increasingly, many documents are Web-based and present their contents using complex layouts. However, approaches that translate 2-dimensional layouts to 1-dimensional speech produce a very different user experience and loss of information. To address this issue, we conducted a study of how blind people navigate and interpret layouts of news and shopping Web pages using current assistive technology. The study revealed that blind people do not parse Web pages fully during their first visit, and that they often miss important parts. The study also provided useful insights for improving assistive technologies.

Visualization

Supporting Document Triage via Annotation-based Multi-Application Visualizations (Full Paper)

Soonil Bae, DoHyoung Kim, Konstantinos Meintanis, J. Michael Moore, Anna Zacchi, Frank Shipman, Haowei Hsieh and Catherine Marshall

Abstract. best use their time in sifting through many potentially relevant documents, a practice we refer to as document triage. Normally, people perform triage using multiple applications in concert: a search engine interface presents lists of potentially relevant documents; a document reader displays their contents; and a third tool.a text editor or personal information management application.is used to record notes and assessments. To support document triage, we have developed a multi-application environment that combines an information workspace with a modified document reader. This environment infers users. interests based on their interactions with both applications, coupled with an analysis of the characteristics and content of the documents they are interacting with. It then uses this interest profile to generate visualizations to direct users. attention to documents or parts of documents that match the inferred interests.

Flexible Access to Photo Libraries via Time, Place, Tags, and Visual Features (Full Paper)

Andreas Girgensohn, Frank Shipman, Thea Turner and Lynn Wilcox

Abstract. Photo libraries are growing in quantity and size, requiring better support for locating desired photographs. MediaGLOW is an interactive visual workspace designed to address this concern. It uses attributes such as visual appearance, GPS locations, user-assigned tags, and dates to filter and group photos. An automatic layout algorithm positions photos with similar attributes near each other to support users in serendipitously finding multiple relevant photos. In addition, the system can explicitly select photos similar to specified photos. We conducted a user evaluation to determine the benefit provided by similarity layout and the relative advantages offered by the different layout similarity criteria and attribute filters. Study participants had to locate photos matching probe statements. In some tasks, participants were restricted to a single layout similarity criterion and filter option. Participants used multiple attributes to filter photos. Layout by similarity without additional filters turned out to be one of the most used strategies and was especially beneficial for geographical similarity. Lastly, the relative appropriateness of the single similarity criterion to the probe significantly affected retrieval performance.

Interactively Browsing Movies in terms of Action, Foreshadowing and Resolution (Short Paper)

Stewart Greenhill, Brett Adams and Svetha Venkatesh

Abstract. We describe a novel video player that uses Temporal Semantic Compression (TSC) to present a compressed summary of a movie. Compression is based on tempo which is derived from film rhythms. The technique identifies periods of action, drama, foreshadowing and resolution, which can be mixed in different amounts to vary the kind of summary presented. The compression algorithm is embedded in a video player, so that the summary can be interactively recomputed during playback.

Timeline Interactive Multimedia Experience (TIME): On Location Access to Aggregate Event Information (Short Paper)

Jeff Crow, Eryn Whitworth, Ame Wongsa and Swati Pendyala

Abstract. Attending a complex scheduled social event, such as a multi-day music festival, requires a significant amount of planning before and during its progression. Advancements in mobile technology and social networks enable attendees to contribute content in real-time that can provide useful information to many. Currently access to and presentation of such information is challenging to use during an event. The Timeline Interactive Multimedia Experience (TIME) system aggregates information posted to multiple social networks and presents the flow of information in a multi-touch timeline interface. TIME is designed to be placed on location to allow real-time access to relevant information that helps attendees to make plans and navigate their crowded surroundings.

Data Mining

Domain-Specific Iterative Readability Computation< (Full Paper)

Jin Zhao and Min-Yen Kan

Abstract. A growing number of domain-specific resources are becoming available online. Although the topic in these specialized resources may be the same, they may cater to different audiences, ranging from children to expert researchers. However, current search engines seldom provide information on the readability of their indexed resources or perform readability-based ranking. Consequently, when users with different levels of expertise search for specialized material, they must sieve through many results to find ones suitable to their level. Calculating a domain-specific readability level for such resources can address this issue. We thus present a new algorithm to measure domain-specific readability. It iteratively computes the readability of domain-specific resources based on the difficulty for domain-specific concepts and vice versa, in a style reminiscent of other bipartite graph algorithms such as Hyperlink-Induced Topic Search (HITS) and the Stochastic Approach for Link-Structure Analysis (SALSA). While simple, our algorithm outperforms standard heuristic measures and remains competitive among supervised-learning approaches. Moreover, it is less domain-dependent and portable across domains as it does not rely on an annotated corpus or expensive expert knowledge that supervised or domain-specific methods require.

Evaluating Topic Models for Digital Libraries (Full Paper)

David Newman, Youn Noh, Edmund Talley, Sarvnaz Karimi and Timothy Baldwin

Abstract. Topic models could have a huge impact on improving the ways users find and discover content in digital libraries and search interfaces, through their ability to automatically learn and apply subject tags to each and every item in a collection, and their ability to dynamically create virtual collections on the fly. However, much remains to be done to tap this potential, and empirically evaluate the true value of a given topic model to humans. In this work, we sketch out some sub-tasks that we suggest pave the way towards this goal, and present methods for assessing the coherence and interpretability of topics learned by topic models. Our large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains. We show how a scoring model — based on pointwise mutual information of word-pairs using Wikipedia, Google and MEDLINE as external data sources — performs well at predicting human scores. This automated scoring of topics is an important first step to integrating topic modeling into digital libraries.

FRBRization of MARC records in multiple catalogs (Full Paper)

Hugo Manguinhas, Nuno Freire and Jose Borbinha

Abstract. This paper addresses the problem of using the FRBR model to support the presentation of results. It describes a service implementing new algorithms and techniques for transforming existing MARC records into the FRBR model for this specific purpose. This work was developed in the context of the TELPlus project and processed 100.000 bibliographic and authority records from multilingual catalogs of 12 European countries.

Infrastructure and Systems

Exposing the Hidden Web for Chemical Digital Libraries (Full Paper)

Sascha Tönnies, Benjamin Köhncke, Oliver Koepler and Wolf-Tilo Balke

Abstract. In recent years the vast amount of digitally available content has lead to the creation of many topic-centered digital libraries. Also in the domain of chemistry more and more digital collections are available, but the complex query formulation still hampers their intuitive adoption. This is because information seeking in chemi-cal documents is focused on chemical entities, for which current standard search relies on complex structures which are hard to extract from documents. Moreover, although simple keyword searches would often be sufficient, current collections simply cannot be indexed by Web search providers due to the ambiguity of chemical substance names. In this paper we present a frame-work for automatically generating metadata-enriched index pages for all documents in a given chemical collection. All information is then linked to the respective documents and thus provides an easy to crawl metadata repository promising to open up digital chemical libraries. Our experiments indexing an open access journal show that not only can the documents be found using a simple Google search via the automatically created index pages, but also that the quality of the search is much more efficient than fulltext indexing in terms of both precision/recall and perfor-mance. Finally, we compare our indexing against a classical struc-ture search and find that keyword-based search can indeed solve at least some of the daily tasks in chemical workflows. To use our framework thus promises to expose a large part of the currently still hidden chemical Web, making the techniques employed interesting for chemical information providers like digital libraries and open access journals.

oreChem ChemxSeer: A Semantic Digital Library (Full Paper)

Na Li, Leilei Zhu, Prasenjit Mitra and C. Lee Giles

Abstract. Bringing semantics to unstructured scientific publications is vital as the amount of scientific literature increases explosively. However, current digital libraries are limited by classic flat structured metadata to model scientific publications that contain rich semantic metadata and semantic relations. Furthermore, how to search scientific literature using those linked semantic metadata and relations stay unsolved. We have developed a semantic digital library oreChem ChemxSeer that models chemistry papers with semantic metadata and semantic relations, and stores and indexes extracted metadata from a chemistry paper repository ChemxSeer in a form called "compound object". A compound object is defined using Object Reuse and Exchange (ORE), which is a new data model that can be used to represent linked resources as an object using Resource Description Framework (RDF) graphs. Creating aggregates of metadata related to a particular object can allow us to manage and retrieve the linked metadata easily as one unit. ORE objects are created on demand; thus, we are able to search for a set of linked metadata with one query. We were also able to model new types of metadata easily. For example, chemists are especially interested in finding information related to experiments in documents. We show how paragraphs containing experiment information in papers can be extracted, tagged based on a chemistry ontology with 470 classes and represented in ORE. Our algorithm uses a classifier with features that are words that are typically only used to describe experiments, like, "apparatus., "prepare., etc. Using a dataset comprising of documents downloaded from the Royal Society of Chemistry digital library, we show that the our proposed method performs well in extracting experiment-related paragraphs from the Chemistry documents.

BinarizationShop: A User-Assisted Software Suite for Converting Old Documents to Black-and-White (Short Paper)

Fanbo Deng, Zheng Wu, Zheng Lu and Michael S. Brown

Abstract. Converting a scanned document to a binary format (black and white) is a key step in the digitization process. While many existing binarization algorithms operate robustly for well-kept documents, these algorithms often produce less than satisfactory results when applied to old documents, especially documents degraded with stains and other discolorations. For these challenging documents, user assistance can be advantageous in directing the binarization procedure. Many existing algorithms, however, are poorly designed to incorporate user assistance. In this paper, we discuss a software framework, BinarizationShop, that combines a series of binarization approaches that have been tailored to exploit user assistance. This framework provides a practical approach for converting difficult documents to black and white.

Using an Ontology and a Multilingual Glossary for Enhancing the Nautical Archaeology Digital Library (Short Paper)

Carlos Monroy, Richard Furuta and Filipe Castro

Abstract. Access to materials in digital collections has been extensively studied within digital libraries. Exploring a collection requires customized indexes and novel interfaces to allow users new exploration mechanisms. Materials or objects can then be found by way of full-text, faceted, or thematic indexes. There has been a marked interest not only in finding objects in a collection, but in discovering relationships and properties. For example, multiple representations of the same object enable the use of visual aids to augment collection exploration. Depending on the domain and characteristics of the objects in a collection, relationships among components can be used to enrich the process of understanding their contents. In this context, the Nautical Archaeology Digital Library (NADL) includes multilingual textual- and visual-rich objects (shipbuilding treatises, illustrations, photographs, and drawings). In this paper we describe an approach for enhancing access to a collection of ancient technical documents, illustrations, and photographs documenting archaeological excavations. Because of the nature of our collection, we exploit a multilingual glossary along with an ontology. Preliminary tests of our prototype suggest the feasibility of our method for enhancing access to the collection.

Integration of Physical and Digital Media

In-depth Utilization of Chinese Ancient Maps: A Hybrid Approach to Digitizing Map Resources in CADAL (Full Paper)

Zhenchao YE, Ling ZHUANG, Jiangqin WU, Chengyang DU, Baogang WEI and Yin ZHANG

Abstract. Electric map is getting increasingly popular as an intuitive and interactive platform for data presentation recently. Thus applications integrated with electric map have attracted much attention. But no offtheshelf systems or services we could use if the time span of maps be extended to historical ones. There are a large number of valuable ancient atlases in CADAL digital library, however, they are seldom utilized because the ones in image format are not easy for users to read or search specific information. In this paper, we propose a novel hybrid approach to utilizing these atlases directly and constructing some applications based on ancient maps. We call it CAMAME which means Chinese ancient maps automatic marking and extraction. We create a gazetteer, use a kernel method to do the regression and correct the estimated results with image processing and local regression methods. The empirical results shows that CAMAME is effective and efficient, most valuable data in the map images is marked and identified, based on which we developed some Chinese literary chronicle applications that display ancient literary and related historical information over those digitized atlas resources in CADAL digital library.

The Fused Library: Integrating Digital and Physical Libraries with Location-Aware Sensors (Full Paper)

George Buchanan

Abstract. This paper reports an investigation into the connection of the workspace of physical libraries with digital library services. Using simple sensor technology, we provide focused access to digital resources on the basis of the user's physical context, including the topic of the stacks they are next to, and the content of books on their reading desks. Our research developed the technological infrastructure to support this fused interaction, investigated current patron behavior in physical libraries, and evaluated our system in a user-centred pilot study. The outcome of this research demonstrates the potential utility of the fused library, and provides a starting point for future exploitation.

What Humanists Want: How Scholars Use Primary Source Documents (Full Paper)

Neal Audenaert and Richard Furuta

Abstract. Despite the growing prominence of digital libraries as tools to support humanities scholars, little is known about the work practices and needs of these scholars as they pertain to working with primary source document. In this paper we present our findings from a formative user study consisting of semi-structured interviews with eight scholars.

We find that the use of primary source materials in scholarship is not a simple, straight-forward examination of a document in isolation. Instead, scholars study primary source materials as an integral part of a complex ecosystem of inquiry that seeks to understand both the text being studied and the context in which that text was created, transmitted and used. Drawing examples from our interviews, we address critical questions of why scholars use primary source documents and what information they hope to gain by studying them. We also briefly summarize key note-taking practices as a means for assessing the potential to design user interfaces that support scholarly work-practices.

Search 2

Context Identification of Sentences in Related Work Sections using Conditional Random Fields: Towards Intelligent Digital Libraries (Full Paper)

Angrosh M.A., Stephen Cranefield and Nigel Stanger

Abstract. Identification of contexts associated with sentences is becoming increasingly necessary for developing intelligent information retrieval systems. This article describes a supervised learning mechanism employing conditional random fields (CRFs) for context identification and sentence classification. Specifically, we focus on sentences in related work sections in research articles. Based on the generic rhetorical pattern, a framework for modeling the sequential flow in these sections is proposed. Adopting a generalization strategy, each of these sentences is transformed into a set of features, which forms our dataset. Prominently, we distinguish between two kinds of features for each of these sentences viz., citation features and sentence features. While an overall accuracy of 96.51% is achieved by using a combination of both citation and sentence features, the use of sentence features alone yields an accuracy of 93.22%. The results also show F-Score ranging from 0.99 to 0.90 for various classes indicating the robustness of our application.

Can an intermediary help users search image databases without annotations? (Full Paper)

Robert Villa, Martin Halvey, Hideo Joho, David Hannah and Joemon Jose

Abstract. Developing methods for searching image databases is a challenging and ongoing area of research. A common approach is to use manual annotations, although generating annotations can be expensive in terms of time and money, and may not be justifiable with respect to the cost in many situations. Content-based search techniques which extract visual features from image data can be used, but users are typically forced to use search using example images or sketching interfaces. This can be difficult if no visual example of the information need is available, or can be difficult to represent with a drawing.

In this paper, we consider an alternative approach which allows a user to search for images through an intermediate database. In this approach, a user can search using text in the intermediate database as a way of finding visual examples of their information need. The visual examples can then be used to search a database that lacks annotations. An interface is developed to support this process, and a user study is presented which compare the intermediary interface to text search, where we consider text as an upper bound of performance. Results show that while performance does not match manual annotations, users are able to find relevant material without requiring collection annotations.

SNDocRank: Document Ranking Based on Social Networks (Full Paper)

Liang Gou, Hung-Hsuan Chen, Jung-Hyun Kim, Xiaolong (Luke) Zhang and C. Lee Giles

Abstract. Ranking algorithms used by search engines can be user-neutral and measure the importance and relevance of documents mainly based on the contents and relationships of documents. However, users with diverse interests may demand different documents even with the same queries. To improve search results by using user preferences, we propose a ranking framework, Social Network Document Rank (SNDocRank), that considers both document contents and the relationship between a searcher and document owners in a social network. SNDocRank combines the traditional tf-idf ranking with our Multi-level Actor Similarity (MAS) algorithm, which measures the similarity between the social networks of the searcher and document owners. We tested the SNDocRank method on video data and social network data extracted from YouTube. The results show that compared with the tf-idf algorithm, the SNDocRank algorithm returns more relevant documents. By using SNDocRank, a searcher can get more relevant search results by joining larger social networks, having more friends in a social network, and belonging to larger local communities in a social network.

Theory and Frameworks

A Mathematical Framework for Modeling and Analyzing Migration Time (Full Paper)

Feng Luan, Mads Nygård and Thomas Mestl

Abstract. File format obsolescence has so far been considered the major risk in long-term storage of digital objects. There are, however, growing indications that file transfer may be a real threat as the migration time, i.e., the time required to migrate Petabytes of data, may easily spend years. However, hardware support is usually limited to 3-4 years and a situation can emerge when a new migration has to be started although the previous one is still not finished yet. This paper chooses a process modeling approach to obtain estimates of upper and lower bounds for the required migration time. The advantage is that information about potential bottlenecks can be acquired. Our theoretical considerations are validated by migration tests at the National Library of Norway (NB) as well as at our department.

Digital Libraries for Scientific Data Discovery and Reuse: From Vision to Practical Reality (Full Paper)

Jillian Wallis, Matthew Mayernik, Christine Borgman and Alberto Pepe

Abstract. Science and technology research is becoming not only more distributed and collaborative, but more highly instrumented. Digital libraries provide a means to capture, manage, and access the data deluge that results from these research enterprises. We have conducted research on data practices and participated in developing data management services for the Center for Embedded Networked Sensing since its founding in 2002 as a National Science Foundation Science and Technology Center. Over the course of 8 years, our digital library strategy has shifted dramatically in response to changing technologies, practices, and policies. We report on the development of several DL systems and on the lessons learned, which include the difficulty of anticipating data requirements from nascent technologies, building systems for highly diverse work practices and data types, the need to bind together multiple single-purpose systems, the lack of incentives to manage and share data, the complementary nature of research and development in understanding practices, and sustainability.

Ensemble PDP-8: Eight Principles for Distributed Portals (Short Paper)

Edward Fox, Yinlin Chen, Monika Akbar, Clifford Shaffer, Stephen Edwards, Peter Brusilovsky, Dan Garcia, Lois Delcambre, Felicia Decker, David Archer, Richard Furuta, Frank Shipman, Stephen Carpenter and Lillian Cassel

Abstract. Ensemble, the National Science Digital Library (NSDL) Pathways project for Computing builds upon a diverse group of prior NSDL, DL-I, and other projects. Ensemble has shaped its activities according to principles related to design, development, implementation, and operation of distributed portals. Here we articulate 8 key principles for distributed portals (PDPs). While our focus is on education and pedagogy, we expect that our experiences will generalize to other digital library application domains. These principles inform, facilitate, and enhance the Ensemble R&D and production activities. They allow us to provide a broad range of services, from personalization to coordination across communities. The eight PDPs can be briefly summarized as: (1) Articulation across communities using ontologies. (2) Browsing tailored to collections. (3) Integration across interfaces and virtual environments. (4) Metadata interoperability and integration. (5) Social graph construction using logging and metrics. (6) Superimposed information and annotation integrated across distributed systems. (7) Streamlined user access with IDs. (8) Web 2.0 multiple social network system interconnection.

Discovering Australia's Research Data (Short Paper)

Stefanie Kethers, Xiaobin Shen, Andrew Treloar and Ross Wilkinson

Abstract. Access to data crucial to research is often slow and difficult. When research problems cross disciplinary boundaries, problems are exacerbated. This paper argues that it is important to make it easier to find and access data that might be found in an institution, in a disciplinary data store, in a government department, or held privately. We explore how to meet ad hoc needs that cannot easily be supported by a disciplinary ontology, and argue that web pages that describe data collections with rich links and rich text are valuable. We describe the approach followed by the Australian National Data Service (ANDS) in making such pages available. Finally, we discuss how we plan to evaluate this approach.

Social Aspects

This is What I'm Doing and Why: Reflections on a Think-Aloud Study of Digital Library Users. Information Behaviour (Short Paper)

Stephann Makri, Ann Blandford and Anna Cox

Abstract. Many user-centred studies of digital libraries include a think-aloud element . where users are asked to verbalise their thoughts, interface actions and sometimes their feelings whilst using digital libraries to help them complete one or more information tasks. These studies are usually conducted with the purpose of identifying usability issues related to the system(s) used or understanding aspects of users' information behaviour. However, few of these studies present detailed accounts of how their think-aloud data was collected and analysed or provide detailed reflection on their methodologies. In this paper, we discuss and reflect on the decisions made when planning and conducting a think-aloud study of lawyers. interactive information behaviour. Our discussion is framed by Blandford et al.'s PRET A Rapporter ('ready to report') framework . a framework that can be used to plan, conduct and describe user-centred studies of digital library use from an information work perspective.

Customizing Science Instruction with Educational Digital Libraries (Short Paper)

Tamara Sumner

Abstract. The Curriculum Customization Service enables science educators to customize their instruction with interactive digital library resources. Preliminary results from a field trial with 124 middle and high school teachers suggest that the Service offers a promising model for embedding educational digital libraries into teaching practices and for supporting teachers to integrate customizing into their curriculum planning.

Impact and Prospect of Social Bookmarks for Bibliographic Information Retrieval (Short Paper)

Kazuhiro Seki, Huawai Qin and Kuniaki Uehara

Abstract. This paper presents our ongoing study of the current/future impact of social bookmarks (or social tags) on information retrieval (IR). Our main research question asked in the present work is ``How are social tags compared with conventional, yet reliable manual indexing from the viewpoint of IR performance?''. To answer the question, we look at the biomedical literature and begin with examining basic statistics of social tags from CiteULike in comparison with Medical Subject Headings (MeSH) annotated in the Medline bibliographic database. Then, using the data, we conduct various experiments in an IR setting, which reveals that social tags work complementarily with MeSH and that retrieval performance would improve as the coverage of CiteULike grows.

Merging Metadata: A Sociotechnical Study of Crosswalking and Interoperability (Short Paper)

Michael Khoo

Abstract. Digital library interoperability relies on the use of a common metadata format. However, implementing a common metadata format among multiple digital libraries is not always a straightforward exercise. This paper presents a case study of the metadata and interoperability issues that arose during the merger of two digital libraries, the Internet Public Library and the Internet Librarian.s Index. As part of the merger, each library.s metadata was crosswalked to Dublin Core. These crosswalks required considerable work, partly because the metadata for each library had been shaped in complex ways over time by local institutional factors. Not all of these differences in the metadata were obvious, and some were ignored for some time, negatively impacting the crosswalk process. A sociotechnical analysis that suggests that local metadata knowledge can be complex in nature and hard to crosswalk, and also that this complexity can be hard to understand to those working in the libraries themselves. Some implications of this finding for digital library interoperability are discussed.

Emulation Based Services in Digital Preservation (Short Paper)

Klaus Rechert, Dirk von Suchodoletz and Randolph Welte

Abstract. The creation of most digital objects occurs solely in interactive graphical user interfaces which were available at the particular time period. Archiving and preservation organizations are posed with large amounts of such objects of various types. At some point they will need to automatically process these to make them available to their users or convert them to a commonly used format. A substantial problem is to provide a wide range of different users with access to ancient environments and to allow using the original environment for a given object. We propose an abstract architecture for emulation services in digital preservation to provide remote user interfaces to emulation over computer networks without the need to install additional software components. Furthermore, we describe how these ideas can be integrated in a framework of web services for common preservation tasks like viewing or migrating digital objects.

Digital Libraries - 10 years past, 10 years forward, a 2020 Vision