DATA MINING

Domain-Specific Iterative Readability Computation< (Full Paper)
Jin Zhao and Min-Yen Kan
Abstract. A growing number of domain-specific resources are becoming available online. Although the topic in these specialized resources may be the same, they may cater to different audiences, ranging from children to expert researchers. However, current search engines seldom provide information on the readability of their indexed resources or perform readability-based ranking. Consequently, when users with different levels of expertise search for specialized material, they must sieve through many results to find ones suitable to their level. Calculating a domain-specific readability level for such resources can address this issue. We thus present a new algorithm to measure domain-specific readability. It iteratively computes the readability of domain-specific resources based on the difficulty for domain-specific concepts and vice versa, in a style reminiscent of other bipartite graph algorithms such as Hyperlink-Induced Topic Search (HITS) and the Stochastic Approach for Link-Structure Analysis (SALSA). While simple, our algorithm outperforms standard heuristic measures and remains competitive among supervised-learning approaches. Moreover, it is less domain-dependent and portable across domains as it does not rely on an annotated corpus or expensive expert knowledge that supervised or domain-specific methods require.

Evaluating Topic Models for Digital Libraries (Full Paper)
David Newman, Youn Noh, Edmund Talley, Sarvnaz Karimi and Timothy Baldwin
Abstract. Topic models could have a huge impact on improving the ways users find and discover content in digital libraries and search interfaces, through their ability to automatically learn and apply subject tags to each and every item in a collection, and their ability to dynamically create virtual collections on the fly. However, much remains to be done to tap this potential, and empirically evaluate the true value of a given topic model to humans. In this work, we sketch out some sub-tasks that we suggest pave the way towards this goal, and present methods for assessing the coherence and interpretability of topics learned by topic models. Our large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains. We show how a scoring model ? based on pointwise mutual information of word-pairs using Wikipedia, Google and MEDLINE as external data sources ? performs well at predicting human scores. This automated scoring of topics is an important first step to integrating topic modeling into digital libraries.

FRBRization of MARC records in multiple catalogs (Full Paper)
Hugo Manguinhas, Nuno Freire and Jose Borbinha
Abstract. This paper addresses the problem of using the FRBR model to support the presentation of results. It describes a service implementing new algorithms and techniques for transforming existing MARC records into the FRBR model for this specific purpose. This work was developed in the context of the TELPlus project and processed 100.000 bibliographic and authority records from multilingual catalogs of 12 European countries.