Using Word Sense Discrimination on Historic Document Collection (Full Paper)
Nina Tahmasebi, Kai Niklas, Thomas Theuerkauf and Thomas Risse
Abstract. Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing found word senses over time, important information can be revealed and used to improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today’s language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations in this paper are performed on The Times Archive, a collection of newspaper articles from 1785?1985.
Chinese Calligraphy Specific Style Rendering System (Full Paper)
Zhenting Zhang, Jiangqin Wu and Kai Yu
Abstract. Manifesting the handwriting characters with the specific style of a famous artwork is fascinating. In this paper, a system is built to render the user’s handwriting characters with a specific style. A stroke database is established firstly. When rendering a character, the strokes are extracted and recognized, then proper radicals and strokes are filtered, finally these strokes are deformed and the result is generated. The Special Nine Grid (SNG) is presented to help recognize radicals and strokes. The Rule-base Stroke Deformation Algorithm (RSDA) is proposed to deform the original strokes according to the handwriting strokes. The rendering result manifests the specific style with high quality. It is feasible for people to generate the tablet or other artworks with the proposed system.
Translating Handwritten Bushman Texts (Full Paper)
Kyle Williams and Hussein Suleman
Abstract. The Lloyd and Bleek Collection is a collection of artefacts documenting the life and language of the Bushman people of southern Africa in the 19th century. Included in this collection is a handwritten dictionary that contains English words and their corresponding |xam Bushman language translations. This dictionary allows for the manual translation of |xam words that appear in the notebooks of the Lloyd and Bleek collection. This, however, is not practical due to the size of the dictionary, which contains over 14000 entries. To solve this problem a content-based image retrieval system was built that allows for the selection of a |xam word from a notebook and returns matching words from the dictionary. The system shows promise with some search keys returning relevant results.