Indexing

This document shortly summarizes the indexing scheme used to efficiently store and retrieve the VK data in the eXist XML database engine.

Apart from the built-in structural index offered by eXist, we define three types of index:

The goal of the first index is allow resolving of snippets of text at various levels in the hierarchy.

The second index allows ordering search results in the original book order. vk:chron attributes only exist at the level of vk:p and contain a number such that if $x/@vk:chron is less than $y/@vk:chron , then the fragment $x occurs earlier in the book than $y. This is purely an optimization and has no effect for the user other than speed.

The full text indices are more complex. We create several of these with Lucene, for the following types of elements:

Text in the elements vk:b (bold) and vk:i (italic) is taken to belong to the containing element, usually a vk:p. vk:page and vk:header elements are ignored to prevent noise.

The indexes are built by running the text through Lucene's StandardAnalyzer. Although designed for English text, this analyzer class works quite well for Dutch and is much more performant than DutchAnalyzer, which was found to be prohibitively slow. (While index construction is an off-line task, the same analyzer has to be used to parse users' query strings online.)