Deliverables Verrijkt Koninkrijk

D1

The data source for the VerrijktKoninkrijk project has been the pdf collection at http://www.niod.nl/koninkrijk/default.asp, which comprises a scanned (in color) and OCR'ed version of the complete scientific edition of “Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog”, written by L. de Jong.

These pdf files have each been transformed into XML with the open-source tool pdf2xml:

http://sourceforge.net/projects/pdf2xml

In order to clean up some of the most obvious OCR mistakes such as floating non-legible characters

        (combinations of ·.,;:"'|/\^~`•_=><)
        

due to dirt on the scannerbed or page, we performed a pre-processing clean-up step for all documents with the following xslt script:

http://transformer.loedejongdigitaal.nl/pdf2htmlcleanup.xsl

The resulting xml documents were transformed into the final book format with the following xslt script:

http://transformer.loedejongdigitaal.nl/loedejong.xsl

This resulted in xml files which validate against the following schema:

http://schema.loedejongdigitaal.nl/book.rnc (Note that this links to an html version of the schema. The original file on which it is based can be found by changing .html back to .rnc)

Description of the data format

The data is made available in the following formats at EASY DANS (link through the assigned Persistent Identifier):

General Description

This is a collection of datasets related to the work of Loe de Jong: "Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog". This is the standard reference on the history of the Netherlands during World War II, and was digitally enriched and curated for CLARIN-NL.

The collection includes:

1) The loedejongdigitaal.nl xml data

This dataset comprises all of the XML enriched books from the loedejongdigitaal.nl xml collection. The collection is organized as a set of 30 XML files, each one corresponding to one of the paper binds and based on the associated pdf with scan and OCR data.

2) The loedejongdigitaal.nl xml data further enriched with semantic analysis

This dataset comprises all of the XML enriched books from the loedejongdigitaal.nl xml collection. The collection is organized as a set of 30 XML files, each one corresponding to one of the paper binds and based on the associated pdf with scan and OCR data. This set contains the semanticized text in FoLiA annotation.

3) The Named Entities detected in the loedejongdigitaal.nl xml data in table format

This dataset comprises all of the detected named entities in the loedejongdigitaal.nl xml collection. The database is organized as table containing the named entity text, type, and paragraph identifier. If relevant and available, it also contains a dutch wikipedia link and an english wikipedia link.

For related datasets see the thematic collection: 'Verrijkt Koninkrijk'. You can find a link to this collection under 'Relations'.

4) The Semantic Layer (RDF/XML Data)

This dataset contains RDF data in XML format for the Linked Data version of the semanticized Named Entities and back of the book terms of the loedejongdigitaal.nl collection.

Description of the data xml format

Each document is a UTF-8 encoded XML file and valid with respect to the book.rnc compact RelaxNG file. The structure of the documents is as follows. The root element root of each document contains 3 elements:

The book element is created based on an automatic detection of a number of visual and textual cues that can be found throughout the different pages. Thanks to a relatively consistent layout used throughout the different parts (with an unfortunate exception of 'deel 14') it was possible to use the same feature detector for all books.

The elements detected were:

Data post processing

The back of the book will be enriched with data from the books themselves, and the pages from those books will be enriched with data from the back of the book.

This creates a co-dependency in the transformation process, which is solved by repeating the transformation process once after it is done the first time.

The order in which all data is processed is as follows:

  1. Transorm all xml's with http://transformer.loudejongdigitaal.nl/d/vk/loudejong.xsl
  2. Place the resulting back of the book (vk.d.reg.xml) at http://transformer.loudejongdigitaal.nl/d/vk/nl.vk.d.reg.xml
  3. Create an XML with all paragraph id's mapped to the page id's those paragraphs appear on with http://www.loedejongdigitaal.nl/parids.xq
  4. Save the resulting XML file as http://transformer.loudejongdigitaal.nl/d/vk/parids.xml
  5. Again, transform all xml's with http://transformer.loudejongdigitaal.nl/d/vk/loudejong.xsl

As a result we will have the following enrichment:

Back of the book

The lemmas in the back-of-the-book (nl.vk.d.reg) contain page references. Because paragraphs are the smallest resolvable element in the curated collection, paragraph references to all paragraphs that have some or complete overlap with a given page have been added.

Pages

To each page element a backof-book-ref element is added if there one or more lemmas which refer to that specific page. These lemma references may function as a 'summary' of a given page, or used in a visualization to allow users easy navigation to other pages which related to the current page via the lemmas. Example:

<backofbook-ref> <lemma-ref>Anti-communisme</lemma-ref> <lemma-ref>Anti-fascisme</lemma-ref> <lemma-ref>Centrale Inlichtingsdienst (voor de oorlog)</lemma-ref> <lemma-ref>Concordaat (20 juli 1933)</lemma-ref> <lemma-ref>Consulaat, Duits, in Amsterdam</lemma-ref> <lemma-ref>Foreign Office/State Department Document Center</lemma-ref> <lemma-ref>Heerlen</lemma-ref> <lemma-ref>Jansen, J. H. G.</lemma-ref> <lemma-ref>Limburg</lemma-ref> <lemma-ref>Noorr, G. C. van</lemma-ref> <lemma-ref>Pius XI, paus</lemma-ref> <lemma-ref>Poels, H. A.</lemma-ref> <lemma-ref>Rooms-Katholiek Episcopaat</lemma-ref> <lemma-ref>Rooms-Katholieke Mijnwerkersbond</lemma-ref> <lemma-ref>Rooms-Katholieke Staatspartij (RKSP)</lemma-ref> </backofbook-ref>

Statistics for the collection

All elements were counted with the following result:

Element Aantal
vk:book 30
vk:chapter 226
vk:section 1885
vk:subsection 4708
vk:p 86257
vk:quote 56547
vk:foreword 6
vk:statement 2
vk:appendix 92
vk:corrections 2
vk:header 16015
vk:footer 7881
vk:page 16922
vk:backofbook 1
vk:block 80
vk:lemma 16186
vk:lemma[.//vk:page-ref/@vk:page-ref] 15369
vk:lemma-ref 148370

D3

For the purpose of this research, Het Koninkrijk has been subdivided hierarchically, as follows, and each element of the hierarchy has been given a unique identifier. The XML snippet corresponding to each element/identifier can be obtaines via:

http://resolver.loedejongdigitaal.nl/<id>

where <id> is the identifier.

Each of these element types has an identifier attribute @vk:id , which reflects the hierarchical structure. Each identifier consists of the prefix nl.vk.d. , followed by a point-separated list of numbers denoting book, chapter, section, paragraph. E.g., in Volume 11b, second half, we find a footnote with the identifier:

nl.vk.d.11a-2.2.1.2.6.6

meaning

11a-2
Volume 11b, second subvolume's vk:book (regarded as a single identifier part; the separator is . , not - ).
2
Chapter Het gouvernement en de nationalisten, the second element below vk:book . The tenth chapter in De Jong's scheme; this is not reflected in the identifier but in a separate attribute.
1
First vk:section (untitled, as the first section of a chapter always is).
1
First vk:subsection .
6
Sixth paragraph ( vk:p ).
6
The actual footnote.