Marine Lives guide to creating a Transkribus Ground Truth

From MarineLives
Revision as of 10:21, March 3, 2022 by ColinGreenstreet (Talk | contribs) (Created page with "'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth''' __TOC__ ==Objective== Our object...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth

Objective


Our objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.

We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).

For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers



Tools



Using Transkribus Expert Client


We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.

We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.



Using Transkribus Lite version 2.0


Transkribus has recently introduced an improved version of its web browser interface (Transkribus Liter Version 2.0).

Transkribus has a useful online guide to using Transkribus Lite Version 2.0.

We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.

The broswer interface also has useful functionality not available in Transkribus Expert Client.

Most useful to date are:

  1. Large thumbnails
  2. Ability to display thumbnails by staus of manuscript pages within our work process


Marine Lives wiki


The Marine Lives wiki is a Semantic Media Wiki. It is organised into volumes and pages.

We are working from volume HCA 13/72 and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus xpert Cleient.



Work process


We have set up a simple work process

1. Automatic layout recognition of all 1518 images in HCA 13/72

- Used the CITlab Advanced Tool

Layout Analysis controls in Tools section of Transkribus Expert Client controls panel

- Modified the layout page by page after manual inspection of automatically generated layouts

  • We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth
  • We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image
  • Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text
  • Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this
  • In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions
  • However, lines of text have already been recognised and allocated to specific text regions. So we, are having to reallocate lines of text to our new Text Regions after we have entered the text. This is cumbersome, so we are looking into alternatives


Layout out HCA 13/72 f.14v once we have manually adjusted the Text Regions, creating six Text Regions and reallocating lines to those regions


2. Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client

[ADD TEXT]