Difference between revisions of "Marine Lives guide to creating a Transkribus Ground Truth"

From MarineLives
Jump to: navigation, search
Line 17: Line 17:
 
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.
 
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.
 
----
 
----
 
 
==Tools==
 
==Tools==
----
 
===Using Transkribus Expert Client===
 
  
We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.
+
We are working with several related Transkribus Tools and with our own semantic media wiki
  
We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.
+
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Expert_Client 1. Transkribus Expert Client]
  
----
+
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Lite_version_2.0 2. Transkribus Lite version 2.0]
===Using Transkribus Lite version 2.0===
+
  
[[File:Transkribus Lite HCA1372 Done Thumbnails 03032022.png|750px|thumb|left|Transkribus Lite Version 2.0 Thumbnail display showing images with status "Done" in our work processl]]
+
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Marine_Lives_wiki 3. Marine Lives semantic media wiki]
 
+
 
+
Transkribus has recently introduced an improved version of its web browser interface ([https://transkribus.eu/lite/ Transkribus Liter Version 2.0]).
+
 
+
Transkribus has a [https://readcoop.eu/transkribus/howto/getting-started-with-transkribus-lite/ useful online guide to using Transkribus Lite Version 2.0].
+
 
+
We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.
+
 
+
The browser interface also has useful functionality not available in Transkribus Expert Client.
+
 
+
Most useful to date are:
+
 
+
    Large thumbnails
+
    Ability to display thumbnails by status of manuscript pages within our work process
+
----
+
 
+
===Marine Lives wiki===
+
 
+
The [http://www.marinelives.org/wiki/MarineLives Marine Lives wiki] is a Semantic Media Wiki. It is organised into volumes and pages.
+
 
+
We are working from volume [http://www.marinelives.org/wiki/HCA_13/72 HCA 13/72] and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus Expert Client.
+
  
 
----
 
----
Line 61: Line 36:
 
==We are experimenting==
 
==We are experimenting==
  
We are experimenting with a range of Transkribus tools related to layout analysis and HTR.
+
We are experimenting with a range of Transkribus tools related to [http://www.marinelives.org/wiki/Customized_Structural_Analysis layout analysis and HTR].
 
+
One tool we are trying out is the manual naming of structural elements in the legal depositions which form our corpus. We are using the customizable structural analysis tools available in Transkribus Expert Client, and hope to train a model to recognise these different structrual types in our data.
+
 
+
[[File:Structural Analysis HCA Depositions 03032022.png|750px|thumb|left|Text regions (and other structural elements of text) can be labelled with customizable structural tags within Transkribus Expert Client]]
+
  
 
----
 
----
==QUESTIONS==
+
==Questions==
  
 
'''We are developing a [http://www.marinelives.org/wiki/Running_List_of_Questions running list of questions.]'''
 
'''We are developing a [http://www.marinelives.org/wiki/Running_List_of_Questions running list of questions.]'''

Revision as of 21:55, March 3, 2022

This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth

Objective


Transkribus web capability has simple, but useful search functionality
We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l

Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.

To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.

We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).

For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.



Tools


We are working with several related Transkribus Tools and with our own semantic media wiki

1. Transkribus Expert Client

2. Transkribus Lite version 2.0

3. Marine Lives semantic media wiki



Work Process


1 Automatic layout recognition of all 1518 images in HCA 13/72

2 Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client

We are experimenting


We are experimenting with a range of Transkribus tools related to layout analysis and HTR.



Questions


We are developing a running list of questions.

Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.

But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.