Difference between revisions of "Marine Lives guide to creating a Transkribus Ground Truth"

From MarineLives
Jump to: navigation, search
(Using Transkribus Lite version 2.0)
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''
 
'''This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth'''
 
+
----
 
__TOC__
 
__TOC__
 
+
----
 
==Objective==
 
==Objective==
  
Line 9: Line 9:
 
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]
 
[[File:Transkribus Lite Search Bahia Full TexT03032022.png|750px|thumb|left|We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l]]
  
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depoitions covering 1570 to 1690 publicly available and searchable.'''
+
'''Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.'''
  
 
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''
 
'''To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.'''
Line 15: Line 15:
 
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).
 
We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).
  
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treate interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.
+
For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.
 
----
 
----
 
 
==Tools==
 
==Tools==
----
 
===Using Transkribus Expert Client===
 
  
We are using Transkribus Expert Client as our main tool to perform automated layout recognition, manual correction of these layouts, and to enter and modify existing semi-diplomatic transcriptions of material in HCA 13/72.
+
We are working with several related Transkribus Tools and with our own semantic media wiki
  
We are then using Transkribus Lite version 2.0 to view completed Ground Truth pages, and to keep an overview of our work.
+
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Expert_Client 1. Transkribus Expert Client]
  
----
+
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Using_Transkribus_Lite_version_2.0 2. Transkribus Lite version 2.0]
===Using Transkribus Lite version 2.0===
+
  
[[File:Transkribus Lite HCA1372 Done Thumbnails 03032022.png|750px|thumb|left|Transkribus Lite Version 2.0 Thumbnail display showing images with status "Done" in our work processl]]
+
[http://www.marinelives.org/wiki/Tools_to_create_our_Ground_Truth#Marine_Lives_wiki 3. Marine Lives semantic media wiki]
  
 
Transkribus has recently introduced an improved version of its web browser interface ([https://transkribus.eu/lite/ Transkribus Liter Version 2.0]).
 
 
Transkribus has a [https://readcoop.eu/transkribus/howto/getting-started-with-transkribus-lite/ useful online guide to using Transkribus Lite Version 2.0].
 
 
We are finding this improved browser interface to be pretty responsive in terms of short lag times as we browse images.
 
 
The browser interface also has useful functionality not available in Transkribus Expert Client.
 
 
Most useful to date are:
 
 
    Large thumbnails
 
    Ability to display thumbnails by status of manuscript pages within our work process
 
 
----
 
----
 +
==Work Process==
  
===Marine Lives wiki===
+
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Automatic_layout_recognition_of_all_1518_images_in_HCA_13.2F72 1 Automatic layout recognition of all 1518 images in HCA 13/72]
  
The [http://www.marinelives.org/wiki/MarineLives Marine Lives wiki] is a Semantic Media Wiki. It is organised into volumes and pages.
+
[http://www.marinelives.org/wiki/Ground_Truth_Work_Process#Input_of_existing_semi-diplomatic_transcriptions_of_HCA_13.2F72_manuscript_pages_into_Transkribus_Expert_Client 2 Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client]
  
We are working from volume [http://www.marinelives.org/wiki/HCA_13/72 HCA 13/72] and are inputing existing semi-diplomatic transcriptions from this volume by hand into Transkribus Expert Client.
+
==We are experimenting==
  
----
+
We are experimenting with a range of Transkribus tools related to [http://www.marinelives.org/wiki/Customized_Structural_Analysis layout analysis and HTR].
 
+
==Work process==
+
 
+
We have set up a simple work process
+
 
+
1. Automatic layout recognition of all 1518 images in HCA 13/72
+
 
+
- Used the CITlab Advanced Tool
+
 
+
[[File:CITlab Advanced Tool ML 03032022.png|500px|thumb|left|Layout Analysis controls in Tools section of Transkribus Expert Client controls panel]]
+
 
+
- Modified the layout page by page after manual inspection of automatically generated layouts
+
 
+
    We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth
+
    We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image
+
    Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text
+
    Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this
+
    In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions
+
    However, base lines of text have already been recognised and allocated to specific text regions.
+
    We have found an easy way using Transkribus layout tools to reallocate the base lines [see below]
+
 
+
- The two key modifications we are making are
+
 
+
(a) Adjusting number size and shape of Text Regions
+
(b) Checking all automatically generated base lines (which themselves are "children" of a partent Text Region)
+
    Look for breaks in base lines
+
    Look for incomplete base lines
+
    Connect broken base lines
+
    Extend incomplete base lines
+
(c) Reallocating base lines to our newly created and/or adjusted Text Regions
+
 
+
[[File:Transkribus Expert Client Layout HCA 1372 f.14v.png|750px|thumb|left|Layout out HCA 13/72 f.14v once we have manually adjusted the Text Regions, creating six Text Regions and reallocating lines to those regions]]
+
 
+
[[File:Reallocating Base Lines To New Text Regions One 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part One]]
+
 
+
[[File:Reallocating Base Lines To New Text Regions Two 03032022.png|750px|thumb|left|Reallocating base lines to new Text Regions: Part Two]]
+
  
 
----
 
----
2. Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client
+
==Questions==
  
Once the automatically generated Text Regions have been adjusted for a specific image page
+
'''We are developing a [http://www.marinelives.org/wiki/Running_List_of_Questions running list of questions.]'''
 
+
* Input the semi-diplomatic Marine Lives transcription for the relevant page, matching each line of transcribed text to the correct automatically generated line within the correct Text Region
+
 
+
* The chart below shows our workflow for manuscript page HCA 13/72 f.11v. 
+
    We have the Marine Lives wiki open at the correct page on the left hand side of our screen.
+
    In the middle and on the right hand of our screen we have the Transkribus Expert Client open with the Layout Tab open in Transcription View.
+
    This enables us to see the relevant part of the image, with the relevant Text Region.
+
    We are pasting transcribed text against the correct lines.
+
    To ensure a good human overview, we have pasted two or three lines of transcribed text into each Text Region
+
    This gives us good human oversight of the document.
+
    Then we work methodically through all the text
+
 
+
[[File:Workflow Page HCA1372f.11v.png|750px|thumb|left|Our workflow showing Marine Lives wiki page and Transkribus Expert Client with Layout Tab open in Transcription View: Part Two]]
+
 
+
----
+
 
+
==QUESTIONS==
+
 
+
'''We are developing a running list of questions.'''
+
  
 
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''
 
'''Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.'''
  
 
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''
 
'''But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.'''
 
----
 
===Question One===
 
 
'''Question One: Is this a sensible division of this page into Text Regions?'''
 
 
[[File:Use Of Transkribus Text Regions 03032022.png|750px|thumb|left|Question One: Is this a sensible division of this page into Text Regions?]]
 
----
 
===Question Two===
 
 
'''Is it best practice to avoid overlapping Text Regions by using irregular shapes? Or, is it better to keep rectangular shapes, aligned horizontally with image, and to accept overlapping Text Regions?'''
 
 
----
 
===Question Three===
 
 
'''How can we train the Transkribus automatic layout tools to understand the range of document structures we have?'''
 
 
<u>Typical structure and variations</u>
 
 
HCA depositions are typically structured with three implied columns of text. Depositions can be as short as a quarter of an image page, or as long as ten image pages.
 
 
Most HCA deposition image pages have an implied three column structure.
 
 
At the start of a deposition, there may be a date, which is usually in the central column. There is then the long or short form name of the cause, which is usually in the left hand column. At the same horizontal level, or somewhat lower follows the full name of the individual being examined (the deponent), together with their residential location, their occupation and age. These data, which we for convenience call the "Personal Front matter" typically runs across the central and right hand column.
 
 
The main body of the deposition (answers to an allegation or libel, or to interrogatories) is in the centre and right of a page.
 
 
At the same horizontal level as the main body of the deposition, there may be additional data in the left hand column. For example, merchant markes (typically these are pictograms), referred to in the main body.
 
 
At the foot of the main body of a deposition, there is a signature, mark or initial(s) (which we describe together as "Signoffs") of the person being examined. This may be can be in the central column, the right hand column, or running over both the central and right hand columns.
 
 
Near the horizonal level of the signoff, there is usually some legal boilerplate in the left hand column.
 
 
<u>Human reading of our documents</u>
 
 
Human beings read legal depositions by starting in the top left hand side of a page, then moving their eyes to the first block of text on the left and then the right, in a zig zag
 
'''
 

Latest revision as of 22:25, March 3, 2022

This page is the main page of a guide that Marine Lives is creating to cover practical aspects of creating a Transkribus Ground Truth






Objective


Transkribus web capability has simple, but useful search functionality
We want to have 80,000 images covering HCA 13/20to HCA 13/79 publicly available and searchable by end 2022l

Our overall objective is to make 80,000 images of English High Court of Admiralty depositions covering 1570 to 1690 publicly available and searchable.

To do this our immediate objective is to create a C17th English secretarial hand HTR model, which we will use on our collection of 80,000 images of English High Court of Admiralty depositions.

We are aiming to create two models. The first based on a Ground Truth of 500,000 words (roughly 1,000 manuscript images). The second based on one million words (roughly 2,000 manuscript images).

For our first model, we are using existing semi-diplomatic transcriptions of the HCA 13/72 volume [late 1650s], made between 2013 and 2015 by Marine Lives volunteers. We have to convert these semi-diplomatic transcriptions back to full diplomatic, and mark up contractions, as part of the process of creating a highly reliable Ground Truth. We also have to treat interlineation differently, with each interline shown separately and the insertion point or points marked inm the line on which the interlineation depends.



Tools


We are working with several related Transkribus Tools and with our own semantic media wiki

1. Transkribus Expert Client

2. Transkribus Lite version 2.0

3. Marine Lives semantic media wiki



Work Process


1 Automatic layout recognition of all 1518 images in HCA 13/72

2 Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client

We are experimenting


We are experimenting with a range of Transkribus tools related to layout analysis and HTR.



Questions


We are developing a running list of questions.

Some of these questions we will probably be able to answer ourselves, as we get more experience of building our Ground Truth.

But, in the meantime, we would appreciate sugegstions from fellow Transkribus users.