Difference between revisions of "Ground Truth Work Process"

From MarineLives
Jump to: navigation, search
(Created page with "'''We have set up a simple work process''' __TOC__ ==Automatic layout recognition of all 1518 images in HCA 13/72== - Used the CITlab Advanced Tool File:CITlab Advanced...")
(No difference)

Revision as of 21:31, March 3, 2022

We have set up a simple work process

Automatic layout recognition of all 1518 images in HCA 13/72

- Used the CITlab Advanced Tool

Layout Analysis controls in Tools section of Transkribus Expert Client controls panel

- Modified the layout page by page after manual inspection of automatically generated layouts

    We are only just beginning to think through what makes sense in terms of use of Text Regions when creating our Ground Truth
    We are finding that the automatic tool is typically producing between one and three Text Regions per manucript image
    Typically the tool is NOT identifying text blocks on the left hand side of an image as separate from structurally separate text in the main body of text
    Ideally, we would train the automatic layout recognition tool to be sensitive to the typical structures of HCA legal depositions, and we are looking into this
    In the short term, we are manually adding Text Regions, and changing the shape and size of Text Regions
    However, base lines of text have already been recognised and allocated to specific text regions. 
    We have found an easy way using Transkribus layout tools to reallocate the base lines [see below]

- The two key modifications we are making are

(a) Adjusting number size and shape of Text Regions
(b) Checking all automatically generated base lines (which themselves are "children" of a partent Text Region)

    Look for breaks in base lines
    Look for incomplete base lines
    Connect broken base lines
    Extend incomplete base lines

(c) Reallocating base lines to our newly created and/or adjusted Text Regions

Layout out HCA 13/72 f.14v once we have manually adjusted the Text Regions, creating six Text Regions and reallocating lines to those regions
Reallocating base lines to new Text Regions: Part One
Reallocating base lines to new Text Regions: Part Two

Input of existing semi-diplomatic transcriptions of HCA 13/72 manuscript pages into Transkribus Expert Client

Once the automatically generated Text Regions have been adjusted for a specific image page

  • Input the semi-diplomatic Marine Lives transcription for the relevant page, matching each line of transcribed text to the correct automatically generated line within the correct Text Region

  • The chart below shows our workflow for manuscript page HCA 13/72 f.11v.
    We have the Marine Lives wiki open at the correct page on the left hand side of our screen. 
    In the middle and on the right hand of our screen we have the Transkribus Expert Client open with the Layout Tab open in Transcription View. 
    This enables us to see the relevant part of the image, with the relevant Text Region.
    We are pasting transcribed text against the correct lines. 
    To ensure a good human overview, we have pasted two or three lines of transcribed text into each Text Region
    This gives us good human oversight of the document.
    Then we work methodically through all the text

Our workflow showing Marine Lives wiki page and Transkribus Expert Client with Layout Tab open in Transcription View: Part Two