Team Giovanni/Patrizia

From MarineLives
Jump to: navigation, search

Team Giovanni/Patrizia

Team Colin

Editorial history

23/08/12: CSG, created page






Suggested links


Team Colin
Team Jill
Team William

TEI: Text Encoding Initiative
TEI Lite
http://www.w3.org/XML/



Tasks for the week



Week commencing 20th August 2012




Week commencing 30th August 2012


- Patrizia, how should we deal with quantities and currencies database wise? <quantity value="hour">6. howers</quantity> OR <quantity value="hour">6</quantity>. howers ? I suggested the latter.
Giovanni:I agree

- We'll need to decide which elements we want in the header, to mimic some of a TEI one. If you manage to give a thought about this, it would be great.
I'm afraid I'm not familiar with the structure of the papers until now. Which are the elements that we want to include?

- One further concern: units of meaning in the text (depositions, cases, etc.). We need to identify them properly I think. Followup: we should probably have one header per document (ie picture), and a separate header for cases, depositions (each case contains many, scattered across different pages). Ideally we'll have some metadata to associate the document with a picture, but we'll also represent the units of meaning of the source, which is paramount.
Patrizia: isn't this something that should mimic the structure of the national archives? (I mean the documentary unit)



Week commencing 3rd September 2012




Useful email records



Patrizia to Giovanni, Colin, Charlene, Stuart, William, Jill; 31/08/12: 23:34


Re: Buttons-Tags: current status (Comments from Patrizia)

GIOVANNI: 2012/8/31 Giovanni Colavizza <giovannicolavizza@gmail.com>

   Dears,


   a brief mail to summarize our current status and focus your attention on some issues that need feedback.


   - help: use it to point the facilitator to dubious passage: does it fit its role properly?
   - HEADER: this will probably undergo some modifications. It's meant to describe the document being transcribed (ie 1 picture), which is an arbitrary unit of work. I suggest we keep it, but move the description of documental units of meaning (cases, depositions), outside, with separate metadata fields. Discussion is under way.
   - Italic (abbreviations and such), strikes and underlines are pretty straightforward, right?
   - Alt/Sic is meant to cover misspelling and word variants. Do you find it sound and sufficient for this task?
   - Person: is it ok?
   - Profession: this also covers titles, occupations, etc. Do you think it's too generic or shall suffice?


PATRIZIA: I would really split the two categories for the reasons explained yesterday: many people in my opinion may have the two attributes

   - Ship?
   - Commodity, Currency, Quantity: do you like the way they work? Would you prefer some disambiguation for example for quantities (ie distinction between weights, distances, etc.)?
   - Place: too generic? Would you prefer to distinguish among cities, countries, regions? If so, would you do it with an attribute (like the value system for Commodity, Currency, Quantity) or with different tags?


PATRIZIA: In my opinion we can leave it to the database: the purpose of a database is to capture everything that is 'fixed'. If Paris is a city, it will be a city at any time it's mentioned. There is no reason to repeat it at any time it's tagged, no? Conversely, for example, an occupation or a title are transient. Thus, it's fine to tag them in relation to a given document, because they can be time-bound.

   - Date, Note?
   - Special characters: is the menu fine? Any more special character you might need?


PATRIZIA: Very good solution!

   GIOVANNI: Anything else left out, comments? Also, I'd appreciate if someone with an estetic sense will help me with colours and disposition of buttons. For now, it's pretty random.


   NB
   If you save to the server, once a transcription is complete, please use the document title (for example HCA 13/71 f.137r P1130424).


   Thank you for your collaboration, all the best




Patrizia to Giovanni, Colin, Charlene; 31/08/12: 23:26


Database vs free text

       COLIN: (2) I like the idea of coding from a fixed list for the charge or substance or a case, e.g. alleged fraudulent acquisition of three bales of cotton by the master of the Red Hand, in contravention of the charterparty; e.g. failure to pay the wages of the crew of the Goulden Eagle; e.g. negligence of the Master of the Prosperous leading to its sinking and the loss of one thousand pounds sterling of cargo. There would be a fixed list like Fraud, Failure to pay wages; and Negligence, but the charge entry should ideally contain some longer text, so that a researcher, after interrogating the relational database for all cases involving (1) French plaintiffs OR defendants AND (2) Fraud could parse through a list of summary cases, rather than having to go to every page which satisfied the Boolean search critera


   GIOVANNI: Exactly my idea. The longer version should go into the summary, or another tag appropriately made for. With the dabatase then, we'll be able to search every case regarding a specific infraction, then to get all the summaries, and go to the specific cases we need. Top down refining.


PATRIZIA: If I may add. This is in my opinion a very important point and also highlightes what makes the database approach so important.

Broadly said, a free text reports things 'in your own words', while a databse tries to normalize the description of what happened and privilege consistency.
In other words: to be obliged to write what historians call the 'regesto' (you say 'summary' in English?) request to group things into categories, and is in my opinionn a very good exercise, also because it asks to the transcriber to review his work.

Think how a copy book or a notary book was usually kept:


       Felix Don Lewis, Thomas Pattrick and S - who were convicted for Forgery, were ordered stand in the Pillory at the Royal Exchange in Cornhil, and one of each of their Ears to be cut off, and to continue in Prison for twelve Months after, according to the Intent of the Statute in that Case made and provided.


This comes from the Old Bailey Proceedings, and lists the essential people and facts involved.

As Giovanni says, if we marry a 'bottom up' transcription, plus a 'top down' finetuning, we will maybe able to keep the best of two worlds: in phase one the transcriptors make the awful work of understanding what is written (i.e.: they simply transcribe), but in phase two, once they finished the transcription, they summarize the document (like above).
On this basis, the database fundamentalist (i.e.: Giovanni and myself) build categories and and the relevant database.

I understand the point of Charlene (from the point of view of somebody interested into law inquiry), but imagine to fix all the categories and problems NOW is in my opinion too early.

Does it make sense?



Jill to Giovanni, Patrizia, Colin, Stuart, William; 31/08/12: 22:45


Buttons and tags & need for an editorial policy for category markup (Comments from Patrizia)

Dear all,

alas, the Riesling was very good and abundant, but I find the below suggestion of Jill very wise. Most of all after reading the very careful comments of Charlene, very appropriate but way too far for 'new' transcribers that still need to get into the mood of the papers.

I really think that we need a couple of weeks of *pure* focus on transcriptions before we are able to make any point.
To catch up with two weeks of transcription won't add too much work, and I guess that both myself and Giovanni are eager to volunteer about that.

Conversely, maybe if we are too assuming and demanding towards the transcriptors from 'moment zero' we can't focus on the collection of all the problems and bottle necks that they meet.

I won't really go into more details than what we accumulated until now.

More the opposite: I would rely also on what the transcribers who didn't take part into this conversation are able to report about their contact with the papers.

What do you think?



Charlene to Giovanni, Patrizia, Stuart, cc William, Jill


On 31 August 2012 18:36, Charlene Eska <ceska@vt.edu> wrote:

Colin,

I’m going through the page as per your directions below. For items you have indicted in red with the Help button, is there a system/procedure yet for being able to receive help/feedback? In other words, I have confirmed your reading of the word ‘since’ which you indicated you needed help with, but I don’t know what to do next…. Thanks!



Jill to Giovanni, Patrizia, Colin, Stuart, William; 31/08/12: XX


Buttons and tags & need for an editorial policy for category markup (Comments from Jill)

2012/8/31 Jill Wilcox <jillwlcx@gmail.com>

   Dear all,


   I have been reading through the emails about using the buttons and transcribing.  If I have read them correctly, you are proposing that for the first 4 weeks we just transcribe and then go back and add in the buttons afterwards.  Whilst I can see the logic for this it does appear to be creating extra work in some respects.


   I am wondering if maybe have two weeks of just transcribing and then get the person who first checks the document could mark it up as they check it. What do you think?


   Best wishes Jill




Giovanni to Colin, Patrizia, Charlene; 31/08/22: 16:22


Good remarks, since I am unacquainted with the documents they are very needed. Patrizia and Charlene, please when you can drop a comment on this, thank you. My replies follow.

2012/8/31 Colin Greenstreet <colin.greenstreet@googlemail.com>

   Hi Giovanni et al


   I'm just back from a mind clearing long walk with Bron, my Hungarian viszla.


   I think I agree with your suggestion Giovanni, but want to mull for a couple more hours


   (1) Essentially, as I understand it, you are proposing that we will be able to identify with metadata (1) all manuscript pages (2) all cases (3) all depositions, and in the instances of cases and depositions, we will be able to identify all manuscript pages which display entirely or in part a case or deposition, and all depositions which display entirely or in part a deposition within a specific case


yes. This aim will require more work making the database, but providing this metadata now is essential.


   (2) I like the idea of coding from a fixed list for the charge or substance or a case, e.g. alleged fraudulent acquisition of three bales of cotton by the master of the Red Hand, in contravention of the charterparty; e.g. failure to pay the wages of the crew of the Goulden Eagle; e.g. negligence of the Master of the Prosperous leading to its sinking and the loss of one thousand pounds sterling of cargo. There would be a fixed list like Fraud, Failure to pay wages; and Negligence, but the charge entry should ideally contain some longer text, so that a researcher, after interrogating the relational database for all cases involving (1) French plaintiffs OR defendants AND (2) Fraud could parse through a list of summary cases, rather than having to go to every page which satisfied the Boolean search critera


Exactly my idea. The longer version should go into the summary, or another tag appropriately made for. With the dabatase then, we'll be able to search every case regarding a specific infraction, then to get all the summaries, and go to the specific cases we need. Top down refining.


   (3) Folios and foliation


   Unless I misunderstand your intention behind:


   <folio>99r</folio>
   <foliation>XXwhatever (blank if missing or verso)</foliation>


   There is no value in having a coding field for foliation that I can see.


   A volume is either foliated or not. Very occasionally (but not in the case of HCA 13/71) you have a volume which has both original handwritten foliation (i.e. folio numbers, f.101r, f.101v, f.102r, f.102v...) and block printed archivist added, probably early C20th foliation (i.e. folio numbers, f.101r, f.101v, f.102r, f.102v...).  And yes, the two series of folio numbers can be out of sync, with the hnadwritten leaf stating f.108 and the block printed leaf stating f.109.  To be crystal clear, whether handwritten or blockprinted, the folio numbers do not include "v" = verso, and "r" = recto.  These supplementary terms (v and r) are, however, always included in an academic citation, and are imputed by the academic or public historian who determines whether the page is facing or reverse (recto or verso)


ok fine, if the foliation is inconsistent, no point in adding an extra field for it. I just remembered an A.A.2 somewhere on a corner, and thought it was some sort of foliation. Let's stick with the folio field then.


   (4) Correctly, I believe, you attach the status and first transcriber codes to the manuscript page, since the workload will be allocated to individuals by page, and it is desirable for pages "owned" by individuals.  team members will feel, I believe, a sense of pride in completing a page, getting the transcription signed off, and then, later, getting the markup signed off. I also think facilitators will allocate pages, not depositions or cases, each week, after a discussion with team members about their availability and estimated productivity.


   <status>First cut transcription completed; Requires editorial input</status>
   <first-transcriber>Colin Greenstreet, 28/08/12</first-transcriber>


yes, exactly


   (5) Case metadata


   Plaintiff and defendant feel right, though I think there may have been different C17th terminology.  Please note that plaintiffs and defendants are typically groups of people. Sometimes they are groups of named individuals, and sometimes they are companies (note that onvestors could be one off investors in a specific voyage, or can be grouped together in twos and threes as partners in a partnership). "Sir John Frederick and Company" is a partnership, with the partners other than Sir John Frederick not identified.


   <plaintiff>ny</plaintiff>
   <defendant>yx</defendant>


of course the terminology is not my field here, and you sure have a better idea. If there are groups of people, it's fine to have a verbatim content into these fields. Also, these fields ought to contain more markup if needed. But, groups of persons rise a potential issue: if we have a company, how do we tag it? It's a juridical person, but not a physical person. Your example is perfect to rise questions:
<defendant><profession>Sir</profession> <person>John Frederick</person> and </person>Company</person></defendant> this is what we have at the moment, and I see it's weak. A title and a company tag would greatly improve the semantic of our markup. Another way around would be to consider the whole Sir John Frederick and Company as a person, meaning a company. Thus: <defendant><person><profession>Sir</profession> <person>John Frederick</person> and </person>Company</person></person></defendant>. Cumbersome. Ideas?


   (6) Deposition metadata


   You propose:


   <witness>name</witness>


   Do you want us simply to input


   <witness>John Brown</witness>


   Or do you want <witness>John Brown, of <place>Deptford in the County of Kent</place>, <profession>Mariner</profession></witness>


I'd prefer the second, with a <person> tag for John Brown.



Colin to Giovanni, Patrizia, Charlene; 31/08/12; 15:09


I'm just back from a mind clearing long walk with Bron, my Hungarian viszla.

I think I agree with your suggestion Giovanni, but want to mull for a couple more hours

(1) Essentially, as I understand it, you are proposing that we will be able to identify with metadata (1) all manuscript pages (2) all cases (3) all depositions, and in the instances of cases and depositions, we will be able to identify all manuscript pages which display entirely or in part a case or deposition, and all depositions which display entirely or in part a deposition within a specific case

(2) I like the idea of coding from a fixed list for the charge or substance or a case, e.g. alleged fraudulent acquisition of three bales of cotton by the master of the Red Hand, in contravention of the charterparty; e.g. failure to pay the wages of the crew of the Goulden Eagle; e.g. negligence of the Master of the Prosperous leading to its sinking and the loss of one thousand pounds sterling of cargo. There would be a fixed list like Fraud, Failure to pay wages; and Negligence, but the charge entry should ideally contain some longer text, so that a researcher, after interrogating the relational database for all cases involving (1) French plaintiffs OR defendants AND (2) Fraud could parse through a list of summary cases, rather than having to go to every page which satisfied the Boolean search critera

(3) Folios and foliation

Unless I misunderstand your intention behind:

<folio>99r</folio>
<foliation>XXwhatever (blank if missing or verso)</foliation>

There is no value in having a coding field for foliation that I can see.

A volume is either foliated or not. Very occasionally (but not in the case of HCA 13/71) you have a volume which has both original handwritten foliation (i.e. folio numbers, f.101r, f.101v, f.102r, f.102v...) and block printed archivist added, probably early C20th foliation (i.e. folio numbers, f.101r, f.101v, f.102r, f.102v...). And yes, the two series of folio numbers can be out of sync, with the hnadwritten leaf stating f.108 and the block printed leaf stating f.109. To be crystal clear, whether handwritten or blockprinted, the folio numbers do not include "v" = verso, and "r" = recto. These supplementary terms (v and r) are, however, always included in an academic citation, and are imputed by the academic or public historian who determines whether the page is facing or reverse (recto or verso)

(4) Correctly, I believe, you attach the status and first transcriber codes to the manuscript page, since the workload will be allocated to individuals by page, and it is desirable for pages "owned" by individuals. team members will feel, I believe, a sense of pride in completing a page, getting the transcription signed off, and then, later, getting the markup signed off. I also think facilitators will allocate pages, not depositions or cases, each week, after a discussion with team members about their availability and estimated productivity.


<status>First cut transcription completed; Requires editorial input</status>
<first-transcriber>Colin Greenstreet, 28/08/12</first-transcriber>

(5) Case metadata

Plaintiff and defendant feel right, though I think there may have been different C17th terminology. Please note that plaintiffs and defendants are typically groups of people. Sometimes they are groups of named individuals, and sometimes they are companies (note that onvestors could be one off investors in a specific voyage, or can be grouped together in twos and threes as partners in a partnership). "Sir John Frederick and Company" is a partnership, with the partners other than Sir John Frederick not identified.

<plaintiff>ny</plaintiff>
<defendant>yx</defendant>

(6) Deposition metadata

You propose:

<witness>name</witness>

Do you want us simply to input

<witness>John Brown</witness>

Or do you want <witness>John Brown, of <place>Deptford in the County of Kent</place>, <profession>Mariner</profession></witness>


Giovanni to Patrizia, Charlene, Colin: 31/08/12; 13:30


thank you Colin, very clarifying. A few comments below and a proposal.

I suggest we have 1 header per document/picture, with this structure:
<header>
<series>HCA 13/71</series>
<folio>99r</folio>
<foliation>XXwhatever (blank if missing or verso)</foliation>
<picture>P1130401</picture>
<summary>Brief description of contents by the transcriber/facilitator, to be done at the end of the work. For example: Answers to the Allegations 3, 4 and part of 5 by xy, For the case regarding yx... Brief summary: the witness talks about.. To be thought of as a summary catalogue entry.</summary>
<document-date>25/02/1655</document-date> directly in modern form

<status>First cut transcription completed; Requires editorial input</status>
<first-transcriber>Colin Greenstreet, 28/08/12</first-transcriber>
</header>

Then, we should have 2 more headers, per case and per deposition:

<case>
<summary>as in the document</summary>
<date value="normalized form">as in document</date>
<charge>piracy, mutiny, etc. we might build a vocabulary with these terms</charge>
<plaintiff>ny</plaintiff>
<defendant>yx</defendant>
anything else?
</case>

<deposition>
<summary>as in the document</summary>
<date value="normalized form">as in document</date>
<witness>name</witness>
anything else?
</deposition>

these two Metadata fields will go at the very start of a case/deposition, so they are a duty just for who transcribe the page where they begin. What do you think?

Comments:
- pay extreme attention not to have overlapping tags. Tags can contain any amount of tagged elements, but in a hierarchical fashion, not with overlaps (see below person and help, Patrizia already pointed this out. It's very important).
- I like the way you use notes for signatures and format elements.



Colin to William; 31/08/12: 11:49


Hi William

Good to hear from you, and thanks for confirming that you are following the debate and thinking about the issues. We are not copying you and Jill and Stuart in on absolutely everything, simply not to swamp you all. Thanks too for your steer on the various substantive issues mentioned in your email.

I agree on need to distinguish countries, cities, and other places, though there will have to be a post initial classification check of cities vs. others.

I also agree on excluding "said" and "the" and "a" and "an" from names, so "Lord Protector", not "The Lord Protector" and "Councell" not "The Councell" and "King", not "The King".

The one problem with leaving out "said" (and I agree with leaving it out", is that when (as I assume we will) someone goes through all references to "ship" and tries to link "ship" to named ships elsewhere on the same manuscript page, or to nearbye pages, for inclusion in the relational database that Patrizia rightly wants to build, we have lost the ability to go first to a "said ship" category. I am presuming that if the language of "said ship" is in the manuscript, then probably somewhere in the deposition the ship was named, or at least referred to as "the english friggatt" or "the french man of warr". But simply in terms of the way that TEI codes appear to be sorted and parsed, it appears that the first character of the text which has been coded is key.

I have begun reading up on TEI, but I am a complete novice. When you have time, you might like to look at:

TEI: Text Encoding Initiative
TEI Lite

I need further convincing of Patrizia's assertion that "Callice in France" is one place not two places which are linked in the mind of the deponent (and scribe). We need to be careful of having the database drive the linking of "Callice" to "France" since the fact that the deponent speaks of "Callice" as being in France is in itself interesting (as opposed to Callice in Europe, or Callice in Northern France. This is particularly important if we want (as I do) to use the geographical coding to explore how people spoke and conceptualised about geographical (and social space). I have seen Barbados described as "in America" and "in the Caribee". I have seen "Lima" described as "in America" etc.

But I am posting everything of importance to a new Team Giovanni/Patrizia discussion area: Team Giovanni/Patrizia

This discussion area is under the umbrella of Team discussion area, which I have added as a tab in the main menu bar. You will also find you, Jill's and my team discussion areas there, for when you want to start using yours. William Kellett posted a short note to your area before he went off on holiday

The discussion Giovanni mentions in the email to which you then responded directly to me is captured in the following two posts. Read them if interested, and you are welcome to comment:

Giovanni to Patrizia, Colin, cc. Charlene; 31/08/12: 09:04

Colin to Giovanni, Patrizia, Charlene; 31/08/12; 10:56

I'm delighted you will be able to join Jill and me on Monday at Westminster School. Jill's online training material is shaping up very well

See: Online Training Activities

Best wishes



Colin to Giovanni, Patrizia, Charlene; 31/08/12; 10:56


In answer to Giovanni's points:

(1) Foliation

The foliation I am citing in the names of images, e.g. HCA 13/71 f.101v is modern archival block printed foliation. Each folio number is printed in the top RH corner of the verso page. There is no original foliation in this volume

- Foliation is variable in HCA volumes and records. Bound volumes account for 95% of all HCA records at TNA. However, many (possibly most) such bound volumes have neither modern (archival) nor older handwritten foliation.

- Where there is original foliation, I presume it was added at some stage, probably contemporaneous with the creation of the bound volume.

- Note that the bound volumes are not pre-bound prior to being filled out by clerks. They consist of folded leaves which have been stiched together to create one to two year groupings (occasionally for a longer period) of court records.

- Ocasionally there are clear collating errors, since some later pages (in order within the volume, whether or not actually foliated) bear deposition or other type of legal statement dates before earlier pages in the same volume

For further information on volumes and bindings see: Leather bindings

For further information on foliation and example see: Double page spreads

For further information of the range of document types (which we will NOT see in HCA 13/71, but exist in other HCA records see

- Numbered paragraphs Interrogatories are in the HCA records series HCA 23/XX (Image shows a page view of Interrogatories for HCA 23/19, which are numbered questions. In HCA/71 part of the legal statements we are transcribing are answers to the Allegations ("allon") which have been made (this is also a formal court document, but not included in HCA 13/71); and part of the staements (but not always) are answers to the Interrogatories (with reference being made in the answer to the number of the interrogatory. These numbered answers do not cite the text of the interrogatory, but often paraphrase the language of the interrogatory in giving the answer. I am hoping that as part of the linking exercise either in this project, or later in 2013, it will be possible to go to the appropriate volume of interrogatories, image these pages, and then link the pages to the answers we will have transcribed for HCA 13/71.

(2) Data types within our chosen HCA 13/71 volume

For further information on data types within HCA 13/71, see Page layout, and specifically Typical single leaf layout

- The page illustrated in the Typical single leaf layout is the frst page in volume, HCA 13/68 f. 1r, and thus comes from the same record series as our chosen volume, HCA 13/71
- The page shows most of the characteristics we need to understand (and to your point, Giovanni), possibly identify as sub areas within the record

      • Case summary details top left


      • Date of the court session (here it was the 22nd of September 1659, but written as


The 22:th day of September 1659

Note Modern block printed folio number in top right hand corner (here it is folio one, and should be described when transcribing as f. 1r (folio one recto; recto = right, or front)

      • Brief statement as to nature of the legal record (here it is an examination)


Examined upon an Allon on the behalfe of
the sayd Keepers of the Liberty of the Liberty of England by
Authority of Parliament

      • Witness name, place of abode, and estimated age at top of main text


Mark Harrison of Wapping in
the County of Midds Mariner aged
seven and twenty yeares or thereabouts
a witnes sworne and examined deposeth and
saith as followeth. vizt

      • Abbreviation in left hand margin "Ren:dt" (contraction for latin word, XXX = XXX)


      • Number in left hand margin stating which number witness in the specific legal case (here this is the first witness)


      • Main body of text (here consisting of thirty seven lines, divided into four paragraphs)


      • Paragraphs in main body of text introduced with the phrase "To the first(second/third/fourth) arle of the sayd allon this deponent saith and deposeth that..."


      • First word of next page at bottom right hand side of page, below end of main text


Missing from this page is a sense that the "main body of text" (in terms of the vast proportion of the words of an individual record relating to one witness statement) can have several sections

The typical sections (though not always present) can be seen by looking sequentially at the six images below, which are from successive folio sides in our chosen volume. I have added my comments on structure by each image link below

13/71 f.99v P1130401

  • This page starts just after the introductory material, which is on the previous page (HCA 13/71 f.99r)


  • The page starts part of the way through the answer of the deponent to the second article (arle) of the allegation (allon)

- confusingly the first full paragraph starts "To the third hee saith...", but it is clear from the next paragraph that the transcriber is within the response to the article portion of the witness deposition ("To the 4th arle hee saith...")

  • From inspection of the previous page XXXX I have added the following metadata for this page. The case summary is a verbatim extract from the relavant part of the front matter on the previous page. Likewise the deposition metadata. Note that "4." refers to the deposition of Charles Anquestil being the fourth deposition, of which there is at least one more, which comes later in the volume


<header>
<folio>HCA 13/71 f.99v</folio>
<picture>P1130401</picture>
<case-summary>The <person>Lord Protector</person> against a certaine shipp called the <ship>fortune</ship>: whereof <person>Daniel <help>Curetson</person></help> is <profession>Master</profession> taken with wynes, and againt the <person>Earle of Charott</person> and <person>others</person> Owners of a shipp of warr called the <ship>golden Eagle</ship> of <place>Callice</place>, and against the said <person>Earle of Charrott</person> and others owners of the <ship>Royal Mary</ship>, a shipp of warr and against all others</case-summary>
<deposition>4. <person>Charles Anquestil</person>, of <place>Callice</place> in <place>ffrance</place> <profession>Mariner</profession> and <profession>Gunner</profession> of the said shipp the <ship>Mary Royall</ship>, aged <quantity value="year">40</quantity>: yeares or thereabouts a Wittnesse sworne and examined saith as followeth </deposition>
<document-date normalized="25/02/1655"></document-date>
<status>First cut transcription completed; Requires editorial input</status>
<first-transcriber>Colin Greenstreet, 28/08/12</first-transcriber>
</header>

HCA 13/71 f.100r P1130402

  • This page continues the answers of the deponent to the allegation


  • Note that the page starts partly way through the deponent's answer to the 6th article of the allegation. This is presumed, but would need to be checked against the previous page, to ensure that it is not infact the answer in one paragraph to multiple articles. Having checke the previous image and trasncription, I see that it is indeed just the answer to the single article 6 (LINE 46 of HCA 13/71 f.99v P1130401: To the 6th hee saith.."


  • Note that some paragraphs address more than one article of the allegation (e.g. "LINE 6: To the 7th and 8th arles...")


  • Note that there are apparently eleven articles in the original allegation, since the deponent answers eleven articles, but with the last numbered article being the tenth ("LINE 34: To the 10th hee saith...). The presumed eleventh is introduced as "LINE 42: To the last hee saith..."


  • Towards the bottom of this page, a new section of the same deposition begins, and is marked by the centred heading:


"LINE 45: To the Crosse Interries:-TEXT IS CENTRED"

- The first and second cross-interrogatories are addressed on this page

13/71 f.100v P1130403

  • The top third of this page contains the remainder of the deposition which I labelled in the metadata as


<deposition>4. <person>Charles Anquestil</person>, of <place>Callice</place> in <place>ffrance</place> <profession>Mariner</profession> and <profession>Gunner</profession> of the said shipp the <ship>Mary Royall</ship>, aged <quantity value="year">40</quantity>: yeares or thereabouts a Wittnesse sworne and examined saith as followeth</deposition>

  • This deposition ends with a signature, which is the original signature of the witness, and is a signature I believe to the verity of the whole clerical record, including the front data. Logically I guess it is a separate data type. Remember some signatures are "marks", where the witness is illiterature, or semi-literate, and a standard piece of analysis would be to look at literacy by occupation, witness type, date etc..


  • The lower two thirds of this page is a new deposition, prefaced by new front material


  • I have recorded the metadata for this part of the page as:


<header>
<folio>HCA 13/71 f.100v</folio>
<picture>P1130403</picture>
<case-summary>The <person>Lord Protector</person> against a certaine shipp called the <ship>fortune</ship></case-summary>
<deposition><person>Charles Anquestil</person>, of <place>Callice</place> in <place>ffrance</place> <profession>Mariner</profession> and <profession>Gunner</profession> of the said shipp the <ship>Mary Royall</ship>, aged <quantity value="year">40</quantity>: yeares or thereabouts a Wittnesse</deposition>
<document-date normalized="25/02/1655"></document-date>
<status>First cut transcription completed; requires checking</status>
<first-transcriber>Colin Greenstreet, 29/08/12</first-transcriber>
</header>

  • Note that I have chosen myself to summarise the case-summary material, which I had recorded verbatim in the previous metadata entry, which is clearly bad practice


  • Note that the front material does not formally repeat the date which was on HCA 13/71 f.99r. Instead it states "same day", so the transcriber needs to track back in the images (or the transcriptions) to see which day is being referred to


  • Note that the deposition in the lower two thirds of the page is for the same case as on the previous page, and is the fifth deposition


- The text of the front material states:

LINE 27. 5.us/ John Mercier of callice in france Mariner Quartermaster
LINE 28. Of the sayd shipp Goulden Eagle aged <quantity value="year">29</quantity>. or thereabouts a [?Wittnes?s?e?]
LINE 29. sworne and examined saith as followeth./

  • This fifth deposition starts by addressing the same allegation as the fourth deposition, and this page has the answer to the first and part of the second allegation


LINE 31: . To the first Arle of the said Allegation hee saith, That hee this deponent 31
LINE 49: To the second arle hee saith, That the said shipps being at Sea in...

  • As a very small point of deatil (but details can cause problems later for finer grained markup, I have had to deal with the use of the indefinite article as a quantity and marked it up accordingly:


LINE 88. friggat about <quantity value="league">a</quantity> league of the English Coast, shee not being then

  • Note that I have not marked up the bottom of this page


HCA f.101r P1130404

  • Nothing remarkable


  • No front matter, since on the previous page, and deposition goes over to the following page


HCA 13/71 f.101v P1130405

  • I have not finished transcribing or marking up this manuscript page


  • It contains cross-interrogatories


LINE 31: 31. To the Crosse Interries/:-Centre heading

  • It is signed by Feban Merchier (I think I am correct in reading this as an "F"), which is possibly a variant of the modern French name "Fabian", and poses a name equivalence point which we will occasionally face on other non-English deponents, since the name was anglicised in the front matter as "John Mercier"


  • Please note that there is a seventeen folio jump between images HCA 13/71 f.101v P1130405 and HCA 13/71 f.118r P1130406, since I had already photpgraphed f.102r to f.117v earlier. Giovanni, I know this is not ideal.


Well, rather a long email, but I hope this helps.

I am posting the email to the Team Giovanni/Patrizia discussion area, to which I am also posting earlier relavant emails:

Useful email records

       to Patrizia, Colin, cc. Charlene; 31/08/12: 09:04
       to Patrizia, Charlene, Stuart, Colin, Jill, William; 31/08/12: 08:36
       Patrizia to Giovanni, Colin 30/08/12; 23:25
       Patrizia to Giovanni, Colin, Charlene, Stuart, Jill, William: 30/08/12: 23:22
       Giovanni to Patrizia, Charlene, Colin, Jill, William; 30/08/12: 12:05
       Patrizia to Giovanni, Colin: 29/08/12; XXX



Giovanni to Patrizia, Charlene, Stuart, Colin, Jill, William; 31/08/12: 09:19


A brief mail to summarize our current status and focus your attention on some issues that need feedback.

- help: use it to point the facilitator to dubious passage: does it fit its role properly?
- HEADER: this will probably undergo some modifications. It's meant to describe the document being transcribed (ie 1 picture), which is an arbitrary unit of work. I suggest we keep it, but move the description of documental units of meaning (cases, depositions), outside, with separate metadata fields. Discussion is under way.
- Italic (abbreviations and such), strikes and underlines are pretty straightforward, right?
- Alt/Sic is meant to cover misspelling and word variants. Do you find it sound and sufficient for this task?
- Person: is it ok?
- Profession: this also covers titles, occupations, etc. Do you think it's too generic or shall suffice?
- Ship?
- Commodity, Currency, Quantity: do you like the way they work? Would you prefer some disambiguation for example for quantities (ie distinction between weights, distances, etc.)?
- Place: too generic? Would you prefer to distinguish among cities, countries, regions? If so, would you do it with an attribute (like the value system for Commodity, Currency, Quantity) or with different tags?
- Date, Note?
- Special characters: is the menu fine? Any more special character you might need?

Anything else left out, comments? Also, I'd appreciate if someone with an estetic sense will help me with colours and disposition of buttons. For now, it's pretty random.

NB
If you save to the server, once a transcription is complete, please use the document title (for example HCA 13/71 f.137r P1130424).

Thank you for your collaboration, all the best



Giovanni to Patrizia, Colin, cc. Charlene; 31/08/12: 09:04


cc. Charlene since I think she needs to comment and be aware of this.

Regarding documental structure, units of meaning in the text and the cataloguing from the NA.

I feel we need to know, in detail, what kind of records they have at the National Archives about our documents, if they have some summary (Patrizia, I mean the regesto, am I correct?), and what's the archivistic unit/s we'll transcribe (also in broader context). This, because we need to be consistent.

Secondly, we need to know what's the structure of the documents, and how it relates (if it does) to pagination (by the way, is there some sort of pagination or foliation? is it original or made by modern archivists? We need to track it in the header if so). Colin, you mentioned to me we have cases and, within each case 1+ depositions. Is this all? Are there other identifiable units of meaning? How these relate with pagination? I see there can be many depositions in one page, and anyway they can start everywhere. Is the same for cases or do they start in a blank page?

We'll probably need a few tags do deal with all this: the header is meant to describe the document being transcribed, as a picture (which is our, arbitrary assumption, but it's a unit of work, not of meaning, we cannot avoid to use). We'll then need (probably) a case tag (with a summary? date? parties?) and a deposition tag (again, summary, date, witness, ..). These will deal with units of meaning, not of (our) work. Needless to say, we need them both: for us these documents are pictures, but the original was/is divided in cases and the like.

I feel this is important, and also my bad not to have asked before. All the best,



Giovanni to Patrizia, Charlene, Stuart, Colin, Jill, William; 31/08/12: 08:36


I'm happy to see Patrizia and I broadly concur on our views: I basically agree with everything. Just a couple of points:

- about the infamous Callice in ffrance: as I said before, I agree we either mark just Callice, or the whole Callice in ffrance as a single place (reason might be that whoever said/wrote it found useful to disambiguate, even if in this case it's straightforward). About marking in a separate way countries, well for me it's no issue: I made only a generic place tag to avoid overwhelm transcribers with buttons, but it would be better for us markuppers not to have places, but cities, countries, etc. If you like, we can have the place tag to perform as the currency or date one, with a value in which the transcriber can specify what kind of place it is: <place value="city">Callice</place> in ffrance. Bear in mind it's hard to find a right balance between richness of input provided and ease of use.

- I'm finally starting to appreciate what Patrizia has in mind for the database: we'll just need to discuss about technology, but the aim I share. I also think we agreed that a) we cannot split transcription and markupping, but we can't have the two done wholly at the same time, so a finer grain mark up plus database building will follow (expecially dealing with non palaeographic but more semantic stuff) b) we might need a few more months to do it properly c) we should have more people helping us on that task too. So, this provided, my initial assumption of processing almost everything automatically due to time and workforce constraints is no longer that imperative. This probably leaves a bit more room for a finer grain tagging by transcribers.



Patrizia to Giovanni, Colin 30/08/12; 23:25


I tried to catch up some issues, probably in an untidy way. I hope something is useful.
Giovanni, tell me if you read my comments in our area. I'm not clear how it works.
I'll try to make an example of excel file for tomorrow, using one of the transcriptions of Colin.



Patrizia to Giovanni, Colin, Charlene, Stuart, Jill, William: 30/08/12: 23:22


PATRIZIA: Only some thoughts about some (not all) of the questions. As Colin said, I am travelling, and see pages only in a very limited and uncomfortable way. Sorry if some comment is unclear.



COLIN: For example:

The button ship: do we highlight "fortune", or "the fortune", or "the shipp the fortune". Does it matter? Giovanni, the "ship" category is currently not displaying in colour the HTML driven publication of the transcribed page at the bottom (it displays as e.g. "the said shipp < style="color:blue">fortune")

PATRIZIA: Yes, it matters. If you include the article in the tag, everything will be sorted under 'the'. Think how the names are painted on true ships: it's likely 'Fortune', not 'The Fortune.' It's the same problem I already highlighted in my last email about 'said' before the name of persons



COLIN: The button "person": do we highlight only personal names, or do we include clear individuals such as "the king of Spain", or should it be "King of Spain".

PATRIZIA: Idem: 'King of Spain', and absolutely yes, include him (as it is). Then it is a problem of the database to give him a name, like here: We saw the king and queen during Mass. It's from Mozart's letters.

If you hover with the mouse over king and queen you read 'Ferdinando IV di Borbone Napoli, born 12/01/1752, died 04/01/1825' and Maria Caroltta (Carolina) d'Asburgo-Lorena - born 13/08/1752, died 07/09/1814.



COLIN: Another example would be the Lord Protector, which I have marked up as The Lord Protector (The <person>Lord Protector </person>against).

PATRIZIA: Sorry: be careful again with spaces within the tags.

This is being used as a name, despite being a title (or is it an occupation). Once we have the occupation button, would we mark this up in preference as an occupation, or is it both a name and an occupation and requires double markup? If it requires double markup, is there a syntax which requires one to come within the others? (I don't see any brackets or other syntactic like devices being generated in the HTML code)

No, I would say that this is only a way to refer to the person. It is not the same case of

<person>Charles Anquestil</person>, <profession>Mariner</profession> and <profession>Gunner</profession>

In this case 'mariner' is a predicate of the person, and indicates his profession (i.e.: I find your mark correct), with the already mentioned warning that I would mark 'gunner' with a tag 'role' or 'title', as you find best (after all I find that 'role' would better suit).
If I may take again an example from Mozart's letter, see this:

http://letters.mozartways.com/index.php?lang=eng&theme=people&name=1200&alpha=C

You see that Antonio Colonna Branciforte is mentioned in letter 171. If you click on 'View', you will see the term 'Cardinal' highlighted within the text of letter 171.

'Cardinal' is a way to refer to the 'person' Antonio Colonna Branciforte, who had, in time,different roles:

1. Assistente al Soglio Pontificio (27/02/1754)
2. Nunzio Apostolico a Venezia (02/04/1754)
3. Cardinale (06/04/1766)
4. Legato Pontificio a Bologna (1769 — 1775)



COLIN: Would "King of Spain" be marked up both as a person and "Spain" as a place, or is this a clear case of where the transcriber is distinguishing person and place?

PATRIZIA: No, again: King of Spain is only a way to refer to that person.Semantically, it does not have reference to a place.



COLIN: In the case of "place" I have assumed (i.e. made an editorial policy assumption" that compound places will be marked up twice, e.g. "of Callice in ffrance" is marked up as "of <place>Callice</place> in <place>ffrance</place>".

PATRIZIA:: See my previous email. The document mentions one place, not two. France is again an attribute of Callice. We can tag the countries, if we find it useful. (This is not needed to find out where is Callice, because this is done with the database, but it could be useful in all the cases where ONLY a country is mentioned. Giovanni, what do you think?)



COLIN: If we were wanting to use markup (converted into TEI compliant markup) to drive searches such as How many legal depositions refer to French war ships (as opposed to merchant ships) in the (English) channel, you would need to know that the "the golden Eagle of Callice" and "the Royal Mary" were french ships and were ships of war, and that Callice (presumably Calais), and other ports such as Dinkirk (with its spelling variants) had been grouped for the purpose of the search under the broader term (English) "Channel"

PATRIZIA: I agree that these will likely be normal searches that people will perform on such a website, but this is the task of the database, in my opinion. The lack of a relational structure (versus a textual search) is exactly what do not allow you to correctly answer to these questions. I sent yesterday a paper to Giovanni that very well highlights the problem. Consider the wonderful Old Bailey Proceedings. Despite being a fantastic resource, there is no way to say if two persons with the same name are two persons or a double reference to the same person. This is why for Mozart we use a relational database, and why I think that we should do the same for Marinelives.



COLIN: I am also clear (for discussion) that we need the transcribers first to create a "clean" transcription, without using any category buttons, and that this then needs review and perfection and signoff, before the categories are added. Otherwise palaeographical questions and learning will get all mixed up with category editorial policy, and I think that is a very big ask for the first four weeks of transcription post training. So I think team facilitators, to the extent that they are acting as page editors, will need to take two passes at each page. For discussion please.

PATRIZIA: Colin: this is probably very, very wise. In Italy we say that Rome was not made in a day. It's already so troubling to get through the paleographical issues, that probably you would better concentrate on this. We can easily go back to these discussions after 3/4 weeks, no?


COLIN: I also understand Giovanni's (and Patrizia's) points about the weakness of Scripto being the absence of a cumulative aggregating robust database, which accumulates the category markup, so that at any one time you can inspect all places, people, etc input by trasncribers to date. However, that is clearly not soluble for this project. It does mean that any "nice to have" but NOT essential functionality such as mapping out places referred to in the transcribed documents would have to be done as a one off piecfe of analysis almost certainly on a sample basis. Playing with mapped data looks like it can only really happen after we have that robust database, which will be generated by the markup/analysis team

PATRIZIA:Totally agree.



COLIN: Finally, it is clear to me that we should move any planned end of project conference from the tentative end of January to a tentative end of March or early April, to give us LOTS of wiggle time on the database/markup/analysis stage of the project

These are my thoughts for what they are worth. I am very keen to hear back as soon as possible from Stuart and Charlene, and also to get William and Jill's input.

I am also going to contact Dr Elaine Murphy today, who I am meeting in Cambridge on Thursday 6th September, to ask her if she would be prepared to take a look at our Scripto modifications and to comment.

Patrizia is leaving today for Austria, and will only be back properly into the conversation Friday week. She and Giovanni are clearly already establishing a very productive relationship, and will jointly be leading the database/semantic markup/analysis team, and will be joint facilitators of that team. I am looking to the two of them, once Patrizia is back, to produce an outline plan for their team, showing broad milestones, and very importantly providing an estimate of the numbers of associates they need on their team, with what sorts of prior experience and training.



Giovanni to Colin, Patrizia, Stuart, cc William , Jill


URGENT - RESPONSE NEEDED: Buttons and tags & need for an editorial policy for category markup (i.e. semantic markup) to be done by transcibers: gIOVANNI'S RESPONSE

Aug 30 (2 days ago)

to me, Patrizia, Charlene, Stuart, William, Jill


GIOVANNI: comments below (i.e. what I would suggest). Please, Still consider Scripto under maintenance for this morning.

All the best

2012/8/30 Colin Greenstreet <colin.greenstreet@googlemail.com>

   COLIN: Dear Scripto improvement team,


   cc FYI: William and Jill,


   I spoke yesterday to both Patrizia and Giovanni via SKYPE.  I am representing the team facilitators in the Scripto improvement team, and would encourage William and Jill, who I am copying in to comment on the experimental functionality now in beta form in http://www.marinelives-transcript.org.


   All my examples below refer to my experimental markup, which I have saved and protected, for
   HCA 13/71 f.100r P1130402
   (http://marinelives-transcript.org/scripto/scripto/?scripto_action=transcribe&scripto_doc_id=363&scripto_doc_page_id=361)


   (1) Giovanni, you continue to move at lightning speed, and are effectively contributing to, synthesising, and prioritising possible functionality improvements and system adjustments.


   (2) I have found working through practical markup examples to be the most effective way to highlight any ambiguities, technical problems, and, very importantly, the fact that we will need a set of editorial policies regarding the use of these new buttons.


   (3) Charlene and Stuart - we now urgently need reaction from you both to the new system.


   * May I suggest to you both that ideally today each of you take a clean page which I have transcribed, and mark them up using the new suite of tools.


   * Perhaps Charlene could markup HCA 13/71 f.100v P1130403 (http://marinelives-transcript.org/scripto/scripto/?scripto_action=transcribe&scripto_doc_id=363&scripto_doc_page_id=362)?


   * Perhaps Stuart could mark up HCA f.101r P1130404 (http://marinelives-transcript.org/scripto/scripto/?scripto_action=transcribe&scripto_doc_id=363&scripto_doc_page_id=364)?


   (3) My comments: use of category markup (to inform the semantic markup/data analysis team when they add further TEI compliant markup).


   Using the categories: "person", "place", "ship", "commodity", "currency", "quantity", and "date" on a real page (HCA 13/71 f.100r P1130402) has highlighted the need to me of conceptual clarity before we markup what until then look like relatively "clean" to read text input boxes. I still think we should markup up the categories, and agree that we need a category for "occupation" (which should probably, but not necessarily subsume "title"), which Giovanni has not yet implemented


   For example:


   * the button ship: do we highlight "fortune", or "the fortune", or "the shipp the fortune". Does it matter?  Giovanni, the "ship" category is currently not displaying in colour the HTML driven publication of the transcribed page at the bottom (it displays as e.g. "the said shipp < style="color:blue">fortune")


GIOVANNI: thanks, will fix. I would markup 'the fortune' since we know it's a ship with the tag. I would just markup 'fortune' if it is referred as such in other parts of the document (i.e. the article is to be tagged if it's part of the name). Try to markup consistently: the same name for every entry referring to the same entity, with a sic inner tag if it's misspelled.

   COLIN: * the button person: do we highlight only personal names, or do we include clear individuals such as "the king of Spain", or should it be "King of Spain".  Another example would be the Lord Protector, which I have marked up as The Lord Protector (The <person>Lord Protector </person>against). This is being used as a name, despite being a title (or is it an occupation). Once we have the occupation button, would we mark this up in preference as an occupation, or is it both a name and an occupation and requires double markup? If it requires double markup, is there a syntax which requires one to come within the others? (I don't see any brackets or other syntactic like devices being generated in the HTML code)


GIOVANNI: person names are distinguished from occupation. We mark them separately, and will try to connect them afterwards. Do not try to overlap tags: each tag should markup the very minimum and specific amount of text for what is meant for. Also, try to use space accordingly: The <person>Lord Protector </person>against should be The <person>Lord Protector</person> against.

   COLIN: * Would "King of Spain" be marked up both as a person and "Spain" as a place, or is this a clear case of where the transcriber is distinguishing person and place?


GIOVANNI: in this context Spain is part of the title/occupation, not a place.

   COLIN: * Are we going to keep titles out of names? I have markeed up "the said Captaine John Coveruserat" ("the said Captaine <person>John Coveruserat</person>" rather than "the said <person>Captaine John Coveruserat</person>". Once we have an occupation button, subsuming title would we therefore markup "Captaine" and "John Coveruserat" separately?


GIOVANNI: exactly, so it's now. Open for discussion of course.

   COLIN: * the buttons "place" and "commodity" seem easier to use conceptually.


   - In the case of "place" I have assumed (i.e. made an editorial policy assumption" that compound places will be marked up twice, e.g. "of Callice in ffrance" is marked up as "of <place>Callice</place> in <place>ffrance</place>".


GIOVANNI: I disagree on this, mark Callice in ffrance as one place: it's a unique entity in the mind of the scribe, so I think you either mark it all or leave ffrance out.

   COLIN: - In the case of "commodity" it is obvious that lists should be divided into individual commodities, e.g. "in wynes and brandewine" is marked up as "in <commodity>wynes</commodity> and <commodity>brandewine</commodity> ")


GIOVANNI: yes. Sorry for this horrible yellow, it'll disappear.

   COLIN: - I am assuming that if we had the phrase "two hogsheads of wine" this would require markup as quantity and commodity, but I am unclear whether it is the "two" which is highlighted, as in <quantity value="hogshead">two</quantity>, or whether it is as in <quantity value="hogshead">two hogsheads </quantity>.


GIOVANNI: correct, use the former

   COLIN: - I am also unclear whether spelling for the units to be entered into the code expression for "currency" and "quantity" should be normalised, as in "ton" not "tonne", "tonnes", "tons"... and if so we will need an agreed set of normalisations.


GIOVANNI: normalized, and yes an agreed set would help a lot.

   COLIN: - I am also not clear whether it is just the numeric value in a quantity which should be marked up, as in "2" or "two", or whether the transcribed quantity description should be included in the expression which is marked up as in "2 hogsheads" or "two hogsheads".  I presume that "2 hogsheads of wine" will be marked up separately for quantity and commodity.


GIOVANNI: see above. I would do this: <quantity value="hogshead">2</quantity> hogsheads of <commodity>wine</commodity>

   COLIN: - I notice that the code expression for date requires normalisation to modern format. Do we want transcribers to adjust the year of the date for dates between January and late March to the following year, or do we want that done later (and probably more accurately) by the markup/analysis team?


GIOVANNI: let's do this later and stick to the text now. 25 March is it?

   COLIN: -- I have chosen for the moment not to correct the date, though I, in my original transcription, had added the modern date in brackets. So I have rendered "25th day of February 1655 (1656)" as "The <date value="25/02/1655">25th day of February 1655</date> (1656)", which displays as "The <date value="25/02/1655">25th day of February 1655</date> (1656)"


GIOVANNI: do not add anything to the text. Let's go verbatim now, then we'll process dates and change the modern form to account for the different year start.

   COLIN: * The note function is neat.  I highlighted XXXX and got the code: " their respective Commons <note>[WHAT IS THIS A CONTRACTION FOR]</note>", which displayed as: "their respective Commons [WHAT IS THIS A CONTRACTION FOR]"


   * I can see that there is some risk of "over-marking up" the text at the trasncription stage, and I think team facilitators need to be consistent in the guidance they give, particularly since otherwise they are going to find themselves editing out over enthusiastic transcibers work, and I don't find it that easy moving between the published display at the bottom of the page, the text input boy, and the manuscript image above to do such editing. My point is that once code is on the page, it is harder to take out and alter than it is to get it right or not put it in in the first place.


   - an example (possibly, but for discussion) is I don't think you would mark up any of "the Captaines of the said frenchmen of warr"?


   So no occupation button for "Captaines" and no ship button for "frenchmen of warr".


GIOVANNI: I would mark Captaines as person, not profession, and frenchmen of warr as a ship.

   Or, alternatively, is this exactly a case of where a transcriber is best placed to know these are occupations and ships worth marking up?


   If we were wanting to use markup (converted into TEI compliant markup) to drive searches such as How many legal depositions refer to French war ships (as opposed to merchant ships) in the (English) channel, you would need to know that the "the golden Eagle of Callice" and "the Royal Mary" were french ships and were ships of war, and that Callice (presumably Calais), and other ports such as Dinkirk (with its spelling variants) had been grouped for the purpose of the search under the broader term (English) "Channel"


GIOVANNI: yes.. The question is: how deep we go and when we stop. I can add a value tag for ships, to record nationality, and a value tag for persons, to record title/profession, if present. Would that suffice?

   COLIN: * If text in the manuscript flows over onto a second line and the expression being marked up also flows over into a second line, the published text at the bottom of the page is displaying the generated line number also in the colour of the chosen category markup expression. I think that is fine, and probably too hard to change.


GIOVANNI: thank you, will fix

   COLIN: * I don't understand what the "Header" button is meant to do - it looks like it generates an input area for metadata? I mistakenly highlighted "To the Crosse Interries:-", which I had alrady marked up in simple text as [TEXT IS CENTRED] thinking that I was marking a centre heading, but my highlighted text disappeared (I have replaced it now) and the following code was generated:


   <header>
   <folio></folio>
   <picture></picture>
   <case-summary></case-summary>
   <deposition></deposition>
   <document-date normalized=""></document-date>
   <status></status>
   <first-transcriber></first-transcriber>
   </header>


   <document-start>
   DOCUMENT TRANSCRIPTION HERE
   </document-end>


GIOVANNI: the header is meant to be used when you start a new document. It provides the correct framework into which transcribers should work. Please, do use it instead of other means, we're going to need it afterwards. I wouldn't provide something to center text and such.

   * I am clear that we should not be marking up values and/or behaviours at the transription stage (if at all), though I believe this is a subject which would merit some considerable thought and discussion about at the advisor level, and possibly post January some trial proof of concept markup once we have a cumulative rigorous database which can be interrogated


   * I think I have no "got it" as to why, in an ideal world, the transcription and full semantic markup are not separated. However, I am even more convinced that with an amateur, but motivated and well trained set of trasncribers working on a short project, it would be a major mistake to try to semantically markup any more than person, place, ship, commodity, occupation/title, currency, cquantity, date.  Frankly, even those eight categories put a lot of "noise" into the text input box.


GIOVANNI: good point. The why is that if we don't do it, we'll need markuppers to process slowly through transcribed pages, with the image at hand, to do that job. We won't need to do the work again, but sort of. If we think we'll be able to get a good number of markuppers, we can try to enhance it further. I personally think we're keeping the markup by transcribers rather low, which is correct, and would stick to what we planned, no more.


   COLIN: * I am also clear (for discussion) that we need the transcribers first to create a "clean" transcription, without using any category buttons, and that this then needs review and perfection and signoff, before the categories are added.  Otherwise palaeographical questions and learning will get all mixed up with category editorial policy, and I think that is a very big ask for the first four weeks of transcription post training. So I think team facilitators, to the extent that they are acting as page editors, will need to take two passes at each page.  For discussion please.


GIOVANNI. many things can be markupped on the fly, since they're rather easy. A second passage, supervised by a facilitator, will probably be needed.

   COLIN: * I also understand Giovanni's (and Patrizia's) points about the weakness of Scripto being the absence of a cumulative aggregating robust database, which accumulates the category markup, so that at any one time you can inspect all places, people, etc input by trasncribers to date. However, that is clearly not soluble for this project.  It does mean that any "nice to have" but NOT essential functionality such as mapping out places referred to in the transcribed documents would have to be done as a one off piecfe of analysis almost certainly on a sample basis.  Playing with mapped data looks like it can only really happen after we have that robust database, which will be generated by the markup/analysis team


GIOVANNI: agree

   COLIN. * Finally, it is clear to me that we should move any planned end of project conference from the tentative end of January to a tentative end of March or early April, to give us LOTS of wiggle time on the database/markup/analysis stage of the project


GIOVANNI: agree



Patrizia to Colin, Giovanni; 29/08/12


URGENT - RESPONSE NEEDED: Buttons and tags & need for an editorial policy for category markup (i.e. semantic markup) to be done by transcibers: Response by Patrizia


On Thu, Aug 30, 2012 at 12:05 AM, Giovanni Colavizza <giovannicolavizza@gmail.com> wrote:

   Dears,


   a few general points follow, together with specific choices I propose, open for discussion.


   After collecting suggestions by everybody, we're approaching a choice I somewhat feel to justify. We have many assumptions to be done, which make this project both stimulating and constrained: lack of time, crowdsourcing, lack of money and necessity for open source solutions are the most important for me, dealing with our IT solution. I think we agreed that we'll have a semi-diplomatic transcription, leaving most of document format out. At the same time we plan to build an as rich as possible database for further use of this data. Given all this, we needed an easy to use environment, knowing that the tuning of markup and actual buildup of database will follow, and be done by different people than transcribers. Yet, it's on their shoulders the burden to highlight meaningful information while transcribing, as we all agreed it's not possible to avoid this. Afterwards, we'll adjust the markup to be TEI compliant (with personalizations) and build a database, for further use.


   Scripto is very probably going to be our solution, not because it's perfect (far from), but because I fail to see any other free and easy to use framework that gives us the same amount of customization (even if not native). What it really lacks, and is a concern for me, are two things: the possibility to build a richer database as we go (with all ships, persons, etc.), and support for consistency checks to facilitate transcribers and reduce mistakes (the two things come together). Yet I fail to see comparable solutions providing these things, so we'll need extra care by transcribers and facilitators, as well as subsequent work on the database. So, I'm aware it's not a perfect solution, but I hope it'll do.


   Let's get specific on what I plan to add to basic Scripto (please also refer to Patrizia's mail attached below):


   - pretty much everything Charlene required. I'm just for now refraining from text align buttons (if we do those, many other format properties should be added: we might perhaps leave all this out and just assume the image siding the transcription during future use? This is what Patrizia suggest, and I endorse it). Would it be perhaps preferable to just add a tag for marginal/inserted text, as Charlene proposed?
   - everything Patrizia suggests (see below), but: abbreviations, dasheds and ampersands, for which I suggest we stick to Charlene's policy, and deal with TEI requirements later on.
   - special things that will be tagged are: persons, places, ships, commodities, currencies, weights, measures, titles/professions. Should we add behaviours/values? We'll need some rules of thumb for transcribers if so, but I think it might be worth a try. Anything else? Please, tell me.


   Patrizia idea to track ownership brings us to a mayor problem, I think: unique identifiers for things and relations between data. Ideally we should strive for a good transcription and a prosopographical framework built upon it, to allow navigation throughout information. This is huge work, ideally done with semantic web tools, which we won't use now. My question is, how and to what extent can we approach and move towards such an end, with our possibilities? We do not have a growing database to work upon, which would allow us to uniquely identify things and add data/relations to them. As every transcriber work, s/he might find mentions of a person already named in other documents, and we won't be able to assume the two are the same (or are different even if with the same name). Some work to this end can and will be done on database building, but I think we'll still have to end up with unlinked data. While I plan to hear more from Patrizia about this, and follow her judgement considering her ample experience, I would like to hear from you all.


   Also, any other idea to improve Scripto, or even a last minute new fancy solution, is welcome. All the best,




Giovanni to Patrizia, Charlene, Colin, Jill, William; 30/08/12: 12:05


URGENT - RESPONSE NEEDED: Buttons and tags & need for an editorial policy for category markup (i.e. semantic markup) to be done by transcibers: Response by Giovanni

On Thu, Aug 30, 2012 at 12:05 AM, Giovanni Colavizza <giovannicolavizza@gmail.com> wrote:

A few general points follow, together with specific choices I propose, open for discussion.

After collecting suggestions by everybody, we're approaching a choice I somewhat feel to justify. We have many assumptions to be done, which make this project both stimulating and constrained: lack of time, crowdsourcing, lack of money and necessity for open source solutions are the most important for me, dealing with our IT solution. I think we agreed that we'll have a semi-diplomatic transcription, leaving most of document format out. At the same time we plan to build an as rich as possible database for further use of this data. Given all this, we needed an easy to use environment, knowing that the tuning of markup and actual buildup of database will follow, and be done by different people than transcribers. Yet, it's on their shoulders the burden to highlight meaningful information while transcribing, as we all agreed it's not possible to avoid this. Afterwards, we'll adjust the markup to be TEI compliant (with personalizations) and build a database, for further use.

Scripto is very probably going to be our solution, not because it's perfect (far from), but because I fail to see any other free and easy to use framework that gives us the same amount of customization (even if not native). What it really lacks, and is a concern for me, are two things: the possibility to build a richer database as we go (with all ships, persons, etc.), and support for consistency checks to facilitate transcribers and reduce mistakes (the two things come together). Yet I fail to see comparable solutions providing these things, so we'll need extra care by transcribers and facilitators, as well as subsequent work on the database. So, I'm aware it's not a perfect solution, but I hope it'll do.

Let's get specific on what I plan to add to basic Scripto (please also refer to Patrizia's mail attached below):

- pretty much everything Charlene required. I'm just for now refraining from text align buttons (if we do those, many other format properties should be added: we might perhaps leave all this out and just assume the image siding the transcription during future use? This is what Patrizia suggest, and I endorse it). Would it be perhaps preferable to just add a tag for marginal/inserted text, as Charlene proposed?

- everything Patrizia suggests (see below), but: abbreviations, dasheds and ampersands, for which I suggest we stick to Charlene's policy, and deal with TEI requirements later on.

- special things that will be tagged are: persons, places, ships, commodities, currencies, weights, measures, titles/professions. Should we add behaviours/values? We'll need some rules of thumb for transcribers if so, but I think it might be worth a try. Anything else? Please, tell me.

Patrizia idea to track ownership brings us to a mayor problem, I think: unique identifiers for things and relations between data. Ideally we should strive for a good transcription and a prosopographical framework built upon it, to allow navigation throughout information. This is huge work, ideally done with semantic web tools, which we won't use now. My question is, how and to what extent can we approach and move towards such an end, with our possibilities? We do not have a growing database to work upon, which would allow us to uniquely identify things and add data/relations to them. As every transcriber work, s/he might find mentions of a person already named in other documents, and we won't be able to assume the two are the same (or are different even if with the same name). Some work to this end can and will be done on database building, but I think we'll still have to end up with unlinked data. While I plan to hear more from Patrizia about this, and follow her judgement considering her ample experience, I would like to hear from you all.

Also, any other idea to improve Scripto, or even a last minute new fancy solution, is welcome. All the best,



Patrizia to Giovanni, Colin: 29/08/12; XXX


URGENT - RESPONSE NEEDED: Buttons and tags & need for an editorial policy for category markup (i.e. semantic markup) to be done by transcibers

2012/8/29 Patrizia Rebulla <patrizia.rebulla@gmail.com>

Hi Colin, hi Giovanni,

We all talked together today. Only to sum it up before leaving (Giovanni already had a longer and more detailed mail on technicalities), I'd suggest for the moment the following

  • add a menu that opens the letters with diacritic and accents (Giovanni got my list)


  • add a button for


- currencies (20 shillings)
- weights (ship of 50 tons; 8 oz of butter...)
- measures (Dunkirk is 20 miles from Southampton; this ship is 18 feet long)
- title (this is beyond occupation: in time of privateering, when a war started somebody who was a fisherman could become the master of a ship; somebody who is a merchant could become an alderman
- ownership (the ship belonged to...; then I know we will have the problem of the proportion of ownership: one third, the half. I'd like to save it, but we may discuss it upon my return. I have some idea)

  • I'd limit to the minimum the diplomatic aspects of the papers, and will concentrate more in the content. What I proposed to Giovanni is to use this tags, which are TEI standards:


- misspellings can be marked with the <sic> tag. This is used with the ‘corr’ attribute to reassure the reader that this is not a faulty transcription, e.g.: but rather shaken by their <sic corr=”nervewracking”>nerveracking</sic>

- abbreviations, if we want to expand them (for readers' sake), have this tag altho

  • I saw some notes about dasheds and ampersands. If we want to leave them as they are, in TEI they are replaced using ISO values. Dashes are characterised as — and ampersands as &


  • Finally, as said today with Giovanni, we need to save dates not only in their narrative form (8 June 1654) but also in their date_form (08/06/1654). This allow us to display the papers in chronological order.


Sorry for this short and untidy note. I'm really struggling with time. Hope this helps.



tp Giovanni, Patrizia, Stuart, cc William, Jill


Dear Scripto improvement team,

cc FYI: William and Jill,

I spoke yesterday to both Patrizia and Giovanni via SKYPE. I am representing the team facilitators in the Scripto improvement team, and would encourage William and Jill, who I am copying in to comment on the experimental functionality now in beta form in http://www.marinelives-transcript.org.

All my examples below refer to my experimental markup, which I have saved and protected, for
HCA 13/71 f.100r P1130402
(http://marinelives-transcript.org/scripto/scripto/?scripto_action=transcribe&scripto_doc_id=363&scripto_doc_page_id=361)

(1) Giovanni, you continue to move at lightning speed, and are effectively contributing to, synthesising, and prioritising possible functionality improvements and system adjustments.

(2) I have found working through practical markup examples to be the most effective way to highlight any ambiguities, technical problems, and, very importantly, the fact that we will need a set of editorial policies regarding the use of these new buttons.

(3) Charlene and Stuart - we now urgently need reaction from you both to the new system.

  • May I suggest to you both that ideally today each of you take a clean page which I have transcribed, and mark them up using the new suite of tools.




(3) My comments: use of category markup (to inform the semantic markup/data analysis team when they add further TEI compliant markup).

Using the categories: "person", "place", "ship", "commodity", "currency", "quantity", and "date" on a real page (HCA 13/71 f.100r P1130402) has highlighted the need to me of conceptual clarity before we markup what until then look like relatively "clean" to read text input boxes. I still think we should markup up the categories, and agree that we need a category for "occupation" (which should probably, but not necessarily subsume "title"), which Giovanni has not yet implemented

For example:

  • the button ship: do we highlight "fortune", or "the fortune", or "the shipp the fortune". Does it matter? Giovanni, the "ship" category is currently not displaying in colour the HTML driven publication of the transcribed page at the bottom (it displays as e.g. "the said shipp < style="color:blue">fortune")


  • the button person: do we highlight only personal names, or do we include clear individuals such as "the king of Spain", or should it be "King of Spain". Another example would be the Lord Protector, which I have marked up as The Lord Protector (The <person>Lord Protector </person>against). This is being used as a name, despite being a title (or is it an occupation). Once we have the occupation button, would we mark this up in preference as an occupation, or is it both a name and an occupation and requires double markup? If it requires double markup, is there a syntax which requires one to come within the others? (I don't see any brackets or other syntactic like devices being generated in the HTML code)


  • Would "King of Spain" be marked up both as a person and "Spain" as a place, or is this a clear case of where the transcriber is distinguishing person and place?


  • Are we going to keep titles out of names? I have markeed up "the said Captaine John Coveruserat" ("the said Captaine <person>John Coveruserat</person>" rather than "the said <person>Captaine John Coveruserat</person>". Once we have an occupation button, subsuming title would we therefore markup "Captaine" and "John Coveruserat" separately?


  • the buttons "place" and "commodity" seem easier to use conceptually.


- In the case of "place" I have assumed (i.e. made an editorial policy assumption" that compound places will be marked up twice, e.g. "of Callice in ffrance" is marked up as "of <place>Callice</place> in <place>ffrance</place>".

- In the case of "commodity" it is obvious that lists should be divided into individual commodities, e.g. "in wynes and brandewine" is marked up as "in <commodity>wynes</commodity> and <commodity>brandewine</commodity> ")

- I am assuming that if we had the phrase "two hogsheads of wine" this would require markup as quantity and commodity, but I am unclear whether it is the "two" which is highlighted, as in <quantity value="hogshead">two</quantity>, or whether it is as in <quantity value="hogshead">two hogsheads </quantity>.

- I am also unclear whether spelling for the units to be entered into the code expression for "currency" and "quantity" should be normalised, as in "ton" not "tonne", "tonnes", "tons"... and if so we will need an agreed set of normalisations.

- I am also not clear whether it is just the numeric value in a quantity which should be marked up, as in "2" or "two", or whether the transcribed quantity description should be included in the expression which is marked up as in "2 hogsheads" or "two hogsheads". I presume that "2 hogsheads of wine" will be marked up separately for quantity and commodity.

- I notice that the code expression for date requires normalisation to modern format. Do we want transcribers to adjust the year of the date for dates between January and late March to the following year, or do we want that done later (and probably more accurately) by the markup/analysis team?

-- I have chosen for the moment not to correct the date, though I, in my original transcription, had added the modern date in brackets. So I have rendered "25th day of February 1655 (1656)" as "The <date value="25/02/1655">25th day of February 1655</date> (1656)", which displays as "The <date value="25/02/1655">25th day of February 1655</date> (1656)"

  • The not function is neat. I highlighted XXXX and got the code: " their respective Commons <note>[WHAT IS THIS A CONTRACTION FOR]</note>", which displayed as: "their respective Commons [WHAT IS THIS A CONTRACTION FOR]"


  • I can see that there is some risk of "over-marking up" the text at the trasncription stage, and I think team facilitators need to be consistent in the guidance they give, particularly since otherwise they are going to find themselves editing out over enthusiastic transcibers work, and I don't find it that easy moving between the published display at the bottom of the page, the text input boy, and the manuscript image above to do such editing. My point is that once code is on the page, it is harder to take out and alter than it is to get it right or not put it in in the first place.


- an example (possibly, but for discussion) is I don't think you would mark up any of "the Captaines of the said frenchmen of warr"?

So no occupation button for "Captaines" and no ship button for "frenchmen of warr".

Or, alternatively, is this exactly a case of where a transcriber is best placed to know these are occupations and ships worth marking up?

If we were wanting to use markup (converted into TEI compliant markup) to drive searches such as How many legal depositions refer to French war ships (as opposed to merchant ships) in the (English) channel, you would need to know that the "the golden Eagle of Callice" and "the Royal Mary" were french ships and were ships of war, and that Callice (presumably Calais), and other ports such as Dinkirk (with its spelling variants) had been grouped for the purpose of the search under the broader term (English) "Channel"

  • If text in the manuscript flows over onto a second line and the expression being marked up also flows over into a second line, the published text at the bottom of the page is displaying the generated line number also in the colour of the chosen category markup expression. I think that is fine, and probably too hard to change.


  • I don't understand what the "Header" button is meant to do - it looks like it generates an input area for metadata? I mistakenly highlighted "To the Crosse Interries:-", which I had alrady marked up in simple text as [TEXT IS CENTRED] thinking that I was marking a centre heading, but my highlighted text disappeared (I have replaced it now) and the following code was generated:


<header>
<folio></folio>
<picture></picture>
<case-summary></case-summary>
<deposition></deposition>
<document-date normalized=""></document-date>
<status></status>
<first-transcriber></first-transcriber>
</header>

<document-start>
DOCUMENT TRANSCRIPTION HERE
</document-end>

  • I am clear that we should not be marking up values and/or behaviours at the transription stage (if at all), though I believe this is a subject which would merit some considerable thought and discussion about at the advisor level, and possibly post January some trial proof of concept markup once we have a cumulative rigorous database which can be interrogated


  • I think I have no "got it" as to why, in an ideal world, the transcription and full semantic markup are not separated. However, I am even more convinced that with an amateur, but motivated and well trained set of trasncribers working on a short project, it would be a major mistake to try to semantically markup any more than person, place, ship, commodity, occupation/title, currency, cquantity, date. Frankly, even those eight categories put a lot of "noise" into the text input box.


  • I am also clear (for discussion) that we need the transcribers first to create a "clean" transcription, without using any category buttons, and that this then needs review and perfection and signoff, before the categories are added. Otherwise palaeographical questions and learning will get all mixed up with category editorial policy, and I think that is a very big ask for the first four weeks of transcription post training. So I think team facilitators, to the extent that they are acting as page editors, will need to take two passes at each page. For discussion please.


  • I also understand Giovanni's (and Patrizia's) points about the weakness of Scripto being the absence of a cumulative aggregating robust database, which accumulates the category markup, so that at any one time you can inspect all places, people, etc input by trasncribers to date. However, that is clearly not soluble for this project. It does mean that any "nice to have" but NOT essential functionality such as mapping out places referred to in the transcribed documents would have to be done as a one off piecfe of analysis almost certainly on a sample basis. Playing with mapped data looks like it can only really happen after we have that robust database, which will be generated by the markup/analysis team


  • Finally, it is clear to me that we should move any planned end of project conference from the tentative end of January to a tentative end of March or early April, to give us LOTS of wiggle time on the database/markup/analysis stage of the project


These are my thoughts for what they are worth. I am very keen to hear back as soon as possible from Stuart and Charlene, and also to get William and Jill's input.

I am also going to contact Dr Elaine Murphy today, who I am meeting in Cambridge on Thursday 6th September, to ask her if she would be prepared to take a look at our Scripto modifications and to comment.

Patrizia is leaving today for Austria, and will only be back properly into the conversation Friday week. She and Giovanni are clearly already establishing a very productive relationship, and will jointly be leading the database/semantic markup/analysis team, and will be joint facilitators of that team. I am looking to the two of them, once Patrizia is back, to produce an outline plan for their team, showing broad milestones, and very importantly providing an estimate of the numbers of associates they need on their team, with what sorts of prior experience and training.




Queries to team leaders

Colin




Comment Box


Comments