End of the day

I have completed a program that reads a set of adorned XML files which have been manually corrected, allows specifying several XML tag categories to include or exclude, and emits the tabular input the MorphAdorner tagger trainer suite uses.

I still have to deal with an ancient training file based upon Mary Wroth’s Urania for which I do not have the matching XML.  I have looked at several available TEI transcriptions.  None matches exactly, but the version in EEBO is close enough for me to try writing a program which collates the token streams and allows me to reconstruct a TEI version of the Urania training data.

Meanwhile I have enough other training data segregated by prose/verse, paratext/no paratext, errata divs ejected, etc. to try adorning some files.  I expect I’ll be working on that the rest of this week.  If it turns out that using separate training data helps, we’ll switch over to using that approach for all the texts.  If it doesn’t or it makes no difference, we’ll continue using the combined training data.

With that I come to the end of my Day of Digital Humanities.  Just a small slice of life in the digital world.

 

 

 

Element specific training data

I am now starting to work on creating segregated training data for MorphAdorner’s part of speech taggers.  This is to see if using element-specific training data helps improve part of speech tagging.

MorphAdorner’s training data is created from manually revised versions of TEI (or TEI-like) XML files which were previously adorned with MorphAdorner.  Hence it is possible to separate sections of the training data by XML element.  Crudely, text which descends from <l> elements is considered verse, while text which descends from <p> elements is considered prose.

We also want to separate text which appears in paratext (e.g., <note> elements) from the main text.

We also want to consider the content of <speaker> elements separately.  These contain lots of weird abbreviations that are in many cases work specific.  The text is also rarely in sentence form.

Some <div> types should probably be ejected from the training data all-together.  A good example is the contents of <div type=”errata”> in the TCP texts.  These contain text that doesn’t comprise good English sentences and are unlikely to generate useful statistical values for adorning actual sentences or lines or poetry.

What I’m doing today

I’m writing this as I finish eating lunch.  I spent the morning working on stuff unrelated to humanities computing.  This afternoon I’m working on the grant project which I mentioned in my About me posting:  the linguistic annotation of several Text Creation Partnership corpora (EEBO, ECCO, and Evans).  The project has been ongoing for a few months, and will continue for a few more.

Today I’m working on two items.   Neither is glamorous, but both are a typical part of the daily project work.

  • Creating separate training data sets for different sets of XML elements.  We want to see if training the part of speech tagger using different data for prose and poetry (for example) improves the results.  An important new feature of MorphAdorner v2 is its ability to use different training data for different XML elements.
  • Updating the MorphAdorner server data files to incorporate a number of corrections we’ve made over the past couple of weeks.  The MorphAdorner server (MAServer) allows access to many MorphAdorner facilities through a simple REST-like HTTP interface with a choice of XML or JSON formatted results.  Currently only the plain text based facilities are implemented. The XML based facilities will be coming Real Soon Now.

About me

My background is in mathematics, statistics, and computer science, as applied to a wide variety of academic disciplines.

My tenure with “digital” humanities stretches back to the early 1970s. At that time I worked with the classicist Alexander MacGregor at the University of Illinois at Chicago to apply classification methods to fragments of ancient Greek manuscripts in an attempt to identify potential authorship or at least authorial influence.

Over the past nine years I’ve work with Martin Mueller, professor emeritus of English and Classics here at Northwestern University, on several projects involving pieces of a grand vision which Martin calls “The Book of English.”

Projects in which I’ve been involved explored ways to implement parts of the Book of English include WordHoard, Monk, VOSPOS, MorphAdorner, and Project Bamboo. MorphAdorner is my attempt to contribute to the solution of the “light but consistent structural and linguistic annotation” puzzle piece of the Book of English.

I typically have less than 5% of my time per year available for humanities computing. The other 95% is spent on ordinary programming projects that keep the University running. This year is an exception. Martin and I secured a grant which allows me to spend about 2/3 of my time improving MorphAdorner and working on several collections from the Text Creation Partnership.