I have completed a program that reads a set of adorned XML files which have been manually corrected, allows specifying several XML tag categories to include or exclude, and emits the tabular input the MorphAdorner tagger trainer suite uses.
I still have to deal with an ancient training file based upon Mary Wroth’s Urania for which I do not have the matching XML. I have looked at several available TEI transcriptions. None matches exactly, but the version in EEBO is close enough for me to try writing a program which collates the token streams and allows me to reconstruct a TEI version of the Urania training data.
Meanwhile I have enough other training data segregated by prose/verse, paratext/no paratext, errata divs ejected, etc. to try adorning some files. I expect I’ll be working on that the rest of this week. If it turns out that using separate training data helps, we’ll switch over to using that approach for all the texts. If it doesn’t or it makes no difference, we’ll continue using the combined training data.
With that I come to the end of my Day of Digital Humanities. Just a small slice of life in the digital world.