The big dataset I’ve been playing with is a ‘small’ subset of the Archive Team Geocities Snapshot. What I’ve done is gone into that collection and downloaded the subsites of the ca.geocities.com domain.
It’s funny how I consider that ‘small,’ when the compressed .tar file that results from downloading them all and putting them together is something like 2.74GB. But anyhoo, compared to the complete collection that is in the 680GB+ range, it’s ‘small.’ In any case, it’s a good training set – a decent size, able to be digested and analyzed, and most of it is focused on Canadian topics.
What the heck do you do with all that data?
Since I’m thinking methodologically these days, I’m considering that part and parcel of my research process. So far, I’ve done the following:
(1) created a mirrored version of them all on my hard drive, which lets me go in and see what each site looked like back then. At some point, I need to get a legacy browser emulator to do that.
(2) backed it all up so I can mess around with things in impunity
(3) Most importantly, to date, I’ve created a full-text repository of each site. To do this, I went through each site with Mathematica, scraped out the plaintext, and aggregated each individual account’s site into one big text file. Some of these are massive, some small. But this lets me start thinking about them at the page level, and also lets me quickly find the information I might be looking for.
(4) From that, I’ve initially created unigram data – the word frequency of each site, one with stopwords, one without. So for each site I have something like this:
{{"sara", 70}, {"said", 63}, {"jack", 60}, {"kyle", 56}, {"time", 54},
{"mother", 48}, {"just", 48}, {"asked", 46}, {"like", 43}, {"man", 41},
{"did", 39}, {"tower", 38}, {"didn", 38}, {"medea", 37}, {"father", 36},
{"saw", 35}, {"way", 35}, {"woman", 35}, {"face", 33}, {"knew", 30},
{"began", 30}, {"simon", 30}, {"came", 30}, {"jason", 29}, {"edward", 29},
{"eyes", 29}, {"new", 29}, {"thought", 29}, {"felt", 28}, {"moment", 28},
{"know", 28}, {"stories", 28}, {"village", 27}, {"good", 27}, {"hand", 26},
{"replied", 25}, {"beautiful", 25}, {"looked", 24}, {"evie", 24},
{"love", 24}, {"selena", 23}, {"glass", 23}, {"turned", 23}, {"light", 23}, ...}
One of the goals with this is to find out what we can from collections like this, and then compare it to what we can learn from collections that are compiled in WARC files. What should historians be pushing for? How should we try to advance digital preservation for our own needs? How does the metadata look like? And, more importantly for me, what can we learn? Can I detect the voices of young Canadians? Can we extract date information to see how things have evolved over time?
So many questions.
But now, a few quick pieces of paperwork await, and it’s off to the interview.