Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2013-08-24T11:06:36+0000
David J. Birnbaum
University of Pittsburgh
Email:
djbpitt@gmail.com
URL:
http://www.obdurodon.org
Jeffrey A. Rydberg-Cox
University of Missouri, Kansas
City
Email:
rydbergcoxj@umkc.edu
URL:
http://r.web.umkc.edu/rydbergcoxj/
This workshop focuses on the use of analytical tools (especially the statistical package R and the topic-modeling toolkit Mallet) and methods (especially Bayesian classification and SVG visualization) to discover and explore information within XML data. By the end of the sessions participants will have learned how to apply the techniques and methods discussed to the analysis and visualization of their own XML texts.
The workshop is intended for beginners, and no prior experience with any of the technologies is required, although participants will need to prepare the outside readings (see below) before each of the two working days. The workshop sessions will then guide the participants through the process of selecting a text, preparing it for processing with XML-related tools, and analyzing the text using R and Mallet.
Day 1 (6 hours of instruction) provides an overview of XML and XML-related technologies, including the tools needed to extract information from the XML in the formats required by the toolkits. Day 2 (6.5 hours of instruction) concentrates on the actual analysis of the data and on formatting it for textual and graphic presentation.
Goals: Ensure that participants’ computers are configured properly before the beginning of the workshop
Topics: Installing software
Which users | Which tools | Notes |
---|---|---|
All users | <oXygen> | Patrick (and others) |
Windows users | cygwin |
Goals: Getting started
Topics: Using the command line and XML
Time | Topic | Notes |
---|---|---|
9:00–10:00 (60) | Using the command line (lecture and hands-on) | pwd, cd, less, cp, mv, grep (including regex), wc; redirection and piping (slides) Prep: Jeff |
10:00–10:15 (15) | Coffee break | |
10:15–10:45 (30) | Introduction to XML and TEI lite (lecture) | Document analysis, OHCO, elements, attributes, well-formedness, validity,
entities and character references Prep: David TEI Lite tutorial |
10:45–11:30 (45) | XML tagging (hands-on) | Examine a TEI lite text (Wordhoard Hamlet) Autotag a plain text with regex search-and-replace (Gutenberg Hamlet) Prep: David Autotag Arienne’s dataset Prep: jeff |
11:30–11:45 (15) | Brief overview of the XML family of standards | Schema languages, schematron, namespaces, XPath, XSLT, XQuery Prep: David |
11:45–12:00 (15) | Brief overview of web standards | (X)HTML and HTML5, CSS (drive-by JavaScript, PHP) Prep: Jeff |
Lunch (provided) (12:00–1:00)
Goals: Getting data out
Topics: Using XPath and XSLT
Time | Topic | Notes |
---|---|---|
1:00–1:30 (30) | XPath paths and axes (lecture and hands-on) | Prep: David |
1:30–2:15 (45) | XPath predicates and functions (lecture and hands-on) | Prep: David |
2:15–2:30 (15) | Coffee break | |
2:30–4:00 (90) | XSLT (lecture and hands-on) | Output should be plain text that can be used as input on day 2, including
result-document Drive-by TEI-to-HTML Prep: Jeff |
Goals: Ensure that participants’ computers are configured properly before the beginning of the workshop
Topics: Installing software
Which users | Which tools | Notes |
---|---|---|
All users | R, Mallet, Saxon | Patrick (and others) |
Goals: Is this statistically unusual or interesting? What is it about?
Topics: Using XSLT, perl, and R; calculatingTFxIDF
Time | Topic | Notes |
---|---|---|
9:00–9:55 (55) | Introduction to perl Tokenizing, stemming and parsing, counting features using Perl hashes Bayesian classification (lecture and hands-on) |
Do men and women speak differently in Hamlet? Count stuff in Arienne’s dataset Prep: Jeff |
9:55–10:10 (15) | Coffee break | |
10:10–11:00 (50) | Variation, standard deviation and z-scores in R (lecture and hands-on; R for digital humanities [web site]) | Are Hamlet’s speeches significantly longer than anyone else’s? Correlation and difference in Arienne’s dataset (test significance based on counts of etymology) Prep: Jeff |
11:00–12:00 (60) | Quantifying what a text is about; keyword extraction (lecture) Calculating TFxIDF as a keyword metric (hands-on) |
What words characterize each speaker’s vocabulary in Hamlet? Prep: Jeff |
Lunch (provided) (12:00–1:00)
Goals: What is this text about? How can I make the information accessible?
Topics: Using Mallet; creating SVG and other visualizations
Time | Topic | Notes |
---|---|---|
1:00–2:00 (60) | Mallet (lecture and hands-on) | What do the topic models look like for Hamlet? Prep: Jeff |
2:00–2:30 (30) | Why SVG: scalability, integration with HTML and JavaScript SVG basics: lines, circles, rectangles, text; the coordinate space and transformations |
Prep: David |
2:30–2:45 (15) | Coffee break | |
2:45–4:30 (105) | XSLT transformation to SVG (hands-on) | Bar chart of speech lengths by character with z-score thresholds Scatter plot of average sentence length (y) over text chunks (x)? Visualize some of Arienne’s materials? Alternative visualizations Prep: David |