Looking at scale in humanities research

I think the idea of the humanities moving from a document-centric ideology to a data-centric ideology is an interesting one. Perhaps what the author of the text is getting at is that examining ideas in microcosm could be an equally reductive form of investigation as  just looking at a large data set. Traditional anthropological field work, for instance, that takes into account first hand interviews and case studies is a document-centric form of research can only go so far in understanding the practices of a particular culture. When data collection and statistical analysis come into play in a field that has previously neglected quantitative ways of knowing, the scale and scope of the investigation broadens too. I got kind of excited about the assertion the author makes that “if print culture foregrounds answers and pushes questions into the background, then perhaps data culture may do the opposite: it privileges queries and treats answers as if they are ephemeral”. At first I was resistant to this idea but I think there is some truth to it. I am all for privileging queries- I think ways in which we acquire knowledge should be messy and creative- if that makes any sense. There is a lot of pressure on the part of someone conducting research, in creating something like an ethnography (to use the example of Anthropology again), to draw immediate conclusions or create conclusive findings based on a small set of data. The same thing could be said of the minutiae of literary analysis: looking at the work of one author or one poem in its historical context. This seems to be how systems of knowledge and education in the humanities are set up- a kind of defensive strategy based on specificity and rhetorical prowess. In response to some of the questions put forward at the end of Emily’s post, I think that placing data culture at the forefront of research can broaden the scope of a researcher’s investigation and therefore allow new pathways of knowledge to open up. This is what might allow fields of study in the humanities to receive more time and attention (which I think they deserve).

A Storm of Carbon and Attributes: Similarities Between Docs and Databases

Databases are extremely exciting for my future work in Digital Humanities. This is the infrastructure for the type of inquiry that I would like to do—overlapping discursive lexicons. It offers opportunities to track the minutia that sometimes gets overlooked in an interdisciplinary project. The major difficulty, however, in this stage of inquiry is finesse. I know that for my projects the appearance of an entity (word) is not determined by the text. The rhetorical play in scientific non-fiction during the sixteenth and seventeenth centuries is much more like literary play than a contemporary reader might expect. This at least seems to present a problem in classifying words in regards to genre. A sphere is a metaphor in a poem and a mathematical construct in a scientific monograph, for example. The overlapping rhetorical strategies and cultural assumptions as well as the overlapping lexicon may mean that the questions that “databased” (I had to) inquiry foregrounds “how do I get from here to there,” or “how do I find “x” and “y” data accounting for complexities “a” and “b”?”

The slide shows that we read for today pointed to a difference between traditional document-centric inquiry and databased inquiry. One of the authors claimed that one foregrounds questions and the other pushes them into the background. I do wonder what happens when we explore the similarities between documents and databases. It seems that all words (as entities) have several attributes—some contextual and some grammatical. The pronoun “she” has the grammatical attribute “pronoun” as well as a contextual attribute of referring to an agent, that is itself a collection of data entities each with a string of attributes. It is difficult to remember at times that Lady Macbeth and Hamlet are mere words on a page—entities with attributes.

Hamlet is a man that wears black, not green.

It is tempting to view “Hamlet” as an entity—but the concept of “Hamlet” is more of a table that includes “black,” but excludes “green,” which in turn excludes “pronoun” Character, then seems to be a series of inclusions and exclusions, one layered on another. The data that makes up “Hamlet” is defined by some sort of linguistic data made of “is” and “is not” (I am going to stop this train of thought because I am colliding with the Derrida’s trace).

Defining both textual information and data in this way suggests to me that document data and database data do behave in similar ways. One of the chief differences is actually in labor. Narrative is a dominant mode perhaps because of the labor that it requires to extract the data, and labor requires time. Extracting data from a database (depending on the complexity and the power of the machine) is the labor of waiting while the machine crunches the data (I am aware that I am, for the moment, excluding database construction (authorship?) and query formation). The experience of reading and waiting are different—especially if you are productive while waiting, perhaps using that time to do some light reading—and our experience of time is altered as a result. I am unable here to address the similarities between database design and authorship, but I think there are connections there as well. My conclusions after thinking and writing about databases are fuzzy at best, but it still feels enlightening and exciting. My understanding of the nature of databases reaffirms my approach to texts, but perhaps it’s partly how I am sorting the data.

Whatever we understand, we understand according to our own nature.

Databases: Doc or Not?

After reading the two slide shows, I think I’m more confused about databases than when I began. I had always thought of a database the way it’s described in the second set of slides: sort of like spreadsheets where one doesn’t see the whole sheet at once and only pulls out the records needed for a particular task (the difference between databases and spreadsheets being that databases can interact with one another). In the first slide show, however, Quamen seems to view databases as pure data, not as documents, since he says both “documents and databases” can co-exist. Is a database sort of like Plato’s ideal solids, in that it doesn’t physically exist anywhere as a document? Is the spreadsheet comparison just a way for us humans to give a structure to something that is really only bits of data scattered over a server?

In the second slide show, the description of a database as “a high-quality representation of the real world” muddied the waters further for me. How can a collection of data represent the real world at “high quality”? To use the example database, a table of information listing species of birds could, I suppose, literally represent real birds, but I don’t see this representation as being high quality, or really anything above rudimentary. The table doesn’t even stand in for actual individual birds, just species. Likewise, the table of data about club members could be said to represent them, but a person’s name and phone number is such a tiny fraction of who s/he is, I take issue with it being either a high-quality or a real-world representation. I suppose I’m just arguing semantics, as I am when I question whether or not a database is a document or not. And I guess the answer doesn’t really matter as long as I understand how databases work, which I do.

Randomly:

Data culture privileges queries and treats answers as if they are ephemeral.

This quote reminded me of the part of The Hitchhiker’s Guide to the Galaxy where, after millions of years of computation, the supercomputer Deep Thought calculated the answer to the question of the meaning of life: 42. It wasn’t until after Deep Thought announced the answer that anyone realized they’d forgotten to ask him what the question was.

Maybe the question is “Are databases documents or not?”

Databases in DH

I echo Matt Smith’s sentiments about the author assessing value between two mediums, that is print and digital mediums. I don’t believe that digital is the only medium which foregrounds questions. Many research articles begin with questions, provides some answers, and more often than not end with more questions for the future. However, the author of “Databases: Intro to Relational Databases & Structured Query Language” may intend that digital mediums are more question-agnostic than print media. When using a database, answers are perceived as less concrete since they rely purely on the query in data cultures or digital mediums. Print media requires disaggregation of data from a research question and reconstruction of data when trying to use it for another purpose. Databases leave more room for manipulation of data, yielding more uses out of a data set. The data can be contextualized for  more than one purpose due to the ability to manipulate queries. So, data cultures are transforming humanities because researchers no longer reverse engineer documents to find the questions, instead the questions guide the data that is being collected in the digital medium.

Image vs. Print Culture

One of the fundamental issues concerning the disparity between print and image culture is the notion that “print culture foregrounds answers and pushes questions into the background.” I tend to balk when one thing is contrasted—at its very essence—to another thing, when lines are drawn and values are assessed. Personally, I found that books foregrounded information and pushed questions into the background until I began reading critically. So, is this a problem of the medium the information is delivered in or a problem of the consumer of that information? There are those that see images, a graph or a politically-charged image, and immediately assume it is indisputable information.

Information, whether image or data, will always be interpreted differently, but I think the important thing here is that the information is “as question-agnostic as they can be.” Print and image culture are capable of asking and answering the same questions. They merely perform the task in different ways. The idea that the scholar is able to “see emergent patterns in the chaos of data” is integral to new expanding new projects and facilitating current ones, but ultimately those emergent patterns will be read, analyzed, and processed by a skilled critical eye. The tools and languages we have learned about seem to facilitate and organize, but the print culture and the print critic are still necessary in order to make use of those tools and languages. I found the brief introduction to how databases actually work to be quite interesting and quite useful concerning how I go about to my own research. Of the many useful tidbits I’ve picked up, I find that digital humanists have an incredibly scientific approach to building a set of queries. I often forget the traditional inductive approach in favor of allowing my interest in a certain subject govern my research. Not only do they build hypotheses scientifically, but they also collect and organize that data in a necessarily logical pattern. I see the appeal of performing traditionally subjective research in a quantitative, scientific way. I grew up in the age that saw Dead Poets Society—jumping on chairs, reciting poetry in one’s whitey-tighties*, and chanting carpe diem—as the definition of English, so it is a nice breath of calculated air to entertain the prospect of rigor in research and eidetic certainty in our results.

*Don’t act like you didn’t have a pair.

Privileging Questions in Databases and Digital Humanities

“If print culture foregrounds answers and pushes questions into the background, then perhaps data culture may do the opposite: it privileges queries and treats answers as if they are ephemeral.” In the slides we read for today, this was marked as an “interesting idea,” and I think it is interesting, not only for discussions about databases versus documents, but for the humanities as a whole. I ran into this idea recently in a different context while reading for the Shakespeare Performance class that many of us are in. We were assigned a 2008 essay called “Adaptation Studies at a Crossroads” by Thomas Leitch, which reviewed the most recent scholarly work focusing on the adaptation and appropriation of literary texts. One of Leitch’s criticisms of undergraduate textbooks on the subject is that “they are limited not because they give incorrect answers to the questions they pose, but because those questions themselves are so limited in their general implications” (68). Leitch further asserts that sometimes “the question is more valuable than any answer” (75) and endorses textbooks that raise productive questions about adaptation studies even, or especially, when those questions are unanswerable.

It’s interesting that Leitch would raise some of the same issues as the DHSI slides, as he is largely dealing with the same kind of shift from print (books) to digital (movies). I think the privileging of questions rather than answers that Leitch and the DHSI lecture bring up is a productive approach to the humanities, and one that is sometimes itself underprivileged. Asking the right kinds of questions is always more important and more productive than definitively answering the wrong ones. And our answers to these questions often change over time anyway. In light of these observations, I’m not going to attempt a conclusion to this post; instead, here are some questions that might prove productive:  How can or will databases change our modes of research? What are the relative advantages and disadvantages of these changes? How can we use quantitative data provided by databases within the framework of disciplines that focus on the “human,” or at least on things that we believe to be largely unquantifiable?

A Database for “One Idea”

I found the Powerpoint on Intro to Relational Databases and SQL fairly straightforward and informative. However, a large portion of “Why Databases?” was unclear without the author’s explanation. In an attempt to expand on the argument for databases in Digital Humanities, I’d like to propose a possible interpretation of the section so nicely decorated with Simpsons characters. As an introduction to the “One Idea” model, I wonder if the photos of the Table of Contents and Index represent inflexible data sets, an old model for cross-referencing that is static and limited compared to databases. The “One Idea” seems to be something that could be stored as a database or as a document, perhaps a research topic about which a scholar wants to gather and curate essays. The first contributor adds four essays to this document or database. The second contributor adds three essays, which become part of the curated information on the topic. The pattern continues until the “One Idea” includes 18 essays. If the essays and the data about them, including subject tags, are stored in a database rather than only in document files, it becomes possible through querying to pick out which essays meet certain additional criteria, in this case, perhaps those that reference Shakespeare’s works.  Using the database, this list of essays (generated by querying the subgroup Shakespeare) can also be sorted using other data attached to each essay, perhaps in this case the year the essay was written.  This allows for much more flexibility and useful searching in a digital archive than only links to documents provide. I’d be interested to hear other theories about this section, if anyone else was curious about its contribution to the overall argument.

Boo-Boo the Barrage Balloon Meets TEI

I guess I’m officially a nerd, because TEI is exciting. Until this reading, I had never heard of SGML, much less that HTML originated from it. Since TEI is also a form of SGML, it and HTML must be siblings, or at least cousins (although HTML sounds like the black sheep of the SGML family). Fun! I also appreciate the explanation of XML, as I wasn’t really sure what it is. My only complaints about the reading were all the broken links (which made it hard to understand some of the instructions, since they referred to documents that aren’t there anymore), the typos, and the very 1990’s frame formatting. I’m guessing “A Very Gentle Introduction” is also very old in computer years.

I was interested in the idea that “preservation is a key problem for an emerging digital culture,” something I hadn’t really considered before this class. Our discussions on bitrot and other issues of digital deterioration have helped make me more aware of the problem, but I’m still a bit stuck in the mindset of “going digital means preserving.” Part of my impetus for my barrage balloon DH project is to preserve original photos and other balloon-related memorabilia in digital scans and to disseminate them online. Sharing the photos I collect is still best accomplished digitally, but could the actual photographs be better preserved than the digital images I make of them, despite the threat of fire, acid, vermin, etc.? To bring my questions more in line with textual documents, what about my very fragile copy of the World War II children’s book Boo-Boo the Barrage Balloon? What can TEI do for Boo-Boo and his compatriots Blossom and Bulgy?

After reading about TEI’s encoding options for various text elements, I can guess that TEI would let me encode the text of Boo-Boo the Barrage Balloon with indicators of quotations and formatting, milestone events (like when Boo-Boo saves London from the Nazis), a bibliography, and a header about the book itself. (Incidentally, I appreciate Mueller’s inclusion of hyperlinks in his TeiXBaby language to update TEI for the Web.) I could then use CSS to format the text of Boo-Boo to make it approximate the text in the book – though without the charming illustrations. Once I get the hang of TEI, I might take a stab at encoding Boo-Boo. I doubt he will be of much interest to scholars, but it will give me some practice!

Speaking of encoding, the reading’s instructions for encoding TEI were a bit confusing to me, although familiarity with HTML helps (especially with containers like <head>, <div>, and <p>, which are the same in HTML). I expect it will make more sense once I actually start encoding, and I’ll learn the language as I go. I can’t wait to get started!

Validator!

Text encoding is a language. Through the post-colonial lens, it is of interest that this language be construed as “universal” despite the fact that it is specifically designed to be constructed within English alphanumerics. After understanding the basic references, one is then able to implement and categorize various languages—French, for example, was shown in the Mueller introduction. Reading this introduction was a similar experience as reading an instruction manual, although I found much of the information helpful in laying the foundation for what TEI is and is capable of creating.

One specific point of interest in relation to this topic is the idea of digital decomposition. The introduction, itself, seems to have fallen victim to a lack of upkeep as some of the hyperlinks contained within the document lead readers to error messages, such as the following:
Screen shot 2015-02-16 at 8.53.09 PM

Despite this, I found the article to be insightful and a good reference for possible future endeavors within the TEI arena. Some of the other links, such as the First World War Poetry Digital Archive, did function, and provided necessary examples of the types of output that can be created by following the various steps listed. Digital upkeep is important and necessary when the only other available form of preserving literary texts is the costly and burdensome literal preservation of books as tangible objects.

The other somewhat revelatory tidbit of information that I gleaned from the text had to do with the somewhat poetic form of TEI. One is able to create elegant code. That is, code that performs its function while simultaneously appearing clean and extremely concise. This type of code is only functional once it successfully passes through the “Validator,” which is ripe for poetic/philosophical inquiry in its own right.

The second page we were led to on the syllabus was also very helpful and much more beneficial for would-be TEI encoders, in my opinion. Here we have a straightforward tutorial, complete with examples, tests, exercises, etcetera, that help to get those that are interested actually doing the task rather than simply talking about it. With any language, immersion is the most successful strategy for retaining and practicing the desired outcome. This is a great way of learning, at least at the basic level, TEI.

TEI

I am doing this blog entry differently than normal — I am writing my questions as I go along and I will strike them if they are answered after further reading. This discusses the origins of TEI.

1. Ok so SGML is a markup meta-language. SGML is a set of rules for making a markup language. SGML is composed of containers and an SGML conformant language — rules about containers and their “content models.” A “document type definition (DTD)” follows rules specified by SGML.

2. Why would I put a “declarative markup” with a “style sheet” instead of just making the words styled/formatted correctly in the first place? The author says combining the style and declarative markup is counterintuitive because a marriage of style and structure  is a deeply engrained in our beliefs.

3. The author says style and structure is disjointed in markup, but then he says:

“The strength and weakness of SGML derive from the same fact: you need a document type definition, which means that you have to think ahead. Writing in SGML or any of its variants involves a willingness to shoulder upfront investments for the sake of downstream benefits.”

So are style and structure really disjointed? We still marry style and structure because we’re thinking ahead about the DTD and consequently design the SGML.

XML is the answer to the problems of SGML and HTML. “You can use XML without thinking ahead and make up your elements en route as long as they nest within each other. This is called writing “well-formed” rather than “valid” XML. Purists discourage this but people will do it anyhow.” So I was essentially correct in my above question; they created a spin of language known as XML to allow us to marry style and structure without losing the computing power of SGML.

4. Seriously, why not HTML if it is universal?  “HTML always lives in sin because it constantly violates the cardinal rule of separating information from the mode of its display.”

5. But why is this important to separate information from mode of its display? Why not call it humanities computing if we want to separate the two? “If you want to use the Internet to move stuff in and out of databases, it becomes very useful to have a markup language with clearly defined containers and content models.That is the impetus behing XML, the “Extensible Markup Language,” which will supersede HTML wherever complex and precise information  is at a premium.”