DARTH Position Available: Media Intern

Arts and Humanities Research Computing (DARTH) is hiring a media intern to assist in interviewing faculty who have received funding, from the Barajas Dean’s Innovation Fund, to pursue research in digital arts and humanities.  The media intern will also write blog posts on these research projects, based on the interviews, for web publication. Depending on the media intern’s skills and interests, these posts could also incorporate or take the form of short videos or audio recordings. Please see http://darthcrimson.org/barajas-deans-innovation-fund/ for example posts on past projects.

Projects from the 2017-2018 round of funding include:

  • EVQ (Electronic Versions of the Quran) – Shady Nasser (NELC)
  • Exploring Mary Magdalene Online – Racha Kirakosian (German)
  • Interoperable Digital Audio Annotation Tool – Richard Wolf (Music)
  • Mind the Gap: Diasporic Archives in Italy and the US – Lorgia García Peña (RRL)
  • Augmented Reality and the “Dream Stela” – Peter der Manuelian (NELC)

Basic Qualifications:

  • Current Harvard College student
  • Interest in arts and humanities research, especially in relation to the digital humanities and emerging computational methods

Key Duties:

  • Prepare questions for faculty interviews
  • Schedule and conduct faculty interviews
  • Create blog posts and/or produce short videos or audio recordings of interviews
  • Publish blog posts to the DARTH website

Required skills and competencies:

  • Ability to plan and execute successful interviews
  • Strong organization and interpersonal communication skills, such as an ability to converse with faculty about their research
  • Demonstrated writing ability, preferably with experience writing for an online audience

Additional Qualifications:

  • Photography / videography skills and basic knowledge of A/V production strongly desired
  • Basic knowledge of WordPress is preferable
  • Ability to attend and cover additional DARTH events is desired

Submit applications and questions to Rashmi Singhal, Interim Director of Arts and Humanities Research Computing, at rashmi_singhal@harvard.edu. Please include a CV or resume, short cover letter, a writing sample, and optionally, a link to your media portfolio with your application.

Pay Range: $12-17/hour, depending on experience.

Research Databases and the Future of Digital Humanities Applications

Introduction: Next Generation Research Computing

Research computing within the arts and humanities has evolved in tandem with rapidly advancing digital methodologies, nuanced datasets, and increasingly robust web programming environments. More than ever, scholars are engaging with shared, scalable research ecosystems that often include content annotation, text/network analysis, data visualization, and crowdsourcing functionalities, among others.

Despite these major shifts within the digital research landscape, commonly adopted databases for content storage and retrieval do not always prioritize the needs of an increasingly sophisticated user base, nor have they been optimized for the immediacy and scalability that modern research applications demand.

The following document surveys modern databases and theories of data modeling in order to compare and contrast differing approaches to database-driven digital research. All examples within this overview represent a selection of projects designed by Harvard University faculty in collaboration with Arts & Humanities Research Computing. Each makes use of varying database technologies, both common and cutting-edge, and was designed with long-term, flexible, and sustainable research applications in mind.

Relational Databases as a Point of Departure

Relational databases are the most commonly used database technology. They were originally developed to keep track of large numbers of interrelated people, objects, and processes, such as patients in a hospital, students at a university, or books that Amazon sells. The software has become so standard and cost effective that relational databases have become the default technology, even when they may not be the best tool for the job.

There are proprietary variations among relational database applications, but generally speaking all share the same fundamental data model: organizing information in one or more tables of rows and columns. Tables contain people, objects, events, etc. of a single type (patients, students, books). Rows describe information pertaining to one instance (single patient, particular student, specific book). Columns contain fields, a type of data that describes aspects of an item within a table (patient name, student campus address, book title). Typically, one of the fields will be a unique identifier that can label a specific row in the table (patient ID, student ID, ISBN).

The most basic example of this is the Excel spreadsheet and its related forms (comma/tab separated values). Relational databases have been in use for decades, and remain the most flexible choice for data storage and retrieval because many are open-source and have large communities of practice. MySQL is perhaps the most popular open source database for digital research projects, blogs, and other web applications.

Good & Bad Relations: Design Thinking for an Opera Social Network

In order to illustrate relational database fundamentals, the following example will walk through the process of creating a simple social network of operas and their performances around the world. This demonstration draws on data from Operabase, an online archive containing information about performances, artists, opera companies, and more.

[/vc_column_text][vc_single_image media=”49584″ caption=”yes” media_width_percent=”80″ alignment=”center”][vc_column_text]

Fidelio, the only opera written by Ludwig van Beethoven, tells the story of Leonore, who disguises herself as a prison guard to free her husband Florestan from certain death within a political prison. It narrates a heroic story about liberty and justice, and has multiple performances playing around the world in 2017.

Beethoven’s opera is one of many within the Operabase archives. In a relational database model, these entries would exist within a table called operas. Each row would represent one opera and each column would represent a particular type of data about the opera such as NAME, ID, COMPOSER, FIRST_PERFORMANCE, etc. Operas are only one type of table. Other tables may include artists, opera companies, theaters, and more. Here is an example of what the operas table may look like:

[/vc_column_text][vc_single_image media=”49588″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]

In order to connect performance dates, locations, and other information with each opera, the user must create new tables to represent these additional entities. The first table will be called theaters, and the second table, called performances, will combine information from operas and theaters into a single item, a performance of a particular opera at a specific opera house. First, here is an example of the theaters table:[/vc_column_text][vc_single_image media=”49589″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]Each theater has a unique THEATER_ID, NAME, COUNTRY, CITY, CAPACITY, SEASON, and WEBSITE. According to Operabase, each theater above will host a production of Fidelio during the 2017 season. Furthermore, each performance will have its own respective beginning and closing dates.

The performances table will illustrate the fundamental behavior of the relational database model: creating relationships between tables. This new table will store performance begin/end dates, and will also draw from the operas table and the theaters table to create instances of connected, queryable information. Connecting data in this manner is called joining, and data joins can serve to create permanent new tables within a database, or as a way of connecting data in a live instance to be used and then discarded from memory once the query has completed. In this example, the data join will connect information together within one table that will become a permanent part of the underlying database.

[/vc_column_text][vc_single_image media=”49595″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]

Within this table structure, PERFORMANCE_BEGIN and PERFORMANCE_END connect to a respective opera and theater, referenced by OPERA_ID and THEATER_ID. In the original operas table, the unique id representing each operaalso called a primary keyis  used to refer to itself from within another table. Establishing a primary key within a table allows for the relational database to be able to distinguish between unique entries within the table.

When an item from one table is referenced within another table, the primary key in the original table links to the secondary table in the form of a foreign key. In other words, in the operas table, Fidelio has a primary key (id number) of 1. In the performances table it is necessary to use that same unique id number to instruct the database to pull opera number 1 from the original table; however, within the new table, the primary key of 1 shifts to become a foreign key, which provides instructions to the database that the key originally comes from the operas table, and that is where it can look to find the original values.

This is a basic example of a growing network of operas, theaters, and performances. For a more robust social network, it would be necessary to create separate tables for performers, composers, opera companies, and more. This would dramatically increase the complexities of the database as new relationships would have to be mapped to performance instances. Beyond performances alone, the complexity of the data would also sharply increase if relationships between performers, composers, and other artists were described.

Mapping data to a specific structure within a database is called data modeling, and with relational databases the data model is extremely important for the integrity of the database. One of the most important factors in the use of relational databases is the time spent planning before any data can be entered. This planning must account for potential changes or augmentations to the original data in advance, because increasingly complex datasets can be challenging to append to pre-structured tables. Graph Databases: New Opportunities for Connected Data asserts “whereas relational databases were initially designed to codify paper forms and tabular structures—something they do exceedingly well—they struggle when attempting to model the ad hoc, exceptional relationships that crop up in the real world. Ironically, relational databases deal poorly with relationships” (Robinson, Webber & Eifrem, 11). 

For arts and humanities research computing projects, creating meaning amongst data is as important, if not more important, than storing the data in the first place. Scholars increasingly look to database technologies with the express purpose of distant reading their data, finding patterns, and testing queries. Thus, the fundamental goal is always to create and expose relationships among data, and the manner of approaching this problem differs depending on the type of project in question.

The following use cases will demonstrate how newer types of databases, known collectively as NoSQL, can more elegantly accommodate arts and humanities data. This new generation of technologies breaks the dominance of relational tables and fields to allow for a more intuitive and robust environment to explore and ask questions of data.

Faculty Case Study: The Giza Project

The Giza Project at Harvard University provides access to “the largest collection of information, media, and research materials ever assembled about the Pyramids and related sites on the Giza Plateau.” The project is led by Peter Der Manuelian, Philip J. King Professor of Egyptology and Director of the Harvard Semitic Museum. Rashmi Singhal of Arts & Humanities Research Computing oversees data architecture and technical development for the project’s website Digital Giza.

Digital Giza is a veritable research environment consisting of highly structured scholarly big data: photos, bibliographies, dates, dig site findings, drawings, documents (published and unpublished), diary entries, ancient people (e.g. identified bodies within a tomb), modern people (e.g. archaeologists, photographers, scholars, etc.), and a host of other archeological information. The goal of the integrated research platform, featuring unique pages for each entity within the database and associated information, is illustrated in the following image of the tomb of Meresankh III:

[/vc_column_text][vc_single_image media=”49694″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_single_image media=”49695″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_single_image media=”49696″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]

The underlying database that informs Digital Giza is called The Museum System (TMS), a relational database optimized for museums and archival collections. TMS serves the needs of the Giza team, allowing for complex, systematic data entry and manipulation. In addition, the team receives support from Harvard University because TMS is in use elsewhere on campus. There was, however, one major drawback: extremely complex queries against TMS were required in order to build the unique website displays on the Digital Giza web application.

In the above example of the tomb of Meresankh III, there are hundreds of photos, finds, videos, and people associated with the object. A typical query to pull together all of this information involves multiple data joins and easily surpasses one hundred lines of code. In addition, future iterations of the dataset may include new types of objects and relationships not previously included within the data model. This poses a challenge for the efficiency and flexibility of the web environment.

It became clear while working on the project that the cost and time required to switch the Giza team to a completely new workflow and database technology would be too prohibitive, and therefore the solution changed: a NoSQL database called ElasticSearch would sit atop TMS, one that could act as a layer between TMS and the web application.

NoSQL frees up the rigidity of the relational data model by implementing a number of alternative data storage solutions. The term itselfNoSQLrepresents a wide variety of database types with differing data models, and for Digital Giza this specifically translated to a document model.

This type of data model creates a unique document for each object within the database. This data format is self-contained, meaning the manner of parsing the data within each document is self-described by that document. This is in direct contrast to relational models where tables have universal, well-defined structures that correspond to data entities. In a document model, no two documents need to have the exact same schema. Below are two examples of documents within the database (in JSON format).

[/vc_column_text][vc_gallery el_id=”gallery-202494″ medias=”49711,49712″ gutter_size=”3″ screen_lg=”1000″ screen_md=”600″ screen_sm=”480″ images_size=”one-one” single_overlay_opacity=”50″ single_padding=”2″][vc_column_text]

In addition, documents are typically encoded using commonly accepted web language standards: XML, JSON, BSON, etc. This makes it easy for developers to incorporate the results of database queries into web applications directly. Document models solve a structural problem created by relational databases; they support a flexible data model that can withstand the demands of a rapidly evolving project with polymorphic data and multiple users logging in from around the world.

In this use case, there is a translation of data from one system (TMS) to another (ElasticSearch). Each database solves a different problem, the former serving as a tool for the storage and long-term upkeep of the data using universally accepted museum standards, and the latter to efficiently populate a user interface for navigating the various archeological data on a research platform built for the web.

Faculty Case Study: Russian Modules

Russian Modules, led by Steven Clancy, Senior Lecturer in Slavic Languages and Literatures and Director of the Slavic Language Program, is a Russian linguistics application designed to support curriculum building, language learning, and related research on the structure of the Russian language. Christopher Morse of Arts & Humanities Research Computing oversees application development, and the project uses a NoSQL graph database called Neo4j.

In its current iteration, users can type Russian text into a form that will automatically parse the information into a variety of categories. These categories range from part of speech, to word inflection, to clusters of common meaning, also called domains. In addition, the tool provides word difficulty levels based on the Russian language curriculum within the Slavic Department in order to gauge whether or not a particular text is too challenging (or not challenging enough) for students to undertake.[/vc_column_text][vc_single_image media=”49668″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]Data modeling within a graph database environment is quite different than within a tabular structure. In fact, graph databases do not have a set data modeling style because they were designed to emphasize the relationships between data more than the data itself. They are typically used to model social networks, families, organizations, epidemiological contagion paths, and other interconnected ideas.  

The graph structure is straightforward: data must take the form of a node, relationship, or property. The resulting structure is reminiscent of a mind map, a series of circles connected by lines that represent some kind of relationship. Each circle within a graph database is referred to as a node or vertex, and each line that connects two nodes is referred to as an link or edge.

Here is an example view of the word “book” in the Russian Modules database:

[/vc_column_text][vc_custom_heading heading_semantic=”h4″ text_size=”h4″]Lemma Search: книга (book)[/vc_custom_heading][vc_raw_html]JTNDZGl2JTIwaWQlM0QlMjJydV9ib29rJTIyJTNFJTBBJTNDcHJlJTNFTUFUQ0glMjAlMjhsJTNBTGVtbWElMjktJTVCciUzQUlORkxFQ1RTX1RPJTVELSUzRSUyOGYlM0FGb3JtJTI5JTIwV0hFUkUlMjBsLmxhYmVsJTIwJTNEJTIwJTI3JUQwJUJBJUQwJUJEJUQwJUI4JUQwJUIzJUQwJUIwJTI3JTIwUkVUVVJOJTIwbCUyQ3IlMkNmJTNCJTNDJTJGcHJlJTNFJTBBJTNDJTJGZGl2JTNF[/vc_raw_html][vc_column_text]

Within the Russian Modules database, all word forms are nodes. Dictionary forms of the word are labeled lemmas, and grammatical inflections of the lemma are labeled forms (e.g. lemma: книга; form: книги [genitive, ‘of the book’]). Encoded within the relationship are additional properties such as what part of speech a word is inflecting to, or what difficulty level the word has. In the above interactive example, the central word is the lemma, and the connections fanning out from the center each represent different forms of that lemma. The relationship between the lemma and its forms also has a unique name: INFLECTS_TO. This name exhibits the relationship between the nodes very clearly.

The querying language used to interface with Neo4j, Cypher, encourages modeling data in plain English. This philosophy echoes the distant yet perennial wisdom of Abelson and Sussman who wrote in their preface to Structure and Interpretation of Computer Programs that “programs must be written for people to read, and only incidentally for machines to execute.” The Neo4j back end includes built-in graphical functionality powered by D3js, and the Cypher query language allows users to intuitively work with their data. The above example can be simplified into plain English without much of a jump:

[/vc_column_text][vc_raw_html]JTNDcHJlJTNFTUFUQ0glMjAlMjhsJTNBTGVtbWElMjktJTVCciUzQUlORkxFQ1RTX1RPJTVELSUzRSUyOGYlM0FGb3JtJTI5JTIwV0hFUkUlMjBsLmxhYmVsJTIwJTNEJTIwJTI3JUQwJUJBJUQwJUJEJUQwJUI4JUQwJUIzJUQwJUIwJTI3JTIwUkVUVVJOJTIwbCUyQ3IlMkNmJTNCJTNDJTJGcHJlJTNF[/vc_raw_html][vc_column_text]

The query searches for all lemmas that match the label “книга”, and then searches for all associated relationships. The relationship is called INFLECTS_TO, but also has a directional component. (Lemma)-[INFLECTS_TO]->(Form) has “–>” inside of it, a graphical way of describing the relationship.

This free data model allows users to create and remove nodes and connections on the fly with very little code. Traversing the various relationships across the graph is also simplified in comparison to extremely complex relational databases with multiple intermediary tables that connect data. Take for example the following query that searches for the word sister, all of its inflected forms, the domain of words it belongs to, and the other words that also make up that domain:

[/vc_column_text][vc_raw_html]JTNDcHJlJTNFTUFUQ0glMjAlMjhsJTNBTGVtbWElMjAlN0JsYWJlbCUzQSUyNyVEMSU4MSVEMCVCNSVEMSU4MSVEMSU4MiVEMSU4MCVEMCVCMCUyNyU3RCUyOS0lNUJyJTNBSU5GTEVDVFNfVE8lNUQtJTNFJTI4ZiUzQUZvcm0lMjklMkMlMjAlMjhsJTI5LSU1QnIxJTNBSEFTX0RPTUFJTiU1RC0lM0UlMjhkJTI5JTNDLSU1QnIyJTNBSEFTX0RPTUFJTiU1RC0lMjhvJTI5JTBBUkVUVVJOJTIwbCUyQ3IlMkNmJTJDcjElMkNyMiUyQ28lM0MlMkZwcmUlM0U=[/vc_raw_html][vc_single_image media=”49625″ caption=”yes” media_width_percent=”100″ alignment=”center”][vc_column_text]

Future iterations of the project endeavor to include more robust natural language processing functionality, interactive word visualizations, and the integration of the database into an eBook for Russian language study.

[/vc_column_text][vc_raw_js]JTNDc2NyaXB0JTIwc3JjJTNEJTIyaHR0cCUzQSUyRiUyRmQzanMub3JnJTJGZDMudjMubWluLmpzJTIyJTNFJTNDJTJGc2NyaXB0JTNFJTBBJTNDc2NyaXB0JTNFJTBBJTBBdmFyJTIwaGVpZ2h0JTIwJTNEJTIwNTAwJTNCJTBBdmFyJTIwd2lkdGglMjAlM0QlMjAxMjAwJTIwJTJBJTIwLjY2JTNCJTBBJTBBdmFyJTIwZm9yY2UlMjAlM0QlMjBkMy5sYXlvdXQuZm9yY2UlMjglMjklMEElMjAlMjAlMjAlMjAuc2l6ZSUyOCU1QndpZHRoJTJDJTIwaGVpZ2h0JTVEJTI5JTBBJTIwJTIwJTIwJTIwLmNoYXJnZSUyOC00MDAlMjklMEElMjAlMjAlMjAlMjAubGlua0Rpc3RhbmNlJTI4MTAwJTI5JTBBJTIwJTIwJTIwJTIwLm9uJTI4JTIydGljayUyMiUyQyUyMHRpY2slMjklM0IlMEElMEF2YXIlMjBkcmFnJTIwJTNEJTIwZm9yY2UuZHJhZyUyOCUyOSUwQSUyMCUyMCUyMCUyMC5vbiUyOCUyMmRyYWdzdGFydCUyMiUyQyUyMGRyYWdzdGFydCUyOSUzQiUwQSUwQXZhciUyMHN2ZyUyMCUzRCUyMGQzLnNlbGVjdCUyOCUyMiUyM3J1X2Jvb2slMjIlMjkuYXBwZW5kJTI4JTIyc3ZnJTIyJTI5JTBBJTIwJTIwJTIwJTIwLmF0dHIlMjglMjJ3aWR0aCUyMiUyQyUyMHdpZHRoJTI5JTBBJTIwJTIwJTIwJTIwLmF0dHIlMjglMjJoZWlnaHQlMjIlMkMlMjBoZWlnaHQlMjklM0IlMEElMEF2YXIlMjBsaW5rJTIwJTNEJTIwc3ZnLnNlbGVjdEFsbCUyOCUyMi5saW5rJTIyJTI5JTJDJTBBJTIwJTIwJTIwJTIwbm9kZSUyMCUzRCUyMHN2Zy5zZWxlY3RBbGwlMjglMjIubm9kZSUyMiUyOSUzQiUwQSUwQWQzLmpzb24lMjglMjJodHRwJTNBJTJGJTJGZGFydGhjcmltc29uLm9yZyUyRmFydHNodW1yYyUyRmRhdGFiYXNlczIuMCUyRnJ1X2Jvb2suanNvbiUyMiUyQyUyMGZ1bmN0aW9uJTI4ZXJyb3IlMkMlMjBncmFwaCUyOSUyMCU3QiUwQSUyMCUyMGlmJTIwJTI4ZXJyb3IlMjklMjB0aHJvdyUyMGVycm9yJTNCJTBBJTBBJTIwJTIwZm9yY2UlMEElMjAlMjAlMjAlMjAlMjAlMjAubm9kZXMlMjhncmFwaC5ub2RlcyUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5saW5rcyUyOGdyYXBoLmxpbmtzJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLnN0YXJ0JTI4JTI5JTNCJTBBJTBBJTIwJTIwbGluayUyMCUzRCUyMGxpbmsuZGF0YSUyOGdyYXBoLmxpbmtzJTI5JTBBJTIwJTIwJTIwJTIwLmVudGVyJTI4JTI5LmFwcGVuZCUyOCUyMmxpbmUlMjIlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAuYXR0ciUyOCUyMmNsYXNzJTIyJTJDJTIwJTIybGluayUyMiUyOSUzQiUwQSUwQSUyMCUyMG5vZGUlMjAlM0QlMjBub2RlLmRhdGElMjhncmFwaC5ub2RlcyUyOSUwQSUyMCUyMCUyMCUyMC5lbnRlciUyOCUyOS5hcHBlbmQlMjglMjJnJTIyJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLmF0dHIlMjglMjJjbGFzcyUyMiUyQyUyMCUyMm5vZGUlMjIlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAub24lMjglMjJkYmxjbGljayUyMiUyQyUyMGRibGNsaWNrJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLmNhbGwlMjhkcmFnJTI5JTNCJTBBJTBBJTIwJTIwbm9kZS5hcHBlbmQlMjglMjJjaXJjbGUlMjIlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAuYXR0ciUyOCUyMnIlMjIlMkMlMjAxMiUyOSUzQiUwQSUyMCUyMG5vZGUuYXBwZW5kJTI4JTIydGV4dCUyMiUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyZHglMjIlMkMlMjAxMiUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyZHklMjIlMkMlMjAlMjIuMzVlbSUyMiUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyZm9udC1zaXplJTIyJTJDJTIwMTQlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAuYXR0ciUyOCUyMmZvbnQtZmFtaWx5JTIyJTJDJTIwJTIyc2Fucy1zZXJpZiUyMiUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyc3Ryb2tlJTIyJTJDJTIwJTIyMHB4JTIyJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLnRleHQlMjhmdW5jdGlvbiUyOGQlMjklMjAlN0IlMjByZXR1cm4lMjBkLmxhYmVsJTIwJTdEJTI5JTNCJTBBJTBBJTBBJTdEJTI5JTNCJTBBJTBBZnVuY3Rpb24lMjB0aWNrJTI4JTI5JTIwJTdCJTBBJTIwJTIwbGluay5hdHRyJTI4JTIyeDElMjIlMkMlMjBmdW5jdGlvbiUyOGQlMjklMjAlN0IlMjByZXR1cm4lMjBkLnNvdXJjZS54JTNCJTIwJTdEJTI5JTBBJTIwJTIwJTIwJTIwJTIwJTIwLmF0dHIlMjglMjJ5MSUyMiUyQyUyMGZ1bmN0aW9uJTI4ZCUyOSUyMCU3QiUyMHJldHVybiUyMGQuc291cmNlLnklM0IlMjAlN0QlMjklMEElMjAlMjAlMjAlMjAlMjAlMjAuYXR0ciUyOCUyMngyJTIyJTJDJTIwZnVuY3Rpb24lMjhkJTI5JTIwJTdCJTIwcmV0dXJuJTIwZC50YXJnZXQueCUzQiUyMCU3RCUyOSUwQSUyMCUyMCUyMCUyMCUyMCUyMC5hdHRyJTI4JTIyeTIlMjIlMkMlMjBmdW5jdGlvbiUyOGQlMjklMjAlN0IlMjByZXR1cm4lMjBkLnRhcmdldC55JTNCJTIwJTdEJTI5JTNCJTBBJTIwJTIwbm9kZS5hdHRyJTI4JTIydHJhbnNmb3JtJTIyJTJDJTIwZnVuY3Rpb24lMjhkJTI5JTIwJTdCJTIwcmV0dXJuJTIwJTIydHJhbnNsYXRlJTI4JTIyJTIwJTJCJTIwZC54JTIwJTJCJTIwJTIyJTJDJTIyJTIwJTJCJTIwZC55JTIwJTJCJTIwJTIyJTI5JTIyJTNCJTIwJTdEJTI5JTNCJTBBJTBBJTBBJTdEJTBBJTBBZnVuY3Rpb24lMjBkYmxjbGljayUyOGQlMjklMjAlN0IlMEElMjAlMjBkMy5zZWxlY3QlMjh0aGlzJTI5LmNsYXNzZWQlMjglMjJmaXhlZCUyMiUyQyUyMGQuZml4ZWQlMjAlM0QlMjBmYWxzZSUyOSUzQiUwQSU3RCUwQSUwQWZ1bmN0aW9uJTIwZHJhZ3N0YXJ0JTI4ZCUyOSUyMCU3QiUwQSUyMCUyMGQzLnNlbGVjdCUyOHRoaXMlMjkuY2xhc3NlZCUyOCUyMmZpeGVkJTIyJTJDJTIwZC5maXhlZCUyMCUzRCUyMHRydWUlMjklM0IlMEElN0QlMEElMEElM0MlMkZzY3JpcHQlM0U=[/vc_raw_js][/vc_column][/vc_row]

Conclusion: Looking Forward

The technologies discussed herein comprise a limited selection of database options, each of which varies in complexity, cost, and community size. At Harvard, successful iterations of digital humanities projects making use of both SQL and NoSQL databases have contributed to local communities of practice that continuously push the envelope to explore and describe modern (and emerging) practices for architecting data.

Looking forward, researchers and scholars should always expect at least a minimal learning curve with any new technology. In the end, however, they should not be completely overwhelmed by their digital tools, otherwise the technology has failed to serve its intended purpose. That being said, new does not always mean better, and sometimes the traditional model or approach is the way to go. And finally, sustainability and the projected lifespan of a project are important considerations for any research application built for the web. Be aware of the required upkeep (yearly hosting costs, software upgrades, deprecations, general maintenance, etc.), and use caution with any technology that lacks compatibility with accepted open standards.

Omeka Sugar: Tutorials for Digital Exhibits

Serving up spoonfuls of knowledge to make work with Omeka go smoothly.

Jeremy Guillette, Digital Scholarship Facilitator for the History Department, is the author if Omeka Sugar, a growing series of tutorials designed to familiarize users with Omeka functionality. Current tutorials include creating items, associating items with collections, building exhibits, and making use of the mapping plugin Neatline.

For more information, or to access the tutorials, please visit Omeka Sugar.

Digital Japanese Literature: Aozora Bunko

Introduction

Higuchi Ichiyō, featured at Aozora

Higuchi Ichiyō, featured at Aozora

Aozora Bunko (青空文庫) is a digital archive of Japanese literature in the public domain. In addition to its web presence, the corpus is also available on GitHub where it can be downloaded in its entirety. This makes it possible to perform a distant reading of the collection, and the following information serves as a general introduction for data analysis and parsing.

First and foremost, you will need to clone the GitHub repository to your computer. A warning: the download is quite large (~4 GB) and therefore may take some time. To install the repository locally, access the command line and input the following:

git clone https://github.com/aozorabunko/aozorabunko.git

If you have not yet installed git on your computer you will first need to follow these directions for your respective operating system.

Aozora Bunko Schema

A large majority of Aozora Bunko story entries maintain a shared structure: header informationmain text, and bibliographic information. Please see the following example of Higuchi Ichiyō’s short story Tsuki no Yo.

It is possible to isolate the elements that make up each story thanks to the standardized HTML output format of the archive; however, one particular challenge is filtering out all furigana in order to properly mine the text. Furigana is used as a reading aid in Japanese—syllabic characters can be appended to ideographic characters (kanji), especially for kanji that are rare or have a special pronunciation. Within the first few lines of Tsuki no Yo there are several instances of furigana usage.

Behind the scenes this looks a bit messy, and it can be difficult to parse. Furigana entries use ruby characters, a web standard element that behaves as an annotation of sorts for logographic characters. For more information on the ruby specifications, please see the W3C documentation page. The following is an example of Tsuki no Yo viewed as HTML.

Parsing Aozora Texts

Molly Des Jardin, Japanese Studies Librarian at the University of Pennsylvania, has written a script in Python that will automatically strip the ruby characters from a text. This script and additional resources for Japanese language analysis can be found on her Japanese Text Analysis library guide. In order to run this script you will need Python installed on your computer, and in addition you will need to install the following dependencies: BeautifulSoup & TinySegmenter. If you are new to Python, make sure to install the pip tool, thereafter you can install the two libraries from the terminal as such:

pip install beautifulsoup4

pip install tinysegmenter

import os
import glob 
import sys
from bs4 import BeautifulSoup
from tinysegmenter import *

for filename in glob.iglob('*.html'):

# Remove ruby and <rt> <rp> tags from text

with open(filename, 'r') as f:
input = f.read()
print filename

soup = BeautifulSoup(input)
tagname = 'rt'
for tag in soup.findAll(tagname):
tag.extract()

tagname = 'rp'
for tag in soup.findAll(tagname):
tag.extract()

tagname = 'span'
for tag in soup.findAll(tagname):
tag.extract()
nonruby = unicode(soup)

# Remove all HTML tags and attributes, then write the file to (filename).txt

nonruby = re.sub('<[^<]+?>', '', nonruby)

segmenter = TinySegmenter() 

tokenized = segmenter.tokenize(nonruby)
tokenized = tokenized[0:(tokenized.index(u'底本'))-1]

tokenized = ' '.join(tokenized)

file = open(filename + '.txt', 'w')
file.write(tokenized.encode('utf-8'))
file.close()

Running Molly’s script is quite simple: place the file into a folder containing the HTML files of each story you would like to parse. From the terminal, simply execute the following command:

python rubydetokenize.py

Please note that this script was designed for Python 2, but can be converted for Python 3 by making small changes to the code (notably, changing the print statements). The script will iterate through the various files to remove all HTML tags and ruby annotations, and will output them as text files with only the text of the story remaining.

Before (tsuki_no_yo.html):

After (tsuki_no_yo.txt):

In order for this script to work correctly on Tsuki no Yo, it was first necessary to add a few lines of additional code. The reason for this is because there is a language usage note made within the text itself, and represented as an HTML <span> element that the script does not originally scan for:

The following code was added to the original script to target any HTML elements called “span,” thereby removing the language usage note entirely. While working with various stories you may discover that there are internal inconsistencies that require you to target specific HTML elements that causing the script to either break or parse incorrectly.

tagname = 'span'
    for tag in soup.findAll(tagname):
    tag.extract()

You may also notice that the script tokenizes words, meaning it attempts to group words based on common lexical patterns in Japanese. This work is done by TinySegmentor, one of many parsing tools for East Asian languages. Another useful parsing tool is MeCab, which also works with Python. No parser is 100% accurate (at least not yet), especially for stories within the Aozora database which may contain antiquated morphological patterns that are no longer in use.

Header Image: Harvard Art Museum, 1977.202. “The Former Deeds of Bodhisattva Medicine King,” Chapter 23 of the Lotus Sutra (Hokekyô) Calligraphy.

Hacking the News: The NYTimes API

Why just read the news when you can hack it?

The above question posed by the New York Times makes an important point: data is not merely raw material to be finessed into content, but content in its own right. The New York Times goes on to say that “when you build applications, create mashups and otherwise reveal the potential of our data, we learn more about what our readers want and gain insight into how news and information can be reimagined.” The following tutorial demonstrates one possible reimagining of New York Times content for use in research and digital scholarship.

In order to access the API, you will need to request an API key from the New York Times developer website. This will provide access with certain limitations (most APIs have usage limits in order to prevent individual users from overburdening the server). With that in mind, please also note that data retrieved from the New York Times API is not exhaustive. This is because intellectual property considerations often prevent certain data from being made publicly available in the first place.

There are multiple ways of querying the API, but this particular example will use Python to make the calls. You will write a script in your text editor of choice that can communicate with the API, and then run it locally from your computer. Copy the following code into your text editor and save it as nytimes.py. You can run the script with the command python nytimes.py.

[/vc_column_text][vc_raw_html]JTNDZGl2JTIwaWQlM0QlMjJweXRob24lMjIlMjBjbGFzcyUzRCUyMmNvZGVzJTIyJTNFJTNDJTJGZGl2JTNF[/vc_raw_html][vc_column_text]

There are three import statements here, json, nytimes, and time. These are references to specific Python packages (prebuilt collections of code). In the case of nytimes, you will need to install this package using pip. For more information on how to install specific Python packages, including pip, check out the following documentation.

The nytimes package provides the tools for querying the New York Times API, the json package will convert the results into JSON format for ease of use, and the time package allows us to set a small delay in between API calls. There are limits on how many times you can query the API per second, and adding a one second delay prevents any connection errors.

In the variable search_obj, you will need to change YOUR_API_KEY_HERE to the API key that you received from the New York Times developer page. Make sure you preserve the single quotations inside the parentheses.

In the above example, we are searching for the term ‘cybersecurity’ beginning after the date listed, in this case, 20000101 (January 1, 2000). This date can be modified to a date of your choosing, or removed completely to search for all instances across time. There are other parameters available as well, for more information please see the API documentation.

The API returns query results in the form of pages. Each page contains ten results—in this case ten unique articles. At present, the maximum amount of pages the API will return is 100. This script will therefore iterate 100 times before stopping. In the case that there are more than 100 pages of content (greater than 1000 individual results), you may need to come up with a workaround. Perhaps the easiest way to manage this would be to change the parameters in the initial query to pull all results one year at a time using begin_date and end_date parameters. After gathering results from individual years you can combine them into one large dataset.

When you run the script you’ve created, it will dump all of the data it retrieves into a file called results.json. The following is an example of an individual article’s metadata:

[/vc_column_text][vc_raw_html]JTNDZGl2JTIwaWQlM0QlMjJqc29uJTIyJTIwY2xhc3MlM0QlMjJjb2RlcyUyMiUzRSUzQyUyRmRpdiUzRQ==[/vc_raw_html][vc_column_text]

New York Times article metadata does not provide the complete text of an article, but it does provide useful information including headlines, abstracts, author information, word count, keywords, and more. The above example represents one single article, and therefore it’s likely that your script will return a substantial amount of content in one single file.

Using the following JSON data, you can create all manner of data visualizations. A few basic examples are included below.

The next two examples, “Number of Articles Published from 2000-now” and “Network Graph of Keywords,” relied on Google Fusion tables, a product that was deprecated in 2020.

Visualisation des Billets Vendus

Visualizing Theater History

interactive_default

Odéon Theater seating layout

Visualisation des Billets Vendus, a data interactive created by Christophe Schuwey, Lecturer at Université de Fribourg (Switzerland), and Christopher Morse, Senior Research Computing Specialist at DARTH, reveals ticket sales at performances during the 1784-1785 season at the Odéon-Théâtre de l’Europe in Paris. The project was conceived during the May 2016 Pratiques théâtrales & archives numérisées Projet des Registres de la Comédie-Française (1680-1793) conference cohosted by Harvard and MIT.

Inspired by the work of conference presenters Pannill Camp, Associate Professor of Drama at Washington University at St. Louis, Juliette Cherbuliez, Associate Professor of French at the University of Minnesota, and Derek Miller, Assistant Professor of English at Harvard University, Christophe and Christopher deliberated over possible methods of representing an interactive theater space within a browser. Although the project is still in its nascent stages, it has already revealed interesting perspectives on performance attendance.

Thanks to the meticulous recordkeeping of the Comédie-Française, it is possible to reconstruct with some certainty how full or empty the theater was during each performance between 1680-1793. In addition to cast lists, show dates, and other relevant performance information, The Comédie-Française Registers Project database contains digitized receipts of daily ticket sales. Visualisation des Billets Vendus uses this data to reveal how crowded the theater was on a given night in the form of a heat map—the hotter the color, the busier the performance.

M119_02_R150_021

Ticket sales during the opening night of The Marriage of Figaro (April 27, 1784), source.

Designing a Theater

Odeon 1re Loges and Galeries

Odéon theater layout, source.

D3JS is a data visualization library built in JavaScript for use in web programming. It was an obvious choice for this particular project because D3 simplifies the task of rendering interactive shapes within the browser and associating them with dynamic data. Using various theater layout diagrams it was possible to abstract the general shape of the theater into a seating chart.

The Odéon consists of several seating areas for which there is recorded data: the parterre assisgalerieprémier loge, deuxième loge, troisième loge, and at the top of the theater, paradis. Similar to other theaters, each seating area is divided into additional sections. For now, each of the individual seating areas were combined on each floor to create a total count.

Each floor of the theater was represented as an arc and assigned its own unique color. A slider at the bottom of the page allows users to cycle through each performance date, and in the top left there is a legend that helps to distinguish how busy a particular floor was in relation to the others. With every movement of the slider, lines representing each floor will appear on the legend to show how that night’s attendance compares with the entire season.

When The Marriage of Figaro opened on April 27, 1784, after years of rewrites and censorship, the Odéon was completely packed. The floor plan is bright red, save for one row: the troixième étage. Could this be an error in the data, or perhaps attributable to special seating arrangements, or season passes purchased in advanced? With a visual guide to each performance it becomes far easier to discovery and query these inconsistencies.

marriageoffigaro

Seating chart during the opening of The Marriage of Figaro (April 27, 1784)

Continuing the Tradition

Visualisation des Billets Vendus, while still in its early stages, has been an interesting thought experiment in theater representation and history, and presents a number of unique challenges. For example, how should one visualize a theater? Does it suffice to abstract a theater into shapes like a seating chart one might see on a website like Ticketmaster? What can be learned (or not) by specificity, that is to say, by attempting to recreate each individual seating area, or even each seat?

1680_repertoire

Afterlife of the 1680 Comédie-Française Repertoire, by Derek Miller

Moving forward, the visualization seeks to encompass the entirety of the Comédie-Française registers collection, totaling over one hundred years of ticket sales, and various user interface improvements over time will make it easier for users to work with the heat map in more detail.

In addition to this project, the conference hackathon also inspired a number of other data visualizations and digital presentations, including one by Derek Miller, which can be read here: Four Perspectives on the Comédie-Française Repertoire.

Harvard IIIF

DARTH has just released its newest project in collaboration with the Harvard Library, the Harvard Art Museums, HarvardX, Harvard University Information Technology (Arts & Humanities Research Computing, the FAS Academic Technology Group, Library Technology Services), and various academic departments: Harvard IIIF. The Harvard University International Image Interoperability Framework (IIIF) website is a centralized resource for documentation, development, and use case scenarios regarding the display and sharing of cultural heritage materials stored within the various Harvard University collections.

Harvard University has adopted the International Image Interoperability Framework standards as developed by the IIIF Consortium for describing and sharing digital assets, and has co-developed the IIIF image viewing software Mirador.

Visitors can explore the site to learn more about Harvard’s work with IIIF, the Mirador viewer, and exciting innovations happening around campus that are made possible by these new standards and technologies.

Dr. Michael Puett and the Power of Imagination

Harvard Professor Michael Puett, inspired by the ancient Chinese text “Zhuangzi,” invites us to view the world from multiple perspectives and unleash the power of our imagination.

Focus: Vassiliki Rapti

Dr. Vassiliki Rapti, Preceptor in Modern Greek, pushes the boundaries of digital media instruction in her classroom by engaging her students in new ways and challenging traditional language teaching methods. DARTH Crimson spotlights her work as an example of the successful integration of technology into course design. Dr. Rapti’s approach allows for greater creativity in the classroom, and this experimentation has resulted in a direct positive response in student enthusiasm and performance.

In one assignment, Rapti instructs her students to set up Facebook profiles for the class. They use the Greek language user interface setting and correspond entirely in Greek. The student responses she has collected have emphasized their enthusiasm for utilizing the language in an immersive, everyday manner, giving them routine practice in a practical space. Some students have chosen to keep their accounts set to the Greek interface indefinitely to hone their skills further.There are also more artistic assignments that provide the opportunity to use technology for self-expression. This approach has earned her recognition, including an Elson Family Award for the integration of the arts into the curriculum. “I realized that allowing the students to be creative is a great motivation for them to improve their language skills,” Rapti said, going on to describe the most ambitious project she tackles with her language students: film inspired by the chorus in Greek tragedy.

All of Rapti’s Modern Greek classes produce a film as a final project. Students participate from every step along the way through research, scripting, acting, filming, and post-production. There is a job for everyone, and English subtitles make the all-Greek performances easily accessible to all audiences. She devised this project while seeking “a more creative and lasting way to teach Modern Greek, and one which could be assessed beyond the final exam.” Although students were receptive to replacing a final exam with an ongoing project, Rapti knew it would be a challenge. “I wondered how I could assess my students in all four language skills (writing, reading, listening, speaking) both in their individual and collective contribution. I was looking, in particular, for a collective experience that would naturally engage each student separately and encourage all students to give their best.”

Rapti found help in the Media Production Center, who continually offer their expertise in film production at Harvard, as well as the Language Resource Center in Lamont Library. Rapti stresses the importance of finding resources within the community to support ambitious ideas such as the film project and also recognizes her other collaborators: Rhea Karabelas-Lesage, Head Bibliographer of the Modern Greek Special Collection; the Woodberry Poetry Room, which has rare materials in Modern Greek; the Sackler Museum; the Arts @29 Garden; the Greek Film Society at Harvard; the Harvard College Hellenic Society; and the Greek Institute.

It does not stop at film either. Students have produced other creative projects as well. One student painted a plate in imitation of an Ancient Greek relic, incorporating a well-researched Greek text. Another developed a playable board game around the Odyssey. Others have designed video projects and slideshows that blend Greek poetry with self-made imagery.

Digital media technologies are more accessible than ever before, and Rapti has shown them to be effective tools for cultivating creativity and self-expression in pedagogy. Her courses engage and enthuse, raising the bar where traditional methods fall flat, because when given the opportunity students will surprise you with their ingenuity.

Rapti1

“Omeropolis” by Yannis Koulias, Advanced Modern Greek 100