Workshop Prep
You will need three things for this workshop:
- some content (hopefully you have your own; we will also provide some)
- a Laptop (we have some to hand out if you need one)
- a text editor (although really you could get away without one. )
Some Content
Textual analysis works by converting a text/texts into a proverbial “bag of words.” These words become the “data” that the computer then counts, weighs, and measures in a variety of ways. In order to prepare your texts for computational analyses, you’ll often need to prepare your text(s) in some fashion.
If you don’t have text of your own, we’ve provided some here, using Shakespeare’s Sonnets:
- Shakespeare’s Sonnets URLs in a CSV
- Shakespeare’s Sonnets in a CSV
- Shakespeare’s Sonnets in Plain Text
- Shakespeare’s Sonnet URLS in a list for Voyant
Text Editors, Get One
Some tools will allow you to use PDFs and Word Documents, for instance,but other tools and processes might require you transform your text(s) into a plain text format (TXT). To do this, you’ll need to download and install a text editor that will allow you to import, clean, and save text in this format, or in other plain formats, like a Comma Separated Variables (CSV) file, which stores information in a table (TABULARLY!) and is very much like a spreadsheet.
Many of us here use Visual Studio Code, from Microsoft, which is free and works on both Mac and PC computers. Another cross-platform text editor to consider is Atom.
Windows users should consider Notepad++, which is simple to use and free. It also has a good find and replace function that works across files in a directory or files open in the editor. This is especially helpful for editing a larger number of files at the same time.
Note that Windows notepad does not handle UTF-8 encoding or UNIX line endings that are standard for cross platform applications.
Mac users can also look at TextWrangler or BBEdit. Also TextEdit, the default text editor on Mac, is useful for simple tasks, like making a document plain text.
On linux, Gedit will suffice.
What is Plain Text
Plain text is basically just what it describes, a file format that saves within it the bare minimum amount of information presented to it. Whereas a Microsoft Word document file will save within it a variety of inforamtion about the font, layout, and properties of the document prepared, a plain text file will save just that information.
The following exercise will demonstrate the differences. This is adapted from a tweet by Alex Gil (that I can’t find at the moment).
Plain Text vs. Word Doc
- Open your text editor and create a TXT file that includes only the letter “a” (no spaces or breaks), then save this as “a.txt” in a new folder.
- Now open Word and create a similar file that only includes “a” and save it as “a.docx in that same folder.
- Open your new folder via the Finder or Explorer and examine the file sizes. Notices anything?
- Optional: If you’re using a windows computer, change the file extension of the .docx file to “.zip.” Open the .zip file by clicking on it and examine all the accompanying files within. What do you see? Try opening some of those files with your text editor!