Analyzing Documents with TF-IDF

Key Info
Description - a brief synopsis, abstract or summary of what the learning resource is about: 

This lesson focuses on a core natural language processing and information retrieval method called Term Frequency - Inverse Document Frequency (tf-idf). You may have heard about tf-idf in the context of topic modeling, machine learning, or other approaches to text analysis. Tf-idf comes up a lot in published work because it’s both a corpus exploration method and a pre-processing step for many other text-mining measures and models.

Looking closely at tf-idf will leave you with an immediately applicable text analysis method. This lesson will also introduce you to some of the questions and concepts of computationally oriented text analysis. Namely, this lesson addresses how you can isolate a document’s most important words from the kinds of words that tend to be highly frequent across a set of documents in that language. In addition to tf-idf, there are a number of computational methods for determining which words or phrases characterize a set of documents, and I highly recommend Ted Underwood’s 2011 blog post as a supplement.

Suggested Prior Skills
- Prior familiarity with Python or a similar programming language. Code for this lesson is written in Python 3.6, but you can run tf-idf in several different versions of Python, using one of several packages, or in various other programming languages. The precise level of code literacy or familiarity recommended is hard to estimate, but you will want to be comfortable with basic types and operations. To get the most out of this lesson, it is recommended that you work your way through something like Codeacademy’s “Introduction to Python” course, or that you complete some of the introductory Python lessons on the Programming Historian.
- In lieu of the above recommendation, you should review Python’s basic types (string, integer, float, list, tuple, dictionary), working with variables, writing loops in Python, and working with object classes/instances.
- Experience with Excel or an equivalent spreadsheet application if you wish to examine the linked spreadsheet files. You can also use the pandas library in python to view the CSVs.

Authoring Person(s) Name: 
Matthew J. Lavin
Authoring Organization(s) Name: 
The Programming Historian
License - link to legal statement specifying the copyright status of the learning resource: 
Creative Commons Attribution 4.0 International - CC BY 4.0
Access Cost: 
No fee
Citation - format of the preferred citation for the learning resource: 
Matthew J. Lavin, "Analyzing Documents with TF-IDF," The Programming Historian 8 (2019), https://doi.org/10.46430/phen0082.
Primary language(s) in which the learning resource was originally published or made available: 
English
More info about
Keywords - short phrases describing what the learning resource is about: 
Data analysis
Data visualization
Data visualization tools
Digital humanities
Historical data
Humanities data
Natural language processing
Programming
Python
Software management
Text and data mining
Subject Discipline - subject domain(s) toward which the learning resource is targeted: 
Arts and Humanities
Arts and Humanities: Digital Humanities
Arts and Humanities: History
Published / Broadcast: 
Monday, May 13, 2019
ID - identifier that provides the means to locate the learning resource or its citation: 
10.46430/phen0082
Type - namespace prefix for the citable locator, if any: 
DOI
Publisher - organization credited with publishing or broadcasting the learning resource: 
The Programming Historian
Media Type - designation of the form in which the content of the learning resource is represented, e.g., moving image: 
Interactive Resource - requires a user to take action or make a request in order for the content to be understood, executed or experienced.
Educational Info
Purpose - primary educational reason for which the learning resource was created: 
Instruction - detailed information about aspects or processes related to data management or data skills.
Learning Resource Type - category of the learning resource from the point of view of a professional educator: 
Learning Activity - guided or unguided activity engaged in by a learner to acquire skills, concepts, or knowledge that may or may not be defined by a lesson. Examples: data exercises, data recipes.
Target Audience - intended audience for which the learning resource was created: 
Data manager
Data professional
Early-career research scientist
Graduate student
Mid-career research scientist
Software engineer
Technology expert group
Intended time to complete - approximate amount of time the average student will take to complete the learning resource: 
More than 1 hour (but less than 1 day)