Text mining with n-gram variables

Citation:

Schonlau, M. , Guenther, N. , & Sucholutsky, I. . (2017). Text mining with n-gram variables. Stata Journal, 17(4), 866-881. Retrieved from https://www.stata-journal.com/article.html?article=st0502

Abstract:

Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the "bag of words". An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables, each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. We illustrate ngram with the categorization of text answers from two open-ended questions.

Notes:

Publisher's Version

Last updated on 11/17/2018

Contact

M3 4128

519-888-4567, ext. 31528

isucholu@uwaterloo.ca

Ilia Sucholutsky

PhD Student, Statistics

Text mining with n-gram variables

Citation:

Abstract:

Notes:

Recent Publications

Contact