As I said, I could not read the whole thing. As I was skimming, I noticed the te...

minimaxir · on Feb 10, 2016

I wouldn't say easily. Keep in mind that checking if something is "unique," it needs to be checked against every other character as well.

For example, the Top 5 Unique Words for Randy Marsh per the analysis are:

stan, stanley, lorde, shelly, son

I downloaded the dataset and quickly calculated the Top 5 Most Frequently Said Words for Randy from the entire population. Those are:

what, stan, yeah, ok, huh

All characters on the show are saying those words (Except "stan"). That's why log-likelihood/tdfif is used on a per-character basis.

wodenokoto · on Feb 11, 2016

It's the likelihood part he is bitching about, not the inverse frequency.

ppod · on Feb 10, 2016

I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking. We can say that Stan uses a word at a rate certain rate corrected for that words base rate in the corpus, and compare this with the rate for another character. If that difference in rate is very small, it's true that we still know for certain that the difference is absolutely true for this corpus, but it may not reflect any substantive difference between the characters.

If this is the view taken, then the population is all of the text that might have been generated by the data-generating process of the scripts -- things like the writers' mental models of the characters. In this view the actual scripts are just a sample from all of the scripts that could have been written while keeping the variable of interest (the characters' character) constant.