Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As I said, I could not read the whole thing. As I was skimming, I noticed the tests, tried to load the main page, and I was disconnected.

Once again, "words that are most unique" to character is a parameter that can easily be counted from the set of ALL words with no sampling uncertainty because, yes, we have the population.



I wouldn't say easily. Keep in mind that checking if something is "unique," it needs to be checked against every other character as well.

For example, the Top 5 Unique Words for Randy Marsh per the analysis are:

stan, stanley, lorde, shelly, son

I downloaded the dataset and quickly calculated the Top 5 Most Frequently Said Words for Randy from the entire population. Those are:

what, stan, yeah, ok, huh

All characters on the show are saying those words (Except "stan"). That's why log-likelihood/tdfif is used on a per-character basis.


It's the likelihood part he is bitching about, not the inverse frequency.


I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking. We can say that Stan uses a word at a rate certain rate corrected for that words base rate in the corpus, and compare this with the rate for another character. If that difference in rate is very small, it's true that we still know for certain that the difference is absolutely true for this corpus, but it may not reflect any substantive difference between the characters.

If this is the view taken, then the population is all of the text that might have been generated by the data-generating process of the scripts -- things like the writers' mental models of the characters. In this view the actual scripts are just a sample from all of the scripts that could have been written while keeping the variable of interest (the characters' character) constant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: