KL Divergence: April 2010

Tuesday, April 20, 2010

Princely programming hotness

Although I'd promised myself radio silence on here until I get some real work done, I'm going to have to make an exception for a super short post. Prince_i has posted a web app so that people (you!) can put in your own parameters and get a probability that you find someone better than your current prince/princess. His is also simulation based, but it is not quite the same model that I used in my first post (where I admittedly did totally yoink some of his ideas). You should probably go check it out.

Seriously, is there anything on earth sexier than a man who does math and programs!? Oh baby, oh baby, I love those greek letters and linux jokes! (I wish I were kidding...)

Monday, April 19, 2010

The Chinese birth calendar is total bunk

The other day someone (thanks, Dick!) posted my blog about celebrity deaths to hacker news, and someone else (thanks, hoelle!) actually commented on it! Hooray!! Pretty much made my... day / month / year. So, in an effort to encourage such interaction and engagement with my little bloggy, I’m going to respond directly and promptly to the suggestion made in the comments at Hacker News at the expense of any progress on my dissertation today. Really hoping my advisor isn’t on to this...

Hoelle said:

I wonder if that will convince my wife. Probably not. Her stats superstitions drive me crazy. Ever heard of the Chinese birth calendar? For example: http://www.webwomb.com/chinesechart.htm. 90%+ accuracy should be an easy claim to bust. Unfortunately for me it’s been right for our kids 2 out of 2 times. Why are stats always so hard to sell over anecdotal experience?

OK, great. This seems easy enough to test. What follows is my first episode of
MathBusters.

(btw, Jamie and Adam, if you are out there, can I please pretty please be the MythBuster’s statistician? You can even use me as Buster II if you want, as long as I get to do math-fun while being blown to smithereens.)

I downloaded data from the website for the Centers for Disease Control and Prevention. Specifically, the data set on births from 2006 in the US territories because it was both recent and smallish. I wrote some quick pythony-goodness to clean that up so I could move it directly over to R– my one true love.

I only consider births in which all of the necessary fields (sex of baby, date of mother’s last menstrual cycle, age of mother at time of birth) are complete, which leaves me with a sample size of 50,079 birth records to play with. Fun!

Ready for the results? The Chinese birth calendar was correct with 49.70 % accuracy on this dataset. With this many observations, the only point of a hypothesis test will be to have one more darn example of a hypothesis test for proportions on the internet. I say the more fun statistics floating around the better, so...

Let’s start by being lenient and test the hypothesis in the classical way that the
Chinese birth calendar is up to anything but complete random chance. We'll even give it credit if it can do a good job at predicting the opposite! At least if we know that it will be useful for something.

That is, the null hypothesis is that the probability of the Chinese birth calendar being correct is p₀ = .5. Relying upon asymptotic normality (I’d say that 50,079 is pretty darn close to infinity), the fact that I still remember this stuff after four years of grad school, and wikipedia (it does not lie!), we have a z statistic of -1.32, which falls in about the 9th percentile of the standard normal distribution, implying a two-sided p-value of 0.187. To use normal stats lingo, we have to fail to reject the hypothesis that the Chinese birth calendar is anything but a complete load of baloney. My poor Chinese granny probably just rolled over in her grave. Maybe that wasn't normal stats lingo. Oops.

Again, for the sake of more fun stats floating around somewhere, what about testing the hypothesis that p ≥ .9, as is claimed on the website? Well, in the classical hypothesis testing framework, I think that would either require integration or a likelihood ratio test, to which I am morally opposed. So, as a shout-out to my Bayesian homies, I’ll just slap a conjugate prior on p (a beta(1,1)= uniform). This results in a posterior distribution for p, p | data ~ beta(24891, 25187), which implies that the posterior probability that p ≥ .90 is about nill. Yep, zero. No fucking chance does that predict births with greater than 90% accuracy. So, there we go, I’m going to go ahead and call this one busted, Jamie.

(Drats, there I go dreaming again...)

Sunday, April 18, 2010

Do celebrities die in threes?

Check out purple line on the left in the plot above, which shows the arrangement throughout the year of the dates of death of some of the celebrities who died in 2009. It kind of looks like the deaths are bunched together. Then look to the right. Those are randomly generated death dates, which, because human brains like to see patterns, also look like there is some clustering.

It seems like every time two celebrities die, there is speculation about who will be the third, as though two celebrity deaths necessarily means a third is on its way. Although it would be nice to wait to post this until this old superstition gets dug up again when two celebrities die in close time proximity, it will certainly happen again and probably soon.

An ideal time to have posted this would have been in June of last year when Michael Jackson and Farrah Fawcett died on the same day, the 25th, and the internets were a-twitter with talk of this old wives' tail. Depending on how you count it, this supposed death troika was rounded out by Ed McMahon (the 23rd) or Billy Mays (the 28th). Although lots of other people have posted about the invalidity of this superstition, I have yet to see any plots depicting the statistical insignificance of this event. And, as I learned in 2nd grade when I failed to actually show that I had indeed mentally carried that 1, the policy is no work, no credit. So, here we go for full points, please...

In order to do any sort of testing, we have to define what it means to "die in threes." Seriously, what does that mean? It isn't enough that they die in clusters of any size, in which case, I would probably be talking about self-exciting processes... yes, I did just throw that in so I could say "self-exciting processes." No, the superstition is specifically that they die in threes.

What I propose as a definition of this is that for any three deaths to count as a triple, the time from the first death to the last death in this set must be less than or equal to the time elapsed from the last event prior to the triple, and it must also be less than or equal to the time until the next death succeeding the triple. The three deaths have to be separated in time from the other deaths.

For example, let's consider the {Ed, Michael, Farrah} candidate triple, in which case the time from the first (Ed) to the last (Michael and Farrah.. I'm only counting this down to a resolution of one day) is two days. In order for this to count as a triple, no celebrities could have died within one day of either end of this triple-- there must not have been any celebrity deaths on the 22nd or the 26th. In order for the {Michael, Farrah, Billy} candidate triple to be a triple by this definition, no other celebrities would have died from the 23rd until the 30th.

One other piece that needs defining is who counts as a celebrity. I used this website (and I really do apologize for the lovely anus ad at the top of that), so that I could not be accused of cherry-picking my list of celebrities. You could still make that claim because I removed a few people I did not consider celebrities: children and criminals, for example. I just didn't feel right including a child. And yes, I hand transcribed all of the celebrity deaths in 2009-- that is truly a labor of love.

I calculated that there were 28 triples by my above definition in this data set of the 157 celebrity deaths of 2009. I then randomly generated 10,000 sets of 157 death dates, where the dates are randomly selected (with replacement, of course) over the course of the entire year, and I calculated the number of triples in each of these completely random data sets.

This histogram of the number of triples from each of the randomly generated death dates shows (1) a remarkably normal shape and (2) that 28 triples is a totally reasonable number to have seen if celebrities die at random days in the year. The number of triples last year (the pink line) falls in about the 70th percentile of what we would expect under completely random death dates-- far from anything anyone would consider statistical significance.

One might argue that many of the people on that list are not celebrities. I actually don't know who most of them are. So, I re-ran this simulation, using only the deaths I had heard of. (This should, sadly, sync up pretty nicely with the list of deaths reported on perezhilton, as that is one of my few sources of "news".) A similar histogram to that shown above appears below for the analogous simulation with 23 deaths. Again, nothing spectacularly exciting is going on. We would expect to see this number of clusters under complete randomness.

So, there you have it. You can be the judge, but as far as I'm concerned, I'm convinced. There isn't significant evidence that celebrities die in threes.

Tuesday, April 13, 2010

Someday I hope I can be this badass

While looking for an intuitive explanation of why check loss is the appropriate loss function for quantile regression, I happened upon this gem.

For those of you without easy access to academic journals, I reproduce for you a few of the choicest excerpts.

The abstract:

"This article discusses (1) our research to provide a framework for almost all of statistical methods for simple data, (2) need to plan the future of the “Science of Statistics” in order to compete for leadership in the practice of the “Statistics of Science”, (3) grand unifying ideas of the Science of Statistics, (4) an elegant rigorous proof when quantile function minimizes check loss function which is the basis of quantile regression, and (5) exact and approximate confidence quantiles (confidence interval endpoint functions) for parameters p and logodds(p) given a sample of a 0-1 variable."

Another nugget of grandiosity:

"This article reports progress in my ambitious 1,000 – chapter research program, whose goal is to provide a framework for statistical methods for simple data, and integrate:

(1) frequentist and Bayesian methods; (2) nonparametric and parametric methods; (3) continuous and discrete data analysis; (4) functional and algorithmic (numerical analysis based) data analysis."

And my favorite part, defining a function to be called "pain":

"We propose “pain” as a name of a penalty function whose minimization is equivalent to the minimization of an objective function."

Sounds pretty painful to me! Unfortunately, he did not follow up by defining any variables as "ass", which would have been a truly bold move.

So, there you have it. This dude is such a badass the editors let him get on his soapbox for the first several pages about the future of statistics and some grand unifying goals. I can only hope that once I reach the point where I've proven myself enough that I don't have to give a crap what anybody thinks, I choose to exercise this right by publishing math papers with rambling prefaces and creative function names. Unfortunately, I suspect I'll instead be the crazy lady on the corner yelling shockingly foul insults at two generations from now's version of the hipster and farting at totally inappropriate times.

Sunday, April 11, 2010

Seasonal Insults

Some people from this project came to Duke this week, touting the greatness of their new method for approximate Bayesian inference and their sweet new R package. Though they didn't actually tell us how it works (apparently it is based upon some sort of advanced computational wizardry), I was excited to try out this new toy. If you browse the website, a lot of the examples are either time series (time serial?) or spatial. Since I deal with spatial data all the effing time, I decided to find some time series data and give INLA a whirl.

So, I present to you the schmuck dataset!

This is the relative search volume of the word "schmuck" weekly from some time in 2004 until yesterday. You could get it yourself off of Google Trends (thanks, Google! ), or you could just get a version here, from which I've already removed all of the extraneous junk.

Look at this beauty!! Something is definitely happening consistently during the 2nd week of December that makes people reaaaally want to search for the word schmuck.

Back to the statistics... using INLA, I tried fitting a latent AR(1) term to this even though that is clearly not the right model. No dice. I tried adding a seasonal component. Still no. It keeps shooting me an error message without much of an explanation. Something about a singular matrix. What matrix, I don't know. So, that's it in a nutshell. Although I was super pumped to take INLA for a test drive, this sort of knocked the wind out of my sails. This is not to say it doesn't work, just that I can't get it to work on my new favorite data set.

So, I leave you with one more seasonal insult, losers. Below you will find the relative search volume for the term "loser."

What is it about the cold months that makes people really interested in schmucks and losers? 10 units of pride to anyone who can give me a plausible explanation!

Wednesday, April 7, 2010

A princess story part II

As it's been pointed out to me, girlfriend's got some issues. I'm making the poor princess a little bit too needy, jumping from one prince right on to the next. She needs time to really focus on herself, time to figure out what SHE likes, time to just enjoy being single. She's no sitting around waiting for Prince Charming, delicate flower of a Snow White.... no, she's got the sass of Princess Jasmine.

In terms of modeling assumptions, the slight tweak required to accommodate our dear princess's independent spirit (or desire to never have to utter the words "I don't know who my babydaddy is."-- thanks, L, for that.) is the addition of a component that determines the wait time between relationships. In the current incarnation of this simulation, the princes arrive back to back, and she is constantly collecting data. Exhausting!

Alternatively, let's consider a model in which there is a period of latency between princes. At the end of each prince's tenure, let's now suppose that the princess waits some exponentially distributed amount of time with mean independent of her mean relationship duration.

How does this change the best strategy?

As expected, she should auto-reject fewer princes on average if she's going to wait a long time between them. Makes sense. Keep in mind that this is under the assumption that she is not collecting data on anyone during the waiting period.

Here's another plot of the best strategy for the number of princes to automatically ditch (instead of do or marry?) for my settings in this experiment, but under conditions where the princess isn't quite so needy. The x-axis shows the average number of days between relationships, and the hot pink region, shows strategies that came within 90% this time of the best one.

I didn't allow for simultaneous data collection on multiple princes (a negative wait time?), but that could be an interesting extension. If you stay tuned, I've got an even more interesting tweak in the works that I'm pretty sure is going to imply that prince_i is still training data.

But first, Bayesian spatial quantile regression awaits...

Tuesday, April 6, 2010

A princess story

Once upon a time, a princess was sequentially presented with N suitors. She had the option to reject each one as he came, but she didn't know the quality of the successive suitors. Once a suitor was rejected, she could not change her mind and return to one she had previously scorned. Thank goodness these rules weren't in effect for Princess Jasmine, or else there would have been no happily ever after for poor Prince Ali.

Wikipedia, as usual, has a very thorough discussion of this problem (listed as the secretary problem, if you want to look it up). However, as I fancy myself a princess, and this has to do with me (of course), I will retain the princess language.

So, here's the deal. It would be extremely beneficial to both me and one man(space) friend to know if we have sampled the pool enough (hehe) to be out of the training data stage. So, we are going to use the princess game to decide what to do when he leaves for the summer and I leave for...ever. But, because we are both huge nerds both would know the answer under the traditional assumptions (and what fun is that?!), let's make this a little more exciting...

The princess game assumes the suitors arrive in a random order / are randomly selected / the king gets to pick who you date. I'm pretty sure that's not how it works-- if it had been up to my dad, my pool would have been a random selection of "nice Asian boys from Berkeley." OK, so let's move forward to a time/place when women get to pick their own boyfriends, and assume that we pick who we date based upon each candidate boyfriend's attributes and how important those attributes are to us at the time. (We can see the whole pool's attributes, but we only know how useful each person was to us after we date them.)

But ahhh! In high school, back when this whole game started, Lance Bass was pretty much my ideal man. (Don't judge! My bff had already called JT.) We all know how well that would have worked out for me if I'd gotten my wish. Thank goodness I've gotten to update my understanding of which man-attributes I like since then. Turns out shy and effeminate isn't quite as appealing to me as I once thought... Strange.

The original framing of the problem assumes that the princess is able to determine the actual quality of the suitors as they come. I want to allow for learning over time about which qualities the princess finds appealing. She'll be picking her next boyfriend based upon her current beliefs about which qualities she likes. (For fellow nerds, she'll rate the remaining princes in the pool based on their posterior expected value to her given all of the princes she's already sampled.)

Lastly, let's ground ourselves in reality. As much as I'd like to keep playing until I find the perfect person, there is only finite time in which to play this game. While long-term relationships use up a lot of your allotted game time, they also allow you to learn a lot more about the attributes you appreciate in a person.

More formally,

Soooooo, because dissertations don't write themselves, I'm just going to run a simulation rather than do what it seems like everyone else has done (yes! gah! I did a mini lit review..) and derive things.

Feel free to skip to the bottom now; you won't hurt my feelings. But, for the brave...

I will include parameters that dictate:

The noise with which the princess observes her utility for a prince after knowing him for only one day.
The average duration of a relationship. (This will be modeled with the exponential distribution, though really a think a mixture distribution with a point mass at 1 day would be pretty appropriate for most of my friends. Not me, of course. I'm a lady.)
The number of attributes that go into a princess's weighting of how much she likes a guy. According to the partner in crime in this project, there should only be two attributes... jerk! jk
Your time limit for picking a mate. I'm setting this to 12 years.
How sure you are about your initial guess at the importance of different attributes, and how far away this is from the truth.
What percentile of awesome does the guy you end up with have to be in to make you happy. If you get someone who is top 10 out of 100, is that good enough? (I'm setting this to be top 5%. No soulmates here!! Why 5%? Ask whomever made it the magic number in hypothesis testing, I don't know)

But, before we get to my decision re: the manfriend, let's look at how one's strategy should change over different values of one of the parameters just to get some intuition about how this simulation is working...

On the left there you see the best strategy for the number of boyfriends you sample and automatically reject before starting to try to find the best one. The blue lines are strategies that came within 80% of the best one. This ranges over the average duration (in days) of the relationship down on the x-axis. The moral of the story: people with lots of short relati

onships should wait longer to start looking seriously than those with a few long term ones. However, the difference between the best strategies isn't that big. (15ish versus 5ish). That being said, ignore the actual numbers... This all depends on the other parameters of the simulation, which I haven't told you for this example.

And now, ta-daaaa!! The results! To get my final sample of best strategies, I averaged over my

beliefs about what all of the parameters of this simulation would be for me. I'll post this code to my website, so if you're interested in how I did this averaging, you can look for yourself. The histogram here shows the optimal strategy over 1000 simulations. It looks like my best bet on average is to wait about 10 princes and then take the best one that comes after that.

Uh oh... there have already been ten... better start running, you know who you are... ;)

Disclaimer: this doesn't take into account the fact that the good ones might get taken . And it assumes that you can't go back to an old one. Both of those aren't quite right, but this was about as much work as I am willing to put in on a procrastination project!

KL Divergence