KL Divergence: 2011

Monday, December 5, 2011

I'm baaaa-aaaack.

You probably didn't realize I was gone. That's ok. Just pretend like you missed me.

Anyway, I'm fresh off of a quarter-life crisis induced year of international wanderings. In case you are wondering (again, just pretend), after my postdoc in Brazil, I stayed for a while and did whatever you do on the beach (absolutely nothing) for a few months. Then off to Germany for some climbing, followed by Colombia and Ecuador, and then up to California. Back down to Peru to do the Inca Trail with some buddies, and then a roadtrip around Europe in a rented car. (Sentence fragments aren't bad if you're blogging. Promise.)

So, in the interest of undoing some of the brain atrophy I've experienced over the last year, expect to see a new post every once in a while.

And in other news, I am moving my blog to my vanity site, www.kristianlum.com/KLdivergence.

Sunday, April 3, 2011

Brasil ranks 31st out of 44 in English profficiency

A few months ago, I did a post about my guess that someone whose first language is widely spoken would be less likely to speak English than someone whose first language is relatively obscure. It looks like I've been outdone.

English First has done a study that assesses the English proficiency of adults in various countries. From this, they have put together an English proficiency index and made some pretty nifty maps and plots.

The English First folks also investigated the same phenomenon that I did in my post. Clearly they have a much bigger budget (greater than $0) for doing these sorts of things, and they didn't just cull their data from Wikipedia, so I tend to go with what they say. Good thing their results support my own-- again, that people whose first language is shared by many are less likely to speak English. However, the relationship they found was "weak." See below.

If you're upset by the fact that the relationship here appears to be in the opposite direction of that which I found earlier, don't be. I was looking at the negative log of the number of native speakers. Why I transformed the data like that, I don't actually remember, but rest assured that this is showing roughly the same thing. Of course, this isn't exactly the same thing, the most obvious reason being that they are looking at "English proficiency", whereas I was looking at the "percent of English speakers."

They also compare English proficiency to various other variables they believe should be related, such as the value of exports per capita, the average number of years of schooling, and gross national income per capita. All of these had a stronger relationship to the English proficiency than the native speakers variable.

One last mildly interesting nugget of information, which was mentioned in the Brazilian article that pointed me to the English First study and website, is that all of the BRIC countries fall right in line. China, India, Brazil, and Russia took the 29th, 30th, 31st, and 32nd spots respectively. The article also pointed out that, although world wide Brazil did not do so well in this ranking, at least it beat Venezuela and Chile!

Sunday, March 27, 2011

The Anne Hathaway Effect

I recently stumbled upon this article in the Huffington Post which claims that every time Anne Hathaway gets a lot of Internet attention (for releasing a movie, hosting the Oscars, or what have you), the stock price for Berkshire Hathaway shoots up. The author, Dan Mirvish, justifies the plausibility of this by saying that "My guess is that all those automated, robotic trading programming are picking up the same chatter on the internet about "Hathaway" as the IMDb's StarMeter, and they're applying it to the stock market."

The data they use to support the claim is that

Oct. 3, 2008 - Rachel Getting Married opens: BRK.A up .44%Jan. 5, 2009 - Bride Wars opens: BRK.A up 2.61%
Feb. 8, 2010 - Valentine's Day opens: BRK.A up 1.01%
March 5, 2010 - Alice in Wonderland opens: BRK.A up .74%
Nov. 24, 2010 - Love and Other Drugs opens: BRK.A up 1.62%
Nov. 29, 2010 - Anne announced as co-host of the Oscars: BRK.A up .25%

I think the first commenter put it well when s/he said

"First!"

Nah, just kidding. Here's what they really said:

This is junk statistics if I've ever seen it. There may be something to the automated trading idea, but these data are proof of nothing. How about the hundreds of other times Ms. Hathaway was in the news and the stock didn't rise so dramatically? How volatile is this stock normally? Are these percentage increases anything out of the ordinary?
Exasperated, I decided to d a quick test. I downloaded the BRK.A data from Jan. 1, 2008 to Mar. 18, 2011 from YAHOO Finance and did a trivial analysis of it in Matlab. Just looking at the difference between open and close prices, the stock was up 0.25% or more 308 times over this period. The stock was up 2.61% or more 47 times over this period. Those two percentages are the lowest and highest in Mr. Mirvish's "data."
As a scientist and math lover I've disappointed to see this story making the rounds with so little skepticism. It's a statement for the level of understanding of statistics and probability by the general public.

Looks like I'm not the only mathbuster out there.

My first complaint about this (and backing up commenter number 1) is that, as someone who does not follow stocks at all, I have no idea if a .74% increase in BRK.A is anything notable. Having downloaded the stock prices since 2008 from Google Finance, I can tell you that it isn't. When Rachel Getting Married opened, the .44% increase was in the 68th percentile of changes in price... including negative changes. It was only in the 32nd percentile of positive changes. Even the biggest change of 2.61% is only in the 92nd percentile overall. Certainly not a tail event. Getting to the point, it's not like every time Anne Hathaway gets naked with Jake Gyllenhaal, the stock holders all go out and by themselves a brand new G6. It's a pretty normal fluctuation.

Over the period from 2008 to yesterday, the stock increased about 47% of the time. Since we are apparently completely disregarding the magnitude of the change, the probability of getting all positive changes when randomly selecting 6 dates out of the 828 trading days is quite small. But what would be the chances of looking at, say, 10 different dates and finding that 6 or more of them are positive?? If we ignore the issue of replacement (which shouldn't be horribly important since the sample size is 828 and we are only sampling 10), the probability of getting exactly 6 is about 18%, and the probability of getting 6 or more is about 31%.

Given that the hypothesis is that the stock price is getting this little upward nudge because of Internet chatter, I checked out Google Trends to find other likely dates that the stock should increase under this hypothesis. Luckily, Google even shows you what the major news stories are on some of the major peaks, so it is easy to figure out the date.

Google Trends for Anne Hathaway
The top line is search volume and the bottom is news volume. They pick out many of the same spikes.

Two big peaks we see on here that haven't already been accounted for in the original post are B, Anne Hathaway Proclaims Love For ‘Family Guy,’ ‘Aqua Teen,’ Fulfills Nerd Vision Of Idealized Woman, on February 23, 2009 and C, Anne Hathaway spends spare time studying physics, on February 2, 2010. On these two dates, BRK.A saw a 1.82% and .11% decrease respectively. Further, when on June 20, 2008 the Los Angeles Times posted a story called Anne Hathaway versus Jessica Alba resulting in the very visible spike in 2008 (I guess everyone likes a good ladyfight), BRK.A experienced a -.79% change. On the opening day of Get Smart, June 20, 2008, BRK.A fell .79%, and if we go back just a little bit further to December 9, 2005, the day that Brokeback Mountain had its major opening in the US, BRK.A dropped .07%. In fact, the sample correlation between Anne Hathaway's Internet search traffic and the price of BRK.A for 2008 to yesterday was just .01-- basically uncorrelated.**

Given all of this, I'm really hoping that Dan Mirvish didn't run out and by up a bunch of BRK.A hoping that his post would force the price up a bit. :)

**This, of course, does not rule out the case that the fancy trading algorithms only act based on spikes in search volume, not normal activity, but just sayin'...

Wednesday, March 16, 2011

Text me where the buildings are, and I'll tell you where the building damage is.

Back in October 2010, Patrick Meier posted an article called How Crowdsourced Data Can Predict Crisis Impact: Findings from Empirical Study on Haiti on his blog, iRevolution. It might be worth your time to go skim that really quickly if you want to get the biggest bang for your buck as you continue reading this... go ahead, I'll wait.

If you did your homework, you already know that in his blog post, he recaps some pretty interesting results from a team at the European Commission's Joint Research Center (JRC). The researchers who did this study were very awesome and sent me the original paper along with some hints as to how they did their analysis. If you want the paper, which appears in Conference Proceedings from the 2nd International Workshop on Validation of Geo-Information Products for Crisis Management, you'll have to track down the proceedings. Alternatively, you can watch the presentation video.

Meier wrote that the JCR team used the SMS reports mapped on the Ushahidi-Haiti platform "to show that this crowdsourced data can help predict the spatial distribution of structural damage in Port-au-Prince". The SMS messages they use were collected starting just four days after the disaster and were sent by Hatians with their "location and urgent needs." Through the magic of spatial statistics, these researchers show that they are able to predict the locations of building damage using the SMS data. They point out that in the event of an emergency such as the Port-au-Prince earthquake, this sort of prediction would be very useful because it is cheap and real-time. You don't need a small army of "some 600 experts from 23 different countries" and the World Bank to assess detailed satellite imagery to pinpoint the damaged buildings. All you'd really need is a much smaller sample of damaged buildings with which to correlate the SMS data, and voila! As you get more SMS data, you would be able to predict where more building damage is (read: people needing help are).

Let's start by taking a look at some of the figures from the paper that support this claim. Figure 1 (in this blog, Figures 4 and 5 in the paper) shows a derivative of Ripley's K-function, which essentially determines whether same-type events (top row) or different-type events (bottom row) can be said to cluster together at various distances. Remember that this paper's main idea is to show that building damage is clustered near SMS messages. One type of event is a SMS message, and the other type is a highly damaged building, as judged by the previously mentioned "experts". The data are the locations of each of these types of events across a 9km x 9km square that comprises the city of Port-au-Prince. The horizontal axis, across which this L function is calculated, represents the distance between the location of events. The green lines are 80% confidence intervals. In a nutshell, if the black line (the calculated L statistic) falls above the green line at any point, then we are to think that within this radius around any given event, events of the same type (top row) or different type (bottom row) are more likely to occur. So, for example, if we look in the bottom right plot of Figure 1, we find that for radii between about 1000m and 3000m from any SMS message, we are likely to find a higher-than-average number of damaged buildings. Hence the usefulness of the SMS messages in this situation.

Figure 1: L statistic from original paper

But, let's think about this for a second. Does it really make sense that this would be the case for a radius of 2km but not 500m? That is, would it really make sense to believe that people are texting for help 2km away from major building damage but not right near the site? Sure, I guess I could buy that. I suppose it could be the case that people very close to the damaged buildings are either dead or incapacitated and thus unable to send SMS messages. I wouldn't expect this to be the case up to a kilometer away from the most damaged buildings, but I'll go with it for now. Secondly, how useful is it to know that there are likely to be damaged buildings within a 2km radius of any text? If we assume that we don't already have a good idea of where buildings are without the text messages, my high school geometry tells me that this 2km radius implies an area of about 12 and a half square kilometers in which we blindly search to find the expected extra building damage. Even subtracting off that inner radius, where there is not likely to be extra damage, we're still left with almost 10 square kilometers. Again, I'll go with it. Maybe the information from all of the text messages combined gives more practically useful information.

The most convincing graphic from this paper (labeled as Figure 7 from their paper, and Figure 2 in my blog) is that which shows the observed density of building damage next to the predicted building damage density given SMS messages. Yep, I agree that this passes the eyeball test. It does look like SMS messages are doing a pretty good job of sniffing out building damage.

Figure 2: Predicted and observed building damage density from original paper.

Alright, now let's take a closer look. I also got a hold of the larger data sources used in this analysis. Because the paper does not list the exact boundaries they used to define Port-au-Prince in their data set, I tried to recreate their data set based on the number of events they reported to have included in the analysis and guessing what the boundaries of their plots were by finding landmarks on a map. After many hours of trying to find a subset of these larger datasets to match SMS and building damage data sets used in the above analysis perfectly, I emerged with something that is hopefully sufficiently similar. First, because I will be doing some statistics and thus no one will trust me (thanks a lot, Mark Twain), I reproduce the above plots using my datasets. Although it looks like I cut off a little bit of space over on the right when trying to match their dataset, for all intents and purposes, I think I've got the same thing. They've got 1645 SMS messages, and I've got 1651. They use 33,800 damaged building locations, while I use 33,153. Although the plots that I have reproduced (Figures 3 and 4) are not *exactly* the same as those presented in the paper (above), I think they are similar enough to conclude I am doing the same thing they are given that the datasets are slightly different and some of these plots require some tuning parameters. I'm satisfied.

Figure 3: My reproduction of the L statistic plots that appear in the original paper using my dataset.

Figure 4: (left) Fitted conditional density of building damage given SMS messages. (right) Observed density of building damage. Both of these plots were produced from my datasets and are intended as reproductions of the plots in the original paper.

My first main question upon reading this paper was whether these text messages were specifically picking out damaged buildings or whether they were simply finding areas of high building density. After all, people send the text messages and people do tend to be in areas with lots of buildings. I re-ran the same analysis with a random sample of 1000 buildings. This is as opposed to the previous plots which were run with a random sample of 1000 damaged buildings. Proceeding with their 80% confidence interval convention, I find very similar results. For radii of about 1.5-3km, SMS message locations correlate with building locations, not just damaged building locations. Further, according to the infallible eyeball test, it seems that the SMS data is doing a good job of finding all of these buildings. (Figures 5 and 6)

Figure 5: L statistics for SMS messages and a random sample of all buildings.

Figure 6: (left) Fitted conditional density of buildings given SMS messages. (right) Observed density of all buildings.

So, what's going on here? My initial reaction was "Blimey! These text messages are just picking out buildings, not damaged buildings! Damaged buildings can only occur where there is a building, and because text messages correlate with buildings themselves, the correlation between text messages and damaged buildings is merely an artifact!" After some quiet introspection, I realized that I may have jumped the gun. Because we only used the trusty eyeball test, we haven't looked at whether text messages do a better job of picking out the specifically damaged buildings than they do any building at all.

For my next trick, I run a Poisson regression. Following the original paper, I bin the data into a 30 by 30 grid, counting up the number of total buildings, damaged buildings, and SMS messages sent in each grid square. A quick diagnostic plot of the total counts versus damaged counts indicates that there is a pretty good linear relationship between the two-- the number of damaged buildings in any square is approximately a constant times the total number of buildings in that square. Although I am hoping with all of my might that my PhD advisor does not read this and find out that I did not use a formal (Bayesian!) spatial model to handle this clearly spatial data, I simply ran a few Poisson regressions to see if the SMS data really is adding anything beyond what we already know from the building counts. In my experience, incorporating a spatial model in the regression would only serve to reduce the significance of the covariates anyway. I fit the linear model

Damaged Buildings ~ Poisson( exp{ b0 + b1* SMS + log (Total Buildings + 1)). (Model 1)

This model includes one plus the total number of buildings as an offset. Adding one simply serves to eliminate the computational problem of taking the log of zero. As discussed in the Wikipedia article linked to offset, this is often used to control for a baseline, in this case the total number of buildings in a square. The results of this regression are

Call:

glm(formula = damcounts ~ offset(log(allcounts + 1)) + textcounts,
   family = poisson(link = "log"))
Deviance Residuals:
   Min 1Q Median 3Q Max
-16.002 -3.074 -0.646 1.324 21.507
Coefficients:
   Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5669207 0.0061123 -256.353 <2e-16 ***
textcounts -0.0024470 0.0009817 -2.493 0.0127 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
   Null deviance: 23336 on 899 degrees of freedom
Residual deviance: 23329 on 898 degrees of freedom
AIC: 26256

For those of us not used to reading R output, look at the number to the far right of "textcounts". While the coefficient on the number of text messages is significant, the sign is in the opposite direction as expected! Having text messages in any grid square results in a prediction of fewer damaged buildings! Could this be that before sending text messages, the people sending them moved away from the damaged buildings for safety reasons?

Next, I suspect the areas of high building density, have a higher percent of damaged buildings than areas of low building density. Imagine that in a dense area, one building falling could cause damage in others, whereas in a less dense area, this would be less likely to happen. To attempt to control for this, I ran another regression in which I include an additional covariate that is just the total number of buildings in the square. That is,

Damaged Buildings ~ Poisson( exp{ b0 + b1* SMS + b2 * Total Buildings + log (Total Buildings + 1)) (Model 2).

The results from Model 2 show that the number of text messages are not significant at the magical 95% significance level.

Call:
glm(formula = damcounts ~ allcounts + offset(log(allcounts +
   1)) + textcounts, family = poisson(link = "log"))
Deviance Residuals:
   Min 1Q Median 3Q Max
-16.2803 -2.6896 -0.5842 1.3627 19.3989
Coefficients:
   Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.768e+00 1.159e-02 -152.58 <2e-16 ***
allcounts 3.794e-04 1.792e-05 21.18 <2e-16 ***
textcounts -1.851e-03 1.006e-03 -1.84 0.0657 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 23336 on 899 degrees of freedom

Residual deviance: 22889 on 897 degrees of freedom

AIC: 25817

Lastly, and I won't show the output this time, if we ignore the offset completely and regress the number of damaged buildings on the total number of buildings, the square root of the total number of buildings, and the number of text messages, we find that the coefficient on the number of text messages has a p-value of .41-- far from significant... even at the 80% level. The rationale for this was simply that some exploratory data analysis suggested that the square root of the total number of buildings might be a good predictor of the number of damaged buildings. From a geometric point of view, if the streets within a square are themselves arranged in a grid, this would be approximately the average number of buildings per street in that square and could maybe proxy for density.

For the non-statisticians in the crowd, what this means is that given just the number of buildings in a square, the number of text messages sent from within that square is not an important factor in determining the number of damaged buildings! So, although text messages may be useful in identifying locations with buildings, if you already know where the buildings are, the text messages are not particularly useful (in this particular case) for figuring out how many of those buildings are damaged. Assuming that a crisis response team could more quickly access maps of building density than even the SMS data, ignoring the SMS data could lead to an even faster and cheaper response in this case.

At this point, if you are paying careful attention, you may think that I've missed the point. We did already show that for small radii, text messages are not correlated with building damage. The approximate 0.15km radii within each box are certainly under the threshold for which we wouldn't expect to see any relationship between text messages and building damage under the original analysis. We already knew that, but I think this is a more formal way of making the point that building locations may be enough to find damaged buildings.

To conclude, one of the main advantages presented in the blog post was how much time and money using SMS messages to find damaged buildings could save. Crowdsourced data may have its uses, but for finding damaged buildings for the case in Haiti, I’d like to propose an even cheaper alternative: a few statisticians, a map, and some coffee.

**** Data obtained from UNITAR/UNOSAT.

KL Divergence