Monday, May 4, 2015

The best counties to raise high achieving, low earning children

This topic was suggested by James Johndrow and this blog is co-written with him. 

Today I came across this beautiful, interactive map on the New York Times that tells you how much financial advantage is gained by spending your childhood in each county in the US. This map goes so far as to report expected gains (or losses) by each county at each of several parental income quantiles. Always a sucker for beautiful, interactive maps, I dove right in and had a look at the county where I grew up:  Placer County, CA.

From this map, kids who grow up in households with income at the 25th percentile, on average should expect to earn about $2,640 more than kids from similarly earning families in an "average" town. (Yes! High air first pump for Placer County and income mobility and all of that.) If you toggle the percentile button to 75% to look at the same statistic for kids from higher earning families, the pictures changes entirely. At age 26, kids from families with earnings in the 75th percentile earn about $2,500 less than if they'd grown up on an "average" town. 

This got me thinking... what was I doing when I was 26? Well, I had just finished my PhD the year before. I had taken a postdoc in Rio de Janeiro, Brazil and, thoroughly burned out, I had promptly quit. So, at age 26 I was earning approximately zero actual dollars per year but was gaining invaluable life experience (or something). Sorry, Placer County. I think I brought that average down. What was my best friend doing? She was still in medical school. Other friends were just finishing law school. 

My guess as to what is going on at the higher percentiles is that a non-negligible percentage of children from high earning families are still in some form of higher education at age 26. One way to see if this is a potential culprit is to actually read the paper . I did that (it was painful) and I don't think they are controlling for enrollment in higher education in these estimates. One thing the authors did note is that children from certain areas tend to marry later in life, and as a result family income at 26 will be lower. So that may also be part of what's going on here. But the "still in school" hypothesis was more interesting to me, so naturally I started digging into details.

A fun way to investigate this is to propose several places that have an "academic culture"-- places  where the higher income people are likely to be professors or physicians. Going out on a limb, if it's true that those people's kids are more likely to still be in school at age 26, we should see negative effects at the higher quantiles at age 26 in those places. We looked at Durham Co, NC (Duke University), Orange Co, NC (UNC Chapel Hill), Tompkins Co, NY (Cornell), and Montgomery Co, VA (Virginia Tech). These are places that we have personal experience with and know fit those criteria. In each case, we found that this map estimates a big economic disadvantage at age 26 to having grown up in these places if your household is in the upper quantiles. The same was true of Washtenau county, MI (University of Michigan), Hampshire county, MA (UMass, Amherst, Hampshire, Smith, Mount Holyoke), Missoula county, MT (University of Montana), Santa Cruz county, CA (UC Santa Cruz), and Los Alamos county, NM (Los Alamos National Labs). 

Another interesting example is Westchester Co, NY. Going entirely on stereotypes here, I'd expect that many of the high earners there work in finance, a field that historically hasn't required graduate school. And, if we again assume that apples don't fall far from trees and kids there are likely to go into the same field, that would explain the huge economic advantage at age 26 for those who grew up in high income families in Westchester Co. They're earning a lot and they aren't still in school. Here's a screenshot of the map for the New York area:

Another interesting thing you can see on this same map is how "bad" Manhattan apparently is for children in families even in the top one percent of income (I think about $500,000/year but don't quote me on that). Maybe children of wealthy families who grew up in New York are more likely to be in grad school or traveling or something at 26. But that isn't necessarily "bad," which is how the graphic depicts it.

Similar economic disadvantage apparently exists for well-off kids in some other places that, while not dominated by colleges, have a sort of academic or artsy reputation. Here I'm thinking of Travis county, TX (UT but also "keep Austin weird"), Marin county, CA (hippies and artists and well to do anti-vaxxers), and, just across the bay, San Francisco county, CA (need I label this place other than "awesome"?). And a final, bizarre example is Hamilton county, IN, which as far as I can tell doesn't fit the "artsy" or "nerdy" bucket, but was apparently named "America's best place to raise a family" in 2008 by Forbes.com (I got this from Wikipedia), but is apparently "very bad" for children in upper income families. By the way, you can check out a bunch of screenshots of the maps for these places below.  

These examples are consistent with our hypothesis that, in some cases, the decreased "expected income" is a function of the fact that many people who end up earning higher amounts of money are still in school at age 26. So, while I don't think this map is technically wrong, I think it is misleading. Without carefully thinking about it, it would imply that the more academically focused counties cause kids who grow up there to have lower earnings generally. While there are the rogue wunderkind techies who make their millions by the time they are 26 without higher education, that certainly is the exception rather than the rule.  For many people who are going to end up at the upper ends of the income distribution, they are still in school at age 26.... much like the senior investigators of this project likely were given that they hold PhDs. 


Thursday, June 12, 2014

Hurricanes with Feminine Names are Deadlier?

I've been seeing a lot of internet chatter lately about a recent study published in the Proceedings of the National Academy of Sciences that concludes that hurricanes with female names kill more people essentially because people bring their preconceptions about females to the table when evaluating the danger associated with the hurricane. Read: people tend to think of hurricanes with female names as wussier than those with male names and make evacuation and safety-related decisions accordingly.

I've also been seeing a lot of blowback and objections to this study, in many cases on the grounds that naming conventions have changed over the years. From 1952 to 1978, all storms were given female names (e.g., Barbara, Florence, Carol, etc.) obviously because hurricanes are very interested in keeping their fingernails well-manicured and enjoy watching Say Yes to the Dress with their girlfriends while drinking skinny margaritas, thus conforming more to female than male stereotypes. I'll just go with that... From 1979 on, they began alternating female and male names (e.g., Gloria, Juan, Kate). Pre-1952, I don't even know what they were doing (e.g. Easy, King, Able)... adjectives? improper nouns?

Anyway, in this study, the researchers had volunteers rate the perceived masculinity/femininity of each name. This masculinity/femininity index (MasFem in their dataset, which you too can download here!) was then thrown into a negative binomial regression with some other shit relating to how strong the hurricane was, and out pops some significant coefficients on the MasFem index and its interactions with how strong the storm is.  Declare victory!! Publish paper!!!!!!!!!!!!

But, hold it there, cowboy/cowgirl. What if the effects we are seeing are due to the fact that earlier storms both tended to have female names and also tended to kill more people, even after accounting for their severity, due to the fact that people in the 60s and 70s were dirty hippies, too stoned from their marijuana cigarettes to take cover? Or the technology to issue early warnings wasn't as good. Whatever.

One quick and dirty way to address that concern would be to throw some additional variables into their model that allow for different effects of storm severity pre- and post- 1979. I just made an extra indicator variable for whether we were talking about the "early" or late hurricanes and re-ran the model, now including my early indictor, its interaction with the two severity variables, and everything else that was in the authors' original model 4 in Table S2. We get this...

EstimateStd. Errorz valuePr(>|z|)
(Intercept)2.325710.1610114.445< 2e-16***
early0.233820.264610.8840.376876
ZMinPressure_A-0.752250.2143-3.510.000448***
ZNDAM0.252030.162831.5480.121663
ZMasFem0.062580.13190.4740.6352
early:ZMinPressure_A0.344180.344830.9980.318231
early:ZNDAM1.310480.328613.99E+006.66E-05***
ZMinPressure_A:ZMasFem0.250960.173741.4440.148602
ZNDAM:ZMasFem0.280250.154851.810.070321.

Yeah, so the "significance" of the the MasFem index almost entirely disappears under this scenario. Only a ghost of it remains in its interaction with a variable representing the damage caused by the hurricane-- one of those severity-related variables I keep mentioning.

This analysis is pretty unsatisfying, though. If you look at the model, it assumes that pre-1979, the effect of wind speed and damage on the number of deaths caused by a hurricane is different than the effect post-1979. There's nothing magical about 1979 that would make us believe that the relationship between windspeed and the number of deaths should suddenly change. The only thing that changed in that year was the naming convention. I'll note that the authors mention that they tried including a linear trend in time in their model but it wasn't significant. But there are a couple of problems with this. First, the time trend should have been interacted with the severity variables. Second, who says the trend is linear? At that point we get into polynomial trends and so forth, and we're starting to run a little low on degrees of freedom.

Instead, let's approach this in an unconventional but more satisfying way that respects the real process as we understand it.  The authors are trying to test the hypothesis that, all else being equal, storms with feminine names kill more people than storms with masculine names. Critics point out that the masculinity/femininity of the name is highly correlated with time, and that the time period during which the storm occurred probably does matter. The quick and dirty model with the "early" variable gives some indication that this is true. Also, I'm not showing it here for brevity, but the linear time trend interacted with the one of the severity variables was highly significant when I added that into the model instead of the "early" indicator.

If it were indeed the case that the explanatory power of the femininity index is driven entirely by the time period's naming convention, then the estimates obtained for the authors' model using the real data should be roughly equivalent to estimates obtained using the same model applied to a dataset in which the MasFem indices are simulated according to the naming convention of the time but otherwise random.

I simulate alternate versions of history in which each hurricane in the dataset is randomly assigned a new name, the only caveat being that the name it is assigned must be from the correct epoch, i.e. a hurricane from 1965 may be reassigned the name of a hurricane from 1975 but not the name of a hurricane from 1985, as 1965 is pre-1979 and 1985 is post-1979. This creates a new dataset in which MasFem is conditionally independent of the number of deaths given the epoch. Thus, in each of these alternate histories, any detected relationship between MasFem and the number of deaths (a significant non-zero coefficient on MasFem) is only present because the death toll is related to the epoch of the hurricane and the femininity of the name is also related to the epoch of the hurricane. This creates a simulated null distribution against which we can test whether the estimated coefficients from the real data are likely different than what we would expect if the death toll were conditionally independent of MasFem given the epoch and the associated naming convention of the time in which the hurricane took place. If you're familiar with graphical model representations of dependence, we are re-sampling new datasets from this model.


From these simulations, we can calculate a kind of empirical p-value that tells us whether the effect the authors find using the original model would be surprisingly large relative to what we should expect if it were true that the relationship between how deadly a hurricane is and how feminine its name is controlled entirely by the epoch in which the hurricane occurs.  The distribution of the three coefficients that form the basis for the conclusion that name femininity is related to the number of deaths are shown below.


Each histogram shows the empirical distribution of the estimated coefficients under the assumption that the relationship between deadliness and femininity only exists because both are related to the time period of the hurricane. The yellow dot shows the estimated coefficient in the real data. Here, we see that the coefficient on femininity is pretty similar to what we would expect under our null hypothesis-- in fact, in 35% of the simulated datasets, the estimated effect of femininity was larger than that discovered in the original paper. That is, if it's true that deadliness is conditionally independent of femininity given epoch, we would expect to estimate an even stronger relationship almost 35% of the time. The other two coefficients, those on femininity interacted with minimum pressure and femininity interacted with total damage, show slightly more promising effects. In each of these cases, the estimated effect in the real data is in roughly the 88th percentile. Again, not especially convincing.

Taking all of this on its own, I'd happily conclude that the we shouldn't start giving menacing names to hurricanes-- like DeathScourgeMonster-- as I've seen suggested around the internet. This analysis, however, is only one piece of evidence among a larger of body of evidence presented in the paper. The other studies they present in the paper seem more sound and do support the possibility that people underestimate the danger associated with female-named hurricanes. Taken all together, I'd actually be more modest than most of the statistical witch hunters I've run across on the topic and say that I think the jury is still out on this one. They may be on to something.... maybe.

-- This analysis brought to you by the letter J, for James Johndrow, who helped write this post.

Thursday, June 13, 2013

Getting set up to call C from R (and Rstudio) on my Mac

I've recently begun to consider the possibility of using C to speed up some of my R code. This is a big step.

 Now, don't get me wrong. I'm not turning into one of the people who complain about how horrendously slow and clunky R is because for the most part, I think they're probably just not using R well. I am the queen of the {l,s,t, }apply. I have oil paintings of the phrase "Avoid loops at all cost" all about my abode. But, sometimes, even if you are an apply-ninja, try as you might, you just can't avoid a loop. (Damn you, MCMC!)

So, I embarked upon the journey to set up my computer to compile C in such a way that I can call that  code in R. It seems like most tutorials assume that you already use C and compilers and whatnot. So, for those of us who are using this as a gateway to more hardcore programming rather than vice versa (i.e. this is your first time venturing into the world of compilers,  command line prompts, etc.), (1) congratulations; and (2) the following are a detailed set of steps for setting up your computer so that C and Rstudio can play nice together, all communicated without the typical condescension you are met with in R forums response to basic questions (to which I say ...  ).

First, here are some relevant details about my computer:

MacBook Pro

Processor  2.4 GHz Intel Core i7 
Memory  8 GB 1333 MHz DDR3
Software  OS X 10.8 (12A269)

Rstudio (type version into Rstudio to see this)
platform       x86_64-apple-darwin9.8.0     
arch           x86_64                       
os             darwin9.8.0                  
system         x86_64, darwin9.8.0          
status                                      
major          2                            
minor          15.2                         
year           2012                         
month          10                           
day            26                           
svn rev        61015                        
language       R                            
version.string R version 2.15.2 (2012-10-26) (Yeah, yeah, I'll update...)
nickname       Trick or Treat 

Now that that's out of the way, on to the main event!


 Install XCode AND (and this important)  Install gcc/LLVM compiler.  (follow this link)

If you do not complete the second step, you will probably be getting warnings like  the following(even if you are using the inline package):

    /Library/Frameworks/R.framework/Resources/include/R.h:32:18: error: math.h: No such file or directory

    /Library/Frameworks/R.framework/Resources/include/R.h:29:20: error: stdlib.h: No such file or directory

    /Library/Frameworks/R.framework/Resources/include/R.h:30:73: error: stdio.h: No such file or directory

At this point, if you have some C code ready to go, you could compile it and run it in R (not R64).
Rstudio uses the x86_64 architecture, so if you try to load your compiled  .so file using Rstudio, you will probably get the something like:

    no suitable image found.  Did find: /Users/you/your_compiled_file.so: mach-o, but wrong architecture

I found the solution to this problem here. In short, go to your terminal and paste in the following:

 vi /Library/Frameworks/R.framework/Resources/bin/R  

You should see a file that begins with
    #!/bin/sh
    # Shell wrapper for R executable.

    R_HOME_DIR=/Library/Frameworks/R.framework/Resources
Type ":46" (no quotation marks) and press enter. This will take you to the 46th line, where you will see something like
    # Since this script can be called recursively, we allow R_ARCH to
    # be overridden from the environment.
    # This script is shared by parallel installs, so nothing in it should
    # depend on the sub-architecture except the default here.
    : ${R_ARCH = `arch`}
Press "i" to insert and change  this to
    # Since this script can be called recursively, we allow R_ARCH to
    # be overridden from the environment.
    # This script is shared by parallel installs, so nothing in it should
    # depend on the sub-architecture except the default here.
    : ${R_ARCH = /x86_64}
Save and quit by typing in "esc" then ":x" then enter.

So, now you're basically ready to go! I'll let the experts take it from here... you can follow any of the many tutorials that are around. For example, this one.  
Try adding #include &lt R.h &gt or #include &lt stdio.h &gt to the top of your .c file (foo.c in the above tutorial) to make sure everything's up and running!
Hopefully you're now all set up to write some awesome code for R!



Thursday, January 17, 2013

How many eligible universities for the Google US/Canada Fellowship?-

Google has offered this fellowship the last few years to badasses in lots of Googly fields. Thing is, only "eligible schools" are allowed to nominate two students... and this list of eligible schools is apparently super top secret. Wikipedia doesn't even know, so the information probably doesn't exist. 

Why do I care, you might wonder. To get a sense of  the chances that a someone gets picked, given they've made it past their university's nomination stage, duh. Imagine my frustration at not being able to find the necessary denominator!

All I can find is a list of past fellows and their institutions. This seems like the perfect opportunity to whip out some stats magic (and the most magical of stats methods, at that) to make a guess. So, here's the plan. I'm going to look at the universities that got picked each year (if more than one student from the same university is picked in a given year, that university just gets counted once) to make this estimate.  

Capture-recapture. Multiple systems estimation. Two names for one of the more surprisingly cool uses of a simple glm. The idea is that if you have several lists of a finite group of items, by looking at the overlaps among the lists, you can estimate the total number of items. In this case, items are eligible institutions. In other cases, items are fish in ponds.

As stats methods are wont to have, there are some built in assumptions: 
(1) the list of eligible schools doesn't change throughout the years 
(2) each eligible university has the same chance of being picked 
(3) picking a university in one year doesn't effect the chances of being picked in the other years
(4) the universities have unique names and can be identified as the same (or different) from one year to the next

Most of these assumptions probably aren't true. I expect they've added schools to the list over the years, and it seems as though some schools have a better chance of being picked each year than others (Stanford got picked every year, but Purdue only got picked once). (3) is reasonable-- maybe they don't like to pick the same places twice in a row... or maybe the good people come in streaks? No clue here. (4) Check. 

But if they're approximately true, maybe that's good enough. So, let's just run with it, ignoring the possible modifications that could be made to remedy the likely infractions....

Data. Code. doot doot doot doot doot doot doot..... Results...

I think there are probably between 28 and 38 eligible universities with a point estimate of 31. That's anywhere between 2 and 12 more than the 26 that have already been picked. 

Seems like the chances are pretty decent once you get to that stage. Aaaaaaaand I'm satisfied. Back to real work. 

Who am I kidding? Back to watching old episodes of One Tree Hill... 






Monday, December 5, 2011

I'm baaaa-aaaack.

You probably didn't realize I was gone. That's ok. Just pretend like you missed me.

Anyway, I'm fresh off of a quarter-life crisis induced year of international wanderings. In case you are wondering (again, just pretend), after my postdoc in Brazil, I stayed for a while and did whatever you do on the beach (absolutely nothing) for a few months. Then off to Germany for some climbing, followed by Colombia and Ecuador, and then up to California. Back down to Peru to do the Inca Trail with some buddies, and then a roadtrip around Europe in a rented car. (Sentence fragments aren't bad if you're blogging. Promise.) 

So, in the interest of undoing some of the brain atrophy I've experienced over the last year, expect to see a new post every once in a while.

And in other news, I am moving my blog to my vanity site, www.kristianlum.com/KLdivergence

Sunday, April 3, 2011

Brasil ranks 31st out of 44 in English profficiency

A few months ago, I did a post about my guess that someone whose first language is widely spoken would be less likely to speak English than someone whose first language is relatively obscure. It looks like I've been outdone.

English First has done a study that assesses the English proficiency of adults in various countries. From this, they have put together an English proficiency index and made some pretty nifty maps and plots.

The English First folks also investigated the same phenomenon that I did in my post. Clearly they have a much bigger budget (greater than $0) for doing these sorts of things, and they didn't just cull their data from Wikipedia, so I tend to go with what they say. Good thing their results support my own-- again, that people whose first language is shared by many are less likely to speak English. However, the relationship they found was "weak." See below.

EF EPI

If you're upset by the fact that the relationship here appears to be in the opposite direction of that which I found earlier, don't be. I was looking at the negative log of the number of native speakers. Why I transformed the data like that, I don't actually remember, but rest assured that this is showing roughly the same thing. Of course, this isn't exactly the same thing, the most obvious reason being that they are looking at "English proficiency", whereas I was looking at the "percent of English speakers."

They also compare English proficiency to various other variables they believe should be related, such as  the value of exports per capita, the average number of years of schooling, and gross national income per capita. All of these had a stronger relationship to the English proficiency than the native speakers variable.

One last mildly interesting nugget of information, which was mentioned in the Brazilian article that pointed me to the English First study and website, is that all of the BRIC countries fall right in line. China, India, Brazil, and Russia took the 29th, 30th, 31st, and 32nd spots respectively. The article also pointed out that, although world wide Brazil did not do so well in this ranking, at least it beat Venezuela and Chile!


Sunday, March 27, 2011

The Anne Hathaway Effect

I recently stumbled upon this article in the Huffington Post which claims that every time Anne Hathaway gets a lot of Internet attention (for releasing a movie, hosting the Oscars, or what have you), the stock price for Berkshire Hathaway shoots up. The author, Dan Mirvish, justifies the plausibility of this by saying that "My guess is that all those automated, robotic trading programming are picking up the same chatter on the internet about "Hathaway" as the IMDb's StarMeter, and they're applying it to the stock market." 


The data they use to support the claim is that 
Oct. 3, 2008 - Rachel Getting Married opens: BRK.A up .44%Jan. 5, 2009 - Bride Wars opens: BRK.A up 2.61%
Feb. 8, 2010 - Valentine's Day opens: BRK.A up 1.01%
March 5, 2010 - Alice in Wonderland opens: BRK.A up .74%
Nov. 24, 2010 - Love and Other Drugs opens: BRK.A up 1.62%
Nov. 29, 2010 - Anne announced as co-host of the Oscars: BRK.A up .25%


I think the first commenter put it well when s/he said 
"First!"
Nah, just kidding. Here's what they really said:
This is junk statistics if I've ever seen it. There may be something to the automated trading idea, but these data are proof of nothing. How about the hundreds of other times Ms. Hathaway was in the news and the stock didn't rise so dramatical­ly? How volatile is this stock normally? Are these percentage increases anything out of the ordinary?
Exasperate­d, I decided to d a quick test. I downloaded the BRK.A data from Jan. 1, 2008 to Mar. 18, 2011 from YAHOO Finance and did a trivial analysis of it in Matlab. Just looking at the difference between open and close prices, the stock was up 0.25% or more 308 times over this period. The stock was up 2.61% or more 47 times over this period. Those two percentage­s are the lowest and highest in Mr. Mirvish's "data."
As a scientist and math lover I've disappoint­ed to see this story making the rounds with so little skepticism­. It's a statement for the level of understand­ing of statistics and probabilit­y by the general public.
Looks like I'm not the only mathbuster out there. 


My first complaint about this (and backing up commenter number 1) is that, as someone who does not follow stocks at all, I have no idea if a .74% increase in BRK.A is anything notable.  Having downloaded the stock prices since 2008 from Google Finance, I can tell you that it isn't.  When Rachel Getting Married opened, the .44% increase was in the 68th percentile of changes in price... including negative changes. It was only in the 32nd percentile of positive changes. Even the biggest change of 2.61% is only in the 92nd percentile overall. Certainly not a tail event.  Getting to the point, it's not like every time Anne Hathaway gets naked with Jake Gyllenhaal, the stock holders all go out and by themselves a brand new G6. It's a pretty normal fluctuation. 

Over the period from 2008 to yesterday, the stock increased about 47% of the time. Since we are apparently completely disregarding the magnitude of the change, the probability of getting all positive changes when randomly selecting 6 dates out of the 828 trading days is quite small. But what would be the chances of looking at, say, 10 different dates and finding that 6 or more of them are positive?? If we ignore the issue of replacement (which shouldn't be horribly important since the sample size is 828 and we are only sampling 10), the probability of getting exactly 6 is about 18%, and the probability of getting 6 or more is about 31%. 


Given that the hypothesis is that the stock price is getting this little upward nudge because of Internet chatter, I checked out Google Trends to find other likely dates that the stock should increase under this hypothesis. Luckily, Google even shows you what the major news stories are on some of the major peaks, so it is easy to figure out the date.

Google Trends for Anne Hathaway
The top line is search volume and the bottom is news volume. They pick out many of the same spikes.

Two big peaks we see on here that haven't already been accounted for in the original post are B, Anne Hathaway Proclaims Love For ‘Family Guy,’ ‘Aqua Teen,’ Fulfills Nerd Vision Of Idealized Woman, on February 23, 2009 and CAnne Hathaway spends spare time studying physics, on February 2, 2010. On these two dates, BRK.A saw a 1.82% and .11% decrease respectively.  Further, when on June 20, 2008 the Los Angeles Times posted a story called Anne Hathaway versus Jessica Alba  resulting in the very visible spike in 2008 (I guess everyone likes a good ladyfight), BRK.A experienced a -.79% change. On the opening day of Get Smart, June 20, 2008, BRK.A fell .79%, and if we go back just a little bit further to December 9, 2005, the day that Brokeback Mountain had its major opening in the US, BRK.A dropped .07%. In fact, the sample correlation between Anne Hathaway's Internet search traffic and the price of BRK.A for 2008 to yesterday was just .01-- basically uncorrelated.** 


Given all of this, I'm really hoping that Dan Mirvish didn't run out and by up a bunch of BRK.A hoping that his post would force the price up a bit. :) 


**This, of course, does not rule out the case that the fancy trading algorithms only act based on spikes in search volume, not normal activity, but just sayin'...