Thursday, June 13, 2013

Getting set up to call C from R (and Rstudio) on my Mac

I've recently begun to consider the possibility of using C to speed up some of my R code. This is a big step.

 Now, don't get me wrong. I'm not turning into one of the people who complain about how horrendously slow and clunky R is because for the most part, I think they're probably just not using R well. I am the queen of the {l,s,t, }apply. I have oil paintings of the phrase "Avoid loops at all cost" all about my abode. But, sometimes, even if you are an apply-ninja, try as you might, you just can't avoid a loop. (Damn you, MCMC!)

So, I embarked upon the journey to set up my computer to compile C in such a way that I can call that  code in R. It seems like most tutorials assume that you already use C and compilers and whatnot. So, for those of us who are using this as a gateway to more hardcore programming rather than vice versa (i.e. this is your first time venturing into the world of compilers,  command line prompts, etc.), (1) congratulations; and (2) the following are a detailed set of steps for setting up your computer so that C and Rstudio can play nice together, all communicated without the typical condescension you are met with in R forums response to basic questions (to which I say ...  ).

First, here are some relevant details about my computer:

MacBook Pro

Processor  2.4 GHz Intel Core i7 
Memory  8 GB 1333 MHz DDR3
Software  OS X 10.8 (12A269)

Rstudio (type version into Rstudio to see this)
platform       x86_64-apple-darwin9.8.0     
arch           x86_64                       
os             darwin9.8.0                  
system         x86_64, darwin9.8.0          
status                                      
major          2                            
minor          15.2                         
year           2012                         
month          10                           
day            26                           
svn rev        61015                        
language       R                            
version.string R version 2.15.2 (2012-10-26) (Yeah, yeah, I'll update...)
nickname       Trick or Treat 

Now that that's out of the way, on to the main event!


 Install XCode AND (and this important)  Install gcc/LLVM compiler.  (follow this link)

If you do not complete the second step, you will probably be getting warnings like  the following(even if you are using the inline package):

    /Library/Frameworks/R.framework/Resources/include/R.h:32:18: error: math.h: No such file or directory

    /Library/Frameworks/R.framework/Resources/include/R.h:29:20: error: stdlib.h: No such file or directory

    /Library/Frameworks/R.framework/Resources/include/R.h:30:73: error: stdio.h: No such file or directory

At this point, if you have some C code ready to go, you could compile it and run it in R (not R64).
Rstudio uses the x86_64 architecture, so if you try to load your compiled  .so file using Rstudio, you will probably get the something like:

    no suitable image found.  Did find: /Users/you/your_compiled_file.so: mach-o, but wrong architecture

I found the solution to this problem here. In short, go to your terminal and paste in the following:

 vi /Library/Frameworks/R.framework/Resources/bin/R  

You should see a file that begins with
    #!/bin/sh
    # Shell wrapper for R executable.

    R_HOME_DIR=/Library/Frameworks/R.framework/Resources
Type ":46" (no quotation marks) and press enter. This will take you to the 46th line, where you will see something like
    # Since this script can be called recursively, we allow R_ARCH to
    # be overridden from the environment.
    # This script is shared by parallel installs, so nothing in it should
    # depend on the sub-architecture except the default here.
    : ${R_ARCH = `arch`}
Press "i" to insert and change  this to
    # Since this script can be called recursively, we allow R_ARCH to
    # be overridden from the environment.
    # This script is shared by parallel installs, so nothing in it should
    # depend on the sub-architecture except the default here.
    : ${R_ARCH = /x86_64}
Save and quit by typing in "esc" then ":x" then enter.

So, now you're basically ready to go! I'll let the experts take it from here... you can follow any of the many tutorials that are around. For example, this one.  
Try adding #include &lt R.h &gt or #include &lt stdio.h &gt to the top of your .c file (foo.c in the above tutorial) to make sure everything's up and running!
Hopefully you're now all set up to write some awesome code for R!



Thursday, January 17, 2013

How many eligible universities for the Google US/Canada Fellowship?-

Google has offered this fellowship the last few years to badasses in lots of Googly fields. Thing is, only "eligible schools" are allowed to nominate two students... and this list of eligible schools is apparently super top secret. Wikipedia doesn't even know, so the information probably doesn't exist. 

Why do I care, you might wonder. To get a sense of  the chances that a someone gets picked, given they've made it past their university's nomination stage, duh. Imagine my frustration at not being able to find the necessary denominator!

All I can find is a list of past fellows and their institutions. This seems like the perfect opportunity to whip out some stats magic (and the most magical of stats methods, at that) to make a guess. So, here's the plan. I'm going to look at the universities that got picked each year (if more than one student from the same university is picked in a given year, that university just gets counted once) to make this estimate.  

Capture-recapture. Multiple systems estimation. Two names for one of the more surprisingly cool uses of a simple glm. The idea is that if you have several lists of a finite group of items, by looking at the overlaps among the lists, you can estimate the total number of items. In this case, items are eligible institutions. In other cases, items are fish in ponds.

As stats methods are wont to have, there are some built in assumptions: 
(1) the list of eligible schools doesn't change throughout the years 
(2) each eligible university has the same chance of being picked 
(3) picking a university in one year doesn't effect the chances of being picked in the other years
(4) the universities have unique names and can be identified as the same (or different) from one year to the next

Most of these assumptions probably aren't true. I expect they've added schools to the list over the years, and it seems as though some schools have a better chance of being picked each year than others (Stanford got picked every year, but Purdue only got picked once). (3) is reasonable-- maybe they don't like to pick the same places twice in a row... or maybe the good people come in streaks? No clue here. (4) Check. 

But if they're approximately true, maybe that's good enough. So, let's just run with it, ignoring the possible modifications that could be made to remedy the likely infractions....

Data. Code. doot doot doot doot doot doot doot..... Results...

I think there are probably between 28 and 38 eligible universities with a point estimate of 31. That's anywhere between 2 and 12 more than the 26 that have already been picked. 

Seems like the chances are pretty decent once you get to that stage. Aaaaaaaand I'm satisfied. Back to real work. 

Who am I kidding? Back to watching old episodes of One Tree Hill...