The main goal of this GSOC project is to optimize the speed of the iregnet package, so that it is as fast as the glmnet package.
iregnet is the first R package to support:
· general interval output data (including left and interval censoring; not just observed and right-censored data typical of survival analysis),
· elastic net (L1 + L2) regularization, and
· a fast glmnet-like coordinate descent solver.
The iregnet package is already useful for making predictions in data sets with possibly censored observations. After this GSOC project, its model fitting code will be even faster.
Before coding, I used Google-gperftools to profiler the iregnet code, and used pprof as the analysis tool. After that, I found that most of time can be saved if we use the BLAS instead of for loop in matrix/vector computation. More details can be found Here for all the profiler results and analysis info.
The coding plan is to rewrite the slow part. It is divided into two aspects for optimization.
Computational Process
There are five parts can be optimized during the computational process
1）Use vector arithmetic in coordinate descent algorithm
2）Pass j if it is converged
3）Only calculate the updated part in coeff loop during COORDINATE DESCENT
4）Save the result of X
5）Rewrite the reweighting step by different distributions and censoring types
C++ linear algebra library
Eigen or Armadillo (use openblas) was the nice choice to be the C++ linear algebra library. Both of them performed well toward matrix maths. And thanks to Rcpp Team, both of them were supported by RcppEigen or RcppArmadillo when used in R. By researching the source code, most of the parts can be optimized by using matrix computation. After the speed test between Eigen and Armadillo, we finally chose Armadillo as our C++ linear algebra library.
We can optimize iregnet from the ways mentioned above. More specific infomations about coding plans and methods could be found in this link
After coding during GSoC, iregnet is now faster than the previous version. I write a benchmark to compare between new and old version of iregnet. This benchmark was divided into two parts across row and col. Every part contains two different distributions and four kinds of censoring types.
For example:
The following figure is the test result in gaussian distribution with none censoring type. In this test, we use microbenchmark to mearsure the speed between glmnet, new iregnet and old iregnet. After the speed test, I use ggplot to get the benchmark figure.
As for the figure, new iregnet is now actually faster than pervious version and closer to glmnet than before. More benchmark results are in this blog. The blog contains all the tests according to different censoring types, different distributions and row&col.
Entire Benchmark link: http://rovervan.com/post/gsoc/iregnet-benchmark
On the winter vacation of this year, I searched cpp GSoC project on github. Finally I found the project of R Org. I learned from the GSoC home page of R that student would be better in finishing the tests as more as he can before contacting mentor. I chose the most interested project I like and spent more time on the qualification tests.
After I finished all the qualification tests, I contacted my mentor Toby in time. Thanks for Toby's suggestions, I found the way to optimize the project with BLAS. That's a good direction and I try different kinds of BLAS before I started writing my application.
When writing the application for iregnet project, I already had lots of profiler result of the iregnet. Also I found the different way to optimize iregnet and wrote them as coding plans in my application. There were specific profiler results in every coding plans to make sure the plan would be worked well. Mentor Toby gave me much advice after I sent him my application and that was really helpful for me. Finally, I got selected for GSoC fortunately.
On this summer, firstly, I started the project by importing RcppArmadillo to iregnet. Secondly, I rewrote the main procedure in iregnet_fit.cpp. After that, I spent time on rewriting every distributions and also their four kinds of censoing types. In the end, I wrote a specific benchmark to show the speed improvements on iregnet.
During GSoC, I has clear interactions with my mentors every week via skype, github and email. With frequency communication, Toby and Anuj mentored me helpfully with my coding project.
After I identified several issues in Rperform, Toby suggested me to communicate with the Rperform developers for resolving these issues. So I gave my first issue on github in Rperform repo and hoped to help them fix the problem together.
When finishing the main optimization plan, I started to write the benchmark for new iregnet (after optimization). Toby shows me lots of ways to make the benchmark more specific and I also spent a little more time to make it well.
In the end, I have learned about how to optimize matrix computation in cpp, use microbenchmark to measure the performance of R packages and create specific figures with ggplot. In addition, I knew more about coding skills and open source workflow on this project. Thanks for my mentor Toby and Anuj. I believed that I made great progress in this summer and I really appreciate their help during GSoC.