Saturday 2 April 2016

Best ways to tackle Big data in R

Big data contains millions of data that can be processed using R.Even though R provides few packages to support big data, extra effort is needed.Map reduce algorithms can be created using R for analysing data (Refer:The Art of R Programming - O'Reilly Media).
The original data set size could increase and lead to a bigger object during the analysing process.

I have listed out best ways to handle big data in R.

  • Divide and conquer 
Large datasets could be  divided into smaller subsets and then each of these smaller subsets are worked upon.And faster solutions and different strategies can be obtained through parallel processing of these smaller subsets of data.

  • Memory and hardware
Every data object is stored in memory by R.Therefore for better performance, machines should have higher capacity of  memory. 8TB along with 64-bits machines are best suitable to work with R.Another way is to make use of the packages such as "ff" and "ffbase".These packages doesn't store data in memory." Scale R" provides a variety of a algorithms for analysing data.

  • Incorporating programming languages like Java or c++
Sometimes, components of a program in R can be integrated with high level languages such as Java (or)  c++ for efficient and better performance. rJava combines R and Java and regarded as connection packages (refer: Advanced R development by Hadley Wickham).Renjin is an open source project with consists of altered R interpreter within JVM.Oracle R also uses R-interpreter with variety of Mathematicals functions and libraries.