Data processing with Spark in R & Python

I recently gave a talk on data processing with Apache Spark using R and Python. tl;dr – the slides and presentation can be accessed below:

As noted in my previous post, Spark has become the defacto standard for big data applications and has been adopted quickly by the industry. See Cloudera’s  One Platform initiative blog post by CEO Mike Olson for their commitment to Spark.

In data science R had seen rapid adoption, not only because it was open source and free compared to costly SAS, but also the huge number of statistical and graphical packages provided by R for data science. The most popular ones of course are the ones from Hadley Wickham (dplyr, ggplot2, reshape2, tidyr and more). On the other hand, Python had seen rapid adoption among developers and engineers due to its being useful to script big data tasks along with data analysis with the help of packages like pandas, scikit-learn, NumPy, SciPy, matplotlib etc. and also the popular iPython & later Jupyter notebooks.

There are numerous posts strewn on the net picking fights between R and Python. However it is quite usual for any big data and data science shop to have developers and data scientists who use either or both these tools. Spark makes it easy for both communities to leverage the power of Hadoop and distributed processing systems with its own APIs like DataFrames which can be used in a polyglot fashion. Therefore it is essential for any data enthusiast to learn about how data processing in Spark can be done using R or Python.

Now on Amazon – download the BIguru blog app!

The BIguru BI Blog app is now available on the Amazon AppStore!

To search and download the app, go to the Amazon AppStore and search for “Biguru BI Blog“.
To download and install, you’ll need to follow instructions for your Android smartphone, i.e. you’ll need to “enable unknown sources” as outlined by Amazon.

BIguru BI Blog app

BIguru BI Blog app

Once you’ve downloaded and installed it (your smartphone Anti-Virus should scan the app after installation) by accepting the defaults, you’re free to get updates on new posts from this blog!

The app is powered by the Como App Maker – which makes it simple to create HTML5 apps for all types of smartphones.

Enjoy your app!

Change the location Google Desktop Search indexes your data

Desktop search has become an important component of our everyday work. With the amount of information explosion, it is only natural that users and enterprises move towards enabling desktop (and enterprise) search for users – subject of course to appropriate security and access controls. BI vendors have moved into this new business space that has opened up and seems to be one of the most promising. While Business Objects had announced support for the Google Search appliance and Google Desktop back in 2006, their most important announcement lately has been the launch of the Business Objects Explorer (formerly known as Polestar) product. More about that in a later post…

Google Desktop Search is one of the most widely used desktop search appliances.  One would expect it to have an intelligent installer as well. Unfortunately, it doesn’t allow you to either choose the installation directory or the location for the search index. It installs in your system drive without providing any means to modify it from the Options setting. This can be quite annoying and frustrating if your system drive is not set up with a huge amount of space, as the Google Desktop search index will expand soon and hog a lot of space (up to 2 GB) on the system drive. I will show a tip here on how you can get around this issue by modifying the location of the Google Desktop search index to change it from the default system drive and without having to rebuild the index.

1. Right click and exit Google Desktop.


2. Open Windows Explorer and navigate to C:\Documents and Settings\<username>Local Settings\Application Data\Google\<google desktop search>


Note: If you’re unable to see “Local Settings” – (it’s a hidden folder) – change your folder options from Tools – View – Show hidden files and folders.

3. Move the <google desktop search> folder to a different drive, e.g. D:\ Google Desktop\<google desktop search>

4. Open the Windows registry editor from Start – Run – typing regedit – Hit Enter.

5. Navigate to HKEY_CURRENT_USER\Software\Google\Google Desktop.

6. Select the “data_dir” key in right pane, double-click to change its value to the new location of the <google desktop search index>


7. Exit the registry editor.

8. Restart Google Desktop Search.

Hello world!

This blog is going to have a series of posts about trends and recent happenings in the BI space, some theory on data warehousing and BI – classical as well as newer approaches. It will also feature some tutorials on some technologies and implementations I have worked with as well as useful links.