Analytics Blog

Spark is the Future of Data Science

At Alpine Data Labs we spend a lot of time working on tools for Data Scientists and Analysts. Traditionally when doing data exploration, people have often used tools very different than production facing engineers, resulting in difficulties propagating insights from Data Scientists. One of the more exciting trends we’ve seen is the increased adoption of Spark in both Data Science and Data Engineering teams.

When Data Science and Data Engineering teams are able to use tools that are built on a common engine, collaboration becomes much easier. Data Scientists are able to access production data which is already in place for Data Engineering workflows. Common workflows also reduce differences between analytics and production environments, which can make a huge impact in deploying recommendation models and similar systems.

Traditional tools for Data Analysts often involved having to sample data to a single machine. Solutions built on Spark, such as Alpine, allow Data Scientists to perform complex analyses and model training on full data sets. As time has shown again and again, more data tends to outperform more complex models. With the old approach, sampling the data could often prove to be complex once multiple data sources & relations get involved. Taking hand rolled models built in R and deploying them to production can be a long and involved process. However, as Data Scientists are increasingly able to work with entire data sets, the need to retrain a similar model on the entire corpus is no longer needed and work can proceed more directly to the evaluation phase.

Data Engineering tools have often had difficulty making the jump over to exploratory analysis as waiting multiple hours to answer an ad-hoc question isn’t really feasible. Thankfully, Spark’s common core has always been much faster than many traditional systems and improvements done for Spark Streaming to reduce task overhead have benefitted all users.

Spark SQL is another important component in helping Spark, and Spark powered tools, expand its reach in Data Science. SQL has remained one of the most popular programming languages and supports a wide variety of backends. With Spark SQL, initial exploratory work can be down in SQL and the results can easily be integrated into traditional pipelines.The integration between Spark SQL and Spark’s Machine Learning library in continuously improving, making this a very powerful integration.

While Spark is not a programming language in the traditional sense, it can be useful to think of it in a similar manner. R has become one of the de facto languages for statistical code in part because of the large package base available for it. The recent merge of Spark R into the mainline Spark code base continues the trend of expanding the reach of Spark to the languages and tools users are familiar with, like PySpark. In addition to supporting a wide language base, spark-packages are also a promising start to bringing together libraries and useful tools for Spark developers with a similar idea as CRAN/CPAN/PyPi. By integrating support for spark-packages into tooling around Spark (like spark-submit) it makes it even easier for users to access libraries for their specialized use case which may not be part of Spark Core.

Beyond simply solving many individual problems well, Spark provides a common platform for Data Scientists and Data Engineers to work together to create innovative data driven solutions. The combination of the fast and powerful core execution engine, along with additional tools for exploratory analysis Spark SQL have the potential to unlock. We’re really excited to see where Spark is heading, and are really committed to its success. Feel free to watch our video from the Spark Summit below or reach out to us for more information.