Analytics Blog

Large-Scale Machine Learning with Apache Spark

We are excited to host of this Thursday’s SF Machine Learning Meetup Group.

This meetup is part of a series of meetings dedicated to Machine Learning in Spark.   As you might remember, Alpine Data Labs was one of the first Advanced Analytics Platforms to announce its commitment to Spark and partner with Databricks back in March.

What is Spark?

If you are new to Spark, you should know that it is a new cluster computing engine that is rapidly gaining popularity.  It is was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning.

The movement around Spark has already gathered over 150 contributors in the past year and it is one of the most active open source projects in big data – surpassing even Hadoop MapReduce!

Last Thursday, Xiangrui Meng, engineer at Databricks, gave a great talk on Spark 101 and explored MlLib, Spark’s built-in machine learning library (special thanks to for hosting this meetup).   Xiangrui showed how to use Spark to process raw data with libraries in Java, Scala or Python and extract features for machine learning. He also reviewed MLlib and showed how to run scalable versions of popular algorithms.  If you didn’t get a chance to attend, don’t worry, we’ve got the recording down below.

This week, Sandy Ryza, Data Scientist and Engineer at Cloudera will talk about unsupervised learning on Spark.  DB Tsai, one of Alpine Data Labs’ Machine Learning engineers will also discuss implementing multinomial logistic regression on Spark using an L-BFGS optimizer .

We’re looking forward to seeing you all this Thursday!

Watch the full recording:

If you don’t already have Alpine, sign up here to get started!

Be sure to subscribe to this blog to receive alerts for new posts in this series. You can subscribe at the top right of this page or add this to your Feedly or RSS reader ->