Spark is new technology that sits on top of Hadoop Distributed File System (HDFS) that is characterized as “a fast and general engine for large-scale data processing.” Spark has three key features that make it the most interesting up and coming technology to rock the big data world since Apache Hadoop in 2005.
1. For iterative analysis like logistic regression, Random Forests, or other advanced algorithms, Spark has demonstrated 100X increase in speed that scales to hundreds of millions of rows.
2. Spark has native support for the latest and greatest programming languages Java, Scala, and of course Python.
3. Spark has generality or platform compatibility in both directions meaning it integrates nicely with SQL engines (Shark), Machine Learning (MLlib), and streaming (Spark Streaming) without requiring new software installed on the cluster using Hadoop’s new YARN cluster manager.
At Alpine, we have made it dead simple to get started with Spark by including the technology in our latest build out of the box. We require no additional software or hardware to leverage our extensive list of operators for data transformation, exploration, and building advanced analytic models. We leverage Hadoop Yarn (Hadoop NextGen) to launch Spark job without any pre-installation of Spark or modification of cluster configuration. This empowers our customers to have seamless integration of our Spark implementation and their Hadoop stack. For example, we have analyzed 50 Million rows of account data in 50 seconds on a 20 node cluster recently at last month GigaOM conference.
As a Spark certified company, Alpine Data Labs will be at the Summit. We’d love to see you there!