While enterprises have traditionally deployed Hadoop clusters on their data centers, there is a growing number creating clusters in the cloud. Cloud providers such as AWS and GCP make it almost effortless to spin-up and tear-down Hadoop clusters on-demand and provide a cost-effective approach to on-demand big data systems. However, the current analytics solutions offered by vendors are extremely limited and may not even extend to the use of Hadoop clusters.
Chorus can be readily deployed to cloud environments and supports not only the typical Hadoop distributions, but also AWS Elastic MapReduce (EMR). Chorus easily leverages data residing in RedShift or MySQL instances — sourcing and syncing data to and from S3.
Deploying, configuring and maintaining a bare-metal Hadoop cluster can be a time consuming effort. In contrast, a multi-node Hadoop cluster can be created with cloud Hadoop deployments at the click of a button.
Once such an instance has been created, deploying Chorus is similarly efficient:
- Create a small container
- Download Chorus from S3 to the container
- Launch the installer (and hit yes a couple of times 🙂 )
- Log into Chorus via your web browser
- Point Chorus at the resourcemanager of your Hadoop cluster, and instruct Chorus to autoconfigure itself for that cluster.
- Start building high-performance analytical workflows using the Chorus visual workflow editor, and running them on your Hadoop cluster using Spark and/or MapReduce.
Start to finish, this process takes as little as 10-minutes. It’s also just a few clicks to add the Redshift data source: data can be moved back and forth between Redshift and EMR, ETL can be performed on the Redshift data in situ, and models trained on Hadoop can be used to score data residing in Redshift (or seamlessly deploy to a customer’s cloud scoring engines using either PMML or PFA). All within minutes of creating the instance and without having to write a single line of code!
With the recent introduction of Spark autotuning in Chorus 6.1, Chorus has a detailed understanding of the resource requirements associated with all of the analyses being run by data scientists using Chorus. As a result, Chorus is capable of understanding the optimal cluster sizing required to support the aggregate load. Future releases will provide the functionality to scale-up and down the size of the cluster by dynamically adding nodes when required and pausing idle nodes when not. This means that Chorus will be able to minimize the cost associated with running the cluster (indeed, customers can already integrate cluster control into their individual flows using the Chorus extensibility SDK).
At Alpine, we’re excited to see more of our customers use Chorus to deploy machine learning in the cloud. Stay tuned for future posts detailing other ways we’re transforming the traditional enterprise data science workflow.