Analytics Blog

Integrating R and the JVM Platform – Alpine Data Labs’ R Execute Operator

In today’s multi-core, concurrent world, with data flowing through applications at previously unseen rates, it is critical to make the best use of existing hardware. In the past, blocking, synchronous, non-event-driven applications were adequate for the tasks for which they were designed. However, with thousands (or even millions) of users and huge web applications, it is necessary to decouple the request for the computation to be made from the response, in order to free up the thread of execution to do something else while some other service that is being called is busy completing its task. Furthermore, a distributed application needs to be resilient, because the more servers you have, the more likely it is for at least one to go down at any given time. The application should be responsive, meaning that it should be event-driven instead of blocking and waiting for a computation to finish. It should be elastic, scaling up and down depending on current conditions. For further details, see the Reactive Manifesto.

 

Reactive programming is a phenomenal idea, but it’s not always achievable “all the way down” in practice. In the real world, one rarely writes entire platforms from scratch and even then, one often needs to integrate with third-party applications that are blocking, stateful, and seem to violate nearly every reactive principle. Despite this, reactive and non-reactive code bases can co-exist, and one can take the best of both worlds using a good integration technology such as Akka.

 

Akka is the #1 reactive programming platform for the Java Virtual Machine (JVM) written in Scala, with APIs for Scala and Java. It’s been shown to scale up to processing 50 million messages per second on a single node. Akka provides a location-independent model of computation, so the programming model for local multi-threading and remoting is the same. Akka provides the “let it crash” fault tolerance model, which was ported over from the Erlang programming language, and which was successfully used for years to power Ericsson’s telephone switching systems. Akka implements the Actor Model, popularized in the 1973 by Carl Hewitt, Peter Bishop and Richard Steiger. It is ironic that an old but great idea took so many years to find commercial usage (starting with Erlang in the late 1980s), but the same could be said of other great ideas such as functional programming, which is picking up just now yet its roots date back to the founding fathers of computer science (Alonzo Church), and the second oldest programming language (Lisp appeared in 1958). In addition to being a great technology for local concurrency, remoting, clustering and streaming, Akka is open source (Apache 2.0 license) and has optional commercial support provided by Typesafe, Inc. The project was founded by Jonas Bonér, Typesafe’s CTO – however, given its open-source nature, it has received contributions from many people, both Typesafe employees and the community at large.

 

Akka serves the use case of integrating Alpine’s backend with R perfectly. It decouples the two applications via message passing, allowing proprietary and GPL code bases to co-exist without licensing issues. It provides the necessary scalability, letting the Alpine user execute many R scripts at the same time. It provides location transparency, letting R workers to execute on a different machine than the one on which the Alpine server is running, opening up the possibility of commanding an entire R cluster. It provides fault tolerance, bringing up new R processes should any of them fail.

You will find below the slides and recording of a presentation I gave on that topic at the San Francisco Scala Meetup on September 10th, 2014.