Analytics Blog

Testing with the Spark Testing Base

Testing in the world of Spark has often involved a lot of hand-rolled artisanal code, which frankly is a good way to ensure that developers write as few tests as possible. I’ve been doing some work with Spark Testing Base (also available on Spark Packages) to try and make testing Spark jobs as easy “normal” software (and remove excuses for not writing tests).

A simple Scala Spark Testing Base test can look like:

class SampleRDDTest extends FunSuite with SharedSparkContext {
test("really simple transformation") {
val input = List("hi", "hi cloudera", "bye")
val expected = List(List("hi"), List("hi", "cloudera"), List("bye"))
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)

(Note that Spark Testing Base is available in Scala, and Java with some early support for Python by Juliet Hougland being driven, in part, by a desire for simpler Python test in Sparkling Pandas).

This admittedly only saves us from setting up and stopping a Sparkcontext (all of about 10 lines difference from the artisanal hand-rolled version), so if you have a fine collection of vintage records feel free to keep making your Spark Core tests by hand (but remember to clear “spark.driver.port” between context stops and starts).

While the boiler plate for testing a “regular” Spark program can easily fit in the span of just a single slide, the amount of boilerplate for testing with Spark DataFrames or Streaming resembles that of a naive Map/Reduce’s word count program.

Testing Spark Streaming by hand involves overcoming a number of hurdles. We need to figure out when our tests are done (and if you look at some tests you can see people waiting K-seconds which is just sad), we need to get our data into Spark Streaming (which as of 1.4.1 has become trickier*), and we need to collect our data. Instead of dealing with these problems, we can just specify our expected input and output and use Spark Testing Base’s testOperation function to test our hypothetical tokenize function:

test("really simple transformation") {
val input = List(List("hi"), List("hi holden"), List("bye"))
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
testOperation[String, String](input, tokenize _, expected, useSet = true)

Which is kind of neat since testOperation handles all of the tricky things with getting the data up, creating a fake clock, and all those fun bits.

Another really exciting library focused on testing Spark programs is sscheck which integrates ScalaCheck with Spark. They do some really interesting work and we’re currently discussing with them about how to include some of their ideas in Spark Testing Base (which currently also supports ScalaCheck but does not generate for DStreams).

Using tools like ScalaCheck is great since we don’t have to keep having the sample example strings showing up (although way less places to hide easter eggs), and they can also generate slightly more pathological cases for us. In the previous examples, under the hood each RDD is generated with a simple call to parallelize, which distributes our data evenly amongst the partitions. One entire class of errors shows up when our data isn’t distributed in these ways, and remembering to always test that can be a bother. Instead using Spark Testing Base’s ScalaCheck support we can specify a test like:

test("map should not change number of elements") {
rdd => == rdd.count()

If you are passionate about making reliable Spark Jobs, or at least not averse to the idea, I will be talking about Spark Testing and Job Validation at Strata NY 2015 in my talk “Effective testing of Spark programs and jobs” from 4:35pm–5:15pm on Wednesday, 09/30/2015. If you enjoy*** filling out online surveys about software testing, I’d also love your input on how you currently test your Spark Programs.

If you want to keep up with my future work, pictures of cats, or general silliness you can follow me on twitter, or for more useful code bits just keep an eye on my github. Many thanks to Alpine for giving me time to host discussions on Spark Testing as well as delicious iced coffee. As always my writing does not necessarily represent the views of my employer(s) past, present, or future**.

* If you need to use any sort of stateful operations, like window operations or updateStateByKey, the traditional approach of using queueStream for tests doesn’t work anymore.
** I mean _maybe_ but I don’t have a time machine.
*** Or at least don’t detest