Goal
Modify range operator so that it returns only the multiples of the step
spark.range(start = 1, end = 10, step = 3, numPartitions = 2).show
The actual range would return {1, 4, 7}
The modified range should return {3, 6, 9}
Approach
Apache Spark's parser generates a abstract syntax tree. The tree nodes are the operators of a query. The AST is passed through various phases of query compilation i.e. analysis, optimizer, planner and execution. In this exercise, I inject a planner rule that replaces a logical plan node with a physical plan node.
There are multiple ways in which we change the Spark's range function but in this project I will inject a new Spark Plan for the range operator. Steps to achieve this:
-
Add a PlannerExtension that injects our strategy
-
Our strategy, ModifiedRangeExecStrategy, replaces the logical range operator with physical plan node ModifiedRangeExec.
-
ModifiedRangeExec is a copy of RangeExec in spark except that I have changed the start number of the range. Since RangeExec is a physical plan node, it does the code generation. But let's keep that exercise for some other project.
Try it out
$ bin/spark-shell --jars <PATH>/planner-0.1.0-SNAPSHOT.jar --conf spark.sql.extensions=dev.plugins.planner.PlannerExtension
scala> spark.range(start = 1, end = 10, step = 3, numPartitions = 2).show
+---+
| id|
+---+
| 3|
| 6|
| 9|
+---+
scala> spark.range(start = 2, end = 10, step = 3, numPartitions = 2).show
+---+
| id|
+---+
| 3|
| 6|
| 9|
+---+