Goal

Support "blob" keyword in Apache Spark SQL. If someone specifies a "blob" keyword, it should be converted that to Array[Byte] internally.

Approach

Apache Spark SQL parser is based on ANTLR. Here is a wonderful tutorial on how ANTLR works. Spark has written visitor functions to return the tree nodes while processing.

Spark allows injection of a new parser instead of the default one. For supporting blob types, following things are done:

  • Added ParserSessionExtension which is used to inject our parser

  • Injected SparkSqlParserWithDataTypeExtension which replaces the default SparkSqlParser. This class replaces the default abstract syntax tree builder with AstBuilderWithDataTypeExtension

  • AstBuilderWithDataTypeExtension overrides visitPrimitiveDataType and adds support for blob type

Try it out

$ bin/spark-shell --jars <PATH>/parser-0.1.0-SNAPSHOT.jar --conf spark.sql.extensions=dev.plugins.parser.ParserSessionExtension 

scala> spark.sql("create table tblWithBlob (mybytearray blob, id int) using parquet options (path '/tmp/p')")

scala> spark.sql("Insert Into tblWithBlob values (array(1,2,3), 1)")

scala> spark.sql("select * from  tblwithblob").schema
res3: org.apache.spark.sql.types.StructType = StructType(StructField(mybytearray,ArrayType(ByteType,true),true), StructField(id,IntegerType,true))

scala> spark.sql("select * from  tblwithblob").show 
+-----------+---+
|mybytearray| id|
+-----------+---+
|  [1, 2, 3]|  1|
+-----------+---+

Github

Here