Goal
Support "blob" keyword in Apache Spark SQL. If someone specifies a "blob" keyword, it should be converted that to Array[Byte] internally.
Approach
Apache Spark SQL parser is based on ANTLR. Here is a wonderful tutorial on how ANTLR works. Spark has written visitor functions to return the tree nodes while processing.
Spark allows injection of a new parser instead of the default one. For supporting blob types, following things are done:
-
Added ParserSessionExtension which is used to inject our parser
-
Injected SparkSqlParserWithDataTypeExtension which replaces the default SparkSqlParser. This class replaces the default abstract syntax tree builder with AstBuilderWithDataTypeExtension
-
AstBuilderWithDataTypeExtension overrides visitPrimitiveDataType and adds support for blob type
Try it out
$ bin/spark-shell --jars <PATH>/parser-0.1.0-SNAPSHOT.jar --conf spark.sql.extensions=dev.plugins.parser.ParserSessionExtension
scala> spark.sql("create table tblWithBlob (mybytearray blob, id int) using parquet options (path '/tmp/p')")
scala> spark.sql("Insert Into tblWithBlob values (array(1,2,3), 1)")
scala> spark.sql("select * from tblwithblob").schema
res3: org.apache.spark.sql.types.StructType = StructType(StructField(mybytearray,ArrayType(ByteType,true),true), StructField(id,IntegerType,true))
scala> spark.sql("select * from tblwithblob").show
+-----------+---+
|mybytearray| id|
+-----------+---+
| [1, 2, 3]| 1|
+-----------+---+