Creating custom SQL UDFs on Spark (Hive) and Databricks with Gradle and Java

Intro

Recently, I was looking to find a good working example on how to create a custom Spark SQL User-Defined-Function (UDF) on Java on build that using a modern tool like Gradle.

Usually, my go-to choice for UDFs is Python but only using SQL-based functions

I didn’t manage to find a quick started quickly on Google so I’ll share one here.

Gradle config

Testing with Gradle version 6.8 .


plugins {
    // to bundle a jar with all dependencies we will be using shadow plugin
    id 'com.github.johnrengelman.shadow' version '6.1.0'
    id 'java'
}

group 'com.tasubo.dev'
version '1.1-SNAPSHOT'

repositories {
    mavenCentral()
}

dependencies {
    // these dependencies are (or should be) available on the cluster itself so no
    // need to bundle these
    compileOnly 'org.apache.hadoop:hadoop-client:3.3.0'
    compileOnly 'org.apache.hive:hive-exec:3.1.2'

    // some external lib that we need to include in the JAR
    implementation group: 'com.neovisionaries', name: 'nv-i18n', version: '1.28' 

    testImplementation 'org.junit.jupiter:junit-jupiter-api:5.7.0'
    
    // hive-exec is needed to make the tests run
    testImplementation 'org.apache.hive:hive-exec:3.1.2'
    testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.7.0'
}

test {
    useJUnitPlatform()
}

Java UDF

Specific (simple) UDF example that can work with one data type.

package com.tasubo.dev;

import com.neovisionaries.i18n.CountryCode;
import org.apache.hadoop.hive.ql.exec.UDF;

public class StarterUdf extends UDF {
    public String evaluate(String code) {
        return CountryCode.getByAlpha2Code(code).getAlpha3();
    }
}

Java Generic UDF

Using GenericUDF it’s possible to make the UDF handle multiple different types, process complex (nested) types and etc.

Showing the full example is outside of the scope, but here is a copy-pastable example of the Generic-version UDF.

public class StarterGenericUdf extends GenericUDF {
    public String evaluate(String code) {
        return CountryCode.getByAlpha2Code(code).getAlpha3();
    }

    @Override
    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
        return null;
    }

    @Override
    public Object evaluate(DeferredObject[] arguments) throws HiveException {
        return CountryCode.getByAlpha2Code(arguments[0].get().toString()).getAlpha3();
    }

    @Override
    public String getDisplayString(String[] children) {
        return null;
    }
}

Tests

A few simple tests should be in order too

import com.tasubo.dev.StarterUdf;
import org.junit.jupiter.api.Test;

import static org.junit.jupiter.api.Assertions.*;

class SampleTest {
    StarterUdf fun = new StarterUdf();

    @Test
    public void two_letter_code_should_be_converted_to_iso_code() {
        String outcome = fun.evaluate("CA");
        assertEquals("CAN", outcome);
    }

}

Using in SQL

To build a final JAR we will be using Shadow plugin:

gradlew shadowJar
ls build\libs
spark-sql-udfs-starter-1.2-SNAPSHOT-all.jar

Finally, we can upload the JAR to the FileStore (on DBFS) and use the following on Spark SQL (Databricks) to create a function:

%sql 
CREATE OR REPLACE FUNCTION convert_country_code AS 'com.tasubo.dev.StarterUdf'
    USING JAR 'dbfs:/FileStore/spark_sql_udf_starter_1_5_SNAPSHOT_all.jar';
--REFRESH  FUNCTION convert_country_code;

select convert_country_code("CA") as iso;

Code

The example here can be found on GitHub here