h_spark.PySparkUDFGraphAdapter¶

This is an experimental GraphAdapter; there is a possibility of their API changing. That said, the code is stable, and you should feel comfortable giving the code for a spin - let us know how it goes, and what the rough edges are if you find any. We’d love feedback if you are using these to know how to improve them or graduate them.

class hamilton.plugins.h_spark.PySparkUDFGraphAdapter¶

UDF graph adapter for PySpark.

This graph adapter enables one to write Hamilton functions that can be executed as UDFs in PySpark.

Core to this is the mapping of function arguments to Spark columns available in the passed in dataframe.

This adapter currently supports:

  • regular UDFs, these are executed in a row based fashion.

  • and a single variant of Pandas UDFs: func(series+) -> series

  • can also run regular Hamilton functions, which will execute spark driver side.

DISCLAIMER – this class is experimental, so signature changes are a possibility!

__init__()¶
build_result(**outputs: Dict[str, Any]) DataFrame¶

Builds the result and brings it back to this running process.

Parameters:

outputs – the dictionary of key -> Union[ray object reference | value]

Returns:

The type of object returned by self.result_builder.

static check_input_type(node_type: Type, input_value: Any) bool¶

If the input is a pyspark dataframe, skip, else delegate the check.

static check_node_type_equivalence(node_type: Type, input_type: Type) bool¶

Checks for the htype.column annotation and deals with it.

execute_node(node: Node, kwargs: Dict[str, Any]) Any¶

Given a node to execute, process it and apply a UDF if applicable.

Parameters:
  • node – the node we’re processing.

  • kwargs – the inputs to the function.

Returns:

the result of the function.