Apache Spark : A comparative overview of UDF, pandas-UDF and arrow-optimized UDF
What does UDF mean and why do they exist ?
UDF stands for User Defined Function. Those functions can be written in Python, R, Java and Scala, enabling more personalized and complex data manipulation.
Indeed, they are designed to make Spark framework more accessible and flexible for a wider range of users. This encourages broader adoption by providing the flexibility needed for various data processing tasks.
1. Standard UDF
UDFs in Apache Spark can be written using various programming languages such as Python, Scala, Java, and R. Each language offers a distinct implementation approach with differences in performance and functionality.
When executed, UDFs are crafted from the driver node to the executors nodes and this regardless the programming language used.
1.1 Java and Scala UDF
UDFs ran within the Java Virtual Machine (JVM) with minor performance penalties. In fact, they do not capitalize on Spark’s advanced optimization features such as Whole-Stage Code Generation and Just-In-Time (JIT) compilation, which are designed to enhance performance.