Split and Explode

Spark split column / Spark explode

This section explains the splitting a data from a single column to multiple columns and flattens the row into multiple columns.

Split a column:

The below example splits a column called ‘email‘ based on ‘@‘ and creates a new column called ‘username‘.

 
df.withColumn('username', split(df['email'], '@')[0]).show() 

Output:

+---+------+-----------+-----------------+--------+
| id|  name|    country|            email|username|
+---+------+-----------+-----------------+--------+
|100|  Rick|Netherlands| rickgr@gmail.com|  rickgr|
|101| Jason|        Aus| json.j@gmail.com|  json.j|
|102|Maggie|        Usa|   mg@hotmail.com|      mg|
|104|Eugine|    Denmark|     eu@gmail.com|      eu|
|105| Jacob|        Usa|jacob@hotmail.com|   jacob|
|110|  null|        Aus|   john@gmail.com|    john|
|112| Negan|        Ind|  Negan@gmail.com|   Negan|
+---+------+-----------+-----------------+--------+
Explode a column:

Let’s see the below data to understand about explode in spark.

[{"id":100,"name":"Rick","country":"Netherlands","carbrand":["ford","audi"]},
{"id":101,"name":"Jason","country":"Aus","carbrand":["ford","bugatti"]},
{"id":102,"name":"Maggie","country":"Usa","carbrand":["subaru","honda"]},
{"id":104,"name":"Eugine","country":"Denmark","carbrand":["kia","honda"]},
{"id":105,"name":"Jacob","country":"Usa","carbrand":["ford","audi"]}] 

The dataset is all about employees and the car-brand owned by employees. An employee can have more than one car. Let’s use this data set to flatten the car brand column.

 
json_df = spark.read.json('data_cars.json',multiLine=True)
json_df.show()

Output:

+---------------+-----------+---+------+
|       carbrand|    country| id|  name|
+---------------+-----------+---+------+
|   [ford, audi]|Netherlands|100|  Rick|
|[ford, bugatti]|        Aus|101| Jason|
|[subaru, honda]|        Usa|102|Maggie|
|   [kia, honda]|    Denmark|104|Eugine|
|   [ford, audi]|        Usa|105| Jacob|
+---------------+-----------+---+------+

explode function will flatten the list like below.

 
json_df.select(explode("carbrand").alias("brands")).show() 

Output:

+-------+
|      e|
+-------+
|   ford|
|   audi|
|   ford|
|bugatti|
| subaru|
|  honda|
|    kia|
|  honda|
|   ford|
|   audi|
+-------+
 
from pyspark.sql import functions as F
json_df.withColumn("brands", F.explode("carbrand")).show()

Output:

+---------------+-----------+---+------+-------+
|       carbrand|    country| id|  name| brands|
+---------------+-----------+---+------+-------+
|   [ford, audi]|Netherlands|100|  Rick|   ford|
|   [ford, audi]|Netherlands|100|  Rick|   audi|
|[ford, bugatti]|        Aus|101| Jason|   ford|
|[ford, bugatti]|        Aus|101| Jason|bugatti|
|[subaru, honda]|        Usa|102|Maggie| subaru|
|[subaru, honda]|        Usa|102|Maggie|  honda|
|   [kia, honda]|    Denmark|104|Eugine|    kia|
|   [kia, honda]|    Denmark|104|Eugine|  honda|
|   [ford, audi]|        Usa|105| Jacob|   ford|
|   [ford, audi]|        Usa|105| Jacob|   audi|
+---------------+-----------+---+------+-------+

Here in the above example, the data in the ‘carbrand‘ column is flattened into multiple rows.

Leave a Reply