Split and Explode
Spark split column / Spark explode
This section explains the splitting a data from a single column to multiple columns and flattens the row into multiple columns.
Split a column:
The below example splits a column called ‘email‘ based on ‘@‘ and creates a new column called ‘username‘.
df.withColumn('username', split(df['email'], '@')[0]).show()
Output:
+---+------+-----------+-----------------+--------+
| id| name| country| email|username|
+---+------+-----------+-----------------+--------+
|100| Rick|Netherlands| rickgr@gmail.com| rickgr|
|101| Jason| Aus| json.j@gmail.com| json.j|
|102|Maggie| Usa| mg@hotmail.com| mg|
|104|Eugine| Denmark| eu@gmail.com| eu|
|105| Jacob| Usa|jacob@hotmail.com| jacob|
|110| null| Aus| john@gmail.com| john|
|112| Negan| Ind| Negan@gmail.com| Negan|
+---+------+-----------+-----------------+--------+
Explode a column:
Let’s see the below data to understand about explode in spark.
[{"id":100,"name":"Rick","country":"Netherlands","carbrand":["ford","audi"]},
{"id":101,"name":"Jason","country":"Aus","carbrand":["ford","bugatti"]},
{"id":102,"name":"Maggie","country":"Usa","carbrand":["subaru","honda"]},
{"id":104,"name":"Eugine","country":"Denmark","carbrand":["kia","honda"]},
{"id":105,"name":"Jacob","country":"Usa","carbrand":["ford","audi"]}]
The dataset is all about employees and the car-brand owned by employees. An employee can have more than one car. Let’s use this data set to flatten the car brand column.
json_df = spark.read.json('data_cars.json',multiLine=True) json_df.show()
Output:
+---------------+-----------+---+------+
| carbrand| country| id| name|
+---------------+-----------+---+------+
| [ford, audi]|Netherlands|100| Rick|
|[ford, bugatti]| Aus|101| Jason|
|[subaru, honda]| Usa|102|Maggie|
| [kia, honda]| Denmark|104|Eugine|
| [ford, audi]| Usa|105| Jacob|
+---------------+-----------+---+------+
explode function will flatten the list like below.
json_df.select(explode("carbrand").alias("brands")).show()
Output:
+-------+
| e|
+-------+
| ford|
| audi|
| ford|
|bugatti|
| subaru|
| honda|
| kia|
| honda|
| ford|
| audi|
+-------+
from pyspark.sql import functions as F json_df.withColumn("brands", F.explode("carbrand")).show()
Output:
+---------------+-----------+---+------+-------+
| carbrand| country| id| name| brands|
+---------------+-----------+---+------+-------+
| [ford, audi]|Netherlands|100| Rick| ford|
| [ford, audi]|Netherlands|100| Rick| audi|
|[ford, bugatti]| Aus|101| Jason| ford|
|[ford, bugatti]| Aus|101| Jason|bugatti|
|[subaru, honda]| Usa|102|Maggie| subaru|
|[subaru, honda]| Usa|102|Maggie| honda|
| [kia, honda]| Denmark|104|Eugine| kia|
| [kia, honda]| Denmark|104|Eugine| honda|
| [ford, audi]| Usa|105| Jacob| ford|
| [ford, audi]| Usa|105| Jacob| audi|
+---------------+-----------+---+------+-------+
Here in the above example, the data in the ‘carbrand‘ column is flattened into multiple rows.