Handling ‘NA’ values

Handling Missing Values in Pyspark

Handling missing values in pyspark is the most critical part of data analysis. It is very common to encounter situations where you find null values and its operations can not be performed with null values.

In this blog, we will discuss handling missing values in the PySpark dataframe. Users can use the filter()  method to find out ‘NA’ or ‘null’ values in a dataframe.

Verify null values in dataframe:

The first step is to identify records with null values. To verify null values of a specific column, use the following syntax.
name‘ is a column in the dataframe.

 
df.filter(df.name.isNull()).show() 

Output:

+---+----+-------+
| id|name|country|
+---+----+-------+
|110|null|    Aus|
+---+----+-------+

you can also use the spark function to filter null records.

 
from pyspark.sql import functions as F
df.where(F.isnull(F.col("name"))).show()
Replace null values:

you can replace all null data with a specified value. This will make sure that all null values are being replaced by the input data. This is useful in the case where you do not want to lose any data because of a few null records.

 
df.na.fill('xxx').show()
or
df.fillna('xxx').show()

Output:

+---+------+-----------+
| id|  name|    country|
+---+------+-----------+
|100|  Rick|Netherlands|
|101| Jason|        Aus|
|102|Maggie|        Usa|
|104|Eugine|    Denmark|
|105| Jacob|        Usa|
|110|   xxx|        Aus|
|112| Negan|        Ind|
+---+------+-----------+
Drop all null rows:

Sometimes, you might need to drop all rows with null values as these rows do not add any value to the analysis. To drop all rows with null values you might use df.dropna().

 
df.dropna().show()

Output:

+---+------+-----------+
| id|  name|    country|
+---+------+-----------+
|100|  Rick|Netherlands|
|101| Jason|        Aus|
|102|Maggie|        Usa|
|104|Eugine|    Denmark|
|105| Jacob|        Usa|
|112| Negan|        Ind|
+---+------+-----------+

In this blog, you learned about handling missing values in pyspark, browse this to learn more about pyspark.

Leave a Reply