Basic Dataframe Example

Pyspark Dataframe / Pyspark filter

In this article, we dive in and see details about Pyspark Dataframe.

Load the CSV file and create a Pyspark Dataframe.

 
spark = SparkSession.builder.appName('spark-sql').master("local").getOrCreate()
df = spark.read.csv('data.csv', header='true', inferSchema='true')
print(type(df))

Create a Dataframe using Named Entity:

 
from pyspark.sql import *
emp = Row("id", "name", "country")
emp1 = emp(100,"Rick","Netherlands")
emp2 = emp(101,'Jason','Aus')
emp_df = spark.createDataFrame([emp1,emp2])
emp_df.show()

Output:

 +---+-----+-----------+
 | id| name|    country|
 +---+-----+-----------+
 |100| Rick|Netherlands|
 |101|Jason|        Aus|
 +---+-----+-----------+ 

Row is an ordered collection of fields.

Verify the schema of the Dataframe:

 
df.printSchema()

Output:

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- country: string (nullable = true)

Check the first few records:

print(df.head(2))

Output:


[Row(id=100, name='Rick', country='Netherlands'), Row(id=101, name='Jason', country='Aus')] 

Select: Select one or multiple columns.

 df.select('name').show()

Output:

+------+
|  name|
+------+
|  Rick|
| Jason|
|Maggie|
|Eugine|
| Jacob|
+------+
 df.select('name','country').show()

Output:

+------+-----------+
|  name|    country|
+------+-----------+
|  Rick|Netherlands|
| Jason|        Aus|
|Maggie|        Usa|
|Eugine|    Denmark|
| Jacob|        Usa|
+------+-----------+

distinct: Find distinct records of a column.

 df.select('country').distinct().show()

Output:

+-----------+
|    country|
+-----------+
|        Aus|
|    Denmark|
|        Ind|
|        Usa|
|Netherlands|
+-----------+

Pyspark filter and where: filter() and where() functions are used to filter data based on certain criteria.

 
df.where(col('id') > 102).show() 
or
df.filter(col('id') > 102).show() 

Output

+---+------+-------+
| id|  name|country|
+---+------+-------+
|104|Eugine|Denmark|
|105| Jacob|    Usa|
|110|  null|    Aus|
|112| Negan|    Ind|
+---+------+-------+

groupBy:

 
df.groupBy('country').count().show()

Output:

+-----------+-----+
 |    country|count|
 +-----------+-----+
 |        Aus|    1|
 |    Denmark|    1|
 |        Usa|    2|
 |Netherlands|    1|
 +-----------+-----+

In this blog, you have seen how to create a Dataframe in PySpark and the basics of the Pyspark dataframe. In the next blog, we will discuss handling ‘NA’ or empty values in a Dataframe. To know more about PySpark follow this link.

Leave a Reply