Basic Dataframe Example
Pyspark Dataframe / Pyspark filter
In this article, we dive in and see details about Pyspark Dataframe.
Load the CSV file and create a Pyspark Dataframe.
spark = SparkSession.builder.appName('spark-sql').master("local").getOrCreate() df = spark.read.csv('data.csv', header='true', inferSchema='true') print(type(df))
Create a Dataframe using Named Entity:
from pyspark.sql import * emp = Row("id", "name", "country") emp1 = emp(100,"Rick","Netherlands") emp2 = emp(101,'Jason','Aus') emp_df = spark.createDataFrame([emp1,emp2]) emp_df.show()
Output:
+---+-----+-----------+
| id| name| country|
+---+-----+-----------+
|100| Rick|Netherlands|
|101|Jason| Aus|
+---+-----+-----------+
Row is an ordered collection of fields.
Verify the schema of the Dataframe:
df.printSchema()
Output:
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- country: string (nullable = true)
Check the first few records:
print(df.head(2))
Output:
[Row(id=100, name='Rick', country='Netherlands'), Row(id=101, name='Jason', country='Aus')]
Select: Select one or multiple columns.
df.select('name').show()
Output:
+------+
| name|
+------+
| Rick|
| Jason|
|Maggie|
|Eugine|
| Jacob|
+------+
df.select('name','country').show()
Output:
+------+-----------+
| name| country|
+------+-----------+
| Rick|Netherlands|
| Jason| Aus|
|Maggie| Usa|
|Eugine| Denmark|
| Jacob| Usa|
+------+-----------+
distinct: Find distinct records of a column.
df.select('country').distinct().show()
Output:
+-----------+
| country|
+-----------+
| Aus|
| Denmark|
| Ind|
| Usa|
|Netherlands|
+-----------+
Pyspark filter and where: filter() and where() functions are used to filter data based on certain criteria.
df.where(col('id') > 102).show() or df.filter(col('id') > 102).show()
Output
+---+------+-------+
| id| name|country|
+---+------+-------+
|104|Eugine|Denmark|
|105| Jacob| Usa|
|110| null| Aus|
|112| Negan| Ind|
+---+------+-------+
groupBy:
df.groupBy('country').count().show()
Output:
+-----------+-----+
| country|count|
+-----------+-----+
| Aus| 1|
| Denmark| 1|
| Usa| 2|
|Netherlands| 1|
+-----------+-----+
In this blog, you have seen how to create a Dataframe in PySpark and the basics of the Pyspark dataframe. In the next blog, we will discuss handling ‘NA’ or empty values in a Dataframe. To know more about PySpark follow this link.