RDD vs DataFrame vs Dataset

This blog is based on the difference between RDD vs DataFrame vs Dataset.

RDD is nothing but Resilient Distributed Datasets which is immutable, fault-tolerant, and distributed in nature. The distributive nature of RDD helps data stored across multiple nodes. RDD is fundamental to spark. RDD has no schema and it is good for handling unstructured data.

Dataframe is a collection of organized data that is stored as columns and rows. Like RDD, the dataframe is also immutable and always maintains a schema. This enables them to perform filter, select, and aggregation operations. This is a higher-level abstraction on data and allows users to impose a structure on distributed data and query as SQL.

Dataset is a distributed collection of data that is nothing but extended from the Dataframe. Dataset is a strongly typed collection which means it is mapped to a schema and the user has to specify a class while defining a dataset. Also, like Dataframe, Dataset is a compile-time type-safety. Dataset is available after spark 1.6.

Both Dataframe and Dataset built upon RDD and user can leverage the optimized performance which dealing with structured data. As Dataframe is ‘untyped’, it can lead to runtime error whereas Dataset is strongly typed, which can be handled at compile time.

Dataset API is available in Java and Scala. Python does not have the support of Dataset AP because of Python’s dynamic nature as a language. The Dataset API features are already available in Python language.

Leave a Reply