PySpark Introduction and Create DataFrame

In this article, we will discuss about PySpark and how to create a dataframe in PySpark.

Introduction to Bigdata

Bigdata is one of the trending technology in todays's world. With out data we can not survive.

Day by Day a lots of data is generating and we have to process this data. Some many technologies came and processed data, But processing data with more efficiently is important. And also we have to utilize very low resources on this data. One of the Best option that Bigdata provides is SPARK.

Spark provides best processing frameworks like apps etc in different programming languages like Java, Python , Scala. Spark is used to store and process the data in an efficient way Now we will disscuss Spark Technology in Python

Python Provides Spark usage by providing a module known as PySpark

Lets discuss about PySpark
If we want to use the PySpark, then we have to install it, we can do this by using pip comand

Syntax:

 copy
pip install pyspark

Now, PySpark is ready to use. Let' see how to import PySpark and use it.

Step-1: Import the PySpark module.
We can do this by using import command.
Syntax:

 copy
import pyspark

Step-2: Create Spark App
We have to create Spark app using name through sparksession. So we have to create spark session and create the app
We can create the app by using getOrCreate() method
Syntax:

 copy
from pyspark.sql import SparkSession
app = SparkSession.builder.appName('app_name').getOrCreate()

We are ready with PySpark ! . Lets create the dataframe
Before creating the dataframe we should know about the dataframe.
A DataFrame is a daat structure which will store the data in rows and columns. We can create a dataframe using a dictionary in which the key refers to the column name in the dataframe and value refers to the row in the dataframe.

In PySpark, we can create the dataframe by using createDataFrame() method
Syntax:

 copy
app_name.createDataFrame( dictionary)

If we want to display the dataframe, then we will use show() method, This will display the dataframe in tabular format
Example:
Python Program to create dataframe from the grocery data

 copy
import pyspark
from pyspark.sql import SparkSession

# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()

# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]

# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display the dataframe
input_dataframe.show()
Output:
 copy
+-------+-------+------------+--------+
|   cost|food_id|        item|quantity|
+-------+-------+------------+--------+
| 234.89|    112|      onions|       4|
|  17.39|    113|      potato|       1|
| 4234.9|    102|      grains|      84|
|1234.89|     98|shampoo/soap|      94|
|  134.0|     56|         oil|      10|
+-------+-------+------------+--------+

If we want to display the dataframe in row format, we have to use collect() method
Syntax:

 copy
dataframe.collect()
Example:
Display PySpark DataFrame in Row format
 copy
import pyspark
from pyspark.sql import SparkSession

# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()

# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]

# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)

#display the dataframe
input_dataframe.collect()
Output:
 copy
[Row(cost=234.89, food_id=112, item='onions', quantity=4),
 Row(cost=17.39, food_id=113, item='potato', quantity=1),
 Row(cost=4234.9, food_id=102, item='grains', quantity=84),
 Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94),
 Row(cost=134.0, food_id=56, item='oil', quantity=10)]
If we want to get top rows then we can use head() method. We have to specify the number of rows to be displayed in the parameter

Syntax:

 copy
dataframe.head(n)
where, n is the number of rows.
Simalarly, if we want to get last rows then we can use tail() method. We have to specify the number of rows to be displayed in the parameter
Syntax:
 copy
dataframe.tail(n)
where, n is the number of rows.
Example:
Display top and last rows
 copy
import pyspark
from pyspark.sql import SparkSession

# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()

# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]

# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)

#display 4 rows from top
print(input_dataframe.head(4))

#display last 2 rows
print(input_dataframe.tail(2))
Output:
 copy
[Row(cost=234.89, food_id=112, item='onions', quantity=4), Row(cost=17.39, food_id=113, item='potato', quantity=1), Row(cost=4234.9, food_id=102, item='grains', quantity=84), Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94)]
[Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94), Row(cost=134.0, food_id=56, item='oil', quantity=10)]