In this article, we will discuss about PySpark and how to create a dataframe in PySpark.
Bigdata is one of the trending technology in todays's world. With out data we can not survive.
Day by Day a lots of data is generating and we have to process this data. Some many technologies came and processed data, But processing data with more efficiently is important. And also we have to utilize very low resources on this data. One of the Best option that Bigdata provides is SPARK.
Spark provides best processing frameworks like apps etc in different programming languages like Java, Python , Scala. Spark is used to store and process the data in an efficient way Now we will disscuss Spark Technology in Python
Python Provides Spark usage by providing a module known as PySpark
Lets discuss about PySpark
If we want to use the PySpark, then we have to install it,
we can do this by using pip comand
Syntax:
pip install pyspark
Now, PySpark is ready to use. Let' see how to import PySpark and use it.
Step-1: Import the PySpark module.
We can do this by using import command.
Syntax:
import pyspark
Step-2: Create Spark App
We have to create Spark app using name through sparksession. So we have to create spark session and create the app
We can create the app by using getOrCreate() method
Syntax:
from pyspark.sql import SparkSession app = SparkSession.builder.appName('app_name').getOrCreate()
We are ready with PySpark ! . Lets create the dataframe
Before creating the dataframe we should know about the dataframe.
A DataFrame is a daat structure which will store the data in rows and columns.
We can create a dataframe using a dictionary in which the key refers to the column name in the dataframe and value refers to the row in the dataframe.
In PySpark, we can create the dataframe by using createDataFrame() method
Syntax:
app_name.createDataFrame( dictionary)
If we want to display the dataframe, then we will use show() method, This will display the dataframe in tabular format
Example:
Python Program to create dataframe from the grocery data
Output:import pyspark from pyspark.sql import SparkSession # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #display the dataframe input_dataframe.show()
+-------+-------+------------+--------+ | cost|food_id| item|quantity| +-------+-------+------------+--------+ | 234.89| 112| onions| 4| | 17.39| 113| potato| 1| | 4234.9| 102| grains| 84| |1234.89| 98|shampoo/soap| 94| | 134.0| 56| oil| 10| +-------+-------+------------+--------+
If we want to display the dataframe in row format, we have to use collect() method
Syntax:
Example:dataframe.collect()
Output:import pyspark from pyspark.sql import SparkSession # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #display the dataframe input_dataframe.collect()
If we want to get top rows then we can use head() method. We have to specify the number of rows to be displayed in the parameter[Row(cost=234.89, food_id=112, item='onions', quantity=4), Row(cost=17.39, food_id=113, item='potato', quantity=1), Row(cost=4234.9, food_id=102, item='grains', quantity=84), Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94), Row(cost=134.0, food_id=56, item='oil', quantity=10)]
Syntax:
Simalarly, if we want to get last rows then we can use tail() method. We have to specify the number of rows to be displayed in the parameterdataframe.head(n) where, n is the number of rows.
Example:dataframe.tail(n) where, n is the number of rows.
Output:import pyspark from pyspark.sql import SparkSession # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #display 4 rows from top print(input_dataframe.head(4)) #display last 2 rows print(input_dataframe.tail(2))
[Row(cost=234.89, food_id=112, item='onions', quantity=4), Row(cost=17.39, food_id=113, item='potato', quantity=1), Row(cost=4234.9, food_id=102, item='grains', quantity=84), Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94)] [Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94), Row(cost=134.0, food_id=56, item='oil', quantity=10)]