In this article, we will show how average function works in PySpark. avg() is an aggregate function which is used to get the average value from the dataframe column/s. We can get average value in three ways.
Lets go through one by one.
First let's create the dataframe for demonstration.
Step 1: Creating a dataframe for the demonstration.
Output:import pyspark from pyspark.sql import SparkSession # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2}, {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':113,'item':'potato','cost':170.39,'quantity':10}, {'food_id':113,'item':'potato','cost':34.39,'quantity':2}, {'food_id':102,'item':'grains','cost':1000.9,'quantity':24}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #display input_dataframe.show()
+-------+-------+------------+--------+ | cost|food_id| item|quantity| +-------+-------+------------+--------+ | 234.89| 112| onions| 4| | 17.39| 113| potato| 1| | 4234.9| 102| grains| 84| | 10.89| 98|shampoo/soap| 2| | 100.89| 98|shampoo/soap| 20| |1234.89| 98|shampoo/soap| 94| | 170.39| 113| potato| 10| | 34.39| 113| potato| 2| | 1000.9| 102| grains| 24| | 134.0| 56| oil| 10| +-------+-------+------------+--------+
Example:dataframe.select(avg('column1'),............,avg('column n')) where, 1. dataframe is the input PySpark DataFrame 2. column specifies the average value to be returned
Output:import pyspark from pyspark.sql import SparkSession #import avg function from pyspark.sql.functions import avg # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2}, {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':113,'item':'potato','cost':170.39,'quantity':10}, {'food_id':113,'item':'potato','cost':34.39,'quantity':2}, {'food_id':102,'item':'grains','cost':1000.9,'quantity':24}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #get the average of cost and quantity column input_dataframe.select(avg('cost'),avg('quantity')).show()
+-----------------+-------------+ | avg(cost)|avg(quantity)| +-----------------+-------------+ |717.3530000000001| 25.1| +-----------------+-------------+
Example:dataframe.agg({'column1': 'avg',......,'column n':'avg'})where, 1. dataframe is the input PySpark DataFrame 2. column specifies the average value to be returned
Output:import pyspark from pyspark.sql import SparkSession # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2}, {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':113,'item':'potato','cost':170.39,'quantity':10}, {'food_id':113,'item':'potato','cost':34.39,'quantity':2}, {'food_id':102,'item':'grains','cost':1000.9,'quantity':24}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #get the average of cost and quantity column input_dataframe.agg({'cost': 'avg','quantity':'avg'}).show()
+-----------------+-------------+ | avg(cost)|avg(quantity)| +-----------------+-------------+ |717.3530000000001| 25.1| +-----------------+-------------+
Example:dataframe.groupBy('group_column').avg('column') where, 1. dataframe is the input dataframe 2. group_column is the column where values are grouped 3. column is the column name to get average value based on group_column
Output:import pyspark from pyspark.sql import SparkSession # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2}, {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':113,'item':'potato','cost':170.39,'quantity':10}, {'food_id':113,'item':'potato','cost':34.39,'quantity':2}, {'food_id':102,'item':'grains','cost':1000.9,'quantity':24}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #get the average of cost column groued by item input_dataframe.groupBy('item').avg('cost').show()
+------------+------------------+ | item| avg(cost)| +------------+------------------+ | grains|2617.8999999999996| | onions| 234.89| | potato| 74.05666666666666| |shampoo/soap|448.89000000000004| | oil| 134.0| +------------+------------------+