PySpark - sum() function

In this post, we will discuss about sum() function in PySpark sum() is an aggregate function which is used to get the total value from the dataframe column/s. We can get sum value in three ways let us see one by one

Let us create the dataframe for demonstration.

Copied

import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display
input_dataframe.show()

Output:

+-------+-------+------------+--------+
|   cost|food_id|        item|quantity|
+-------+-------+------------+--------+
| 234.89|    112|      onions|       4|
|  17.39|    113|      potato|       1|
| 4234.9|    102|      grains|      84|
|  10.89|     98|shampoo/soap|       2|
| 100.89|     98|shampoo/soap|      20|
|1234.89|     98|shampoo/soap|      94|
| 170.39|    113|      potato|      10|
|  34.39|    113|      potato|       2|
| 1000.9|    102|      grains|      24|
|  134.0|     56|         oil|      10|
+-------+-------+------------+--------+

sum() : Using select() method

select() method is used to select the sum value from the dataframe columns. It can take single or multiple columns at a time. It will take sum() function as parameter. But, we have to import sum function from pyspark.sql.functions

Syntax:

dataframe.select(sum('column1'),............,sum('column n'))

were,

dataframe is the input PySpark DataFrame
column specifies the total value to be returned

Example: In this example will use sum function on cost and quantity columns.

Copied

import pyspark
from pyspark.sql import SparkSession
#import sum function 
from pyspark.sql.functions import sum
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#get the sum of cost and quantity  column
input_dataframe.select(sum('cost'),sum('quantity')).show()

Output:

+-----------------+-------------+
|        sum(cost)|sum(quantity)|
+-----------------+-------------+
|7173.530000000001|          251|
+-----------------+-------------+

sum() : Using agg() method

agg() stands for aggregation which is used to select the sum value from the dataframe columns. It will take a dictionary as a parameter in which key will be the column name in the dataframe and value represents the aggregate function name that is sum. We can specify multiple columns to apply the aggregate function.

Syntax:

dataframe.agg({'column1': 'sum',......,'column n':'sum'})

were,

dataframe is the input PySpark DataFrame
column specifies the total value to be returned

Example: In this example will use sum function on cost and quantity columns.

Copied

import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#get the sum of cost and quantity column
input_dataframe.agg({'cost': 'sum','quantity':'sum'}).show()

Output:

+-----------------+-------------+
|        sum(cost)|sum(quantity)|
+-----------------+-------------+
|7173.530000000001|          251|
+-----------------+-------------+

sum() : Using groupBy() with sum()

If we want to get the sum based on values in a group, we must use groupBy() function. This will group the values which are similar in a column and return the sum based on group.

Syntax:

dataframe.groupBy('group_column').sum('column')

were,

dataframe is the input dataframe
group_column is the column where values are grouped
column is the column name to get sum value based on group_column

Example: Python program to get sum by grouping the item column with cost

Copied

import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#get the sum of cost column groued by item
input_dataframe.groupBy('item').sum('cost').show()

Output:

+------------+------------------+
|        item|         sum(cost)|
+------------+------------------+
|      grains| 5235.799999999999|
|      onions|            234.89|
|      potato|222.16999999999996|
|shampoo/soap|           1346.67|
|         oil|             134.0|
+------------+------------------+