PySpark - min() function

In this post, we will discuss about min() function in PySpark.
min() is an aggregate function which is used to get the minimum value from the dataframe column/s. We can get minimum value in three ways. Lets see one by one
Let's create the dataframe for demonstration.
 copy
import pyspark
from pyspark.sql import SparkSession

# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()

# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]

# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)

#display
input_dataframe.show()
Output:
 copy
+-------+-------+------------+--------+
|   cost|food_id|        item|quantity|
+-------+-------+------------+--------+
| 234.89|    112|      onions|       4|
|  17.39|    113|      potato|       1|
| 4234.9|    102|      grains|      84|
|  10.89|     98|shampoo/soap|       2|
| 100.89|     98|shampoo/soap|      20|
|1234.89|     98|shampoo/soap|      94|
| 170.39|    113|      potato|      10|
|  34.39|    113|      potato|       2|
| 1000.9|    102|      grains|      24|
|  134.0|     56|         oil|      10|
+-------+-------+------------+--------+

Method - 1 : Using select() method

select() method is used to select the minimum value from the dataframe columns. It can take single or multipe columns at a time. It will take min() function as parameter.
But,we have to import min function from pyspark.sql.functions
Syntax:
 copy
dataframe.select(min('column1'),............,min('column n'))
where,
1. dataframe is the input PySpark DataFrame
2. column  specifies the minimum value to be returned
Example:
In this example will use min function on cost and quantity columns.
 copy
import pyspark
from pyspark.sql import SparkSession

#import min function 
from pyspark.sql.functions import min

# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()

# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]

# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)

#get the minimum of cost and quantity  column
input_dataframe.select(min('cost'),min('quantity')).show()
Output:
 copy
+---------+-------------+
|min(cost)|min(quantity)|
+---------+-------------+
|    10.89|            1|
+---------+-------------+

Method - 2 : Using agg() method

agg() stands for aggragation which is used to select the minimum value from the dataframe columns. It will take a dictinary as a parameter in which key will be the column name in the dataframe and value represents the aggregate function name that is min. we can specify multiple columns to apply the aggregate function

Syntax:
 copy
dataframe.agg({'column1': 'min',......,'column n':'min'})where,
1. dataframe is the input PySpark DataFrame
2. column  specifies the minimum value to be returned
Example:
 copy
import pyspark
from pyspark.sql import SparkSession

# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()

# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]

# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)

#get the minimum of cost and quantity column
input_dataframe.agg({'cost': 'min','quantity':'min'}).show()


Output:
 copy
+---------+-------------+
|min(cost)|min(quantity)|
+---------+-------------+
|    10.89|            1|
+---------+-------------+

Method - 3 : Using groupBy() with min()

If we want to get the minimum value based on values in a group we have to use groupBy() function.
This will group the values which are similar in a column and return the minimum value based on group.
Syntax:
 copy
dataframe.groupBy('group_column').min('column')
where,
1. dataframe is the input dataframe
2. group_column is the column where values are grouped
3. column is the column name to get minimum value based on group_column
Example:
Python program to get minimum value by grouping the item column with cost
 copy
import pyspark
from pyspark.sql import SparkSession

# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()

# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]

# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)

#get the minimum of cost column groued by item
input_dataframe.groupBy('item').min('cost').show()
Output:
 copy
+------------+---------+
|        item|min(cost)|
+------------+---------+
|      grains|   1000.9|
|      onions|   234.89|
|      potato|    17.39|
|shampoo/soap|    10.89|
|         oil|    134.0|
+------------+---------+