Let's create the dataframe for demonstration.
Output:import pyspark from pyspark.sql import SparkSession # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2}, {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':113,'item':'potato','cost':170.39,'quantity':10}, {'food_id':113,'item':'potato','cost':34.39,'quantity':2}, {'food_id':102,'item':'grains','cost':1000.9,'quantity':24}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #display input_dataframe.show()
+-------+-------+------------+--------+ | cost|food_id| item|quantity| +-------+-------+------------+--------+ | 234.89| 112| onions| 4| | 17.39| 113| potato| 1| | 4234.9| 102| grains| 84| | 10.89| 98|shampoo/soap| 2| | 100.89| 98|shampoo/soap| 20| |1234.89| 98|shampoo/soap| 94| | 170.39| 113| potato| 10| | 34.39| 113| potato| 2| | 1000.9| 102| grains| 24| | 134.0| 56| oil| 10| +-------+-------+------------+--------+
Example:dataframe.select(max('column1'),............,max('column n')) where, 1. dataframe is the input PySpark DataFrame 2. column specifies the maximum value to be returned
Output:import pyspark from pyspark.sql import SparkSession #import max function from pyspark.sql.functions import max # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2}, {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':113,'item':'potato','cost':170.39,'quantity':10}, {'food_id':113,'item':'potato','cost':34.39,'quantity':2}, {'food_id':102,'item':'grains','cost':1000.9,'quantity':24}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #get the maximum of cost and quantity column input_dataframe.select(max('cost'),max('quantity')).show()
+---------+-------------+ |max(cost)|max(quantity)| +---------+-------------+ | 4234.9| 94| +---------+-------------+
Example:dataframe.agg({'column1': 'max',......,'column n':'max'}) where, 1. dataframe is the input PySpark DataFrame 2. column specifies the maximum value to be returned
Output:import pyspark from pyspark.sql import SparkSession # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2}, {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':113,'item':'potato','cost':170.39,'quantity':10}, {'food_id':113,'item':'potato','cost':34.39,'quantity':2}, {'food_id':102,'item':'grains','cost':1000.9,'quantity':24}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #get the maximum of cost and quantity column input_dataframe.agg({'cost': 'max','quantity':'max'}).show()
+---------+-------------+ |max(cost)|max(quantity)| +---------+-------------+ | 4234.9| 94| +---------+-------------+
Example:dataframe.groupBy('group_column').max('column') where, 1. dataframe is the input dataframe 2. group_column is the column where values are grouped 3. column is the column name to get maximum value based on group_column
Output:import pyspark from pyspark.sql import SparkSession # create the app name GKINDEX app = SparkSession.builder.appName('GKINDEX').getOrCreate() # create grocery data with 5 items with 4 attributes grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4}, {'food_id':113,'item':'potato','cost':17.39,'quantity':1}, {'food_id':102,'item':'grains','cost':4234.9,'quantity':84}, {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2}, {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20}, {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94}, {'food_id':113,'item':'potato','cost':170.39,'quantity':10}, {'food_id':113,'item':'potato','cost':34.39,'quantity':2}, {'food_id':102,'item':'grains','cost':1000.9,'quantity':24}, {'food_id':56,'item':'oil','cost':134.00,'quantity':10}] # creating a dataframe from the grocery_data input_dataframe = app.createDataFrame( grocery_data) #get the maximum of cost column groued by item input_dataframe.groupBy('item').max('cost').show()
+------------+---------+ | item|max(cost)| +------------+---------+ | grains| 4234.9| | onions| 234.89| | potato| 170.39| |shampoo/soap| 1234.89| | oil| 134.0| +------------+---------+