En este artículo, vamos a discutir la función Groupby en PySpark usando Python.
Vamos a crear el marco de datos para la demostración:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of student data data = [["1", "sravan", "IT", 45000], ["2", "ojaswi", "CS", 85000], ["3", "rohith", "CS", 41000], ["4", "sridevi", "IT", 56000], ["5", "bobby", "ECE", 45000], ["6", "gayatri", "ECE", 49000], ["7", "gnanesh", "CS", 45000], ["8", "bhanu", "Mech", 21000] ] # specify column names columns = ['ID', 'NAME', 'DEPT', 'FEE'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe.show()
Producción:
En PySpark, groupBy() se usa para recopilar datos idénticos en grupos en PySpark DataFrame y realizar funciones agregadas en los datos agrupados.
La operación de agregación incluye:
- count(): esto devolverá el recuento de filas para cada grupo.
dataframe.groupBy(‘column_name_group’).count()
- mean(): Esto devolverá la media de los valores para cada grupo.
dataframe.groupBy(‘column_name_group’).mean(‘column_name’)
- max(): Esto devolverá el máximo de valores para cada grupo.
dataframe.groupBy(‘column_name_group’).max(‘column_name’)
- min(): Esto devolverá el mínimo de valores para cada grupo.
dataframe.groupBy(‘column_name_group’).min(‘column_name’)
- sum(): Esto devolverá los valores totales para cada grupo.
dataframe.groupBy(‘column_name_group’).sum(‘column_name’)
- avg(): Esto devolverá el promedio de valores para cada grupo.
dataframe.groupBy(‘column_name_group’).avg(‘column_name’).show()
Tenemos que usar cualquiera de las funciones con groupby mientras usamos el método
Sintaxis : dataframe.groupBy(‘column_name_group’).aggregate_operation(‘column_name’)
Ejemplo 1: Groupby con sum()
Groupby con DEPT junto con FEE con sum().
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of student data data = [["1", "sravan", "IT", 45000], ["2", "ojaswi", "CS", 85000], ["3", "rohith", "CS", 41000], ["4", "sridevi", "IT", 56000], ["5", "bobby", "ECE", 45000], ["6", "gayatri", "ECE", 49000], ["7", "gnanesh", "CS", 45000], ["8", "bhanu", "Mech", 21000] ] # specify column names columns = ['ID', 'NAME', 'DEPT', 'FEE'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT along FEE with sum() dataframe.groupBy('DEPT').sum('FEE').show()
Producción:
Ejemplo 2: Groupby con min()
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of student data data = [["1", "sravan", "IT", 45000], ["2", "ojaswi", "CS", 85000], ["3", "rohith", "CS", 41000], ["4", "sridevi", "IT", 56000], ["5", "bobby", "ECE", 45000], ["6", "gayatri", "ECE", 49000], ["7", "gnanesh", "CS", 45000], ["8", "bhanu", "Mech", 21000] ] # specify column names columns = ['ID', 'NAME', 'DEPT', 'FEE'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT along FEE with min() dataframe.groupBy('DEPT').min('FEE').show()
Producción:
Ejemplo 3: Groupby con max()
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of student data data = [["1", "sravan", "IT", 45000], ["2", "ojaswi", "CS", 85000], ["3", "rohith", "CS", 41000], ["4", "sridevi", "IT", 56000], ["5", "bobby", "ECE", 45000], ["6", "gayatri", "ECE", 49000], ["7", "gnanesh", "CS", 45000], ["8", "bhanu", "Mech", 21000] ] # specify column names columns = ['ID', 'NAME', 'DEPT', 'FEE'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT along FEE with max() dataframe.groupBy('DEPT').max('FEE').show()
Producción:
Ejemplo 4: Groupby con avg()
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of student data data = [["1", "sravan", "IT", 45000], ["2", "ojaswi", "CS", 85000], ["3", "rohith", "CS", 41000], ["4", "sridevi", "IT", 56000], ["5", "bobby", "ECE", 45000], ["6", "gayatri", "ECE", 49000], ["7", "gnanesh", "CS", 45000], ["8", "bhanu", "Mech", 21000] ] # specify column names columns = ['ID', 'NAME', 'DEPT', 'FEE'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT along FEE with avg() dataframe.groupBy('DEPT').avg('FEE').show()
Producción:
Ejemplo 5: Groupby con count()
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of student data data = [["1", "sravan", "IT", 45000], ["2", "ojaswi", "CS", 85000], ["3", "rohith", "CS", 41000], ["4", "sridevi", "IT", 56000], ["5", "bobby", "ECE", 45000], ["6", "gayatri", "ECE", 49000], ["7", "gnanesh", "CS", 45000], ["8", "bhanu", "Mech", 21000] ] # specify column names columns = ['ID', 'NAME', 'DEPT', 'FEE'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT with count() dataframe.groupBy('DEPT').count().show()
Producción:
Ejemplo 6: Groupby con mean()
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of student data data = [["1", "sravan", "IT", 45000], ["2", "ojaswi", "CS", 85000], ["3", "rohith", "CS", 41000], ["4", "sridevi", "IT", 56000], ["5", "bobby", "ECE", 45000], ["6", "gayatri", "ECE", 49000], ["7", "gnanesh", "CS", 45000], ["8", "bhanu", "Mech", 21000] ] # specify column names columns = ['ID', 'NAME', 'DEPT', 'FEE'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT with mean() dataframe.groupBy('DEPT').mean('FEE').show()
Producción:
Aplicando groupby() en múltiples columnas
Aquí vamos a usar groupby() en varias columnas.
Sintaxis: dataframe.groupBy(‘column_name_group1′,’column_name_group2′,…………,’column_name_group n’).aggregate_operation(‘column_name’)
Ejemplo 1: Groupby con funciones mean() con DEPT y NAME
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of student data data = [["1", "sravan", "IT", 45000], ["2", "ojaswi", "CS", 85000], ["3", "rohith", "CS", 41000], ["4", "sridevi", "IT", 56000], ["5", "bobby", "ECE", 45000], ["6", "gayatri", "ECE", 49000], ["7", "gnanesh", "CS", 45000], ["8", "bhanu", "Mech", 21000] ] # specify column names columns = ['ID', 'NAME', 'DEPT', 'FEE'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT and NAME with mean() dataframe.groupBy('DEPT', 'NAME').mean('FEE').show()
Producción:
También podemos agrupar y agregar en varias columnas a la vez usando la siguiente sintaxis:
dataframe.groupBy(“group_column”).agg( max(“column_name”),sum(“column_name”),min(“column_name”),mean(“column_name”),count(“column_name”).show()
Tenemos que importar estas funciones agregadas desde el módulo sql.functions.
Ejemplo:
Python3
# importing module import pyspark # import sum, min,avg,count,mean and max functions from pyspark.sql.functions import sum, max, min, avg, count, mean # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of student data data = [["1", "sravan", "IT", 45000], ["2", "ojaswi", "CS", 85000], ["3", "rohith", "CS", 41000], ["4", "sridevi", "IT", 56000], ["5", "bobby", "ECE", 45000], ["6", "gayatri", "ECE", 49000], ["7", "gnanesh", "CS", 45000], ["8", "bhanu", "Mech", 21000] ] # specify column names columns = ['ID', 'NAME', 'DEPT', 'FEE'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # Groupby with DEPT with sum() , min() , max() dataframe.groupBy("DEPT").agg(max("FEE"), sum("FEE"), min("FEE"), mean("FEE"), count("FEE")).show()
Producción:
Publicación traducida automáticamente
Artículo escrito por sravankumar8128 y traducido por Barcelona Geeks. The original can be accessed here. Licence: CCBY-SA