Programa MapReduce: análisis de datos meteorológicos para analizar días cálidos y fríos

Aquí, escribiremos un programa Map-Reduce para analizar conjuntos de datos meteorológicos para comprender su modelo de programación de procesamiento de datos. Los sensores meteorológicos recopilan información meteorológica en todo el mundo en un gran volumen de datos de registro. Estos datos meteorológicos están semiestructurados y orientados a registros.
Estos datos se almacenan en un formato ASCII orientado a líneas, donde cada fila representa un solo registro. Cada fila tiene muchos campos como longitud, latitud, temperatura máxima y mínima diaria, temperatura promedio diaria, etc. Para simplificar, nos centraremos en el elemento principal, es decir, la temperatura. Usaremos los datos de los Centros Nacionales de Información Ambiental (NCEI). Tiene una gran cantidad de datos meteorológicos históricos que podemos usar para nuestro análisis de datos.
Planteamiento del problema:

Analyzing weather data of Fairbanks, Alaska to find cold and hot days using MapReduce Hadoop.

Paso 1:

Podemos descargar el conjunto de datos de este enlace , para varias ciudades en diferentes años. elija el año de su elección y seleccione cualquiera de los archivos de texto de datos para analizar. En mi caso, he seleccionado el conjunto de datos CRND0103-2020-AK_Fairbanks_11_NE.txt para el análisis de días cálidos y fríos en Fairbanks, Alaska.
Podemos obtener información sobre los datos del archivo README.txt disponible en el sitio web del NCEI.

Paso 2:

A continuación se muestra el ejemplo de nuestro conjunto de datos donde la columna 6 y la columna 7 muestran la temperatura máxima y mínima, respectivamente.

minnimum-and-maximum-temprature-field-in-dataset

Paso 3:

Haga un proyecto en Eclipse con los siguientes pasos:

Primero abra Eclipse -> luego seleccione Archivo -> Nuevo -> Proyecto Java -> Nómbrelo MyProject -> luego seleccione usar un entorno de ejecución -> elija JavaSE-1.8 luego siguiente -> Finalizar .

create-java-project

En este proyecto, cree una clase Java con el nombre MyMaxMin -> luego haga clic en Finalizar

create-java-class

Copie el código fuente a continuación a esta clase Java MyMaxMin

JAVA

// importing Libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
 
public class MyMaxMin {
 
     
    // Mapper
     
    /*MaxTemperatureMapper class is static
     * and extends Mapper abstract class
     * having four Hadoop generics type
     * LongWritable, Text, Text, Text.
    */
     
     
    public static class MaxTemperatureMapper extends
            Mapper<LongWritable, Text, Text, Text> {
         
        /**
        * @method map
        * This method takes the input as a text data type.
        * Now leaving the first five tokens, it takes
        * 6th token is taken as temp_max and
        * 7th token is taken as temp_min. Now
        * temp_max > 30 and temp_min < 15 are
        * passed to the reducer.
        */
 
    // the data in our data set with
    // this value is inconsistent data
    public static final int MISSING = 9999;
         
    @Override
        public void map(LongWritable arg0, Text Value, Context context)
                throws IOException, InterruptedException {
 
        // Convert the single row(Record) to
        // String and store it in String
        // variable name line
             
        String line = Value.toString();
             
            // Check for the empty line
            if (!(line.length() == 0)) {
                 
                // from character 6 to 14 we have
                // the date in our dataset
                String date = line.substring(6, 14);
 
                // similarly we have taken the maximum
                // temperature from 39 to 45 characters
                float temp_Max = Float.parseFloat(line.substring(39, 45).trim());
                 
                // similarly we have taken the minimum
                // temperature from 47 to 53 characters
                 
                float temp_Min = Float.parseFloat(line.substring(47, 53).trim());
 
                // if maximum temperature is
                // greater than 30, it is a hot day
                if (temp_Max > 30.0) {
                     
                    // Hot day
                    context.write(new Text("The Day is Hot Day :" + date),
                                         new Text(String.valueOf(temp_Max)));
                }
 
                // if the minimum temperature is
                // less than 15, it is a cold day
                if (temp_Min < 15) {
                     
                    // Cold day
                    context.write(new Text("The Day is Cold Day :" + date),
                            new Text(String.valueOf(temp_Min)));
                }
            }
        }
 
    }
 
// Reducer
     
    /*MaxTemperatureReducer class is static
      and extends Reducer abstract class
      having four Hadoop generics type
      Text, Text, Text, Text.
    */
     
    public static class MaxTemperatureReducer extends
            Reducer<Text, Text, Text, Text> {
 
        /**
        * @method reduce
        * This method takes the input as key and
        * list of values pair from the mapper,
        * it does aggregation based on keys and
        * produces the final context.
        */
         
        public void reduce(Text Key, Iterator<Text> Values, Context context)
                throws IOException, InterruptedException {
 
             
            // putting all the values in
            // temperature variable of type String
            String temperature = Values.next().toString();
            context.write(Key, new Text(temperature));
        }
 
    }
 
 
 
    /**
    * @method main
    * This method is used for setting
    * all the configuration properties.
    * It acts as a driver for map-reduce
    * code.
    */
     
    public static void main(String[] args) throws Exception {
 
        // reads the default configuration of the
        // cluster from the configuration XML files
        Configuration conf = new Configuration();
         
        // Initializing the job with the
        // default configuration of the cluster    
        Job job = new Job(conf, "weather example");
         
        // Assigning the driver class name
        job.setJarByClass(MyMaxMin.class);
 
        // Key type coming out of mapper
        job.setMapOutputKeyClass(Text.class);
         
        // value type coming out of mapper
        job.setMapOutputValueClass(Text.class);
 
        // Defining the mapper class name
        job.setMapperClass(MaxTemperatureMapper.class);
         
        // Defining the reducer class name
        job.setReducerClass(MaxTemperatureReducer.class);
 
        // Defining input Format class which is
        // responsible to parse the dataset
        // into a key value pair
        job.setInputFormatClass(TextInputFormat.class);
         
        // Defining output Format class which is
        // responsible to parse the dataset
        // into a key value pair
        job.setOutputFormatClass(TextOutputFormat.class);
 
        // setting the second argument
        // as a path in a path variable
        Path OutputPath = new Path(args[1]);
 
        // Configuring the input path
        // from the filesystem into the job
        FileInputFormat.addInputPath(job, new Path(args[0]));
 
        // Configuring the output path from
        // the filesystem into the job
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
        // deleting the context path automatically
        // from hdfs so that we don't have
        // to delete it explicitly
        OutputPath.getFileSystem(conf).delete(OutputPath);
 
        // exiting the job only if the
        // flag value becomes false
        System.exit(job.waitForCompletion(true) ? 0 : 1);
 
    }
}

Ahora necesitamos agregar un jar externo para los paquetes que hemos importado. Descargue el paquete jar Hadoop Common y Hadoop MapReduce Core según su versión de Hadoop.
Puede consultar la versión de Hadoop:

hadoop version

check-hadoop-version

Ahora agregamos estos jars externos a nuestro MyProject . Haga clic derecho en MyProject -> luego seleccione Build Path -> Haga clic en Configure Build Path y seleccione Add External jars…. y agregue frascos desde su ubicación de descarga, luego haga clic en -> Aplicar y cerrar .

adding-external-jar-files-to-our-project

Ahora exporte el proyecto como archivo jar. Haga clic con el botón derecho en MyProject , elija Exportar… y vaya a Java -> Archivo JAR, haga clic en -> Siguiente y elija su destino de exportación, luego haga clic en -> Siguiente .
elija Clase principal como MyMaxMin haciendo clic en -> Examinar y luego haga clic en -> Finalizar -> Aceptar .

export-java-MyProject

select-main-class

Paso 4:

Inicie nuestros Hadoop Daemons

start-dfs.sh

start-yarn.sh

Paso 5:

Mueva su conjunto de datos a Hadoop HDFS.
Sintaxis:

hdfs dfs -put /file_path /destination

En el siguiente comando / muestra el directorio raíz de nuestro HDFS.

hdfs dfs -put /home/dikshant/Downloads/CRND0103-2020-AK_Fairbanks_11_NE.txt /

Verifique el archivo enviado a nuestro HDFS.

hdfs dfs -ls /

copying-the-dataset-to-our-HDFS

Paso 6:

Ahora ejecute su archivo Jar con el siguiente comando y produzca la salida en MyOutput File.
Sintaxis:

hadoop jar /jar_file_location /dataset_location_in_HDFS /output-file_name

Dominio:

hadoop jar /home/dikshant/Documents/Project.jar /CRND0103-2020-AK_Fairbanks_11_NE.txt /MyOutput

running-our-jar-file-for-analysis

Paso 7:

Ahora muévase a localhost:50070/ , en utilidades, seleccione Examinar el sistema de archivos y descargue part-r-00000 en el directorio /MyOutput para ver el resultado.

hdfs-view-1

hdfs-view-2

Paso 8:

Vea el resultado en el archivo descargado.

top-10-result-obtained

En la imagen de arriba, puedes ver los 10 mejores resultados que muestran los días fríos. La segunda columna es un día en formato aaaa/mm/dd. Por ejemplo, 20200101 significa

year = 2020
month = 01
Date = 01

Publicación traducida automáticamente

Artículo escrito por dikshantmalidev y traducido por Barcelona Geeks. The original can be accessed here. Licence: CCBY-SA