Clasificación externa – Barcelona Geeks

Clasificación externa es un término para una clase de algoritmos de clasificación que pueden manejar grandes cantidades de datos. La clasificación externa es necesaria cuando los datos que se clasifican no caben en la memoria principal de un dispositivo informático (generalmente RAM) y, en cambio, deben residir en la memoria externa más lenta (generalmente un disco duro). La clasificación externa suele utilizar una estrategia híbrida de clasificación y fusión. En la fase de clasificación, los fragmentos de datos lo suficientemente pequeños como para caber en la memoria principal se leen, clasifican y escriben en un archivo temporal. En la fase de fusión, los subarchivos ordenados se combinan en un solo archivo más grande.
Un ejemplo de clasificación externa es el algoritmo de clasificación de combinación externa, que clasifica los fragmentos que caben en la RAM y luego fusiona los fragmentos ordenados. Primero dividimos el archivo en corridasde modo que el tamaño de una ejecución sea lo suficientemente pequeño como para caber en la memoria principal. Luego ordene cada ejecución en la memoria principal utilizando el algoritmo de clasificación de clasificación por combinación. Finalmente, fusione las ejecuciones resultantes en ejecuciones cada vez más grandes, hasta que se ordene el archivo.

Cuando hacemos una clasificación externa:
1. Cuando los datos no clasificados son demasiado grandes para realizar la clasificación en la memoria interna de la computadora, usamos la clasificación externa.
2. En la clasificación externa usamos el dispositivo secundario. en un dispositivo de almacenamiento secundario, usamos la array de discos de cinta.
3. cuando los datos son grandes, como en la ordenación combinada y la ordenación rápida.
4. Clasificación rápida: mejor tiempo de ejecución promedio.
5. Merge Sort: Mejor tiempo en el peor de los casos.
6. Para realizar la operación de ordenación y combinación, únase a los datos.
7. Para realizar el pedido por la consulta.
8. Para seleccionar elemento duplicado.
9. Donde necesitamos recibir una gran cantidad de información del usuario.

Ejemplo:

Ordenar por fusión
clasificación de cintas
Clasificación polifásica
raíz externa
Fusión externa

Requisito previo para el algoritmo/código:
MergeSort : se usa para ordenar ejecuciones individuales (una ejecución es parte de un archivo que es lo suficientemente pequeño como para caber en la memoria principal)
Merge K Sorted Arrays : se usa para fusionar ejecuciones ordenadas.
A continuación se muestran los pasos utilizados en la implementación de C++.
Entradas:

input_file  : Name of input file. input.txt
output_file : Name of output file, output.txt
run_size : Size of a run (can fit in RAM)
num_ways : Number of runs to be merged

Solución:
La idea es muy simple, no se pueden ordenar todos los elementos a la vez ya que el tamaño es muy grande. Entonces, los datos se dividen en fragmentos y luego se ordenan mediante la clasificación por combinación. Luego, los datos ordenados se vuelcan en archivos. Como una cantidad tan grande de datos no se puede manejar por completo. Ahora, después de clasificar los trozos individuales. Ordene toda la array utilizando la idea de fusionar k arrays ordenadas.
Algoritmo:

Lea input_file de tal manera que, como máximo, los elementos ‘run_size’ se lean a la vez. Haga lo siguiente para cada lectura de ejecución en una array.
Ordene la ejecución usando MergeSort .
Almacene la array ordenada en un archivo. Digamos ‘i’ para i-ésimo archivo.
Combine los archivos ordenados utilizando el enfoque discutido fusionar k arrays ordenadas

A continuación se muestra la implementación en C++ de los pasos anteriores.

CPP

// C++ program to implement
// external sorting using
// merge sort
#include <bits/stdc++.h>
using namespace std;
  
struct MinHeapNode {
    // The element to be stored
    int element;
  
    // index of the array from which
    // the element is taken
    int i;
};
  
// Prototype of a utility function
// to swap two min heap nodes
void swap(MinHeapNode* x, MinHeapNode* y);
  
// A class for Min Heap
class MinHeap {
    // pointer to array of elements in heap
    MinHeapNode* harr;
  
    // size of min heap
    int heap_size;
  
public:
    // Constructor: creates a min
    // heap of given size
    MinHeap(MinHeapNode a[], int size);
  
    // to heapify a subtree with
    // root at given index
    void MinHeapify(int);
  
    // to get index of left child
    // of node at index i
    int left(int i) { return (2 * i + 1); }
  
    // to get index of right child
    // of node at index i
    int right(int i) { return (2 * i + 2); }
  
    // to get the root
    MinHeapNode getMin() { return harr[0]; }
  
    // to replace root with new node
    // x and heapify() new root
    void replaceMin(MinHeapNode x)
    {
        harr[0] = x;
        MinHeapify(0);
    }
};
  
// Constructor: Builds a heap from
// a given array a[] of given size
MinHeap::MinHeap(MinHeapNode a[], int size)
{
    heap_size = size;
    harr = a; // store address of array
    int i = (heap_size - 1) / 2;
    while (i >= 0) {
        MinHeapify(i);
        i--;
    }
}
  
// A recursive method to heapify
// a subtree with root
// at given index. This method
// assumes that the
// subtrees are already heapified
void MinHeap::MinHeapify(int i)
{
    int l = left(i);
    int r = right(i);
    int smallest = i;
    if (l < heap_size && harr[l].element < harr[i].element)
        smallest = l;
    if (r < heap_size && harr[r].element < harr[smallest].element)
        smallest = r;
    if (smallest != i) {
        swap(&harr[i], &harr[smallest]);
        MinHeapify(smallest);
    }
}
  
// A utility function to swap two elements
void swap(MinHeapNode* x, MinHeapNode* y)
{
    MinHeapNode temp = *x;
    *x = *y;
    *y = temp;
}
  
// Merges two subarrays of arr[].
// First subarray is arr[l..m]
// Second subarray is arr[m+1..r]
void merge(int arr[], int l, int m, int r)
{
    int i, j, k;
    int n1 = m - l + 1;
    int n2 = r - m;
  
    /* create temp arrays */
    int L[n1], R[n2];
  
    /* Copy data to temp arrays L[] and R[] */
    for (i = 0; i < n1; i++)
        L[i] = arr[l + i];
    for (j = 0; j < n2; j++)
        R[j] = arr[m + 1 + j];
  
    /* Merge the temp arrays back into arr[l..r]*/
    // Initial index of first subarray
    i = 0;
  
    // Initial index of second subarray
    j = 0;
  
    // Initial index of merged subarray
    k = l;
    while (i < n1 && j < n2) {
        if (L[i] <= R[j])
            arr[k++] = L[i++];
        else
            arr[k++] = R[j++];
    }
  
    /* Copy the remaining elements of L[],
        if there are any */
    while (i < n1)
        arr[k++] = L[i++];
  
    /* Copy the remaining elements of R[],
        if there are any */
    while (j < n2)
        arr[k++] = R[j++];
}
  
/* l is for left index and r is right index of the
   sub-array of arr to be sorted */
void mergeSort(int arr[], int l, int r)
{
    if (l < r) {
        // Same as (l+r)/2, but avoids overflow for
        // large l and h
        int m = l + (r - l) / 2;
  
        // Sort first and second halves
        mergeSort(arr, l, m);
        mergeSort(arr, m + 1, r);
  
        merge(arr, l, m, r);
    }
}
  
FILE* openFile(char* fileName, char* mode)
{
    FILE* fp = fopen(fileName, mode);
    if (fp == NULL) {
        perror("Error while opening the file.\n");
        exit(EXIT_FAILURE);
    }
    return fp;
}
  
// Merges k sorted files. Names of files are assumed
// to be 1, 2, 3, ... k
void mergeFiles(char* output_file, int n, int k)
{
    FILE* in[k];
    for (int i = 0; i < k; i++) {
        char fileName[2];
  
        // convert i to string
        snprintf(fileName, sizeof(fileName),
                 "%d", i);
  
        // Open output files in read mode.
        in[i] = openFile(fileName, "r");
    }
  
    // FINAL OUTPUT FILE
    FILE* out = openFile(output_file, "w");
  
    // Create a min heap with k heap
    // nodes. Every heap node
    // has first element of scratch
    // output file
    MinHeapNode* harr = new MinHeapNode[k];
    int i;
    for (i = 0; i < k; i++) {
        // break if no output file is empty and
        // index i will be no. of input files
        if (fscanf(in[i], "%d ", &harr[i].element) != 1)
            break;
  
        // Index of scratch output file
        harr[i].i = i;
    }
    // Create the heap
    MinHeap hp(harr, i);
  
    int count = 0;
  
    // Now one by one get the
    // minimum element from min
    // heap and replace it with
    // next element.
    // run till all filled input
    // files reach EOF
    while (count != i) {
        // Get the minimum element
        // and store it in output file
        MinHeapNode root = hp.getMin();
        fprintf(out, "%d ", root.element);
  
        // Find the next element that
        // will replace current
        // root of heap. The next element
        // belongs to same
        // input file as the current min element.
        if (fscanf(in[root.i], "%d ",
                   &root.element)
            != 1) {
            root.element = INT_MAX;
            count++;
        }
  
        // Replace root with next
        // element of input file
        hp.replaceMin(root);
    }
  
    // close input and output files
    for (int i = 0; i < k; i++)
        fclose(in[i]);
  
    fclose(out);
}
  
// Using a merge-sort algorithm,
// create the initial runs
// and divide them evenly among
// the output files
void createInitialRuns(
    char* input_file, int run_size,
    int num_ways)
{
    // For big input file
    FILE* in = openFile(input_file, "r");
  
    // output scratch files
    FILE* out[num_ways];
    char fileName[2];
    for (int i = 0; i < num_ways; i++) {
        // convert i to string
        snprintf(fileName, sizeof(fileName),
                 "%d", i);
  
        // Open output files in write mode.
        out[i] = openFile(fileName, "w");
    }
  
    // allocate a dynamic array large enough
    // to accommodate runs of size run_size
    int* arr = (int*)malloc(
        run_size * sizeof(int));
  
    bool more_input = true;
    int next_output_file = 0;
  
    int i;
    while (more_input) {
        // write run_size elements
        // into arr from input file
        for (i = 0; i < run_size; i++) {
            if (fscanf(in, "%d ", &arr[i]) != 1) {
                more_input = false;
                break;
            }
        }
  
        // sort array using merge sort
        mergeSort(arr, 0, i - 1);
  
        // write the records to the
        // appropriate scratch output file
        // can't assume that the loop
        // runs to run_size
        // since the last run's length
        // may be less than run_size
        for (int j = 0; j < i; j++)
            fprintf(out[next_output_file],
                    "%d ", arr[j]);
  
        next_output_file++;
    }
  
    // close input and output files
    for (int i = 0; i < num_ways; i++)
        fclose(out[i]);
  
    fclose(in);
}
  
// For sorting data stored on disk
void externalSort(
    char* input_file, char* output_file,
    int num_ways, int run_size)
{
    // read the input file,
    // create the initial runs,
    // and assign the runs to
    // the scratch output files
    createInitialRuns(input_file,
                      run_size, num_ways);
  
    // Merge the runs using
    // the K-way merging
    mergeFiles(output_file, run_size, num_ways);
}
  
// Driver program to test above
int main()
{
    // No. of Partitions of input file.
    int num_ways = 10;
  
    // The size of each partition
    int run_size = 1000;
  
    char input_file[] = "input.txt";
    char output_file[] = "output.txt";
  
    FILE* in = openFile(input_file, "w");
  
    srand(time(NULL));
  
    // generate input
    for (int i = 0; i < num_ways * run_size; i++)
        fprintf(in, "%d ", rand());
  
    fclose(in);
  
    externalSort(input_file, output_file, num_ways,
                 run_size);
  
    return 0;
}

Análisis de Complejidad:

Complejidad temporal: O(n * log n).
El tiempo necesario para la ordenación por combinación es O (ejecuciones * run_size * log run_size), que es igual a O (n log run_size). Para fusionar las arrays ordenadas, la complejidad del tiempo es O (n * ejecuciones de registro). Por lo tanto, la complejidad de tiempo general es O(n * log run_size + n * log runs). Dado que log run_size + log runs = log run_size*runs = log n, la complejidad del tiempo del resultado será O(n * log n).
Espacio auxiliar: O(run_size).
run_size es el espacio necesario para almacenar la array.

Nota: este código no funcionará en el compilador en línea, ya que requiere permisos de creación de archivos. Cuando se ejecuta la máquina local, produce un archivo de entrada de muestra «input.txt» con 10000 números aleatorios. Ordena los números y coloca los números ordenados en un archivo «output.txt». También genera archivos con los nombres 1, 2, .. para almacenar ejecuciones ordenadas.
Referencias:

Este artículo es una contribución de Aditya Goel . Escriba comentarios si encuentra algo incorrecto o si desea compartir más información sobre el tema tratado anteriormente.

Publicación traducida automáticamente

Artículo escrito por GeeksforGeeks-1 y traducido por Barcelona Geeks. The original can be accessed here. Licence: CCBY-SA

CPP

Deja una respuesta Cancelar la respuesta