Contando k-mers a través de Suffix Array

Requisito previo: array de sufijos. ¿Qué son los k-mers? El término k-mer s generalmente se refiere a todas las posibles substrings de longitud k que están contenidas en una string. Contar todos los k-mers en las lecturas de secuenciación de ADN/ARN es el paso preliminar de muchas aplicaciones bioinformáticas. ¿Qué es una array de sufijos? Una array de sufijos es una array ordenada de todos los sufijos de una string. Es una estructura de datos utilizada, entre otros, en índices de texto completo, algoritmos de compresión de datos. Puede encontrar más información aquí. Problema: Nos dan una string str y un entero k. Tenemos que encontrar todos los pares (substr, i) tales que substr sea una substring de longitud k de str que ocurra exactamente i veces. Pasos involucrados en el enfoque:Tomemos la palabra “banana$” como ejemplo. Paso 1: Calcule la array de sufijos del texto dado.

          6     $   
          5     a$
          3     ana$
          1     anana$
          0     banana$
          4     na$                    
          2     nana$

Paso 2: iterar a través de la array de sufijos manteniendo “curr_count” . 1. Si la longitud del sufijo actual es menor que k, omita la iteración. Es decir, si k = 2 , la iteración se omitirá cuando el sufijo actual sea $ . 2. Si el sufijo actual comienza con la misma longitud – k substring que el sufijo anterior, entonces incremente curr_count. Por ejemplo, durante la cuarta iteración, el sufijo actual “anana$” comienza con la misma substring de longitud k “an”como el sufijo anterior «ana$» comenzaba. Entonces, incrementaremos curr_count en este caso. 3. Si la condición 2 no se cumple, entonces si la longitud del sufijo anterior es igual a k, entonces es un par válido y lo generaremos junto con su conteo actual; de lo contrario, omitiremos esa iteración.

                 curr_count  Valid Pair
 6     $           1                     
 5     a$          1
 3     ana$        1         (a$, 1)
 1     anana$      1
 0     banana$     2         (an, 2)
 4     na$         1         (ba, 1)               
 2     nana$       1         (na, 2)

Ejemplos:

Input : banana$ // Input text
Output : (a$, 1) // k- mers
         (an, 2)
         (ba, 1)
         (na, 2)
 
Input : geeksforgeeks
Output : (ee, 2) 
         (ek, 2)
         (fo, 1)
         (ge, 2)
         (ks, 2)
         (or, 1)
         (sf, 1)

El siguiente es el código C para el enfoque explicado anteriormente:

C

// C program to solve K-mers counting problem
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
// Structure to store data of a rotation
struct rotation {
    int index;
    char* suffix;
};
 
// Compares the rotations and
// sorts the rotations alphabetically
int cmpfunc(const void* x, const void* y)
{
    struct rotation* rx = (struct rotation*)x;
    struct rotation* ry = (struct rotation*)y;
    return strcmp(rx->suffix, ry->suffix);
}
 
// Takes input_text and its length as arguments
// and returns the corresponding suffix array
char** computeSuffixArray(char* input_text,
                               int len_text)
{
    int i;
 
    // Array of structures to store rotations
    // and their indexes
    struct rotation suff[len_text];
 
    // Structure is needed to maintain old
    // indexes of rotations after sorting them
    for (i = 0; i < len_text; i++) {
        suff[i].index = i;
        suff[i].suffix = (input_text + i);
    }
 
    // Sorts rotations using comparison function
    // defined above
    qsort(suff, len_text, sizeof(struct rotation), cmpfunc);
 
    // Stores the suffixes of sorted rotations
    char** suffix_arr =
       (char**)malloc(len_text * sizeof(char*));
 
    for (i = 0; i < len_text; i++) {
        suffix_arr[i] =
        (char*)malloc((len_text + 1) * sizeof(char));
        strcpy(suffix_arr[i], suff[i].suffix);
    }
 
    // Returns the computed suffix array
    return suffix_arr;
}
 
// Takes suffix array, its size and valid length as
// arguments and outputs the valid pairs of k - mers
void findValidPairs(char** suffix_arr, int n, int k)
{
    int curr_count = 1, i;
    char* prev_suff = (char*)malloc(n * sizeof(char));
 
    // Iterates over the suffix array,
    // keeping a current count
    for (i = 0; i < n; i++) {
 
        // Skipping the current suffix
        // if it has length < valid length
        if (strlen(suffix_arr[i]) < k) {
 
            if (i != 0 && strlen(prev_suff) == k) {
                printf("(%s, %d)\n", prev_suff, curr_count);
                curr_count = 1;}
 
            strcpy(prev_suff, suffix_arr[i]);
            continue;
        }
 
        // Incrementing the curr_count if first
        // k chars of prev_suff and current suffix
        // are same
        if (!(memcmp(prev_suff, suffix_arr[i], k))) {
            curr_count++;
        }
        else {
 
            // Pair is valid when i!=0 (as there is
            // no prev_suff for i = 0) and when strlen
            // of prev_suff is k
            if (i != 0 && strlen(prev_suff) == k) {
                printf("(%s, %d)\n", prev_suff, curr_count);
                curr_count = 1;
                memcpy(prev_suff, suffix_arr[i], k);
                prev_suff[k] = '\0';
            }
            else {
                memcpy(prev_suff, suffix_arr[i], k);
                prev_suff[k] = '\0';
                continue;
            }
        }
 
        // Modifying prev_suff[i] to current suffix
        memcpy(prev_suff, suffix_arr[i], k);
        prev_suff[k] = '\0';
    }
 
    // Printing the last valid pair
    printf("(%s, %d)\n", prev_suff, curr_count);
}
 
// Driver program to test functions above
int main()
{
    char input_text[] = "geeksforgeeks";
    int k = 2;
    int len_text = strlen(input_text);
 
    // Computes the suffix array of our text
    printf("Input Text: %s\n", input_text);
    char** suffix_arr =
      computeSuffixArray(input_text, len_text);
 
    // Finds and outputs all valid pairs
    printf("k-mers: \n");
    findValidPairs(suffix_arr, len_text, k);
 
    return 0;
}

Producción:

Input Text: banana$ 
k-mers: 
(a$, 1)
(an, 2)
(ba, 1)
(na, 2)

Complejidad de tiempo: O(s*len_text*log(len_text)), asumiendo que s es la longitud del sufijo más largo. Fuentes: 1. Suffix Array Wikipedia 2. Suffix Array CMU

Publicación traducida automáticamente

Artículo escrito por AnureetKaur y traducido por Barcelona Geeks. The original can be accessed here. Licence: CCBY-SA

C

Deja una respuesta Cancelar la respuesta