Crear índice invertido para archivo usando Python

Un índice invertido es una estructura de datos de índice que almacena una asignación del contenido, como palabras o números, a sus ubicaciones en un documento o conjunto de documentos. En palabras simples, es una estructura de datos similar a un hashmap que lo dirige de una palabra a un documento o una página web.

Crear índice invertido

Crearemos un índice invertido a nivel de palabra , es decir, devolverá la lista de líneas en las que está presente la palabra. También crearemos un diccionario en el que los valores clave representen las palabras presentes en el archivo y el valor de un diccionario estará representado por la lista que contiene los números de línea en los que están presentes. Para crear un archivo en el cuaderno de Júpiter, use la función mágica:

%%writefile file.txt
This is the first word.
This is the second text, Hello! How are you?
This is the third, this is it now.

Esto creará un archivo llamado file.txt con el siguiente contenido.

Para leer el archivo:

Python3

# this will open the file
file = open('file.txt', encoding='utf8')
read = file.read()
file.seek(0)
read
  
# to obtain the
# number of lines
# in file
line = 1
for word in read:
    if word == '\n':
        line += 1
print("Number of lines in file is: ", line)
  
# create a list to
# store each line as
# an element of list
array = []
for i in range(line):
    array.append(file.readline())
  
array

Producción:

Number of lines in file is: 3
['This is the first word.\n',
'This is the second text, Hello! How are you?\n',
'This is the third, this is it now.']

Funciones utilizadas:

Abrir: Se utiliza para abrir el archivo.
read: Esta función se utiliza para leer el contenido del archivo.
seek(0): Devuelve el cursor al principio del archivo.

Eliminar puntuación:

Python3

punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
for ele in read:  
    if ele in punc:  
        read = read.replace(ele, " ")  
          
read
  
# to maintain uniformity
read=read.lower()                    
read

Producción:

'this is the first word \n
this is the second text hello how are you \n
this is the third this is it now '

Limpie los datos eliminando palabras vacías:

Las palabras vacías son aquellas palabras que no tienen emociones asociadas y pueden ignorarse de manera segura sin sacrificar el significado de la oración.

Python3

from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
  
for i in range(1):
    # this will convert
    # the word into tokens
    text_tokens = word_tokenize(read)
  
tokens_without_sw = [
    word for word in text_tokens if not word in stopwords.words()]
  
print(tokens_without_sw)

Producción:

['first', 'word', 'second', 'text', 'hello', 'third']

Crear un índice invertido:

Python3

dict = {}
  
for i in range(line):
    check = array[i].lower()
    for item in tokens_without_sw:
  
        if item in check:
            if item not in dict:
                dict[item] = []
  
            if item in dict:
                dict[item].append(i+1)
  
dict

Producción:

{'first': [1],
'word': [1],
'second': [2], 
'text': [2], 
'hello': [2], 
'third': [3]}

Publicación traducida automáticamente

Artículo escrito por romy421kumari y traducido por Barcelona Geeks. The original can be accessed here. Licence: CCBY-SA