Dado el conjunto de datos, podemos encontrar k número de palabras más frecuentes.
La solución a este problema ya la presento como Encontrar las k palabras más frecuentes de un archivo . Pero podemos resolver este problema de manera muy eficiente en Python con la ayuda de algunos módulos de alto rendimiento.
Para hacer esto, usaremos un módulo de tipo de datos de alto rendimiento, que es collections . Este módulo obtuvo algunos tipos de datos de contenedores especializados y usaremos la clase de contador de este módulo.
Ejemplos:
Input : "John is the son of John second. Second son of John second is William second." Output : [('second', 4), ('John', 3), ('son', 2), ('is', 2)] Explanation : 1. The string will converted into list like this : ['John', 'is', 'the', 'son', 'of', 'John', 'second', 'Second', 'son', 'of', 'John', 'second', 'is', 'William', 'second'] 2. Now 'most_common(4)' will return four most frequent words and its count in tuple. Input : "geeks for geeks is for geeks. By geeks and for the geeks." Output : [('geeks', 5), ('for', 3)] Explanation : most_common(2) will return two most frequent words and their count.
Acercarse :
- Import Counter class from collections module.
- Split the string into list using split(), it will return the lists of words.
- Now pass the list to the instance of Counter class
- The function 'most-common()' inside Counter will return the list of most frequent words from list and its count.
A continuación se muestra la implementación de Python del enfoque anterior:
# Python program to find the k most frequent words # from data set from collections import Counter data_set = "Welcome to the world of Geeks " \ "This portal has been created to provide well written well" \ "thought and well explained solutions for selected questions " \ "If you like Geeks for Geeks and would like to contribute " \ "here is your chance You can write article and mail your article " \ " to contribute at geeksforgeeks org See your article appearing on " \ "the Geeks for Geeks main page and help thousands of other Geeks. " \ # split() returns list of all the words in the string split_it = data_set.split() # Pass the split_it list to instance of Counter class. Counter = Counter(split_it) # most_common() produces k frequently encountered # input values and their respective counts. most_occur = Counter.most_common(4) print(most_occur)
Producción :
[('Geeks', 5), ('to', 4), ('and', 4), ('article', 3)]