Python: enfoques de lematización con ejemplos

La siguiente es una guía paso a paso para explorar varios tipos de enfoques de lematización en python junto con algunos ejemplos e implementación de código. Se recomienda encarecidamente que se ciña al flujo dado a menos que comprenda el tema, en cuyo caso puede buscar cualquiera de los enfoques que se indican a continuación.

¿Qué es la Lematización?
En contraste con la derivación , la lematización es mucho más poderosa. Va más allá de la reducción de palabras y considera el vocabulario completo de un idioma para aplicar un análisis morfológico a las palabras, con el objetivo de eliminar solo las terminaciones flexivas y devolver la forma base o de diccionario de una palabra, que se conoce como el lema .

Para mayor claridad, mire los siguientes ejemplos que se dan a continuación:

Original Word ---> Root Word (lemma)      Feature

   meeting    --->   meet                (core-word extraction)
   was        --->    be                 (tense conversion to present tense)
   mice       --->   mouse               (plural to singular)

CONSEJO: Siempre convierta su texto a minúsculas antes de realizar cualquier tarea de PNL, incluida la lematización.

Varios enfoques para la lematización: repasaremos
9 enfoques diferentes para realizar la lematización junto con múltiples ejemplos e implementaciones de código.

WordNet
WordNet (con etiqueta POS)
TextBlob
TextBlob (con etiqueta POS)
espacioso
Etiquetador de árboles
Patrón
Gensim
Stanford Core NLP

1. Wordnet Lemmatizer
Wordnet es una base de datos léxica disponible públicamente de más de 200 idiomas que proporciona relaciones semánticas entre sus palabras. Es una de las primeras y más utilizadas técnicas de lematización.

Está presente en la biblioteca nltk en python.
Wordnet vincula palabras en relaciones semánticas. (por ejemplo, sinónimos)
Agrupa los sinónimos en forma de synsets .
- synsets : un grupo de elementos de datos que son semánticamente equivalentes.

Cómo utilizar:

Descargue el paquete nltk : en su indicador o terminal de anaconda, escriba:
pip install nltk
Descargue Wordnet desde nltk : en su consola de Python, haga lo siguiente:
import nltk
nltk.download(‘wordnet’)
nltk.download(‘averaged_perceptron_tagger’)

Código:

Python3

import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
 
# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()
 
# single word lemmatization examples
list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling',
         'driving', 'died', 'tried', 'feet']
for words in list1:
    print(words + " ---> " + wnl.lemmatize(words))
     
#> kites ---> kite
#> babies ---> baby
#> dogs ---> dog
#> flying ---> flying
#> smiling ---> smiling
#> driving ---> driving
#> died ---> died
#> tried ---> tried
#> feet ---> foot

Código:

Python3

# sentence lemmatization examples
string = 'the cat is sitting with the bats on the striped mat under many flying geese'
 
# Converting String into tokens
list2 = nltk.word_tokenize(string)
print(list2)
#> ['the', 'cat', 'is', 'sitting', 'with', 'the', 'bats', 'on',
#   'the', 'striped', 'mat', 'under', 'many', 'flying', 'geese']
 
lemmatized_string = ' '.join([wnl.lemmatize(words) for words in list2])
 
print(lemmatized_string)  
#> the cat is sitting with the bat on the striped mat under many flying goose

2. Wordnet Lemmatizer (con etiqueta POS)
En el enfoque anterior, observamos que los resultados de Wordnet no estaban a la altura. Palabras como ‘sentarse’, ‘volar’, etc. permanecieron iguales después de la lematización. Esto se debe a que estas palabras se tratan como un sustantivo en la oración dada en lugar de un verbo. Para superar esto, usamos etiquetas POS (Part of Speech).
Agregamos una etiqueta con una palabra particular que define su tipo (verbo, sustantivo, adjetivo, etc.).
Por ejemplo,
Palabra + Tipo (etiqueta POS) —> Palabra lematizada
conducción + verbo ‘v’ —> conducir
perros + sustantivo ‘n’ —> perro

Código:

Python3

# WORDNET LEMMATIZER (with appropriate pos tags)
 
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
 
lemmatizer = WordNetLemmatizer()
 
# Define function to lemmatize each word with its POS tag
 
# POS_TAGGER_FUNCTION : TYPE 1
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None
 
sentence = 'the cat is sitting with the bats on the striped mat under many badly flying geese'
 
# tokenize the sentence and find the POS tag for each token
pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence)) 
 
print(pos_tagged)
#>[('the', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('with', 'IN'),
# ('the', 'DT'), ('bats', 'NNS'), ('on', 'IN'), ('the', 'DT'), ('striped', 'JJ'),
# ('mat', 'NN'), ('under', 'IN'), ('many', 'JJ'), ('flying', 'VBG'), ('geese', 'JJ')]
 
# As you may have noticed, the above pos tags are a little confusing.
 
# we use our own pos_tagger function to make things simpler to understand.
wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))
print(wordnet_tagged)
#>[('the', None), ('cat', 'n'), ('is', 'v'), ('sitting', 'v'), ('with', None),
# ('the', None), ('bats', 'n'), ('on', None), ('the', None), ('striped', 'a'),
# ('mat', 'n'), ('under', None), ('many', 'a'), ('flying', 'v'), ('geese', 'a')]
 
lemmatized_sentence = []
for word, tag in wordnet_tagged:
    if tag is None:
        # if there is no available tag, append the token as is
        lemmatized_sentence.append(word)
    else:       
        # else use the tag to lemmatize the token
        lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
lemmatized_sentence = " ".join(lemmatized_sentence)
 
print(lemmatized_sentence)
#> the cat can be sit with the bat on the striped mat under many fly geese

3. TextBlob
TextBlob es una biblioteca de Python utilizada para procesar datos textuales. Proporciona una API simple para acceder a sus métodos y realizar tareas básicas de PNL.

Descargue el paquete TextBlob: en su indicador o terminal de anaconda, escriba:
pip install textblob

Código:

Python3

from textblob import TextBlob, Word
 
my_word = 'cats'
 
# create a Word object
w = Word(my_word)
 
print(w.lemmatize())
#> cat
 
sentence = 'the bats saw the cats with stripes hanging upside down by their feet.'
 
s = TextBlob(sentence)
lemmatized_sentence = " ".join([w.lemmatize() for w in s.words])
 
print(lemmatized_sentence)
#> the bat saw the cat with stripe hanging upside down by their foot

4. TextBlob (con etiqueta POS)
Igual que en el enfoque de Wordnet sin usar las etiquetas POS apropiadas, también observamos las mismas limitaciones en este enfoque. Entonces, usamos uno de los aspectos más poderosos del módulo TextBlob, el etiquetado de ‘Parte del discurso’ para superar este problema.

Código:

Python3

from textblob import TextBlob
 
# Define function to lemmatize each word with its POS tag
 
# POS_TAGGER_FUNCTION : TYPE 2
def pos_tagger(sentence):
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', "N": 'n', "V": 'v', "R": 'r'}
    words_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]   
    lemma_list = [wd.lemmatize(tag) for wd, tag in words_tags]
    return lemma_list
 
# Lemmatize
sentence = "the bats saw the cats with stripes hanging upside down by their feet"
lemma_list = pos_tagger(sentence)
lemmatized_sentence = " ".join(lemma_list)
print(lemmatized_sentence)
#> the bat saw the cat with stripe hang upside down by their foot
 
lemmatized_sentence = " ".join([w.lemmatize() for w in t_blob.words])
print(lemmatized_sentence)
#> the bat saw the cat with stripe hanging upside down by their foot

Aquí hay un enlace para todos los tipos de abreviaturas de etiquetas con sus significados. (desplácese hacia abajo para ver la tabla de etiquetas)

5. spaCy
spaCy es una biblioteca de Python de código abierto que analiza y «comprende» grandes volúmenes de texto. Hay modelos separados disponibles que se adaptan a idiomas específicos (inglés, francés, alemán, etc.).

Download spaCy package :(a) Open anaconda prompt or terminal as administrator and run the command:
                
                
            (b) Now, open anaconda prompt or terminal normally and run the command:
                

If successful, you should see a message like:

    Linking successful
    C:\Anaconda3\envs\spacyenv\lib\site-packages\en_core_web_sm -->
    C:\Anaconda3\envs\spacyenv\lib\site-packages\spacy\data\en

You can now load the model via

Código:

Python3

import spacy
nlp = spacy.load('en_core_web_sm')
 
# Create a Doc object
doc = nlp(u'the bats saw the cats with best stripes hanging upside down by their feet')
 
# Create list of tokens from given string
tokens = []
for token in doc:
    tokens.append(token)
 
print(tokens)
#> [the, bats, saw, the, cats, with, best, stripes, hanging, upside, down, by, their, feet]
 
lemmatized_sentence = " ".join([token.lemma_ for token in doc])
 
print(lemmatized_sentence)
#> the bat see the cat with good stripe hang upside down by -PRON- foot

En el código anterior, observamos que este enfoque era más poderoso que nuestros enfoques anteriores como:

Incluso se detectaron pronombres . ( identificado por -PRON- )
Incluso lo mejor se cambió a bueno.

6. TreeTagger
TreeTagger es una herramienta para anotar texto con información de parte del discurso y lema. El TreeTagger se ha utilizado con éxito para etiquetar más de 25 idiomas y se puede adaptar a otros idiomas si se dispone de un corpus de capacitación etiquetado manualmente.

Palabra	TPV	lema
la	DT	la
Etiquetador de árboles	notario público	Etiquetador de árboles
es	VBZ	ser
fácil	JJ	fácil
a	A	a
usar	VB	usar
.	ENVIADO	.

How to use: 
1. Download TreeTagger package : In your anaconda prompt or terminal, type:
                      
2. Download TreeTagger Software: Click on TreeTagger and download the software as per your OS. 
(Steps of installation given on website)

Código:

Python3

# 6. TREETAGGER LEMMATIZER
import pandas as pd
import treetaggerwrapper as tt
 
t_tagger = tt.TreeTagger(TAGLANG ='en', TAGDIR ='C:\Windows\TreeTagger')
 
pos_tags = t_tagger.tag_text("the bats saw the cats with best stripes hanging upside down by their feet")
 
original = []
lemmas = []
tags = []
for t in pos_tags:
    original.append(t.split('\t')[0])
    tags.append(t.split('\t')[1])
    lemmas.append(t.split('\t')[-1])
 
Results = pd.DataFrame({'Original': original, 'Lemma': lemmas, 'Tags': tags})
print(Results)
 
#>      Original  Lemma Tags
# 0       the     the   DT
# 1      bats     bat  NNS
# 2       saw     see  VVD
# 3       the     the   DT
# 4      cats     cat  NNS
# 5      with    with   IN
# 6      best    good  JJS
# 7   stripes  stripe  NNS
# 8   hanging    hang  VVG
# 9    upside  upside   RB
# 10     down    down   RB
# 11       by      by   IN
# 12    their   their  PP$
# 13     feet    foot  NNS

7. Pattern
Pattern es un paquete de Python comúnmente utilizado para minería web, procesamiento de lenguaje natural, aprendizaje automático y análisis de redes. Tiene muchas capacidades útiles de PNL. También contiene una característica especial que discutiremos a continuación.

How to use: 
Download Pattern package: In your anaconda prompt or terminal, type:

Código:

Python3

# PATTERN LEMMATIZER
import pattern
from pattern.en import lemma, lexeme
from pattern.en import parse
 
sentence = "the bats saw the cats with best stripes hanging upside down by their feet"
 
lemmatized_sentence = " ".join([lemma(word) for word in sentence.split()])
 
print(lemmatized_sentence)
#> the bat see the cat with best stripe hang upside down by their feet
 
# Special Feature : to get all possible lemmas for each word in the sentence
all_lemmas_for_each_word = [lexeme(wd) for wd in sentence.split()]
print(all_lemmas_for_each_word)
 
#> [['the', 'thes', 'thing', 'thed'],
#   ['bat', 'bats', 'batting', 'batted'],
#   ['see', 'sees', 'seeing', 'saw', 'seen'],
#   ['the', 'thes', 'thing', 'thed'],
#   ['cat', 'cats', 'catting', 'catted'],
#   ['with', 'withs', 'withing', 'withed'],
#   ['best', 'bests', 'besting', 'bested'],
#   ['stripe', 'stripes', 'striping', 'striped'],
#   ['hang', 'hangs', 'hanging', 'hung'],
#   ['upside', 'upsides', 'upsiding', 'upsided'],
#   ['down', 'downs', 'downing', 'downed'],
#   ['by', 'bies', 'bying', 'bied'],
#   ['their', 'theirs', 'theiring', 'theired'],
#   ['feet', 'feets', 'feeting', 'feeted']]

NOTA : si el código anterior genera un error que dice ‘el generador generó StopIteration’ . Solo ejecútalo de nuevo. Funcionará después de 3-4 intentos.

8. Gensim
Gensim está diseñado para manejar grandes colecciones de texto mediante transmisión de datos. Sus funciones de lematización se basan en el paquete de patrones que instalamos anteriormente.

La función gensim.utils.lemmatize() se puede utilizar para realizar la lematización. Este método viene bajo el módulo utils en python.
Podemos usar este lematizador del patrón para extraer tokens codificados en UTF8 en su forma base = lema.
Solo considera sustantivos , verbos , adjetivos y adverbios por defecto (todos los demás lemas se descartan).
Por ejemplo

Word          --->  Lemmatized Word 
are/is/being  --->  be
saw           --->  see

How to use: 
1. Download Pattern package: In your anaconda prompt or terminal, type:
                  
                  
2. Download Gensim package: Open your anaconda prompt or terminal as administrator and type:
                 
                        OR

Código:

Python3

from gensim.utils import lemmatize
 
sentence = "the bats saw the cats with best stripes hanging upside down by their feet"
 
lemmatized_sentence = [word.decode('utf-8').split('.')[0] for word in lemmatize(sentence)]
 
print(lemmatized_sentence)
#> ['bat / NN', 'see / VB', 'cat / NN', 'best / JJ',
#   'stripe / NN', 'hang / VB', 'upside / RB', 'foot / NN']

NOTA : si el código anterior genera un error que dice ‘ generador generó StopIteration ‘. Solo ejecútalo de nuevo. Funcionará después de 3-4 intentos.

En el código anterior, como ya habrás notado, el lematizador gensim ignora las palabras como ‘the’ , ‘with’ , ‘by’ ya que no caen en las 4 categorías de lemas mencionadas anteriormente. (sustantivo/verbo/adjetivo/adverbio)

9. Stanford CoreNLP
CoreNLP permite a los usuarios derivar anotaciones lingüísticas para el texto, incluidos límites de tokens y oraciones, partes del discurso, entidades nombradas, valores numéricos y de tiempo, análisis de dependencia y circunscripción, sentimiento, atribuciones de citas y relaciones.

¡CoreNLP es su ventanilla única para el procesamiento de lenguaje natural en Java!
CoreNLP actualmente admite 6 idiomas, incluidos árabe, chino, inglés, francés, alemán y español.

How to use: 
1. Get JAVA 8 : Download Java 8 (as per your OS) and install it.

2. Get Stanford_coreNLP package : 
    2.1) DownloadStanford_CoreNLPand unzip it.                   
    2.2) Open terminal 
                  
    (a) go to the directory where you extracted the above file by doing
    cd C:\Users\...\stanford-corenlp-4.1.0 on terminal
                        
    (b) then, start your Stanford CoreNLP server by executing the following command on terminal: 
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize, ssplit, pos, lemma, parse, sentiment" -port 9000 -timeout 30000
    **(leave your terminal open as long as you use this lemmatizer)** 
    
3. Download Standford CoreNLP package: Open your anaconda prompt or terminal, type:

Código:

Python3

from stanfordcorenlp import StanfordCoreNLP
import json
 
# Connect to the CoreNLP server we just started
nlp = StanfordCoreNLP('http://localhost', port = 9000, timeout = 30000)
 
# Define properties needed to get lemma
props = {'annotators': 'pos, lemma', 'pipelineLanguage': 'en', 'outputFormat': 'json'}
 
 
sentence = "the bats saw the cats with best stripes hanging upside down by their feet"
parsed_str = nlp.annotate(sentence, properties = props)
print(parsed_str)
 
#> "sentences": [{"index": 0,
#  "tokens": [
#        {
#          "index": 1,
#          "word": "the",
#          "originalText": "the",
#          "lemma": "the",           <--------------- LEMMA
#          "characterOffsetBegin": 0,
#          "characterOffsetEnd": 3,
#          "pos": "DT",
#          "before": "",
#          "after": " "
#        },
#        {
#          "index": 2,
#          "word": "bats",
#          "originalText": "bats",
#          "lemma": "bat",           <--------------- LEMMA
#          "characterOffsetBegin": 4,
#          "characterOffsetEnd": 8,
#          "pos": "NNS",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 3,
#          "word": "saw",
#          "originalText": "saw",
#          "lemma": "see",           <--------------- LEMMA
#          "characterOffsetBegin": 9,
#          "characterOffsetEnd": 12,
#          "pos": "VBD",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 4,
#          "word": "the",
#          "originalText": "the",
#          "lemma": "the",          <--------------- LEMMA
#          "characterOffsetBegin": 13,
#          "characterOffsetEnd": 16,
#          "pos": "DT",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 5,
#          "word": "cats",
#          "originalText": "cats",
#          "lemma": "cat",          <--------------- LEMMA
#          "characterOffsetBegin": 17,
#          "characterOffsetEnd": 21,
#          "pos": "NNS",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 6,
#          "word": "with",
#          "originalText": "with",
#          "lemma": "with",          <--------------- LEMMA
#          "characterOffsetBegin": 22,
#          "characterOffsetEnd": 26,
#          "pos": "IN",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 7,
#          "word": "best",
#          "originalText": "best",
#          "lemma": "best",          <--------------- LEMMA
#          "characterOffsetBegin": 27,
#          "characterOffsetEnd": 31,
#          "pos": "JJS",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 8,
#          "word": "stripes",
#          "originalText": "stripes",
#          "lemma": "stripe",          <--------------- LEMMA
#          "characterOffsetBegin": 32,
#          "characterOffsetEnd": 39,
#          "pos": "NNS",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 9,
#          "word": "hanging",
#          "originalText": "hanging",
#          "lemma": "hang",          <--------------- LEMMA
#          "characterOffsetBegin": 40,
#          "characterOffsetEnd": 47,
#          "pos": "VBG",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 10,
#          "word": "upside",
#          "originalText": "upside",
#          "lemma": "upside",          <--------------- LEMMA
#          "characterOffsetBegin": 48,
#          "characterOffsetEnd": 54,
#          "pos": "RB",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 11,
#          "word": "down",
#          "originalText": "down",
#          "lemma": "down",          <--------------- LEMMA
#          "characterOffsetBegin": 55,
#          "characterOffsetEnd": 59,
#          "pos": "RB",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 12,
#          "word": "by",
#          "originalText": "by",
#          "lemma": "by",          <--------------- LEMMA
#          "characterOffsetBegin": 60,
#          "characterOffsetEnd": 62,
#          "pos": "IN",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 13,
#          "word": "their",
#          "originalText": "their",
#          "lemma": "they"#,          <--------------- LEMMA
#          "characterOffsetBegin": 63,
#          "characterOffsetEnd": 68,
#          "pos": "PRP$",
#          "before": " ",
#          "after": " "
#        },
#        {
#          "index": 14,
#          "word": "feet",
#          "originalText": "feet",
#          "lemma": "foot",          <--------------- LEMMA
#          "characterOffsetBegin": 69,
#          "characterOffsetEnd": 73,
#          "pos": "NNS",
#          "before": " ",
#          "after": ""
#        }
#      ]
#    }
#  ]

Código:

Python3

# To get the lemmatized sentence as output
 
# ** RUN THE ABOVE SCRIPT FIRST **
 
lemma_list = []
for item in parsed_dict['sentences'][0]['tokens']:
    for key, value in item.items():
        if key == 'lemma':
            lemma_list.append(value)
         
print(lemma_list)
#> ['the', 'bat', 'see', 'the', 'cat', 'with', 'best', 'stripe', 'hang', 'upside', 'down', 'by', 'they', 'foot']
 
lemmatized_sentence = " ".join(lemma_list)
print(lemmatized_sentence)
#>the bat see the cat with best stripe hang upside down by the foot

Conclusión:
estos son los diversos enfoques de lematización que puede consultar mientras trabaja en un proyecto de PNL. La selección del enfoque de lematización depende únicamente de los requisitos del proyecto. Cada enfoque tiene su conjunto de ventajas y desventajas. La lematización es obligatoria para proyectos críticos donde la estructura de la oración es importante, como aplicaciones de lenguaje, etc.

Publicación traducida automáticamente

Artículo escrito por prakharr0y y traducido por Barcelona Geeks. The original can be accessed here. Licence: CCBY-SA

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Deja una respuesta Cancelar la respuesta