Extraiga el código HTML de la etiqueta dada y su padre usando BeautifulSoup

En este artículo, discutiremos cómo extraer el código HTML de la etiqueta dada y su padre usando BeautifulSoup.

Módulos necesarios

Primero, necesitamos instalar todos estos módulos en nuestra computadora.

BeautifulSoup: nuestro módulo principal contiene un método para acceder a una página web a través de HTTP.

pip install bs4

lxml: Biblioteca auxiliar para procesar páginas web en lenguaje python.

pip install lxml

requests: hace que el proceso de envío de requests HTTP sea impecable. El resultado de la función.

pip install requests

Scraping de un sitio web de muestra

Importamos nuestro módulo beautifulsoup y requests. Declaramos Header y agregamos un agente de usuario. Esto asegura que el sitio web de destino que vamos a raspar no considere el tráfico de nuestro programa como spam y finalmente sea bloqueado por ellos.

Python3

# importing the modules
from bs4 import BeautifulSoup
import requests
  
# URL to the scraped
URL = "https://en.wikipedia.org/wiki/Machine_learning"
  
# getting the contents of the website and parsing them
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "lxml")

Ahora, para apuntar al elemento sobre el que desea obtener la información, haga clic derecho en él y haga clic en inspeccionar elemento. Luego, desde la ventana de inspección del elemento, intente encontrar un atributo HTML que sea único para los demás. La mayoría de las veces es el Id del elemento.

Aquí para extraer el HTML del título del sitio, podemos extraerlo fácilmente usando la identificación del título.

Python3

# getting the h1 with id as firstHeading and printing it
title = soup.find("h1", attrs={"id": 'firstHeading'})
print(title)

Ahora extrayendo el contenido de la etiqueta en cuestión, simplemente podemos usar el método .get_text(). La implementación sería la siguiente:

Python3

# getting the text/content inside the h1 tag we
# parsed on the previous line
cont = title.get_text()
print(cont)

Ahora, para extraer el HTML del elemento principal de un elemento en cuestión, tomemos un ejemplo de un lapso que tiene el ID «Machine_learning_approaches».

Necesitamos extraerlo para que muestre el HTML en forma de listas de listas.

Python3

# getting the HTML of the parent parent of 
# the h1 tag we parsed earlier
parent = soup.find("span", 
                   attrs={"id": 'Machine_learning_approaches'}).parent()
print(parent)

A continuación el programa completo:

Python3

# importing the modules
from bs4 import BeautifulSoup 
import requests 
  
# URL to the scraped
URL = "https://en.wikipedia.org/wiki/Machine_learning"
  
# getting the contents of the website and parsing them
webpage = requests.get(URL) 
soup = BeautifulSoup(webpage.content, "lxml")
  
# getting the h1 with id as firstHeading and printing it
title = soup.find("h1", attrs={"id": 'firstHeading'})
print(title)
  
# getting the text/content inside the h1 tag we 
# parsed on the previous line
cont = title.get_text()
print(cont)
  
# getting the HTML of the parent parent of 
# the h1 tag we parsed earlier
parent = soup.find("span", 
                   attrs={"id": 'Machine_learning_approaches'}).parent()
print(parent)

Producción:

También puede consultar este video para obtener una explicación:

Publicación traducida automáticamente

Artículo escrito por saikatsahana91 y traducido por Barcelona Geeks. The original can be accessed here. Licence: CCBY-SA

Módulos necesarios

Scraping de un sitio web de muestra

Python3

Python3

Python3

Python3

Python3

También puede consultar este video para obtener una explicación:

Deja una respuesta Cancelar la respuesta