¿Cómo analizar el archivo HTML local en Python?

Prerrequisitos : Beautifulsoup

Analizar significa dividir un archivo o entrada en partes de información/datos que pueden almacenarse para nuestro uso personal en el futuro. A veces, necesitamos datos de un archivo existente almacenado en nuestras computadoras, en tales casos se puede utilizar la técnica de análisis. El análisis incluye múltiples técnicas utilizadas para extraer datos de un archivo. Lo siguiente incluye la modificación del archivo, la eliminación de algo del archivo, la impresión de datos, el uso del método generador de niños recursivos para recorrer los datos del archivo, la búsqueda de los niños de las etiquetas , el web scraping de un enlace para extraer información útil, etc.

Modificando el archivo

Usando el método embellecer para modificar el código HTML de- https://festive-knuth-1279a2.netlify.app/, luzca mejor. Prettify hace que el código se vea en la forma estándar como la que se usa en VS Code .

Ejemplo:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Importing the HTTP library
import requests as req
  
# Requesting for the website
Web = req.get('https://festive-knuth-1279a2.netlify.app/')
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(Web.text, 'lxml')
  
# Using the prettify method
print(S.prettify())

Producción:

Quitar una etiqueta

Se puede eliminar una etiqueta usando el método de descomposición y el método select_one con los selectores de CSS para seleccionar y luego eliminar el segundo elemento de la etiqueta li y luego usar el método embellecer para modificar el código HTML del archivo index.html.

Ejemplo:

Archivo utilizado:

parsign html

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Using the select-one method to find the second element from the li tag
Tag = S.select_one('li:nth-of-type(2)')
  
# Using the decompose method
Tag.decompose()
  
# Using the prettify method to modify the code
print(S.body.prettify())

Producción:

Encontrar etiquetas

Las etiquetas se pueden encontrar normalmente e imprimir normalmente usando print().

Ejemplo:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
Parse = BeautifulSoup(index, 'lxml')
  
# Printing html code of some tags
print(Parse.head)
print(Parse.h1)
print(Parse.h2)
print(Parse.h3)
print(Parse.li)

Producción:

Etiquetas transversales

El método recursiveChildGenerator se usa para atravesar etiquetas, que encuentra recursivamente todas las etiquetas dentro de las etiquetas del archivo.

Ejemplo:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Using the recursiveChildGenerator method to traverse the html file
for TraverseTags in S.recursiveChildGenerator():
  # Traversing the names of the tags
    if TraverseTags.name:
      # Printing the names of the tags
        print(TraverseTags.name)

Producción:

Analizando el nombre y los atributos de texto de las etiquetas

Usando el atributo de nombre de la etiqueta para imprimir su nombre y el atributo de texto para imprimir su texto junto con el código de la etiqueta – ul del archivo.

Ejemplo:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Printing the Code, name, and text of a tag
print(f'HTML: {S.ul}, name: {S.ul.name}, text: {S.ul.text}')

Producción:

Encontrar hijos de una etiqueta

El atributo Children se usa para obtener los hijos de una etiqueta. El atributo Niños devuelve ‘etiquetas con espacios’ entre ellos, estamos agregando una condición- e. El nombre no es Ninguno para imprimir solo los nombres de las etiquetas del archivo.

Ejemplo:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Providing the source
Attr = S.html
  
# Using the Children attribute to get the children of a tag
# Only contain tag names and not the spaces
Attr_Tag = [e.name for e in Attr.children if e.name is not None]
  
# Printing the children
print(Attr_Tag)

Producción:

Encontrar niños en todos los niveles de una etiqueta:

El atributo Descendientes se utiliza para obtener todos los descendientes (hijos en todos los niveles) de una etiqueta del archivo.

Ejemplo:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Providing the source
Des = S.body
  
# Using the descendants attribute
Attr_Tag = [e.name for e in Des.descendants if e.name is not None]
  
# Printing the children
print(Attr_Tag)

Producción:

Encontrar todos los elementos de las etiquetas

Usando find_all():

El método find_all se usa para encontrar todos los elementos ( nombre y texto ) dentro de la etiqueta p del archivo.

Ejemplo:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Using the find_all method to find all elements of a tag
for tag in S.find_all('p'):
  
  # Printing the name, and text of p tag
    print(f'{tag.name}: {tag.text}')

Producción:

Selectores CSS para encontrar elementos :

Usando el método de selección para usar los selectores de CSS para encontrar el segundo elemento de la etiqueta li del archivo.

Ejemplo:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Using the select method
# Prints the second element from the li tag
print(S.select('li:nth-of-type(2)'))

Producción:

Publicación traducida automáticamente

Artículo escrito por ayushraghuwanshi80 y traducido por Barcelona Geeks. The original can be accessed here. Licence: CCBY-SA

Modificando el archivo

Python3

Quitar una etiqueta

Python3

Encontrar etiquetas

Python3

Etiquetas transversales

Python3

Analizando el nombre y los atributos de texto de las etiquetas

Python3

Encontrar hijos de una etiqueta

Python3

Python3

Encontrar todos los elementos de las etiquetas

Python3

Python3

Deja una respuesta Cancelar la respuesta