Python 3 BeautifulSoup4 Library Script to Strip or Remove HTML Tags From HTML File or Raw HTML Using lxml Library Full Project For Beginners

Python 3 BeautifulSoup4 Library Script to Strip or Remove HTML Tags From HTML File or Raw HTML Using lxml Library Full Project For Beginners

 

Welcome folks today in this post we will be removing html tags from html file or raw html using beautifulsoup4 library in python. All the full source code of the application is shown below.

 

 

Get Started

 

 

In order to get started you need to install the following libraries using the pip command as shown below

 

pip install bs4

 

pip install lxml

 

After installing these libraries make an app.py file and copy paste the following code

 

app.py

 

Firstly we will be removing the html tags from the raw html. All the source code of the example is given below

 

from bs4 import BeautifulSoup

raw_html = """

<!DOCTYPE html>
<html>
  <head>
    <title>Currency Converter in Javascript</title>
    <link
      rel="stylesheet"
      href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"
    />
  </head>
  <body>
  <p>hello this is some html</p>
  <h1>My name is Gautam</h1>
</body>
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script>
</html>

"""
cleantext = BeautifulSoup(raw_html, "lxml").text

print(cleantext)

 

 

So now if you execute the python script by typing the below command as shown below

READ  Python 3 Reportlab Library Script to Create Simple PDF Document Containing Some Text Full Example For Beginners

 

python app.py

 

 

So as you can see all the html tags were removed and only the raw text is shown in the command line

 

Now in example 2 we will be stripping the html tags from the html file and saving a new output text file

 

app.py

 

from bs4 import BeautifulSoup
import codecs

raw_html = """

<!DOCTYPE html>
<html>
  <head>
    <title>Currency Converter in Javascript</title>
    <link
      rel="stylesheet"
      href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"
    />
  </head>
  <body>
  <p>hello this is some html</p>
  <h1>My name is Gautam</h1>
</body>
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script>
</html>

"""


f=codecs.open("test.html", 'r')

cleantext = BeautifulSoup(f.read(), "lxml").text

print(cleantext)
output = open("output.txt","w")
output.write(cleantext)

 

 

So now after execution of this script it has created an output.txt file as shown below which only contains the raw text data

 

 

Leave a Reply