Welcome folks today in this post we will be removing
html tags from html file
or raw html using beautifulsoup4
library in python. All the full source code of the application is shown below.
Get Started
In order to get started you need to install the following libraries using the pip
command as shown below
pip install bs4
pip install lxml
After installing these libraries make an app.py
file and copy paste the following code
app.py
Firstly we will be removing the html tags
from the raw html
. All the source code of the example is given below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from bs4 import BeautifulSoup raw_html = """ <!DOCTYPE html> <html> <head> <title>Currency Converter in Javascript</title> <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" /> </head> <body> <p>hello this is some html</p> <h1>My name is Gautam</h1> </body> <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script> </html> """ cleantext = BeautifulSoup(raw_html, "lxml").text print(cleantext) |
So now if you execute the python
script by typing the below command as shown below
python app.py
So as you can see all the html tags
were removed and only the raw text is shown in the command line
Now in example 2 we will be stripping
the html tags from the html file and saving a new output text file
app.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
from bs4 import BeautifulSoup import codecs raw_html = """ <!DOCTYPE html> <html> <head> <title>Currency Converter in Javascript</title> <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" /> </head> <body> <p>hello this is some html</p> <h1>My name is Gautam</h1> </body> <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></script> </html> """ f=codecs.open("test.html", 'r') cleantext = BeautifulSoup(f.read(), "lxml").text print(cleantext) output = open("output.txt","w") output.write(cleantext) |
So now after execution of this script it has created an output.txt
file as shown below which only contains the raw text
data