HumanClassification (Part 3) – Parsing the page
In this phase, we’ve planned to parse the given page to obtain some essential data. For this purpose, we started to categorize all pages that we probably deal with. They’ve been classified into 6 different categories which are presented in the following with essential data of each one:
1. E-commerce Website
* Title/H1
* Short description
* Brand
* Price
* Category
2. Corporate Website
* HomePageTitle/H1
* Short description
3. Blog
* Title/H1
* Short description
* Category
4. News Website
* Title/H1
* Short description
* Category
5. Social Networking Website
* Title/H1
* Short description
* Channel that the user reads or follows it
6. Educational Website
* Title/H1
* Short description
* Category
* Price
Initial Step
Gaining the whole targeted page with [`BeautifulSoup`](https://pypi.org/project/beautifulsoup4/).
For extracting data, we implemented two approaches. The first one was inaccurate, while the other approach was so much better!
First approach:
First of all, we started the use of [`nltk`](https://www.nltk.org/data.html) library. This tool provides us words in the page based on its frequency. So we designed a function to search words based on `HTML` tags with different coefficients to get the most accurate data.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
STOP_WORDS = set(stopwords.words('english'))
def pars_based_on_parameter(soup, parameter:str, coefficient: int):
# Extract text from a variety of HTML tags
text = ' '.join([element.get_text() for element in soup.find_all(parameter)])
# Tokenize the text
words = word_tokenize(text)
# Convert to lower case
words = [word.lower() for word in words]
# Remove stop words and non-alphabetic words
keywords = [word for word in words if word.isalpha() and word not in STOP_WORDS]
# Get the most common keywords
keyword_freq = Counter(keywords)
# Give relevant coefficient
for key in keyword_freq.keys():
keyword_freq[key] *= coefficient
return keyword_freq
As I mentioned this approach wa inaccurate so, it caused that we face to `ld+json`!
What’s `ld+json`?
JSON-LD (JavaScript Object Notation for Linked Data) offers a simpler means to create machine-readable data from websites to promote search results.
In simpler terms, it delivers more easily indexable content to search crawlers like Googlebot.
**Now we go to it …**
Second (Final) Approach:
Based on it’s mentioned, most of the vital data are presented in `ld+json` so we get and purify. However, it’s worth mentioning that data in the `title`, `h1`, and `meta` are considered.
titles_h1s = {
'title': soup.title.string if soup.title else "No title found",
'h1_tags': [h1.get_text().strip() for h1 in soup.find_all('h1')]
}
for meta in metas:
try:
outs[f"{meta['property']}"] = {meta['content']}
except:
print('Not Related!')
lds = soup.find_all("script", {"type": "application/ld+json"})
outs['ld+json'] = json.loads(ld.get_text())