HumanClassification (Part 3) – Parsing the page
In this phase, we’ve planned to parse the given page to obtain some essential data. For this purpose, we started to categorize all pages that we probably deal with. They’ve been classified into 6 different categories which are presented in the following with essential data of each one:
1. E-commerce Website
* Title/H1
* Short description
* Brand
* Price
* Category
2. Corporate Website
* HomePageTitle/H1
* Short description
3. Blog
* Title/H1
* Short description
* Category
4. News Website
* Title/H1
* Short description
* Category
5. Social Networking Website
* Title/H1
* Short description
* Channel that the user reads or follows it
6. Educational Website
* Title/H1
* Short description
* Category
* Price
Initial Step
Gaining the whole targeted page with [`BeautifulSoup`](https://pypi.org/project/beautifulsoup4/).
For extracting data, we implemented two approaches. The first one was inaccurate, while the other approach was so much better!
First approach:
First of all, we started the use of [`nltk`](https://www.nltk.org/data.html) library. This tool provides us words in the page based on its frequency. So we designed a function to search words based on `HTML` tags with different coefficients to get the most accurate data.
from nltk.tokenize import word_tokenize from nltk.corpus import stopwords STOP_WORDS = set(stopwords.words('english')) def pars_based_on_parameter(soup, parameter:str, coefficient: int): # Extract text from a variety of HTML tags text = ' '.join([element.get_text() for element in soup.find_all(parameter)]) # Tokenize the text words = word_tokenize(text) # Convert to lower case words = [word.lower() for word in words] # Remove stop words and non-alphabetic words keywords = [word for word in words if word.isalpha() and word not in STOP_WORDS] # Get the most common keywords keyword_freq = Counter(keywords) # Give relevant coefficient for key in keyword_freq.keys(): keyword_freq[key] *= coefficient return keyword_freq
As I mentioned this approach wa inaccurate so, it caused that we face to `ld+json`!
What’s `ld+json`?
JSON-LD (JavaScript Object Notation for Linked Data) offers a simpler means to create machine-readable data from websites to promote search results.
In simpler terms, it delivers more easily indexable content to search crawlers like Googlebot.
**Now we go to it …**
Second (Final) Approach:
Based on it’s mentioned, most of the vital data are presented in `ld+json` so we get and purify. However, it’s worth mentioning that data in the `title`, `h1`, and `meta` are considered.
titles_h1s = { 'title': soup.title.string if soup.title else "No title found", 'h1_tags': [h1.get_text().strip() for h1 in soup.find_all('h1')] }
for meta in metas: try: outs[f"{meta['property']}"] = {meta['content']} except: print('Not Related!')
lds = soup.find_all("script", {"type": "application/ld+json"}) outs['ld+json'] = json.loads(ld.get_text())