HumanClassification (Part 3) - Parsing the page

September 5, 2024

HumanClassification (Part 3) – Parsing the page

In this phase, we’ve planned to parse the given page to obtain some essential data. For this purpose, we started to categorize all pages that we probably deal with. They’ve been classified into 6 different categories which are presented in the following with essential data of each one:

1. E-commerce Website
* Title/H1
* Short description
* Brand
* Price
* Category

2. Corporate Website
* HomePageTitle/H1
* Short description

3. Blog
* Title/H1
* Short description
* Category

4. News Website
* Title/H1
* Short description
* Category

5. Social Networking Website
* Title/H1
* Short description
* Channel that the user reads or follows it

6. Educational Website
* Title/H1
* Short description
* Category
* Price

Initial Step
Gaining the whole targeted page with [`BeautifulSoup`](https://pypi.org/project/beautifulsoup4/).
For extracting data, we implemented two approaches. The first one was inaccurate, while the other approach was so much better!

First approach:

First of all, we started the use of [`nltk`](https://www.nltk.org/data.html) library. This tool provides us words in the page based on its frequency. So we designed a function to search words based on `HTML` tags with different coefficients to get the most accurate data.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

STOP_WORDS = set(stopwords.words('english'))

def pars_based_on_parameter(soup, parameter:str, coefficient: int):
    # Extract text from a variety of HTML tags
    text = ' '.join([element.get_text() for element in soup.find_all(parameter)])

    # Tokenize the text
    words = word_tokenize(text)
    # Convert to lower case
    words = [word.lower() for word in words]
    # Remove stop words and non-alphabetic words
    keywords = [word for word in words if word.isalpha() and word not in STOP_WORDS]
    # Get the most common keywords
    keyword_freq = Counter(keywords)
    # Give relevant coefficient
    for key in keyword_freq.keys():
        keyword_freq[key] *= coefficient

    return keyword_freq

As I mentioned this approach wa inaccurate so, it caused that we face to `ld+json`!

What’s `ld+json`?

JSON-LD (JavaScript Object Notation for Linked Data) offers a simpler means to create machine-readable data from websites to promote search results.

In simpler terms, it delivers more easily indexable content to search crawlers like Googlebot.

**Now we go to it …**

Second (Final) Approach:

Based on it’s mentioned, most of the vital data are presented in `ld+json` so we get and purify. However, it’s worth mentioning that data in the `title`, `h1`, and `meta` are considered.

titles_h1s = {
    'title': soup.title.string if soup.title else "No title found",
    'h1_tags': [h1.get_text().strip() for h1 in soup.find_all('h1')]
    }

for meta in metas:
    try:
        outs[f"{meta['property']}"] = {meta['content']}
    except:
        print('Not Related!')

lds = soup.find_all("script", {"type": "application/ld+json"})
outs['ld+json'] = json.loads(ld.get_text())

HumanClassification (Part 3) – Parsing the page

ObjectDetector Project: A Deep Learning Approach

NewsClassifier: Automating News Categorization

FaceAI Project Report