I used ChatGPT to analyze more than 500 articles from the acclaimed “PUNCH” newspaper
A month ago I wrote a piece on how to use CHATGPT to code easily. Check it out: Chat GPT and Coding: How to Code Easily with Chat GPT in 2023 (codeant.org)
I decided to take my own advice and I used it to analyze more than 500 articles from the PUNCH website: https://punchng.com/
Natural Language Processing (NLP) is used everywhere, from chatbots to even your favourite search engines(Bing, Google, etc). This is what computers use to understand human language (or natural language) and I wanted to explore how it could be used to help businesses in the media industry.
The Punch newspaper is the most widely read newspaper in Nigeria and I personally remember reading their Sunday issues when I was young because they would add a very short comic. My dad would buy the newspaper as we were going home from church and I would collect it, remove the page with the comic and then return the rest to him. The newspaper was founded in 1970 by James Aboderin and Sam Amuka.
Thanks to the DSN and DataCamp Scholarship, I took an advanced course, ‘Feature Engineering for NLP in Python’ on the DataCamp platform and this provided the knowledge I needed to analyze their data.
I spent almost a month collecting 585 links from their online platform. I scraped at least 70 articles for each of the following categories:
- Politics
- News
- Metro Plus
- Sports
- Business
- Editorial and
- HealthWise
I wanted to find out the following:
- Who is the average person that can read articles on the PUNCH website? (Readability Index)
- How subjective are the articles? (Integrity of the News Source)
- Which category has the most positive emotion? (Positive Polarity) and finally
- Which category of articles has the lowest number of words? (User Attention)
Who is the average person that can read articles from the PUNCH website? (Readability Index)
Sadly, not everybody has the ability to read and write. According to data from the World Bank, the literacy rate of adults in Nigeria (2018) is 62%. This means that only 62% of Nigerian adults can read and write. It is important for everyone to have access to news to stay up to date on events happening around them both at home and abroad. As a newspaper, your goal is to ensure that everybody from CEOs with post-graduate degrees to market women who might only have a secondary school education can read and understand the information that you are putting out.
# AVERAGE FOG INDEX
average_fog_index = data['FOG INDEX'].mean()
print(average_fog_index)
[6.325771756892892]
My analysis showed that the fog index which just really means the reading level is 6.3. This corresponds to the 6th grade reading level. The average age of a child in the 6th grade is 12. This means that people aged 12 and above can read an average article on the PUNCH website.
How subjective are the articles? (Integrity of the News Source)
My next question was about the subjectivity of the articles. It is important that an international newspaper put out unbiased information. Unlike personal blogs or websites where people give their own perspectives and opinions on things. Newspapers are to tell you how things are as objectively as possible, not adding salt or sugar.
# AVERAGE SUBJECTIVITY OF THE ARTICLES
average_subjectivity = data['SUBJECTIVITY SCORE'].mean()
print(average_subjectivity)
[0.08911143353352624]
Subjectivity is measured on a scale of 0–1. Where 0 stands for objectivity (i.e. zero to little bias/personal opinion) and 1 stands for subjectivity (i.e. high bias/personal opinion). As you can see the average subjectivity is less than zero which means that the articles are on average very objective.
Which category has the most positive emotion? (Positive Polarity)
An article talking about a huge natural disaster where people died will have a negative polarity because the topic brings about negative or sad emotions. On the other hand, if an article was talking about how a couple that has been married for 10 years without children finally gave birth to twins, the article will have a positive polarity.
If you’re wondering how it’s calculated. Here you go:
polarity_score = (positive_score — negative_score)/((positive_score + negative_score) + 0.000001)
where:
positive score = number of positive words in the article
negative score = number of negative words in the article
# WHICH CATEGORY HAS THE MOST POSITIVE EMOTION?
tags_df = data.groupby('TAGS').mean()
tags_df.sort_values(by='POLARITY SCORE', ascending= False)['POLARITY SCORE']
"""
TAGS
Sport 0.369845
Business 0.365503
Politics 0.157121
News -0.094879
General Health -0.121839
Editorial -0.363592
Metro Plus -0.485386
Name: POLARITY SCORE, dtype: float64
"""
Although both Sport and Business articles are close, Sport articles have the highest positive emotion unlike that of Metro Plus which has the lowest polarity score of -0.485. This is not surprising since Metro Plus articles are articles surrounding natural disasters and criminal activities.
Which category of articles has the lowest number of words? (User Attention)
Nobody likes reading a lengthy article. I’m sure you’ve clicked on a link to an article before, scrolled a little and got tired of reading. User retention is a key success metric in the media industry, from YouTube videos to blog posts, it is one of the reasons TikTok is so popular, they put out short videos which easily retain the attention of users. I wanted to see which category of articles has the lowest number of words.
# WHICH CATEGORY HAS THE LOWEST NUMBER OF WORDS?
tags_df.sort_values(by='WORD COUNT')['WORD COUNT']
"""
Metro Plus 204.180723
Sport 206.812500
News 209.575000
Business 230.762500
Politics 348.620000
General Health 379.187500
Editorial 666.268293
Name: WORD COUNT, dtype: float64
"""
Metro Plus articles have the lowest number of words. This may be due to the articles being a summary of incidents and not really a detailed write-up like Editorial articles which have the highest number of words.
What more can I do?
If I had access to the website’s User Engagement, I would be able to see which patterns encourage high user engagement and those that do not thereby making changes to improve the engagement of users on the platform.
Although it is important to note that User engagement is not only linked to the articles but also to the User Interface of the website. Is the website easy to navigate or do people get lost trying to read the next article?
References
Literacy rate, adult total (% of people ages 15 and above) — Nigeria | Data (worldbank.org)
How did I use ChatGPT to help me with this?
Let me show you:
The code provided to me by ChatGPT was really helpful for my final code. Check out the my github repo for the data I scraped and the full anaysis. Also check the similarities between my code and ChatGPT’s code: