Twitter data analysis - Assignment #2 for Computational Social Media course (Fulbright University Vietnam)
This is a Python program I created to statistically analyze a dataset of 10,000 Tweets (from Twitter) for the Computational Social Media course (CS205) at Fulbright University Vietnam in the Fall term, academic year 2021-2022.
The program uses pandas, numpy, and math libraries on Python alongsides key concepts such as string manipulation, lists, dictionaries, and loops.
- Open and read the Tweets data from 'twitter_covid_fuv_2021.xlsx' file
- Compute through the dataset and return the below descriptive statistics as required in the assignment:
- Percentage of tweets that contain URLs.
- Percentage of tweets that are (or contain) retweets.
- Percentage of tweets that contain vaccination hashtags/keywords (%pfizer, %moderna, %astrazeneca, %janssen, %verocell).
- Distribution of languages declared in the tweet metadata (%EN, %FR,....)
- Table of the 30 most frequent hashtags in the following format:[rank, hashtag, frequency]. Example: [1, #coronavirus, 2500]
- Percentage of tweets directly generated by all the 20 media accounts together. e.g.: 3% of tweets were produced by the 20 media accounts altogether.
- Percentage of tweets directly generated by the 20 NGOs/gov. accounts. e.g.: 5% of tweets were produced by the 20 NGOs/government accounts.
- Percentage of tweets generated by all the 20 media accounts that appear as retweets.
- Percentage of tweets generated by all the 20 NGOs/gov. accounts that appear as retweets.
The Twitter data in the Excel file is provided by our course instructor for the educational purpose of the course only. I am grateful for our instructor, Thay Phan Thanh Trung, as he kindly gave us access to the dataset as well as clear instructions and requirements on this analysis assignment.
#python #university #computerscience #dataanalysis #learning #pythonproject