Skip to content

xjessiex/AI_Internship_Evaluation

Repository files navigation

Automated Insights Data Science Intern Evaluation

Project Status: [Completed]

Project Intro/Objective

Dataset consists of about 42,000 movie summaries scraped from Wikipedia. Following given guidelines, I use natural language processing (NLP) tools with R and conduct univariate and multi-variate exploration of the dataset, summarized below:

  • find the most produced and the most profitable movie genres
  • identify common characteristics in movies summaries
  • compare the word usage in the top five produced genres with Zipf’s law

Methods Used

  • Text mining packages: tm, tidytext;
  • Data visualization: ggplot2;
  • Natural language Processing library: spaCy (wrapped in cleanNLP), topicmodels;
  • Data manipulation: tidyverse;

Tools

  • R version 3.5.1
  • Python3 (backend)

Warning

The RDS files which are saved and loaded the coding file may not work due to the limitation in push size of GitHub.

Featured Deliverables

Contact

Releases

No releases published

Packages

No packages published

Languages