Dataset consists of about 42,000 movie summaries scraped from Wikipedia. Following given guidelines, I use natural language processing (NLP) tools with R and conduct univariate and multi-variate exploration of the dataset, summarized below:
- find the most produced and the most profitable movie genres
- identify common characteristics in movies summaries
- compare the word usage in the top five produced genres with Zipf’s law
- Text mining packages:
tm
,tidytext
; - Data visualization:
ggplot2
; - Natural language Processing library:
spaCy
(wrapped incleanNLP
),topicmodels
; - Data manipulation:
tidyverse
;
- R version 3.5.1
- Python3 (backend)
The RDS files which are saved and loaded the coding file may not work due to the limitation in push size of GitHub.
- Html: Copy the summary link into this github html reader
- PDF (fail to display latex equation): Project Summary
- Xiaoxuan Yang: [xy77@duke.edu]