Skip to content

Summarize kanji frequency and difficulty from Japanese newspaper headlines

License

Notifications You must be signed in to change notification settings

morrev/japan-news-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

japan-news-scraper

Simple study aid to display daily headlines, most frequent kanji, and highest difficulty kanji (based on grade level and stroke count). The difficulty is defined as: (grade_weight)*(grade level) + (1 - grade_weight)*(stroke count) By default the grade_weight is set to 1 (i.e. by default, do not consider stroke count).

Running

Running scraper.py without arguments appends the latest kanji frequencies (extracted from the Nikkei 225 homepage) to summary/summary.csv:

python scraper.py

Arguments:

  • --d displays the scraping output (otherwise, silently scrapes and appends to csv)
  • --w 0.5 sets the grade weight to 0.5

For example, to display top kanji on the Nikkei homepage by difficulty, with equal weights for kanji grade level and stroke count:

python scraper.py --d --w 0.5

Current functionality

Example

Setup/Dependencies

Requires "kanjidic2.xml" (data on kanji grade level and stroke count) at the relative directory: "inputs/kanjidic2.xml". The xml file is available at http://www.edrdg.org/kanjidic/kanjidic2.xml.gz

Attribution/Credits

Future Considerations

About

Summarize kanji frequency and difficulty from Japanese newspaper headlines

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages