Skip to content

okkymabruri/news-watch

Repository files navigation

news-watch: Indonesia's top news websites scraper

PyPI version Build Status PyPI Downloads

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research

Installation

You can install newswatch via pip:

pip install news-watch

Usage

To run the scraper from the command line:

newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> --silent

Command-Line Arguments

--keywords, -k: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").

--start_date, -sd: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2023-01-01).

--scrapers, -s: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.

--output_format, -of: Optional. Specify the output format (currently support csv, xlsx).

--silent, -S: Optional. Run the scraper without printing output to the console.

--list_scrapers: Optional. List supported scrapers.

Examples

Scrape articles related to "ihsg" from January 1st, 2025:

newswatch --keywords ihsg --start_date 2025-01-01

Scrape articles for multiple keywords (ihsg, bank, keuangan) and disable logging:

newswatch -k "ihsg,bank,keuangan" -sd 2025-01-01 --silent

List supported scrapers:

newswatch --list_scrapers

Scrape articles for specific news website (bisnisindonesia and detik) with excel output format and disable logging:

newswatch -k "ihsg" -s "bisnisindonesia,detik" --output_format xlsx -S

Run on Google Colab

You can run news-watch on Google Colab Open In Colab

Output

The scraped articles are saved as a CSV or XLSX file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.

The output file contains the following columns:

  • title
  • publish_date
  • author
  • content
  • keyword
  • category
  • source
  • link

Supported Websites

Note:

  • Running Kontan.co.id and Jawapos on the cloud currently leads to errors due to Cloudflare restrictions.
  • Limitation: Kontan.co.id scraper can process a maximum of 50 pages.

Contributing

Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.

Running Tests

To run the test:

pytest tests/

License

This project is licensed under the MIT - see the LICENSE file for details.

Citation

If you use this software, please use the following BibTex entry:

@software{mabruri_newswatch,
  author       = {Okky Mabruri},
  title        = {news-watch},
  version      = {0.2.0},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.14908390},
  url          = {https://doi.org/10.5281/zenodo.14908390}
}

Available on Zenodo: DOI

Related Work