news-watch: Indonesia's top news websites scraper

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research

Installation

You can install newswatch via pip:

pip install news-watch

Usage

To run the scraper from the command line:

newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> --silent

Command-Line Arguments

--keywords, -k: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").

--start_date, -sd: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2023-01-01).

--scrapers, -s: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.

--output_format, -of: Optional. Specify the output format (currently support csv, xlsx).

--silent, -S: Optional. Run the scraper without printing output to the console.

--list_scrapers: Optional. List supported scrapers.

Examples

Scrape articles related to "ihsg" from January 1st, 2025:

newswatch --keywords ihsg --start_date 2025-01-01

Scrape articles for multiple keywords (ihsg, bank, keuangan) and disable logging:

newswatch -k "ihsg,bank,keuangan" -sd 2025-01-01 --silent

List supported scrapers:

newswatch --list_scrapers

Scrape articles for specific news website (bisnisindonesia and detik) with excel output format and disable logging:

newswatch -k "ihsg" -s "bisnisindonesia,detik" --output_format xlsx -S

Run on Google Colab

You can run news-watch on Google Colab

Output

The scraped articles are saved as a CSV or XLSX file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.

The output file contains the following columns:

title
publish_date
author
content
keyword
category
source
link

Supported Websites

Note:

Running Kontan.co.id and Jawapos on the cloud currently leads to errors due to Cloudflare restrictions.

Limitation: Kontan.co.id scraper can process a maximum of 50 pages.

Contributing

Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.

Running Tests

To run the test:

pytest tests/

License

This project is licensed under the MIT - see the LICENSE file for details.

Citation

If you use this software, please use the following BibTex entry:

@software{mabruri_newswatch,
  author       = {Okky Mabruri},
  title        = {news-watch},
  version      = {0.2.0},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.14908390},
  url          = {https://doi.org/10.5281/zenodo.14908390}
}

Available on Zenodo:

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
.vscode		.vscode
newswatch		newswatch
notebook		notebook
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

news-watch: Indonesia's top news websites scraper

Installation

Usage

Examples

Run on Google Colab

Output

Supported Websites

Contributing

Running Tests

License

Citation

Related Work

About

Releases 5

Packages

Languages

License

okkymabruri/news-watch

Folders and files

Latest commit

History

Repository files navigation

news-watch: Indonesia's top news websites scraper

Installation

Usage

Examples

Run on Google Colab

Output

Supported Websites

Contributing

Running Tests

License

Citation

Related Work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages