news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research
You can install newswatch via pip:
pip install news-watch
To run the scraper from the command line:
newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> --silent
Command-Line Arguments
--keywords
, -k
: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").
--start_date
, -sd
: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2023-01-01).
--scrapers
, -s
: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.
--output_format
, -of
: Optional. Specify the output format (currently support csv, xlsx).
--silent
, -S
: Optional. Run the scraper without printing output to the console.
--list_scrapers
: Optional. List supported scrapers.
Scrape articles related to "ihsg" from January 1st, 2025:
newswatch --keywords ihsg --start_date 2025-01-01
Scrape articles for multiple keywords (ihsg, bank, keuangan) and disable logging:
newswatch -k "ihsg,bank,keuangan" -sd 2025-01-01 --silent
List supported scrapers:
newswatch --list_scrapers
Scrape articles for specific news website (bisnisindonesia and detik) with excel output format and disable logging:
newswatch -k "ihsg" -s "bisnisindonesia,detik" --output_format xlsx -S
You can run news-watch on Google Colab
The scraped articles are saved as a CSV or XLSX file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH
.
The output file contains the following columns:
title
publish_date
author
content
keyword
category
source
link
- Bisnis Indonesia
- Bloomberg Technoz
- CNBC Indonesia
- Detik.com
- Jawapos
- Katadata.co.id
- Kompas.com
- Kontan.co.id
- Metrotvnews.com
- Tempo.co
- Viva.co.id
Note:
- Running Kontan.co.id and Jawapos on the cloud currently leads to errors due to Cloudflare restrictions.
- Limitation: Kontan.co.id scraper can process a maximum of 50 pages.
Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.
To run the test:
pytest tests/
This project is licensed under the MIT - see the LICENSE file for details.
If you use this software, please use the following BibTex entry:
@software{mabruri_newswatch,
author = {Okky Mabruri},
title = {news-watch},
version = {0.2.0},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.14908390},
url = {https://doi.org/10.5281/zenodo.14908390}
}