GitHub - notnews/top_news: Collecting URLs Daily From News Feeds of Major National News Sites 2022--

Top News! URLs from News Feeds of Major National News Sites (2022-)

We automatically pull daily news data from major national news sites: ABC, CBS, CNN, LA Times, NBC, NPR, NYT, Politico, ProPublica, USA Today, and WaPo using Github Workflows. For the latest version, please take a look at the respective JSON files.

As of March 2025, we have about 700k unique URLs.

Other Scripts + Data

The script for aggregating the URLs and March-2025 dump of URLs (.zip)
The script for downloading the article text and parsing some features using newspaper3k, e.g., publication date, authors, etc. and putting it in a DB is here. The script checks the local DB before incrementally processing new data.

The June 2023 full-text dump is here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZNAKK6
The March 2025 dump (minus the exceptions listed below) is in the same place.

Newspaper3k can't parse USAT, Politico, and ABC URLs. I use custom Google search to dig up the URLs and get the data. The script is here.

Get Started With Exploring the Data

To explore the DB, some code (Jupyter NB) ...

from sqlite_utils import Database
from itertools import islice

db = Database("../cbs.db")
print("Tables:", db.table_names())

Tables: ['../cbs_stories']

Table Schema

schema = db[table_name].schema
print("Schema:\n")
print(schema)

Schema:

CREATE TABLE [../cbs_stories] (
   [url] TEXT PRIMARY KEY,
   [source] TEXT,
   [publish_date] TEXT,
   [title] TEXT,
   [authors] TEXT,
   [text] TEXT,
   [extraction_date] TEXT,
   [domain] TEXT
)

db_file = "../cbs.db"
table_name = "../cbs_stories" # yup! it has the ../

db = Database(db_file)

for row in islice(db[table_name].rows, 5):
    print(f"URL: {row['url']}")
    print(f"Title: {row['title']}")
    print(f"Date: {row['publish_date']}")
    print(f"Text preview: {row['text'][:100]}...\n")

Exporting to Pandas

# Option 1: Convert all data to a DataFrame
df = pd.DataFrame(list(db[table_name].rows))

# Option 2: If the table is very large, you might want to limit rows
# df = pd.DataFrame(list(islice(db[table_name].rows, 1000)))  # first 1000 rows

# Print info about the DataFrame
print(f"DataFrame shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.head())

Name		Name	Last commit message	Last commit date
Latest commit History 22,030 Commits
.github/workflows		.github/workflows
agg		agg
.gitignore		.gitignore
Citation.cff		Citation.cff
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
abc.py		abc.py
abc_urls.json		abc_urls.json
cbs.py		cbs.py
cbs_urls.json		cbs_urls.json
cnn.py		cnn.py
cnn_urls.json		cnn_urls.json
concat_json.py		concat_json.py
lat.py		lat.py
lat_urls.json		lat_urls.json
nbc.py		nbc.py
nbc_urls.json		nbc_urls.json
npr.py		npr.py
npr_urls.json		npr_urls.json
nyt.py		nyt.py
nyt_urls.json		nyt_urls.json
politico.py		politico.py
politico_urls.json		politico_urls.json
propub.py		propub.py
propub_urls.json		propub_urls.json
requirements.txt		requirements.txt
usat.py		usat.py
usat_urls.json		usat_urls.json
wapo.py		wapo.py
wapo_urls.json		wapo_urls.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Top News! URLs from News Feeds of Major National News Sites (2022-)

Other Scripts + Data

Get Started With Exploring the Data

Table Schema

Exporting to Pandas

About

Releases

Packages

Contributors 4

Languages

notnews/top_news

Folders and files

Latest commit

History

Repository files navigation

Top News! URLs from News Feeds of Major National News Sites (2022-)

Other Scripts + Data

Get Started With Exploring the Data

Table Schema

Exporting to Pandas

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages