We automatically pull daily news data from major national news sites: ABC, CBS, CNN, LA Times, NBC, NPR, NYT, Politico, ProPublica, USA Today, and WaPo using Github Workflows. For the latest version, please take a look at the respective JSON files.
As of March 2025, we have about 700k unique URLs.
-
The script for aggregating the URLs and March-2025 dump of URLs (.zip)
-
The script for downloading the article text and parsing some features using newspaper3k, e.g., publication date, authors, etc. and putting it in a DB is here. The script checks the local DB before incrementally processing new data.
- The June 2023 full-text dump is here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZNAKK6
- The March 2025 dump (minus the exceptions listed below) is in the same place.
- Newspaper3k can't parse USAT, Politico, and ABC URLs. I use custom Google search to dig up the URLs and get the data. The script is here.
To explore the DB, some code (Jupyter NB) ...
from sqlite_utils import Database
from itertools import islice
db = Database("../cbs.db")
print("Tables:", db.table_names())
Tables: ['../cbs_stories']
schema = db[table_name].schema
print("Schema:\n")
print(schema)
Schema:
CREATE TABLE [../cbs_stories] (
[url] TEXT PRIMARY KEY,
[source] TEXT,
[publish_date] TEXT,
[title] TEXT,
[authors] TEXT,
[text] TEXT,
[extraction_date] TEXT,
[domain] TEXT
)
db_file = "../cbs.db"
table_name = "../cbs_stories" # yup! it has the ../
db = Database(db_file)
for row in islice(db[table_name].rows, 5):
print(f"URL: {row['url']}")
print(f"Title: {row['title']}")
print(f"Date: {row['publish_date']}")
print(f"Text preview: {row['text'][:100]}...\n")
# Option 1: Convert all data to a DataFrame
df = pd.DataFrame(list(db[table_name].rows))
# Option 2: If the table is very large, you might want to limit rows
# df = pd.DataFrame(list(islice(db[table_name].rows, 1000))) # first 1000 rows
# Print info about the DataFrame
print(f"DataFrame shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.head())