Skip to content

Collecting URLs Daily From News Feeds of Major National News Sites 2022--

Notifications You must be signed in to change notification settings

notnews/top_news

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Top News! URLs from News Feeds of Major National News Sites (2022-)

We automatically pull daily news data from major national news sites: ABC, CBS, CNN, LA Times, NBC, NPR, NYT, Politico, ProPublica, USA Today, and WaPo using Github Workflows. For the latest version, please take a look at the respective JSON files.

As of March 2025, we have about 700k unique URLs.

Other Scripts + Data

  1. The script for aggregating the URLs and March-2025 dump of URLs (.zip)

  2. The script for downloading the article text and parsing some features using newspaper3k, e.g., publication date, authors, etc. and putting it in a DB is here. The script checks the local DB before incrementally processing new data.

  1. Newspaper3k can't parse USAT, Politico, and ABC URLs. I use custom Google search to dig up the URLs and get the data. The script is here.

Get Started With Exploring the Data

To explore the DB, some code (Jupyter NB) ...

from sqlite_utils import Database
from itertools import islice

db = Database("../cbs.db")
print("Tables:", db.table_names())
Tables: ['../cbs_stories']

Table Schema

schema = db[table_name].schema
print("Schema:\n")
print(schema)
Schema:

CREATE TABLE [../cbs_stories] (
   [url] TEXT PRIMARY KEY,
   [source] TEXT,
   [publish_date] TEXT,
   [title] TEXT,
   [authors] TEXT,
   [text] TEXT,
   [extraction_date] TEXT,
   [domain] TEXT
)
db_file = "../cbs.db"
table_name = "../cbs_stories" # yup! it has the ../

db = Database(db_file)

for row in islice(db[table_name].rows, 5):
    print(f"URL: {row['url']}")
    print(f"Title: {row['title']}")
    print(f"Date: {row['publish_date']}")
    print(f"Text preview: {row['text'][:100]}...\n")

Exporting to Pandas

# Option 1: Convert all data to a DataFrame
df = pd.DataFrame(list(db[table_name].rows))

# Option 2: If the table is very large, you might want to limit rows
# df = pd.DataFrame(list(islice(db[table_name].rows, 1000)))  # first 1000 rows

# Print info about the DataFrame
print(f"DataFrame shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.head())

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •