NOAA Fisheries’ Marine Recreational Information Program (MRIP) conducts annual recreational saltwater fishing surveys at the national level to estimate total recreational catch. This data is used to assess and maintain sustainable fish stocks. Survey data is available from 1981 to 2023.
In this project, survey data will be extracted from an NOAA website and loaded into a data warehouse. This data will then be transformed as needed to be ready for reporting & analytics. A web app will be used to interact with the transformed data and generate insights.
An end-to-end data product will be built consisting of extracting, loading, and transforming (ELT) of raw data to generating dynamic and interactive visualizations on a web application. The high level data flow, with technologies used, can be seen below:
- Go to directory where repo will be cloned to
cd <directory>
- Clone repo to directory
git clone https://github.com/lopezj1/noaa_eda.git
- Switch to project directory
cd noaa_eda
- Create nginx_proxy_manager_default network if it does not exist (*this network is needed in production)
docker network inspect nginx_proxy_manager_default >/dev/null 2>&1 || docker network create nginx_proxy_manager_default
- Run docker compose to spin up container
docker compose up -d
- Visit prefect dashboard at http://localhost:4200
- Wait 1-2 minutes for Prefect Agent to start and Deployments to be created.
- Quick run ingest flow from Deployments
- Default year range is from 2018-2023 to have quicker loading time ~ 5 minutes.
- Quick run dbt flow from Deployments
- Running all models will take about 5 minutes.
- If running Docker Enginer on WSL, you may need to allocate more memory in .wslconfig
- Visit streamlit app at http://localhost:8501
- Visit dbt docs at http://localhost:8080
Survey data is stored at:
https://www.st.nmfs.noaa.gov/st1/recreational/MRIP_Survey_Data/CSV/.
Data is stored as csv files inside zip folders cataloged by year and wave (if multiple survey were taken that year). Python script ingest_noaa.py will handle the extract and load (EL) of the data. The EL pipeline consists of the following general steps:
- GET request to retrieve folders named by year and wave
- Unzip folders to extract csv files
- Copy csv files to /tmp folder in main project directory
- INSERT pandas dataframes into appropriate tables within a persistent DuckDB schema named raw
- To be memory efficient each individal csv file is processed with the help of the DLT python library.
After loading the source data into DuckDB, data models were created to transform the raw data to feature rich data in a separate schema named analytics using dbt. Documentation for this dbt project can be found at http://localhost:8080.
The ELT pipeline was orchestrated using Prefect. This allow monitoring of tasks and flows within the pipeline and also customization of input data ranges to process. Prefect dashboard can be accessed at http://localhost:4200.
A web app was built using Streamlit to allow for self serve analytics. Web app can be accessed at http://localhost:8501.
Future implementation could consist of bringing in tidal, weather, lunar data via APIs in conjunction with this survey data to create predictive ML models on catch success rate.
About NOAA MRIP
NOAA MRIP Survey Data
dbt
Prefect
dlt
duckdb
streamlit