Skip to content

Practical financial data science examples applying statistics, time series analysis, graph analytics, backtesting, machine learning, natural language processing, neural networks and LLMs

License

Notifications You must be signed in to change notification settings

terence-lim/financial-data-science-notebooks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FINANCIAL DATA SCIENCE

As financial markets produce vast volumes of structured and unstructured data, the ability to extract insights and develop predictive models has become increasingly important. Financial Data Science Python Notebooks provide a practical guide for analysts, researchers, and data scientists looking to apply Python and its broad ecosystem of libraries, tools, frameworks, and community resources to financial analysis, econometrics, and machine learning.

Designed to support financial data science workflows, the companion FinDS Python package demonstrates how to use database engines such as SQL, Redis, and MongoDB to manage and access large datasets, including:

  • Core financial databases such as CRSP, Compustat, IBES, and TAQ

  • Public economic data APIs from sources like FRED and the Bureau of Economic Analysis (BEA)

  • Structured and unstructured data from academic and research websites

In addition to data access, it provides practical examples and templates for applying:

  • Financial econometrics and time series modeling

  • Graph analytics, event studies, and backtesting strategies

  • Machine learning for predictive analytics

  • Natural language processing (NLP) to extract insights from financial text

  • Neural networks and large language models (LLMs) for advanced decision-making

March 2025: Updated with data through early 2025 and incorporated the latest LLMs -- Microsoft Phi-4-multimodal (released Feb 2025), Google Gemma-3-12B (March 2025), DeepSeek-R1-14B (January 2025), Meta Llama-3.1-8B (July 2024), GPT-4o-mini (July 2024).

image

Topics

notebook Financial Data Science
1.1_stock_prices Stock price properties CRSP stocks Statistical moments
1.2_jegadeesh_titman Price momentum CRSP stocks Hypothesis testing,
Newey-West estimator
1.3_fama_french Value and size CRSP stocks,
Compustat
Linear regression
1.4_fama_macbeth CAPM Fama-French Non-linear regression,
Quadratic optimization
1.5_contrarian_trading Mean reversion,
Implementation shortfall
CRSP stocks Structural breaks
1.6_quant_factors Factor investing,
Backtesting
CRSP stocks,
Compustat, IBES
Cluster analysis
1.7_event_study Event studies S&P key developments Multiple testing, Fourier transforms
2.1_economic_indicators Economic data revisions,
Employment payrolls
ALFRED Outlier detection
2.2_regression_diagnostics Consumer and
producer prices
FRED Linear regression diagnostics
2.3_time_series Industrial production
and inflation
FRED Time series analysis
2.4_approximate_factors Approximate factor models FRED-MD Unit root test,
EM Algorithm
2.5_economic_states State space models FRED-MD Gaussian mixture,
hidden Markov models
3.1_term_structure Interest rates FRED yield curve Low-rank approximation
3.2_bond_returns Bonds risk factors FRED bond returns Principal component analysis
3.3_options_pricing Binomial tree,
Black-Scholes-Merton
simulated Monte Carlo simulations
3.4_value_at_risk Value-at-risk FRED crypto-currencies Conditional volatility
3.5_covariance_matrix Portfolio risk Fama-French industries Covariance matrix estimation
3.6_market_microstructure Market liquidity TAQ tick data High frequency volatility
3.7_event_risk Earnings expectations IBES Poisson regression,
generalized linear model
4.1_network_graphs Supply chain Compustat principal customers Network graphs
4.2_community_detection Industry taxonomy Hoberg-Phillips Community detection
4.3_graph_centrality Input-output uses Bureau of Economic Analysis Graph centrality
4.4_link_prediction Product markets Hoberg-Phillips Link prediction
4.5_spatial_regression Earnings surprises IBES, Hoberg-Phillips Spatial regression
5.1_fomc_topics FOMC meetings Federal Reserve Topic modeling
5.2_management_sentiment Management discussions SEC Edgar,
Loughran-Macdonald
Sentiment analysis
5.3_business_textual Business descriptions SEC Edgar Part-of-speech,
Density-based clustering
6.1_classification_models Industry classification SEC Edgar Classification
6.2_regression_models Macroeconomic forecasts FRED-MD Regression
6.3_deep_learning Industry classification SEC Edgar Neural networks,
word embeddings
6.4_convolutional_net Macroeconomic forecasts FRED-MD Convolutional neural nets,
vector autoregression
6.5_recurrent_net Macroeconomic forecasts FRED-MD Recurrent neural nets,
dynamic factor models
6.6_reinforcement_learning Retirement spending SBBI Reinforcement learning
6.7_language_modeling Fedspeak Federal Reserve Language modeling,
Transformers
7.1_large_language_models Market risk disclosures SEC Edgar Text summarization
7.2_llm_finetuning Industry classification SEC Edgar LLM fine-tuning
7.3_llm_prompting Financial news sentiment Kaggle Prompt engineering
7.4_llm_agents Corporate philanthropy MVCP textbook Multi-agents, chatbots,
retrieval-augmented generation

Documentation

Github repos

Contact

https://terence-lim.github.io