Digital Alchemy: Molecular Absorption Wavelength Prediction

A Random Forest regression model for predicting absorption wavelengths corresponding to a maximum in the spectrum of molecules using their structural information (SMILES representation).

Project Overview

This project implements a machine learning pipeline to predict molecular absorption wavelength corresponding to a maximum in the spectrum using Random Forest regression. It processes molecular spectral data, extracts features using Morgan fingerprints, and trains a model to predict the primary absorption wavelength.

File Structure

.
├── config.py                # Configuration parameters and settings
├── data/
│   ├── processed/         # Processed datasets directory
│   ├── raw/               # Raw data directory
│   │   └── PhotochemCAD/  # Raw spectral data
├── notebooks/
│   └── rf_model_pipeline.ipynb  # Model training and evaluation notebook
└── src/
    ├── data_processing.py     # Data processing and feature engineering
    ├── model_training.py      # Model training and hyperparameter optimization
    └── utils.py               # Utility functions and helpers

Data Processing Pipeline

1. Data Processing (`src/data_processing.py`)

Processes raw absorption spectra from .abs.txt files
Extracts primary absorption peak
Retrieves SMILES representations from PubChem using batch processing
Implements parallel processing for file handling
Performs outlier detection using IQR method for wavelength and absorption values
Creates processed molecular datasets with:
- Molecule identifiers (CAS, Name)
- SMILES representation
- Primary absorption wavelength
- Outlier flags for wavelength and absorption values

2. Model Training (`src/model_training.py`)

Converts SMILES to canonical form
Generates 1024-bit Morgan fingerprints
Extracts primary absorption peak
Implements stratified train/test splitting by molecule (CAS number)
- Ensures balanced distribution of outliers between splits
- Prevents data leakage across splits

Model Evaluation (`notebooks/rf_model_pipeline.ipynb`)

Features

Morgan fingerprints (1024 bits)
Primary wavelength target

Model Architecture

Random Forest Regressor with optimized hyperparameters:
- n_estimators: 134
- max_depth: 78
- max_features: 0.3007
- min_samples_split: 2
- bootstrap: True

Hyperparameter Optimization

The model supports multiple optimization strategies:

Default Mode: Uses pre-optimized parameters for stable performance
Broad Search: RandomizedSearchCV for extensive parameter space exploration
Advanced Optimization: Optuna integration for sophisticated hyperparameter tuning

Performance Metrics

Mean Squared Error (MSE)
R-squared (R²)
Visualization of predictions vs. actuals
Error distribution analysis

Usage

Dataset Setup

Before running the data processing script, you need to download and set up the PhotochemCAD dataset from https://photochemcad.com/download:

Download the PhotochemCAD dataset
After extraction, place the dataset in the data/raw directory as 'PhotochemCAD' folder
Ensure the 'PhotochemCAD' folder contains the 'Common Compounds' folder which includes .abs.txt files

Data Processing and Model Training

Via jupyter notebook:

# Run the whole pipeline
jupyter notebook notebooks/rf_model_pipeline.ipynb

Alternatively:

Data Processing:

# Process raw spectral data and create molecular dataset
python -m src.data_processing

Train Model:

# Train and evaluate the Random Forest model
python -m src.model_training

Optimization Methods

To use different optimization strategies:

# For stable use (pre-optimized parameters)
model = train_and_tune_rf(X_train, y_train, optimization_method='final')

# For broad parameter space exploration
model = train_and_tune_rf(X_train, y_train, optimization_method='broad_search')

# For Optuna optimization
model = train_and_tune_rf(X_train, y_train, optimization_method='optuna')

Dependencies

Python 3.x
pandas
numpy
scikit-learn
RDKit
matplotlib
seaborn
scikit-optimize
PubChemPy
optuna
Jupyter Notebook

License

See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digital Alchemy: Molecular Absorption Wavelength Prediction

Project Overview

File Structure

Data Processing Pipeline

1. Data Processing (`src/data_processing.py`)

2. Model Training (`src/model_training.py`)

Model Evaluation (`notebooks/rf_model_pipeline.ipynb`)

Features

Model Architecture

Hyperparameter Optimization

Performance Metrics

Usage

Dataset Setup

Data Processing and Model Training

Optimization Methods

Dependencies

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt

License

madahian/abs_spectra

Folders and files

Latest commit

History

Repository files navigation

Digital Alchemy: Molecular Absorption Wavelength Prediction

Project Overview

File Structure

Data Processing Pipeline

1. Data Processing (src/data_processing.py)

2. Model Training (src/model_training.py)

Model Evaluation (notebooks/rf_model_pipeline.ipynb)

Features

Model Architecture

Hyperparameter Optimization

Performance Metrics

Usage

Dataset Setup

Data Processing and Model Training

Optimization Methods

Dependencies

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Data Processing (`src/data_processing.py`)

2. Model Training (`src/model_training.py`)

Model Evaluation (`notebooks/rf_model_pipeline.ipynb`)

Packages