Skip to content

mjdoom16/COF_Database_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

COF Database project

Data and context can be found on this paper: https://pubs.acs.org/doi/epdf/10.1021/acs.chemmater.8b01425?ref=article_openPDF

-- Project Status: [On-hold]

Project Intro/ Objective

This project attempts to find insights and predict methane uptake capacity of covalent organic frameworks via a regression model.

Project Description

First, I wanted to visualise the data to understand the trends and outliers. This includes:

  • a report of min, max and all categorical variables
  • boxplots of continuous values
  • histograms of discrete values

Then, the data was visualised using a sns.relplot() to show the relationship of predictors to the response (y = AbsMU_high_P_[molec/unit_cell]) and color-coded by bond types.

The data was then organised into X and y and using a random forest to find feature importance based on mean decrease of purity. This was done to reduce the dimensionality from p=1116. A threshold of 0.001 was used to chose important features, with supercell volume being the most important.

Many algorithms were assessed for selection. Algorithms (from sklearn) were trialed using default parameters with RepeatedKFold cross-validation (n_splits = 5, n_repeats = 10) include:

  • Linear Regression
  • Decision Tree
  • ensemble methods:
    • Random Forest
    • AdaBoost
    • Bagging
    • GradientBoosting
    • XGBoost <-- using the XGBoost library
  • SVR
  • KNeighbors

Evalution of each algorithm includes:

  • metrics: Averages, train and validation scores printed
    • mean_absolute_error
    • mean_square_error
    • root_mean_square_error
  • Plots
    • Learning Curves (scoring = RMSE)
    • Prediction plots of simulated data and predicted data

Random Forest had the best performance so was this algorithm was selected. Hyperparameter tuning using Optuna evaluated on held-out test set.

final model

Future Work

  • Do a more in-depth search with classification:
    • multi-nomial classification of qualitative values
      • bond_type (K=5)
      • parent network (K=309)
    • evaluation of 2D and 3D COF
    • unsupervised learning
      • clustering

Objective

  • Curate large dataset
  • Trained ML algorithm to predict target property
  • Select optimal algorithm for material representation
  • Validate algorithm
  • Developed an assessment protocol informed by construction of model

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published