ViCom CollabMap

ViCom CollabMap is a data science project designed to visually present existing collaborations within the ViCom research group and uncover potential new connections. This project processes researcher data, extracts expertise, and generates potential collaborations, all presented in an interactive web-based map.

Sometimes, you want to know who else is quietly exploring eye-tracking or deep into Bayesian stats—without turning into Sherlock Holmes. ViCom CollabMap is here to save the day, letting you see where collaborations already exist and where they might blossom if only you knew the right people.

Tools and Techniques Used

Python: Core language for all data processing.
Jupyter Notebooks: For step-by-step data extraction and cleaning.
APIs:
- OpenAlex API for fetching publication data.
- OpenAI API for extracting expertise and keywords using NLP.
Streamlit: To build the interactive web app.
Folium: For creating dynamic, interactive maps.
Pandas & NumPy: For data manipulation and analysis.
Fuzzy Matching (via FuzzyWuzzy): To handle similar keywords and standardize terms.

(P.S. This entire project was built in just three weeks. So please forgive the occasional quirk or feature that’s still “under construction.”)

Overview

This project was completed over the span of 3 weeks, primarily focusing on data processing and a bit of web-based visualization. It aims to:

Collect researcher info and publication data (via OpenAlex).
Generate meaningful keywords with the help of OpenAI.
Identify potential synergies between researchers.
Show existing and potential collaborations on a lovely interactive map, courtesy of Streamlit + Folium.

Project Workflow

Data Gathering
Researchers’ details (names, affiliations, bios, and geolocation) were scraped from the ViCom website. Then, using OpenAlex, 1,264 recent publications were fetched.
Data Preprocessing & Cleaning
Publications were cleaned for duplicates, missing titles, etc., and merged with geolocation data for each researcher.
Expertise Extraction
OpenAI was then employed to read through abstracts and produce relevant themes and expert keywords. Who doesn’t want a robot telling us our life’s work in bullet points?
Similarity Computation
I tally up how many keywords each pair of researchers shares—like counting how many matching socks you’ve got in the laundry.
Web App Visualization
Finally, I created a Streamlit + Folium web app to:
- Show a map of where each researcher is located.
- Draw lines for existing collaborations (blue/green lines).
- Visualize potential collaborations via purple arcs.

Detailed Notebook Workflow

02.Extract_Publication_data.ipynb

Input: A CSV of researchers (names, affiliations).
Process:
- Queries OpenAlex for each name.
- Fetches publication data (title, abstract, authors, year) from the last 5 years.
- Reconstruct abstracts from OpenAlex’s “inverted index” (like puzzle-solving with words).
Output: A CSV (03.ViCom_Publications_OpenAlex.csv) with all the extracted data.

03.Clean_Publication_data.ipynb

Input: The raw file from the previous notebook.
Process:
- Fills in missing columns (title, abstract).
- Removes duplicates and lumps data into a single “Text” field, also merges with researcher bios.
Output: A cleaned CSV (03.ViCom_Publications_OpenAlex_Cleaned.csv) and a processed file (06. Processed_Researcher_Data.csv).

04.Data_processing_with_OpenAI.ipynb

Input: The merged file of abstracts + bios.
Process:
- Splits each researcher’s text into smaller chunks (avoiding token-limit drama).
- Calls OpenAI to extract keywords: “Themes” vs. “Expertise.”
- Combines them back into a single row per researcher.
Output: A CSV (08.researchers_with_themes_expertise_openai.csv) listing each researcher’s themes and expertise.

05.Clean_OpenAI_data.ipynb

Input: The raw OpenAI keyword data.
Process:
- Applies synonyms (merging “sign language” with “sign language,” etc.).
- Uses fuzzy matching to remove near-duplicates.
- Drops generic or irrelevant terms (“education,” “research focus,” etc.).
Output: A tidy CSV (08.researchers_with_themes_expertise_cleaned.csv) with standardized keywords.

06.Similarity.ipynb

Input: Cleaned expertise data.
Process:
- For each pair of researchers, count overlapping keywords (Themes + Expertise).
- Saves the synergy count and the list of matching keywords (like “EEG,” “lexical semantics,” etc.).
Output: A CSV (09.potential_collaborations.csv) listing synergy scores for every possible pair.

app.py

Input: Multiple CSVs (researcher info, collaborations, synergy scores).
Process:
- Streamlit app with optional password protection.
- Builds an interactive Folium map with researcher markers, lines for existing collaborations, and arcs for potential ones.
- Lets you filter by project type or synergy threshold.
Output: Shiny interactive map and tables of who’s working (or should be working) with whom.

How to Use Each Notebook

02.Extract_Publication_data.ipynb
- Update the file paths to point at your researcher CSV.
- Make sure your internet is up—OpenAlex can’t be reached by carrier pigeon.
- Run all cells to get a CSV of each researcher’s publications.
03.Clean_Publication_data.ipynb
- Point it to the CSV from the previous step.
- Run all cells to produce cleaned data and a combined “Text” field.
04.Data_processing_with_OpenAI.ipynb
- Insert your OpenAI API key (replace "xxx").
- Adjust MAX_CHARS_PER_CHUNK if your texts are huge.
- Run all cells; wait for GPT to parse your data.
05.Clean_OpenAI_data.ipynb
- Load the new CSV from the OpenAI pipeline.
- Click “Run” and watch it unify synonyms and remove fluff.
06.Similarity.ipynb
- Uses the final “cleaned” data.
- Run it to get synergy scores for each pair of researchers.
app.py
- Check you have Streamlit and Folium installed.
- Add .streamlit/secrets.toml with your chosen password.
- streamlit run app.py -> Follow the link in your terminal -> Enjoy the map!

Error Handling

API Rate Limits

If OpenAI or OpenAlex complains about too many requests, add a small time delay in the loops (e.g., time.sleep(2)).

Missing Geolocation

Double-check the Latitude and Longitude columns in 01_participants_with_geo.csv. A blank lat/long means no map marker.

JSON Decoding Woes

Sometimes GPT’s output isn’t perfect JSON. This notebook tries to handle that, but you may need to re-run or modify the prompt.

No Publications Found

Some researchers may genuinely not have entries in OpenAlex or have incomplete info. In those cases, the script logs “No publications found.”

Improvement Ideas

Smarter Keywords

Right now, I rely on GPT for keywords. You could integrate domain-specific dictionaries or trained classifiers for more accuracy.

Extended Publications

For older or specialized data, consider using other APIs (e.g., PubMed, CrossRef) to widen the publication range.

Add More Filters

Let users filter by “department,” “method,” or “region.” Not everyone enjoys endless scrolling.

Automated Pipelines

Use GitLab CI or GitHub Actions to refresh the data nightly, so you’re always up-to-date with the newest preprints.

Better UI

If you have a knack for design, feel free to refactor the interface for a more polished (or simpler) look.

Files in the Repository

02.Extract_Publication_data.ipynb
Pulls data from OpenAlex for each researcher.
03.Clean_Publication_data.ipynb
Fixes missing values, merges data fields, and aggregates text.
04.Data_processing_with_OpenAI.ipynb
Feeds abstracts to OpenAI for keyword extraction.
05.Clean_OpenAI_data.ipynb
Removes duplicates, merges synonyms, cleans up categories.
06.Similarity.ipynb
Calculates synergy scores for researcher pairs.
app.py
The interactive Streamlit app for exploring and filtering collaborations.

How to Run the Project Locally

Clone the Repo

git clone <repository_url>
cd ViCom

### **Create a Virtual Environment**

```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Requirements

pip install -r requirements.txt

Install Requirements

password = "whatever_password_you_want"

Run the App

streamlit run app.py

Open Your Browser Visit http://localhost:8501 and log in with your password.

Technologies Used

Python
Streamlit
Folium
OpenAI API
OpenAlex API
Pandas & NumPy

(Who doesn’t love a data pipeline with a dash of AI?)

Final Thoughts

This project was a solo sprint, so if you spot anything bizarre, please bear with me. ViCom CollabMap is here to help you discover your next big research partnership—or at least give you a fun excuse to reach out to that colleague you’ve been curious about. Enjoy!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Data_clean		Data_clean
.DS_Store		.DS_Store
.gitignore		.gitignore
02.Extract_Publication_data.ipynb		02.Extract_Publication_data.ipynb
03. Clean_Publication_data.ipynb		03. Clean_Publication_data.ipynb
04. Data_processing_with_OpenAI.ipynb		04. Data_processing_with_OpenAI.ipynb
05.Clean_OpenAI_data.ipynb		05.Clean_OpenAI_data.ipynb
06.Similarity.ipynb		06.Similarity.ipynb
README.md		README.md
app.py		app.py
map.html		map.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViCom CollabMap

Tools and Techniques Used

Table of Contents

Overview

Project Workflow

Detailed Notebook Workflow

02.Extract_Publication_data.ipynb

03.Clean_Publication_data.ipynb

04.Data_processing_with_OpenAI.ipynb

05.Clean_OpenAI_data.ipynb

06.Similarity.ipynb

app.py

How to Use Each Notebook

Error Handling

API Rate Limits

Missing Geolocation

JSON Decoding Woes

No Publications Found

Improvement Ideas

Smarter Keywords

Extended Publications

Add More Filters

Automated Pipelines

Better UI

Files in the Repository

How to Run the Project Locally

Clone the Repo

Install Requirements

Install Requirements

Run the App

Technologies Used

Final Thoughts

About

Releases

Packages

Languages

monika-kotus/ViCom

Folders and files

Latest commit

History

Repository files navigation

ViCom CollabMap

Tools and Techniques Used

Table of Contents

Overview

Project Workflow

Detailed Notebook Workflow

02.Extract_Publication_data.ipynb

03.Clean_Publication_data.ipynb

04.Data_processing_with_OpenAI.ipynb

05.Clean_OpenAI_data.ipynb

06.Similarity.ipynb

app.py

How to Use Each Notebook

Error Handling

API Rate Limits

Missing Geolocation

JSON Decoding Woes

No Publications Found

Improvement Ideas

Smarter Keywords

Extended Publications

Add More Filters

Automated Pipelines

Better UI

Files in the Repository

How to Run the Project Locally

Clone the Repo

Install Requirements

Install Requirements

Run the App

Technologies Used

Final Thoughts

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages