Vector Database Project

This project is a simple and efficient tool for creating and querying a vector database. It leverages pre-trained models to generate embeddings from PDF documents and performs similarity search using the Faiss library. The project supports adding new documents, querying for similar content, and managing the database.

Features

Supports two pre-trained embedding models:
1. NoInstruct-small-Embedding-v0 (smaller and faster, suitable for lightweight applications).
2. stella_en_400M_v5 (larger model for better embeddings).
Efficient text chunking: Text from PDFs is chunked for processing, ensuring overlapping to preserve context.
Query support: Retrieve relevant document sections with similarity ranking.
Persistence: Save and load vector database and metadata for reusability.
Database management: Add documents, query content, clear the database, or delete saved data files.

Usage

Select Engine: Choose between the two supported embedding models.
Choose Action:
- Add a document to the database: Provide the path to a PDF file, and the project will extract and store its content in the database.
- Query the database: Input a query, and the tool will rank similar chunks of text from the database.
- Clear the database: Remove all stored data from the memory.
- Save database to files: Persist the vector index and metadata to files for later use.
- Remove database files: Delete saved index and metadata files from storage.

Example workflow

Add a PDF:
- Input the PDF file’s path.
- Text will be extracted, chunked, and embedded into the vector database.
Run a query:
- Input a natural language query.
- The tool retrieves and ranks similar text from stored documents.
Save or clear the database as needed.

Setup

I recommend using Python 3.12.7 or a similar version. Newer or slightly older versions of Python should work fine, but the specific version used in development was 3.12.7.

To set up the project locally, follow these steps:

Clone the repository:

git clone <repository-url>
cd Vector_DB

Download Faiss: Installing faiss is not easy, you can start by folowing their installation guide, if this guide is not enough for you to get Faiss working (it wasn't enough for me), take a look at a list of commands I had to run on MacOS to install it, feel free to get inspired:

brew install libomp
brew install swig
brew install gflags

pip install numpy
git clone https://github.com/facebookresearch/faiss.git
cd faiss
git checkout fix_nightly_build

cmake -B build \
    -DFAISS_ENABLE_GPU=OFF \
    -DPython_EXECUTABLE=$(which python3) \
    -DCMAKE_BUILD_TYPE=Release \
    -DOpenMP_CXX_FLAGS="-I$(brew --prefix libomp)/include" \
    -DOpenMP_CXX_LIB_NAMES="omp" \
    -DOpenMP_omp_LIBRARY="$(brew --prefix libomp)/lib/libomp.dylib" \
    -DSWIG_EXECUTABLE=$(which swig) \
    -DOpenMP_C_FLAGS="-I$(brew --prefix libomp)/include" \
    -DOpenMP_C_LIB_NAMES="omp" \
    -DCMAKE_PREFIX_PATH=$(brew --prefix gflags)

cd build
make -j7
cd ..
(cd build/faiss/python/ ; python3 setup.py build)

pip install faiss-cpu

Install remaining libraries:

pip install -r requirements.txt

Run the app: To run the application, simply execute:

python src/main.py

Project Structure

Here’s a quick overview of the project structure:

Vector_DB/
├── data/              # Directory for saving database files
├── src/               # Source code
├── requirements.txt   # required python libraries
└── README.md          # Project description

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Database Project

Features

Usage

Example workflow

Setup

Project Structure

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

noNScop/Vector_DB

Folders and files

Latest commit

History

Repository files navigation

Vector Database Project

Features

Usage

Example workflow

Setup

Project Structure

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages