Skip to content

This project is a simple and efficient tool for creating and querying a vector database. It leverages pre-trained models to generate embeddings from PDF documents and performs similarity search using the Faiss library.

License

Notifications You must be signed in to change notification settings

noNScop/Vector_DB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vector Database Project

This project is a simple and efficient tool for creating and querying a vector database. It leverages pre-trained models to generate embeddings from PDF documents and performs similarity search using the Faiss library. The project supports adding new documents, querying for similar content, and managing the database.

Features

  • Supports two pre-trained embedding models:
    1. NoInstruct-small-Embedding-v0 (smaller and faster, suitable for lightweight applications).
    2. stella_en_400M_v5 (larger model for better embeddings).
  • Efficient text chunking: Text from PDFs is chunked for processing, ensuring overlapping to preserve context.
  • Query support: Retrieve relevant document sections with similarity ranking.
  • Persistence: Save and load vector database and metadata for reusability.
  • Database management: Add documents, query content, clear the database, or delete saved data files.

Usage

  1. Select Engine: Choose between the two supported embedding models.
  2. Choose Action:
    • Add a document to the database: Provide the path to a PDF file, and the project will extract and store its content in the database.
    • Query the database: Input a query, and the tool will rank similar chunks of text from the database.
    • Clear the database: Remove all stored data from the memory.
    • Save database to files: Persist the vector index and metadata to files for later use.
    • Remove database files: Delete saved index and metadata files from storage.

Example workflow

  1. Add a PDF:

    • Input the PDF file’s path.
    • Text will be extracted, chunked, and embedded into the vector database.
  2. Run a query:

    • Input a natural language query.
    • The tool retrieves and ranks similar text from stored documents.
  3. Save or clear the database as needed.

Setup

I recommend using Python 3.12.7 or a similar version. Newer or slightly older versions of Python should work fine, but the specific version used in development was 3.12.7.

To set up the project locally, follow these steps:

  1. Clone the repository:
git clone <repository-url>
cd Vector_DB
  1. Download Faiss: Installing faiss is not easy, you can start by folowing their installation guide, if this guide is not enough for you to get Faiss working (it wasn't enough for me), take a look at a list of commands I had to run on MacOS to install it, feel free to get inspired:
brew install libomp
brew install swig
brew install gflags

pip install numpy
git clone https://github.com/facebookresearch/faiss.git
cd faiss
git checkout fix_nightly_build

cmake -B build \
    -DFAISS_ENABLE_GPU=OFF \
    -DPython_EXECUTABLE=$(which python3) \
    -DCMAKE_BUILD_TYPE=Release \
    -DOpenMP_CXX_FLAGS="-I$(brew --prefix libomp)/include" \
    -DOpenMP_CXX_LIB_NAMES="omp" \
    -DOpenMP_omp_LIBRARY="$(brew --prefix libomp)/lib/libomp.dylib" \
    -DSWIG_EXECUTABLE=$(which swig) \
    -DOpenMP_C_FLAGS="-I$(brew --prefix libomp)/include" \
    -DOpenMP_C_LIB_NAMES="omp" \
    -DCMAKE_PREFIX_PATH=$(brew --prefix gflags)

cd build
make -j7
cd ..
(cd build/faiss/python/ ; python3 setup.py build)

pip install faiss-cpu
  1. Install remaining libraries:
pip install -r requirements.txt
  1. Run the app: To run the application, simply execute:

python src/main.py

Project Structure

Here’s a quick overview of the project structure:

Vector_DB/
├── data/              # Directory for saving database files
├── src/               # Source code
├── requirements.txt   # required python libraries
└── README.md          # Project description

License

This project is licensed under the MIT License. See the LICENSE file for more information.

About

This project is a simple and efficient tool for creating and querying a vector database. It leverages pre-trained models to generate embeddings from PDF documents and performs similarity search using the Faiss library.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages