Skip to content

Aashish1-1-1/betterpipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Betterpipeline

Betterpipeline is a pipeline for webcrawler for identification of the site that are important to crawl, identification also means to eliminate scammy websites

Features

  • Scamwebsite check using dataset linear regression
  • Weighted crawl using word embedding(vector database)
  • Priority queue based on weight for efficient use of resource that important site get crawled first
  • Everything containerized

To run

git clone https://github.com/Aashish1-1-1/betterpipeline
sudo docker build -t betterpipeline .
sudo docker run -it betterpipeline

you can always change seed website(dockerfile) and depth(main.py) from dockerfile and main.py

About

pipelining

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published