Murkmaw

Murkmaw is a Rust-based multithreaded web crawler designed for efficient link graph construction, image extraction, and customizable logging. It features a modular architecture that supports future enhancements and customization.

Features

Multithreaded Web Crawler

Parallel Crawling: Utilizes multithreading for faster page scraping with configurable worker threads.
Link Graph Construction: Maintains a graph structure (LinkGraph) tracking parent-child associations and link references.
Data Extraction: Retrieves links, images, and titles from web pages.
Customizable Crawling: Specify the maximum number of links and images to process.

Enhanced Logging

Progress Bars: Displays link discovery progress with a real-time progress bar.
Spinners: Visual feedback for different stages of image processing and serialization.
Customizable Output: Built using the indicatif and console crates.

Image Utilities

Metadata Handling: Converts extracted links into image metadata, including alt text and source URL.
Image Downloading: Saves images locally in a user-defined directory.
Image Database: Serializes image metadata into a JSON database.

Getting Started

Prerequisites

Rust (latest stable version)
Crates used in the project:
tokio (for asynchronous operations)
reqwest (for HTTP requests)
serde and serde_json (for serialization and JSON handling)
rayon (for multithreading)
indicatif and console (for logging and UI enhancements)
anyhow (for error handling)

Installation

Clone the repository:

git clone https://github.com/IAmFarrokhnejad/Murkmaw.git
 cd Murkmaw

Install dependencies:

  cargo build

Usage

Run the application with the following command:

cargo run --release -- --starting_url <URL> --max_links <N> --max_images <N> --n_worker_threads <N> --log_status <true/false> --img_save_dir <directory> --links_json <filename>

Command-Line Options

starting_url: The initial URL to crawl (required).
max_links: The maximum number of links to process (default: 100).
max_images: The maximum number of images to extract (default: 50).
n_worker_threads: Number of worker threads for parallel crawling (default: 4).
log_status: Whether to enable logging (default: true).
img_save_dir: Directory to save downloaded images (default: ./images).
links_json: Filename for the JSON file storing the link graph (default: links.json).

Contribution Guidelines

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a new branch for your feature or bug fix.
Submit a pull request with a clear description of your changes.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name	Name	Last commit message	Last commit date
Latest commit IAmFarrokhnejad Finalied Nov 22, 2024 e03a4d2 · Nov 22, 2024 History 26 Commits
murkmaw	murkmaw	Finalied	Nov 22, 2024
.gitattributes	.gitattributes	Initial commit	May 20, 2024
LICENSE	LICENSE	Create LICENSE	Oct 7, 2024
README.md	README.md	Finalied	Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Murkmaw

Features

Multithreaded Web Crawler

Enhanced Logging

Image Utilities

Getting Started

Prerequisites

Installation

Usage

Command-Line Options

Contribution Guidelines

License

About

Releases

Packages

Languages

License

IAmFarrokhnejad/Murkmaw

Folders and files

Latest commit

History

Repository files navigation

Murkmaw

Features

Multithreaded Web Crawler

Enhanced Logging

Image Utilities

Getting Started

Prerequisites

Installation

Usage

Command-Line Options

Contribution Guidelines

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages