Rust Web Crawler

A modern web crawler with a React frontend and Rust backend. This project demonstrates core web crawling functionality with a user-friendly interface, asynchronous processing, and a RESTful API. The crawler can be configured to start from any URL and is capable of limiting the crawl to a specific domain or depth.

Features

Web Interface: Clean, responsive React frontend built with Material-UI components.
RESTful API: Backend API built with Axum web framework.
Asynchronous Crawling: Efficiently fetches multiple pages in parallel.
Configurable Parameters: Control depth, page limits, and domain restriction through the UI.
Status Indicators: Color-coded status indicators for crawled pages.
Real-time Feedback: Server status monitoring and error handling.
URL Validation: Frontend and backend validation to ensure proper URLs.
Cross-Origin Support: CORS enabled for API communication.

Installation

Prerequisites

Rust: Install Rust by following instructions at rust-lang.org.
Node.js: Install Node.js and npm from nodejs.org.

Setting Up the Backend

Clone this repository:

git clone https://github.com/zazabap/web_crawler.git
cd web_crawler

Build the Rust backend:
```
cargo build --release
```

Setting Up the Frontend

Navigate to the frontend directory:
```
cd frontend
```
Install dependencies:
```
npm install
```

Usage

Running the Application

Start the backend server:
```
cargo run
```
This will start the Rust backend server at http://localhost:8000.
In a separate terminal, start the frontend development server:
```
cd frontend
npm run dev
```
This will start the React frontend at http://localhost:5173 (or similar).
Open your browser and navigate to the frontend URL.

Using the Web Interface

The interface will automatically check if the backend server is running.
Enter a URL to crawl (e.g., https://example.com).
Adjust the crawl parameters:
- Crawl Depth: How many links deep to crawl (1-10).
- Max Pages: Maximum number of pages to crawl (10-500).
- Stay on Same Domain: Toggle to restrict crawling to the starting domain.
Click "Start Crawling".
View the results below, with status codes color-coded:
- Green: 200-level status (success)
- Blue: 300-level status (redirect)
- Yellow: 400-level status (client error)
- Red: 500-level status (server error)

API Endpoints

GET /status: Check if the server is running and get version information.

POST /crawl: Start a crawl operation with the following JSON parameters:

{
  "start_url": "https://example.com",
  "depth_limit": 3,
  "max_pages": 100,
  "same_domain": true
}

Command Line Usage (Legacy)

The crawler can also be run from the command line:

cargo run -- --start-url "https://example.com" --depth-limit 3 --max-pages 100 --same-domain

Arguments:

--start-url: (Required) The starting URL for the crawler.
--depth-limit: (Optional) Maximum depth of the crawl. Defaults to 3.
--max-pages: (Optional) Maximum number of pages to crawl.
--same-domain: (Optional) Restrict crawling to the starting domain.

Storage and Visualization Tools

Data Storage

The crawler can save data to both SQLite and CSV formats when run from the command line.

Viewing Stored Data

SQLite: Use DB Browser for SQLite or the SQLite CLI to view output.db.
CSV: Open output.csv in a spreadsheet application.

Terminal Visualization Tool

For command-line data visualization:

cargo run --bin visualize

Contributions

Contributions are welcome! Feel free to submit issues, feature requests, or pull requests to improve this project.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
frontend		frontend
src		src
test		test
Cargo.toml		Cargo.toml
README.md		README.md
tauri.conf.json		tauri.conf.json
web_crawler_portfolio.md		web_crawler_portfolio.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rust Web Crawler

Features

Installation

Prerequisites

Setting Up the Backend

Setting Up the Frontend

Usage

Running the Application

Using the Web Interface

API Endpoints

Command Line Usage (Legacy)

Arguments:

Storage and Visualization Tools

Data Storage

Viewing Stored Data

Terminal Visualization Tool

Contributions

About

Releases

Packages

Languages

zazabap/web_crawler

Folders and files

Latest commit

History

Repository files navigation

Rust Web Crawler

Features

Installation

Prerequisites

Setting Up the Backend

Setting Up the Frontend

Usage

Running the Application

Using the Web Interface

API Endpoints

Command Line Usage (Legacy)

Arguments:

Storage and Visualization Tools

Data Storage

Viewing Stored Data

Terminal Visualization Tool

Contributions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages