A modern web crawler with a React frontend and Rust backend. This project demonstrates core web crawling functionality with a user-friendly interface, asynchronous processing, and a RESTful API. The crawler can be configured to start from any URL and is capable of limiting the crawl to a specific domain or depth.
- Web Interface: Clean, responsive React frontend built with Material-UI components.
- RESTful API: Backend API built with Axum web framework.
- Asynchronous Crawling: Efficiently fetches multiple pages in parallel.
- Configurable Parameters: Control depth, page limits, and domain restriction through the UI.
- Status Indicators: Color-coded status indicators for crawled pages.
- Real-time Feedback: Server status monitoring and error handling.
- URL Validation: Frontend and backend validation to ensure proper URLs.
- Cross-Origin Support: CORS enabled for API communication.
- Rust: Install Rust by following instructions at rust-lang.org.
- Node.js: Install Node.js and npm from nodejs.org.
-
Clone this repository:
git clone https://github.com/zazabap/web_crawler.git cd web_crawler
-
Build the Rust backend:
cargo build --release
-
Navigate to the frontend directory:
cd frontend
-
Install dependencies:
npm install
-
Start the backend server:
cargo run
This will start the Rust backend server at http://localhost:8000.
-
In a separate terminal, start the frontend development server:
cd frontend npm run dev
This will start the React frontend at http://localhost:5173 (or similar).
-
Open your browser and navigate to the frontend URL.
- The interface will automatically check if the backend server is running.
- Enter a URL to crawl (e.g., https://example.com).
- Adjust the crawl parameters:
- Crawl Depth: How many links deep to crawl (1-10).
- Max Pages: Maximum number of pages to crawl (10-500).
- Stay on Same Domain: Toggle to restrict crawling to the starting domain.
- Click "Start Crawling".
- View the results below, with status codes color-coded:
- Green: 200-level status (success)
- Blue: 300-level status (redirect)
- Yellow: 400-level status (client error)
- Red: 500-level status (server error)
- GET /status: Check if the server is running and get version information.
- POST /crawl: Start a crawl operation with the following JSON parameters:
{ "start_url": "https://example.com", "depth_limit": 3, "max_pages": 100, "same_domain": true }
The crawler can also be run from the command line:
cargo run -- --start-url "https://example.com" --depth-limit 3 --max-pages 100 --same-domain
--start-url
: (Required) The starting URL for the crawler.--depth-limit
: (Optional) Maximum depth of the crawl. Defaults to3
.--max-pages
: (Optional) Maximum number of pages to crawl.--same-domain
: (Optional) Restrict crawling to the starting domain.
The crawler can save data to both SQLite and CSV formats when run from the command line.
- SQLite: Use DB Browser for SQLite or the SQLite CLI to view
output.db
. - CSV: Open
output.csv
in a spreadsheet application.
For command-line data visualization:
cargo run --bin visualize
Contributions are welcome! Feel free to submit issues, feature requests, or pull requests to improve this project.