Skip to content
/ qMiner Public

🕷️ Async Web Crawler with Licensing High-performance asynchronous web crawler with built-in licensing system, JavaScript rendering support, and RESTful API. Features Playwright integration for complete web page rendering and SQLite storage. ⚡️ Features: - Async crawling with aiohttp - Built-in licensing system - JavaScript rendering - RESTfulAPI

Notifications You must be signed in to change notification settings

rafsid/qMiner

Repository files navigation

Async Web Crawler with Licensing System

Overview

A high-performance asynchronous web crawler built with Python, featuring a licensing system, Playwright-based rendering, and a RESTful API interface. The crawler efficiently handles JavaScript-rendered content and manages both internal and external links while respecting crawl depth limits.

Features

  • Asynchronous crawling with aiohttp and Playwright
  • Built-in licensing system (subscription and one-time)
  • SQLite database for storing crawl results
  • RESTful API endpoints for control and monitoring
  • Docker support
  • Configurable crawl depth and URL limits
  • JavaScript rendering support
  • Detailed logging system

Requirements

  • Python 3.8+
  • Playwright
  • Flask
  • aiohttp
  • aiosqlite
  • Beautiful Soup 4
  • uvicorn
  • Other dependencies in requirements.txt

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/async-web-crawler.git
cd async-web-crawler
  1. Create and activate virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Playwright browsers:
playwright install
  1. Set up environment variables:
cp .env.example .env

Edit .env with your settings:

DB_NAME=crawler.db
MAX_DEPTH=5
MAX_URLS=1000
SECRET_KEY=your_secret_key_here
ALLOWED_HOSTS=localhost,127.0.0.1
DEBUG=False

Usage

Running the Server

python app.py

Or with Docker:

docker build -t web-crawler .
docker run -p 5000:5000 web-crawler

API Endpoints

  1. Start a Crawl
POST /crawl
{
    "url": "https://example.com",
    "max_depth": 3,
    "max_urls": 100,
    "license_key": "your_license_key"
}
  1. Get Results
GET /results?license_key=your_license_key&page=1&per_page=20
  1. Create License
POST /license
{
    "key": "license_key",
    "type": "subscription"  # or "one-time"
}

Configuration Options

Variable Description Default
DB_NAME SQLite database name crawler.db
MAX_DEPTH Maximum crawl depth 5
MAX_URLS Maximum URLs to crawl 1000
SECRET_KEY Flask secret key your_secret_key_here
ALLOWED_HOSTS Allowed host list localhost,127.0.0.1
DEBUG Debug mode False

Database Schema

Crawls Table

CREATE TABLE crawls (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    url TEXT NOT NULL,
    depth INTEGER NOT NULL,
    internal_links TEXT,
    external_links TEXT,
    title TEXT,
    crawled_at TIMESTAMP
)

Licenses Table

CREATE TABLE licenses (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    key TEXT UNIQUE NOT NULL,
    type TEXT NOT NULL,
    expiration TIMESTAMP
)

License Management

Two types of licenses are supported:

  • One-time: Never expires
  • Subscription: 30-day validity

Error Handling

  • Comprehensive error logging
  • Graceful handling of network issues
  • Timeout management for slow responses
  • Invalid license handling

Security Features

  • License key validation
  • Configurable allowed hosts
  • Rate limiting (configurable)
  • Input validation

Contributing

Contributions are welcome! Please read our Contributing Guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Playwright team for the browser automation
  • Flask team for the web framework
  • Beautiful Soup team for HTML parsing

About

🕷️ Async Web Crawler with Licensing High-performance asynchronous web crawler with built-in licensing system, JavaScript rendering support, and RESTful API. Features Playwright integration for complete web page rendering and SQLite storage. ⚡️ Features: - Async crawling with aiohttp - Built-in licensing system - JavaScript rendering - RESTfulAPI

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published