Async Web Crawler with Licensing System

Overview

A high-performance asynchronous web crawler built with Python, featuring a licensing system, Playwright-based rendering, and a RESTful API interface. The crawler efficiently handles JavaScript-rendered content and manages both internal and external links while respecting crawl depth limits.

Features

Asynchronous crawling with aiohttp and Playwright
Built-in licensing system (subscription and one-time)
SQLite database for storing crawl results
RESTful API endpoints for control and monitoring
Docker support
Configurable crawl depth and URL limits
JavaScript rendering support
Detailed logging system

Requirements

Python 3.8+
Playwright
Flask
aiohttp
aiosqlite
Beautiful Soup 4
uvicorn
Other dependencies in requirements.txt

Installation

Clone the repository:

git clone https://github.com/yourusername/async-web-crawler.git
cd async-web-crawler

Create and activate virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Install Playwright browsers:

playwright install

Set up environment variables:

cp .env.example .env

Edit .env with your settings:

DB_NAME=crawler.db
MAX_DEPTH=5
MAX_URLS=1000
SECRET_KEY=your_secret_key_here
ALLOWED_HOSTS=localhost,127.0.0.1
DEBUG=False

Usage

Running the Server

python app.py

Or with Docker:

docker build -t web-crawler .
docker run -p 5000:5000 web-crawler

API Endpoints

Start a Crawl

POST /crawl
{
    "url": "https://example.com",
    "max_depth": 3,
    "max_urls": 100,
    "license_key": "your_license_key"
}

Get Results

GET /results?license_key=your_license_key&page=1&per_page=20

Create License

POST /license
{
    "key": "license_key",
    "type": "subscription"  # or "one-time"
}

Configuration Options

Variable	Description	Default
DB_NAME	SQLite database name	crawler.db
MAX_DEPTH	Maximum crawl depth	5
MAX_URLS	Maximum URLs to crawl	1000
SECRET_KEY	Flask secret key	your_secret_key_here
ALLOWED_HOSTS	Allowed host list	localhost,127.0.0.1
DEBUG	Debug mode	False

Database Schema

Crawls Table

CREATE TABLE crawls (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    url TEXT NOT NULL,
    depth INTEGER NOT NULL,
    internal_links TEXT,
    external_links TEXT,
    title TEXT,
    crawled_at TIMESTAMP
)

Licenses Table

CREATE TABLE licenses (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    key TEXT UNIQUE NOT NULL,
    type TEXT NOT NULL,
    expiration TIMESTAMP
)

License Management

Two types of licenses are supported:

One-time: Never expires
Subscription: 30-day validity

Error Handling

Comprehensive error logging
Graceful handling of network issues
Timeout management for slow responses
Invalid license handling

Security Features

License key validation
Configurable allowed hosts
Rate limiting (configurable)
Input validation

Contributing

Contributions are welcome! Please read our Contributing Guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Playwright team for the browser automation
Flask team for the web framework
Beautiful Soup team for HTML parsing

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
.vscode		.vscode
data		data
Dockerfile		Dockerfile
README.md		README.md
crawler.db		crawler.db
qminer copy 2.py		qminer copy 2.py
qminer.py		qminer.py
qminer1.py		qminer1.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Async Web Crawler with Licensing System

Overview

Features

Requirements

Installation

Usage

Running the Server

API Endpoints

Configuration Options

Database Schema

Crawls Table

Licenses Table

License Management

Error Handling

Security Features

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

rafsid/qMiner

Folders and files

Latest commit

History

Repository files navigation

Async Web Crawler with Licensing System

Overview

Features

Requirements

Installation

Usage

Running the Server

API Endpoints

Configuration Options

Database Schema

Crawls Table

Licenses Table

License Management

Error Handling

Security Features

Contributing

License

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages