A high-performance asynchronous web crawler built with Python, featuring a licensing system, Playwright-based rendering, and a RESTful API interface. The crawler efficiently handles JavaScript-rendered content and manages both internal and external links while respecting crawl depth limits.
- Asynchronous crawling with aiohttp and Playwright
- Built-in licensing system (subscription and one-time)
- SQLite database for storing crawl results
- RESTful API endpoints for control and monitoring
- Docker support
- Configurable crawl depth and URL limits
- JavaScript rendering support
- Detailed logging system
- Python 3.8+
- Playwright
- Flask
- aiohttp
- aiosqlite
- Beautiful Soup 4
- uvicorn
- Other dependencies in requirements.txt
- Clone the repository:
git clone https://github.com/yourusername/async-web-crawler.git
cd async-web-crawler
- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Install Playwright browsers:
playwright install
- Set up environment variables:
cp .env.example .env
Edit .env
with your settings:
DB_NAME=crawler.db
MAX_DEPTH=5
MAX_URLS=1000
SECRET_KEY=your_secret_key_here
ALLOWED_HOSTS=localhost,127.0.0.1
DEBUG=False
python app.py
Or with Docker:
docker build -t web-crawler .
docker run -p 5000:5000 web-crawler
- Start a Crawl
POST /crawl
{
"url": "https://example.com",
"max_depth": 3,
"max_urls": 100,
"license_key": "your_license_key"
}
- Get Results
GET /results?license_key=your_license_key&page=1&per_page=20
- Create License
POST /license
{
"key": "license_key",
"type": "subscription" # or "one-time"
}
Variable | Description | Default |
---|---|---|
DB_NAME | SQLite database name | crawler.db |
MAX_DEPTH | Maximum crawl depth | 5 |
MAX_URLS | Maximum URLs to crawl | 1000 |
SECRET_KEY | Flask secret key | your_secret_key_here |
ALLOWED_HOSTS | Allowed host list | localhost,127.0.0.1 |
DEBUG | Debug mode | False |
CREATE TABLE crawls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT NOT NULL,
depth INTEGER NOT NULL,
internal_links TEXT,
external_links TEXT,
title TEXT,
crawled_at TIMESTAMP
)
CREATE TABLE licenses (
id INTEGER PRIMARY KEY AUTOINCREMENT,
key TEXT UNIQUE NOT NULL,
type TEXT NOT NULL,
expiration TIMESTAMP
)
Two types of licenses are supported:
- One-time: Never expires
- Subscription: 30-day validity
- Comprehensive error logging
- Graceful handling of network issues
- Timeout management for slow responses
- Invalid license handling
- License key validation
- Configurable allowed hosts
- Rate limiting (configurable)
- Input validation
Contributions are welcome! Please read our Contributing Guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- Playwright team for the browser automation
- Flask team for the web framework
- Beautiful Soup team for HTML parsing