Bite-2-Byte is an innovative web scraping tool designed to crawl websites, extract Questions and Answers (Q&A) content, and format it into structured data suitable for AI model training, particularly for Large Language Models (LLMs). The name reflects the process of 'biting' through raw web information to convert it into 'bytes' of usable data.
- User-Friendly Interface: Interactive CLI for easy operation.
- Web Crawling: Automatically navigates through website links starting from a base URL.
- Q&A Extraction: Identifies and extracts question-answer pairs from web content.
- Data Formatting: Supports multiple output formats (JSONL, CSV, TXT) for AI training compatibility.
- Validation: Analyzes extracted data to ensure suitability for LLM training.
- Dependency Management: Automatically checks and installs required Python libraries.
The bite.sh
script now includes OS detection to ensure compatibility across different systems:
- macOS: The script uses
pip
to install required Python packages fromrequirements.txt
. - Windows: It checks for
pip
availability. Ifpip
is not found, it provides instructions for installing Python or using Chocolatey as an alternative package manager. - Other OS: For unsupported systems, the script will prompt you to install dependencies manually.
To run the installation, simply execute ./bite.sh
in your terminal. The script will handle the rest based on your operating system.
- Clone the Repository (or download the zip file):
git clone https://github.com/vinipx/bite-2-byte.git cd bite-2-byte
- Install Dependencies:
Ensure you have Python 3.6+ and pip installed. Then run:
Alternatively, the CLI script will attempt to install missing libraries on startup.
pip install -r requirements.txt
Bite-2-Byte offers two ways to run the application:
-
Interactive CLI Script (Recommended for ease of use):
./bite.sh
This script will guide you through providing the base URL and selecting a data format.
-
Direct Python Execution:
python main.py --url https://example.com --format jsonl
- Run the app using
./bite.sh
. - Enter a base URL (e.g.,
https://example.com
). - Choose a data format (JSONL, CSV, or TXT).
- The app will crawl the site, extract Q&A data, validate it, and save the results.
Bite-2-Byte includes a validation step to ensure the extracted data is suitable for AI training:
- Checks for minimum length of questions (10 characters) and answers (20 characters).
- Verifies that questions end with a question mark.
- Requires at least 70% of extracted pairs to meet quality criteria.
- Provides detailed feedback if the data is deemed unsuitable.
The extracted Q&A data will be saved in the specified format as training_data.[format]
:
- JSONL (default): Each line is a JSON object with
question
,answer
, andsource
fields. - CSV: Structured table with columns for
question
,answer
, andsource
. - TXT: Human-readable format with Q&A pairs separated by new lines.
Contributions are welcome! If you'd like to improve Bite-2-Byte, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature-name
). - Make your changes and commit them (
git commit -am 'Add some feature'
). - Push to the branch (
git push origin feature/your-feature-name
). - Create a new Pull Request.
Please ensure your code adheres to the project's coding standards and includes appropriate documentation.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, suggestions, or issues, please open an issue on GitHub or contact the maintainers at vinipxf@gmail.com.
Built with ❤️ for AI enthusiasts and data scientists.