pdf2txt-multipage-extractor is a Python tool designed to extract text from PDF files, processing entire directories efficiently. Optimized with multiprocessing, it can handle thousands of PDFs, making it suitable for large-scale document conversion tasks.
- Batch Processing: Converts all PDFs in a specified folder to text files.
- Multiprocessing: Utilizes multiple CPU cores for faster processing.
- Encoding Handling: Manages various text encodings to ensure accurate text representation.
- Indentation Preservation: Retains the original indentation and structure from PDFs.
Pdf_test/
: Directory containing sample PDFs for testing purposes.Python2/
: Contains the Python 2 version of the extraction script.Python3/
: Contains the Python 3 version of the extraction script.README.md
: This file, providing an overview of the project.
- Python 3.x (recommended)
- pdfminer.six: A PDF parsing library for text extraction.
-
Clone the Repository:
git clone https://git.1-hub.cnLorysHamadache/pdf2txt-multipage-extractor.git cd pdf2txt-multipage-extractor/Python3
-
Install Dependencies:
pip install -r requirements.txt
-
Prepare Your Directories:
- Place all target PDF files into a source directory (e.g.,
input_pdfs/
). - Ensure an output directory exists for the text files (e.g.,
output_txts/
).
- Place all target PDF files into a source directory (e.g.,
-
Run the Extraction Script:
python pdf_to_txt.py input_pdfs/ output_txts/
Replace
pdf_to_txt.py
with the actual script name if different. -
Review Output:
- Each PDF in
input_pdfs/
will have a corresponding.txt
file inoutput_txts/
containing the extracted text.
- Each PDF in
-
Library Selection:
- pdfminer.six was chosen for its reliability in text extraction across diverse PDF formats.
-
Extraction Process:
- Single PDF Handling: Developed a function to extract text, ensuring proper encoding and indentation.
- Batch Processing: Implemented a loop to process all PDFs in the source directory.
- Multiprocessing: Utilized Python's
multiprocessing
module to enhance processing speed by leveraging multiple CPU cores.
-
Performance:
- Tested on 10,031 PDFs, completing extraction in approximately 12 minutes under typical system load.
- Optimization Attempts: Explored multithreading; however, observed no significant performance gains due to the Global Interpreter Lock (GIL) in Python.
This project is licensed under the MIT License. See the LICENSE
file for details.
Contributions are welcome! Please fork the repository and submit a pull request with your enhancements.
For questions or suggestions, please open an issue in the repository.
Note: This project is provided "as-is" without warranty of any kind. Use at your own discretion.