This library was developed for a purpose of accounts payable automation in a real-life company. It can be used as a standalone solution or as a part of a workflow.
At this point it is working only with machine generated pdf files.
OCR functionality will added at the next iterations.
Text extraction is handled by PdfPlumber library.
To use the project you need to install the dependencies listed in the requirements.txt
file.
Dependencies can be installed by running the following command:
pip install -r requirements.txt
This library supposes that each vendor has a separate template.
Templates are the collection of patterns (regular extpessions) and other parameters stored in JSON file. Patterns are used for extracting data from the invoice.
Template JSON file can store any other parameters. The libary is not restricted to any specific format of the templates.
- Creating JSON file with a template.
- Adding created template to the Template Mapping JSON file.
To see the example of template handling, refer to the section Usage
below and sample
folder of the repository.
Below is an example of using the invoice-parser
library.
from invoiceparser.loader import Loader
from invoiceparser.pdfparser import PdfParser
from invoiceparser.invoice import Invoice
from invoiceparser.matcher import Matcher
from invoiceparser.templatehandler import TemplateHandler
from invoiceparser.datacollector import DataCollector
# if GUI is used for the front-end, below variable will be passed from it
vendorname = 'Vendor1'
templatemapping = 'TemplateMapping.json'
folder_with_pdf = 'C:/Users/user/invoiceparser/src'
output_folder = 'C:/Users/user/invoiceparser/'
#loading pdf files
files = Loader.get_pdf_files(folder_with_pdf)
#loading the template
template = TemplateHandler.load_template(templatemapping, vendorname)
#creating repository for storing parsed data
parsed_data = DataCollector()
for f in files:
inv = Invoice(template, PdfParser.extract_raw_text(f))
invdata = {}
# loaded template contains value indicator.
# it specifies if value contains regular expression or not.
for key, value in inv.template.items():
if inv.template[key][0] == 'regex':
invdata[key], inv.text = Matcher.find_match(inv.template[key][1], inv.text)
else:
invdata[key] = inv.template[key][1]
# adding invoice data to repository
parsed_data.add(invdata)
#viewing dataframe with parsed data
print(parsed_data.dataframe)
#save to excel, if needed
parsed_data.save_to_excel(output_folder)