Skip to content

Python parser library for extracting data from machine generated pdf invoice

Notifications You must be signed in to change notification settings

svolodarskyi/invoice-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Invoice Parser for Account Payable Automation

Table of Contents

Overview

This library was developed for a purpose of accounts payable automation in a real-life company. It can be used as a standalone solution or as a part of a workflow.

At this point it is working only with machine generated pdf files.

OCR functionality will added at the next iterations.

Text extraction is handled by PdfPlumber library.

Setup

To use the project you need to install the dependencies listed in the requirements.txt file. Dependencies can be installed by running the following command:

pip install -r requirements.txt

Templates

This library supposes that each vendor has a separate template.

Templates are the collection of patterns (regular extpessions) and other parameters stored in JSON file. Patterns are used for extracting data from the invoice.

Template JSON file can store any other parameters. The libary is not restricted to any specific format of the templates.

Steps for adding vendor template:

  1. Creating JSON file with a template.
  2. Adding created template to the Template Mapping JSON file.

To see the example of template handling, refer to the section Usage below and sample folder of the repository.

Usage

Below is an example of using the invoice-parser library.

from invoiceparser.loader import Loader
from invoiceparser.pdfparser import PdfParser
from invoiceparser.invoice import Invoice
from invoiceparser.matcher import Matcher
from invoiceparser.templatehandler import TemplateHandler
from invoiceparser.datacollector import DataCollector


# if GUI is used for the front-end, below variable will be passed from it
vendorname = 'Vendor1'
templatemapping = 'TemplateMapping.json'
folder_with_pdf = 'C:/Users/user/invoiceparser/src'
output_folder = 'C:/Users/user/invoiceparser/'

#loading pdf files
files = Loader.get_pdf_files(folder_with_pdf)

#loading the template
template = TemplateHandler.load_template(templatemapping, vendorname)

#creating repository for storing parsed data
parsed_data = DataCollector()

for f in files:

    inv = Invoice(template, PdfParser.extract_raw_text(f))

    invdata = {}

    # loaded template contains value indicator.
    # it specifies if value contains regular expression or not.
  
    for key, value in inv.template.items():
        if inv.template[key][0] == 'regex':
            invdata[key], inv.text = Matcher.find_match(inv.template[key][1], inv.text)
        else:
            invdata[key] = inv.template[key][1]
            
    # adding invoice data to repository
    parsed_data.add(invdata)

#viewing dataframe with parsed data
print(parsed_data.dataframe)

#save to excel, if needed
parsed_data.save_to_excel(output_folder)

About

Python parser library for extracting data from machine generated pdf invoice

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages