Skip to content

A Python library for text normalization, specifically designed for Vietnamese and English text processing. This library provides comprehensive text normalization capabilities including handling of special characters, numbers, dates, and various text formats.

License

Notifications You must be signed in to change notification settings

ducnt18121997/Viet-Text-Normalization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Normalization System - Regrex Implementation

This is Python Implementation based on Regrex & Rule-based for convert writing words to reading words, researched and developed by Dean Ng.

python regrex

Features

  • Vietnamese text normalization
  • Special character handling
  • Number and currency normalization
  • Date format normalization
  • Support for superscript and subscript characters
  • Complex text pattern recognition
  • Unit and currency handling
  • Roman numeral processing

Installation

conda create --name venv python=3.8
pip install -r requirements.txt
  1. Set up Java environment (required for VnCoreNLP):
  • Install Java JDK (version 21 or compatible)
  • Set JAVA_HOME environment variable to your JDK installation path

Usage

from cores.normalizer import TextNormalizer

# Initialize the normalizer with VnCoreNLP model path
text_normalizer = TextNormalizer("./exps/vncorenlp/")

# Normalize text
text = "1. Những ngân hàng đang có lãi suất cho vay bình quân cao như Liên Việt, Bản Việt, Kiên Long với lãi suất từ 8,07 $ -  8,94$..."
normalized_text = text_normalizer(text)

Project Structure

text-normalization/
├── constants/         # Character sets and constants
├── cores/            # Core normalization logic
├── exps/             # Experiment configurations
├── logs/             # Log files
├── utils/            # Utility functions
├── example.py        # Usage examples
└── requirements.txt  # Project dependencies

Reference

NOTE: I may forgot repo that I use to reference for this repo, please create issue ticket and I will update it. Thank you very much

Cite

If you find this repository useful for your research, please cite:

@misc{deanng_2025,
    author = {Dean Nguyen},
    title = {Vietnamese Text Normalization},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/ducnt18121997/text-normalization}}
}

About

A Python library for text normalization, specifically designed for Vietnamese and English text processing. This library provides comprehensive text normalization capabilities including handling of special characters, numbers, dates, and various text formats.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages