GitHub - taishi-i/toiro at 0.0.4

Name	Name	Last commit message	Last commit date
Latest commit taishi-i add v0.0.4 Aug 16, 2020 e66d997 · Aug 16, 2020 History 27 Commits
.github/workflows	.github/workflows	Create python-publish.yml	Aug 13, 2020
docker/cpu	docker/cpu	fix build errors	Aug 13, 2020
docs	docs	add v0.0.4	Aug 16, 2020
examples	examples	fix 01_getting_started_ja.ipynb	Aug 16, 2020
test	test	add disable_tokenizers to test_tokenizers.py	Aug 16, 2020
toiro	toiro	add v0.0.4	Aug 16, 2020
.gitignore	.gitignore	Initial commit	Aug 13, 2020
.readthedocs.yml	.readthedocs.yml	add .readthedocs.yml	Aug 14, 2020
.travis.yml	.travis.yml	fix .travis.yml	Aug 14, 2020
LICENSE	LICENSE	Initial commit	Aug 13, 2020
MANIFEST.in	MANIFEST.in	Initial commit	Aug 13, 2020
README.md	README.md	add v0.0.4	Aug 16, 2020
setup.cfg	setup.cfg	Initial commit	Aug 13, 2020
setup.py	setup.py	add v0.0.4	Aug 16, 2020

toiro

Toiro is a comparison tool of Japanese tokenizers.

Compare the processing speed of tokenizers
Compare the words segmented in tokenizers
Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)

It also provides useful functions for natural language processing in Japanese.

Data downloader for Japanese text corpora
Preprocessor of these corpora
Text classifier for Japanese text (e.g., SVM, BERT)

Installation

Python 3.6+ is required. You can install toiro with the following command. Janome is included in the default installation.

pip install toiro

Adding a tokenizer to toiro

If you want to add a tokenizer to toiro, please install it individually. This is an example of adding SudachiPy and nagisa to toiro.

pip install sudachipy sudachidict_core
pip install nagisa

If you want to install all the tokonizers at once, please use the following command.

pip install toiro[all_tokenizers]

Getting started

You can check the available tokonizers in your Python environment.

from toiro import tokenizers

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers)

Toiro supports 9 different Japanese tokonizers. This is an example of adding SudachiPy and nagisa.

{'nagisa': {'is_available': True, 'version': '0.2.7'},
 'janome': {'is_available': True, 'version': '0.3.10'},
 'mecab-python3': {'is_available': False, 'version': False},
 'sudachipy': {'is_available': True, 'version': '0.4.9'},
 'spacy': {'is_available': False, 'version': False},
 'ginza': {'is_available': False, 'version': False},
 'kytea': {'is_available': False, 'version': False},
 'jumanpp': {'is_available': False, 'version': False},
 'sentencepiece': {'is_available': False, 'version': False}}

Download the livedoor news corpus and compare the processing speed of tokenizers.

from toiro import tokenizers
from toiro import datadownloader

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']

# Download the livedoor news corpus and load it as pandas.DataFrame
corpus = corpora[0]
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
texts = train_df[1]

# Compare the processing speed of tokenizers
report = tokenizers.compare(texts)
#=> [1/3] Tokenizer: janome
#=> 100%|███████████████████| 5900/5900 [00:07<00:00, 746.21it/s]
#=> [2/3] Tokenizer: nagisa
#=> 100%|███████████████████| 5900/5900 [00:15<00:00, 370.83it/s]
#=> [3/3] Tokenizer: sudachipy
#=> 100%|███████████████████| 5900/5900 [00:08<00:00, 696.68it/s]
print(report)
{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',
  'arch': 'X86_64',
  'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',
  'count': 8},
 'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},
 'janome': {'elapsed_time': 9.114670515060425},
 'nagisa': {'elapsed_time': 15.873093605041504},
 'sudachipy': {'elapsed_time': 9.05256724357605}}

# Compare the words segmented in tokenizers
text = "都庁所在地は新宿区。"
tokenizers.print_words(text, delimiter="|")
#=>        janome: 都庁|所在地|は|新宿|区|。
#=>        nagisa: 都庁|所在|地|は|新宿|区|。
#=>     sudachipy: 都庁|所在地|は|新宿区|。

Run toiro in Docker

You can use all tokenizers by building a docker container from Docker Hub.

docker run --rm -it taishii/toiro /bin/bash

How to run the Python interpreter in the Docker container

Run the Python interpreter.

root@cdd2ad2d7092:/workspace# python3

Compare the words segmented in tokenizers

>>> from toiro import tokenizers
>>> text = "都庁所在地は新宿区。"
>>> tokenizers.print_words(text, delimiter="|")
mecab-python3: 都庁|所在地|は|新宿|区|。
       janome: 都庁|所在地|は|新宿|区|。
       nagisa: 都庁|所在|地|は|新宿|区|。
    sudachipy: 都庁|所在地|は|新宿区|。
        spacy: 都庁|所在|地|は|新宿|区|。
        ginza: 都庁|所在地|は|新宿区|。
        kytea: 都庁|所在|地|は|新宿|区|。
      jumanpp: 都庁|所在|地|は|新宿|区|。
sentencepiece: ▁|都|庁|所在地|は|新宿|区|。

Get more information about toiro

Tutorials

Tutorials in Japanese

01_getting_started_ja.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

toiro

Installation

Adding a tokenizer to toiro

Getting started

Run toiro in Docker

Get more information about toiro

About

Releases 7

Languages

License

taishi-i/toiro

Folders and files

Latest commit

History

Repository files navigation

toiro

Installation

Adding a tokenizer to toiro

Getting started

Run toiro in Docker

Get more information about toiro

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Languages