Getting the indices of the words #70

eladbitton · 2021-08-31T11:00:08Z

I want to use the library and get indices of the extracted word instead of the extracted word string.

For example:
a list that tells me where is the start and end of each word in the input string - [(0,2), (2,5), (6,8)]

Is there a way to do that?

titipata · 2021-08-31T11:09:32Z

Hi @eladbitton, short answer here: so the actual task of Deepcut is to predict if the character is the beginning of the word or not. You can take this line https://github.com/rkcosmos/deepcut/blob/master/deepcut/deepcut.py#L315 as an output to get the following start,end of each word as you wanted.

titipata · 2021-08-31T11:30:47Z

Here is how you can do it using deepcut:

import numpy as np
import deepcut
from deepcut import DeepcutTokenizer

text = "ฉันอยากกินข้าว"  # input text
x_char, x_type = deepcut.utils.create_feature_array(text)
y_predict = DeepcutTokenizer().model.predict([x_char, x_type])
y_predict = (y_predict.ravel() > 0.5).astype(int)
y_predict = y_predict[1:].tolist() + [1] # predicting starting characters as an output

position = [0] + np.where(y_predict)[0].tolist() # getting position as you want
position = list(zip(pos, pos[1:]))

You will get the following output:

print(position)
>>> [(0, 2), (2, 6), (6, 9), (9, 13)]

eladbitton · 2021-08-31T11:38:47Z

Amazing thank you!

titipata · 2021-08-31T11:48:36Z

@eladbitton sounds great. Let me know if it works for you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting the indices of the words #70

Getting the indices of the words #70

eladbitton commented Aug 31, 2021 •

edited

Loading

titipata commented Aug 31, 2021

titipata commented Aug 31, 2021 •

edited

Loading

eladbitton commented Aug 31, 2021

titipata commented Aug 31, 2021

Getting the indices of the words #70

Getting the indices of the words #70

Comments

eladbitton commented Aug 31, 2021 • edited Loading

titipata commented Aug 31, 2021

titipata commented Aug 31, 2021 • edited Loading

eladbitton commented Aug 31, 2021

titipata commented Aug 31, 2021

eladbitton commented Aug 31, 2021 •

edited

Loading

titipata commented Aug 31, 2021 •

edited

Loading