Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting the indices of the words #70

Open
eladbitton opened this issue Aug 31, 2021 · 4 comments
Open

Getting the indices of the words #70

eladbitton opened this issue Aug 31, 2021 · 4 comments

Comments

@eladbitton
Copy link

eladbitton commented Aug 31, 2021

I want to use the library and get indices of the extracted word instead of the extracted word string.

For example:
a list that tells me where is the start and end of each word in the input string - [(0,2), (2,5), (6,8)]

Is there a way to do that?

@titipata
Copy link
Collaborator

Hi @eladbitton, short answer here: so the actual task of Deepcut is to predict if the character is the beginning of the word or not. You can take this line https://github.com/rkcosmos/deepcut/blob/master/deepcut/deepcut.py#L315 as an output to get the following start,end of each word as you wanted.

@titipata
Copy link
Collaborator

titipata commented Aug 31, 2021

Here is how you can do it using deepcut:

import numpy as np
import deepcut
from deepcut import DeepcutTokenizer

text = "ฉันอยากกินข้าว"  # input text
x_char, x_type = deepcut.utils.create_feature_array(text)
y_predict = DeepcutTokenizer().model.predict([x_char, x_type])
y_predict = (y_predict.ravel() > 0.5).astype(int)
y_predict = y_predict[1:].tolist() + [1] # predicting starting characters as an output

position = [0] + np.where(y_predict)[0].tolist() # getting position as you want
position = list(zip(pos, pos[1:]))

You will get the following output:

print(position)
>>> [(0, 2), (2, 6), (6, 9), (9, 13)]

@eladbitton
Copy link
Author

Amazing thank you!

@titipata
Copy link
Collaborator

@eladbitton sounds great. Let me know if it works for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants