Skip to content

Commit 923cea7

Browse files
committedFeb 11, 2020
first commit
0 parents  commit 923cea7

26 files changed

+2813
-0
lines changed
 

‎LICENSE

+202
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
2+
Apache License
3+
Version 2.0, January 2004
4+
http://www.apache.org/licenses/
5+
6+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7+
8+
1. Definitions.
9+
10+
"License" shall mean the terms and conditions for use, reproduction,
11+
and distribution as defined by Sections 1 through 9 of this document.
12+
13+
"Licensor" shall mean the copyright owner or entity authorized by
14+
the copyright owner that is granting the License.
15+
16+
"Legal Entity" shall mean the union of the acting entity and all
17+
other entities that control, are controlled by, or are under common
18+
control with that entity. For the purposes of this definition,
19+
"control" means (i) the power, direct or indirect, to cause the
20+
direction or management of such entity, whether by contract or
21+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
22+
outstanding shares, or (iii) beneficial ownership of such entity.
23+
24+
"You" (or "Your") shall mean an individual or Legal Entity
25+
exercising permissions granted by this License.
26+
27+
"Source" form shall mean the preferred form for making modifications,
28+
including but not limited to software source code, documentation
29+
source, and configuration files.
30+
31+
"Object" form shall mean any form resulting from mechanical
32+
transformation or translation of a Source form, including but
33+
not limited to compiled object code, generated documentation,
34+
and conversions to other media types.
35+
36+
"Work" shall mean the work of authorship, whether in Source or
37+
Object form, made available under the License, as indicated by a
38+
copyright notice that is included in or attached to the work
39+
(an example is provided in the Appendix below).
40+
41+
"Derivative Works" shall mean any work, whether in Source or Object
42+
form, that is based on (or derived from) the Work and for which the
43+
editorial revisions, annotations, elaborations, or other modifications
44+
represent, as a whole, an original work of authorship. For the purposes
45+
of this License, Derivative Works shall not include works that remain
46+
separable from, or merely link (or bind by name) to the interfaces of,
47+
the Work and Derivative Works thereof.
48+
49+
"Contribution" shall mean any work of authorship, including
50+
the original version of the Work and any modifications or additions
51+
to that Work or Derivative Works thereof, that is intentionally
52+
submitted to Licensor for inclusion in the Work by the copyright owner
53+
or by an individual or Legal Entity authorized to submit on behalf of
54+
the copyright owner. For the purposes of this definition, "submitted"
55+
means any form of electronic, verbal, or written communication sent
56+
to the Licensor or its representatives, including but not limited to
57+
communication on electronic mailing lists, source code control systems,
58+
and issue tracking systems that are managed by, or on behalf of, the
59+
Licensor for the purpose of discussing and improving the Work, but
60+
excluding communication that is conspicuously marked or otherwise
61+
designated in writing by the copyright owner as "Not a Contribution."
62+
63+
"Contributor" shall mean Licensor and any individual or Legal Entity
64+
on behalf of whom a Contribution has been received by Licensor and
65+
subsequently incorporated within the Work.
66+
67+
2. Grant of Copyright License. Subject to the terms and conditions of
68+
this License, each Contributor hereby grants to You a perpetual,
69+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70+
copyright license to reproduce, prepare Derivative Works of,
71+
publicly display, publicly perform, sublicense, and distribute the
72+
Work and such Derivative Works in Source or Object form.
73+
74+
3. Grant of Patent License. Subject to the terms and conditions of
75+
this License, each Contributor hereby grants to You a perpetual,
76+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77+
(except as stated in this section) patent license to make, have made,
78+
use, offer to sell, sell, import, and otherwise transfer the Work,
79+
where such license applies only to those patent claims licensable
80+
by such Contributor that are necessarily infringed by their
81+
Contribution(s) alone or by combination of their Contribution(s)
82+
with the Work to which such Contribution(s) was submitted. If You
83+
institute patent litigation against any entity (including a
84+
cross-claim or counterclaim in a lawsuit) alleging that the Work
85+
or a Contribution incorporated within the Work constitutes direct
86+
or contributory patent infringement, then any patent licenses
87+
granted to You under this License for that Work shall terminate
88+
as of the date such litigation is filed.
89+
90+
4. Redistribution. You may reproduce and distribute copies of the
91+
Work or Derivative Works thereof in any medium, with or without
92+
modifications, and in Source or Object form, provided that You
93+
meet the following conditions:
94+
95+
(a) You must give any other recipients of the Work or
96+
Derivative Works a copy of this License; and
97+
98+
(b) You must cause any modified files to carry prominent notices
99+
stating that You changed the files; and
100+
101+
(c) You must retain, in the Source form of any Derivative Works
102+
that You distribute, all copyright, patent, trademark, and
103+
attribution notices from the Source form of the Work,
104+
excluding those notices that do not pertain to any part of
105+
the Derivative Works; and
106+
107+
(d) If the Work includes a "NOTICE" text file as part of its
108+
distribution, then any Derivative Works that You distribute must
109+
include a readable copy of the attribution notices contained
110+
within such NOTICE file, excluding those notices that do not
111+
pertain to any part of the Derivative Works, in at least one
112+
of the following places: within a NOTICE text file distributed
113+
as part of the Derivative Works; within the Source form or
114+
documentation, if provided along with the Derivative Works; or,
115+
within a display generated by the Derivative Works, if and
116+
wherever such third-party notices normally appear. The contents
117+
of the NOTICE file are for informational purposes only and
118+
do not modify the License. You may add Your own attribution
119+
notices within Derivative Works that You distribute, alongside
120+
or as an addendum to the NOTICE text from the Work, provided
121+
that such additional attribution notices cannot be construed
122+
as modifying the License.
123+
124+
You may add Your own copyright statement to Your modifications and
125+
may provide additional or different license terms and conditions
126+
for use, reproduction, or distribution of Your modifications, or
127+
for any such Derivative Works as a whole, provided Your use,
128+
reproduction, and distribution of the Work otherwise complies with
129+
the conditions stated in this License.
130+
131+
5. Submission of Contributions. Unless You explicitly state otherwise,
132+
any Contribution intentionally submitted for inclusion in the Work
133+
by You to the Licensor shall be under the terms and conditions of
134+
this License, without any additional terms or conditions.
135+
Notwithstanding the above, nothing herein shall supersede or modify
136+
the terms of any separate license agreement you may have executed
137+
with Licensor regarding such Contributions.
138+
139+
6. Trademarks. This License does not grant permission to use the trade
140+
names, trademarks, service marks, or product names of the Licensor,
141+
except as required for reasonable and customary use in describing the
142+
origin of the Work and reproducing the content of the NOTICE file.
143+
144+
7. Disclaimer of Warranty. Unless required by applicable law or
145+
agreed to in writing, Licensor provides the Work (and each
146+
Contributor provides its Contributions) on an "AS IS" BASIS,
147+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148+
implied, including, without limitation, any warranties or conditions
149+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150+
PARTICULAR PURPOSE. You are solely responsible for determining the
151+
appropriateness of using or redistributing the Work and assume any
152+
risks associated with Your exercise of permissions under this License.
153+
154+
8. Limitation of Liability. In no event and under no legal theory,
155+
whether in tort (including negligence), contract, or otherwise,
156+
unless required by applicable law (such as deliberate and grossly
157+
negligent acts) or agreed to in writing, shall any Contributor be
158+
liable to You for damages, including any direct, indirect, special,
159+
incidental, or consequential damages of any character arising as a
160+
result of this License or out of the use or inability to use the
161+
Work (including but not limited to damages for loss of goodwill,
162+
work stoppage, computer failure or malfunction, or any and all
163+
other commercial damages or losses), even if such Contributor
164+
has been advised of the possibility of such damages.
165+
166+
9. Accepting Warranty or Additional Liability. While redistributing
167+
the Work or Derivative Works thereof, You may choose to offer,
168+
and charge a fee for, acceptance of support, warranty, indemnity,
169+
or other liability obligations and/or rights consistent with this
170+
License. However, in accepting such obligations, You may act only
171+
on Your own behalf and on Your sole responsibility, not on behalf
172+
of any other Contributor, and only if You agree to indemnify,
173+
defend, and hold each Contributor harmless for any liability
174+
incurred by, or claims asserted against, such Contributor by reason
175+
of your accepting any such warranty or additional liability.
176+
177+
END OF TERMS AND CONDITIONS
178+
179+
APPENDIX: How to apply the Apache License to your work.
180+
181+
To apply the Apache License to your work, attach the following
182+
boilerplate notice, with the fields enclosed by brackets "[]"
183+
replaced with your own identifying information. (Don't include
184+
the brackets!) The text should be enclosed in the appropriate
185+
comment syntax for the file format. We also recommend that a
186+
file or class name and description of purpose be included on the
187+
same "printed page" as the copyright notice for easier
188+
identification within third-party archives.
189+
190+
Copyright 2014 The Board of Trustees of The Leland Stanford Junior University
191+
192+
Licensed under the Apache License, Version 2.0 (the "License");
193+
you may not use this file except in compliance with the License.
194+
You may obtain a copy of the License at
195+
196+
http://www.apache.org/licenses/LICENSE-2.0
197+
198+
Unless required by applicable law or agreed to in writing, software
199+
distributed under the License is distributed on an "AS IS" BASIS,
200+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201+
See the License for the specific language governing permissions and
202+
limitations under the License.

‎README.md

+90
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# 自然言語処理用Docker環境
2+
3+
## 概要
4+
GloVeによる単語ベクトル、SCDVによる文章ベクトルを活用し、類義語検索や類義文検索を行う。
5+
6+
## 環境
7+
8+
- ホスト環境
9+
10+
| 環境 | バージョン |
11+
| --- | --- |
12+
| Ubuntu | 18.04 |
13+
| Docker | 18.09.7 |
14+
| Docker Compose | 1.17.1 |
15+
16+
- Dockerイメージ
17+
18+
| 環境 | バージョン |
19+
| --- | --- |
20+
| Jupyter Lab | jupyter/datascience-notebook |
21+
| Elasticsearch | docker.elastic.co/elasticsearch/elasticsearch:7.5.0 |
22+
| Kibana | docker.elastic.co/kibana/kibana:7.5.0 |
23+
24+
## Dockerイメージ概要
25+
-Jupyter Lab
26+
- Sudachi、Ginza導入
27+
- matplitlib用の日本語フォント導入
28+
29+
- Elasticsearch
30+
- analysis-sudachi-elasticsearch導入
31+
32+
## 構築方法
33+
- Dockerをインストールする。
34+
```
35+
$ sudo apt update
36+
$ sudo apt install docker docker-compose
37+
```
38+
- 初期設定を行う。
39+
```
40+
$ ./init.sh
41+
```
42+
43+
- docker-composeでコンテナを起動する。
44+
```
45+
$ sudo docker-compose up
46+
```
47+
48+
## 各環境へのアクセス方法
49+
- Jupyter Labコンテナ
50+
http://[ホスト]:8888
51+
52+
- Kibana
53+
http://[ホスト]:5601
54+
55+
- Elasticsearch
56+
http://[ホスト]:9800
57+
58+
## 使い方
59+
基本的にJupyter Lab上での作業となる。
60+
61+
### 1. 環境の初期設定、データクローリング、GloVeとSCDVによる単語・文章のベクトル化
62+
1. Jupyter Labコンテナにアクセスする。
63+
2. nlp_book.ipynbを開く
64+
3. セルの先頭から順番に実行する
65+
主な処理は以下の通り。
66+
- データクローリング
67+
- データの前処理
68+
- Elasticsearchへの登録
69+
- Elasticsearchによる文章のトークナイズ
70+
- GloVeによる単語ベクトルの生成
71+
- SCDVによる文章ベクトルの生成とElasticsearchへの登録
72+
73+
### 2. 類似語の抽出
74+
GloVeによる単語ベクトルを使い、類似語を抽出する。
75+
76+
1. Similarity.ipynbを開く。
77+
2. 最初のセルに類似語を抽出したい単語と、上位何件を取得するか変数に記述する。
78+
- word : 類似語を抽出したい単語
79+
- top_k : 上位何位まで取得するか
80+
3. セルを実行する。
81+
82+
### 3. LDAトピックモデル
83+
トークナイズした単語を使い、LDAトピックモデルによる分類を行う。
84+
85+
1. LDA_topic_model.ipynbを開く。
86+
2. セルを上から順に実行する。
87+
- LDAトピックモデルの学習
88+
- 学習したトピック毎に特徴的な単語をWordCloudで可視化
89+
- pyLDAvisによるトピックの分布を可視化
90+

‎docker-compose.yml

+44
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
version: "3"
2+
3+
services:
4+
elasticsearch:
5+
build:
6+
context: ./elasticsearch
7+
dockerfile: dockerfile_elastic
8+
environment:
9+
- discovery.type=single-node
10+
- cluster.name=docker-cluster
11+
- bootstrap.memory_lock=true
12+
- xpack.security.enabled=false
13+
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
14+
ulimits:
15+
memlock:
16+
soft: -1
17+
hard: -1
18+
ports:
19+
- 9200:9200
20+
volumes:
21+
- ./elasticsearch/es-data:/usr/share/elasticsearch/data
22+
kibana:
23+
image: docker.elastic.co/kibana/kibana:7.5.0
24+
ports:
25+
- 5601:5601
26+
jupyter:
27+
build:
28+
context: ./jupyter
29+
dockerfile: dockerfile_jupyter
30+
user: root
31+
environment:
32+
#NB_UID: 500
33+
#NB_GID: 100
34+
NB_UID: 1000
35+
NB_GID: 1000
36+
GRANT_SUDO: "yes"
37+
TZ: "Asia/Tokyo"
38+
JUPYTER_ENABLE_LAB: "yes"
39+
ports:
40+
- "8888:8888"
41+
volumes:
42+
- "./jupyter/data:/home/jovyan/work"
43+
privileged: true
44+
command: start.sh jupyter lab --NotebookApp.token='' --no-browser

‎elasticsearch/dockerfile_elastic

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
ARG ELASTIC_VER=7.5.0
2+
ARG ELASTIC_SUDACHI_VER=${ELASTIC_VER}-1.3.2
3+
ARG SUDACHI_VER=0.3.2
4+
5+
FROM docker.elastic.co/elasticsearch/elasticsearch:${ELASTIC_VER}
6+
7+
COPY sudachi.json /usr/share/elasticsearch/config/sudachi/
8+
COPY analysis-sudachi-elasticsearch7.5-1.3.2.zip /tmp/
9+
COPY system_full.dic /usr/share/elasticsearch/config/sudachi/
10+
11+
RUN elasticsearch-plugin install file:///tmp/analysis-sudachi-elasticsearch7.5-1.3.2.zip && \
12+
rm /tmp/analysis-sudachi-elasticsearch7.5-1.3.2.zip

‎elasticsearch/sudachi.json

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
{
2+
"systemDict" : "system_full.dic",
3+
"inputTextPlugin" : [
4+
{ "class" : "com.worksap.nlp.sudachi.DefaultInputTextPlugin" },
5+
{ "class" : "com.worksap.nlp.sudachi.ProlongedSoundMarkInputTextPlugin",
6+
"prolongedSoundMarks": ["", "-", "", "", ""],
7+
"replacementSymbol": ""}
8+
],
9+
"oovProviderPlugin" : [
10+
{ "class" : "com.worksap.nlp.sudachi.MeCabOovProviderPlugin" },
11+
{ "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
12+
"oovPOS" : [ "補助記号", "一般", "*", "*", "*", "*" ],
13+
"leftId" : 5968,
14+
"rightId" : 5968,
15+
"cost" : 3857 }
16+
],
17+
"pathRewritePlugin" : [
18+
{ "class" : "com.worksap.nlp.sudachi.JoinNumericPlugin",
19+
"joinKanjiNumeric" : true },
20+
{ "class" : "com.worksap.nlp.sudachi.JoinKatakanaOovPlugin",
21+
"oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ],
22+
"minLength" : 3
23+
}
24+
]
25+
}

‎init.sh

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
cd ./elasticsearch
2+
mkdir es-data
3+
4+
wget https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/sudachi-dictionary-20200127-full.zip
5+
6+
unzip sudachi-dictionary-20200127-full.zip
7+
8+
mv sudachi-dictionary-20200127/system_full.dic .
9+
10+
wget https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v7.5.0-1.3.2/analysis-sudachi-elasticsearch7.5-1.3.2.zip
11+

‎jupyter/data/Elasticsearch_sim_search.ipynb

+438
Large diffs are not rendered by default.

‎jupyter/data/LDA_topic_model.ipynb

+399
Large diffs are not rendered by default.

‎jupyter/data/Similarity.ipynb

+84
Large diffs are not rendered by default.

‎jupyter/data/es_accident_schema.txt

+77
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
{
2+
"aliases" : {},
3+
"mappings":{
4+
"properties" : {
5+
"sentence" : {
6+
"type" : "text"
7+
},
8+
"category" : {
9+
"type" : "keyword"
10+
},
11+
"scdv_vector" : {
12+
"type" : "dense_vector",
13+
"dims" : 1000
14+
}
15+
}
16+
},
17+
"settings": {
18+
"index": {
19+
"analysis": {
20+
"tokenizer": {
21+
"sudachi_tokenizer": {
22+
"mode" : "search",
23+
"settings_path" : "/usr/share/elasticsearch/config/sudachi/sudachi.json",
24+
"resources_path" : "/usr/share/elasticsearch/config/sudachi/",
25+
"type" : "sudachi_tokenizer",
26+
"discard_punctuation" : "true"
27+
}
28+
},
29+
"analyzer": {
30+
"sudachi_analyzer": {
31+
"filter": [
32+
"sudachi_baseform",
33+
"lowercase",
34+
"my_posfilter",
35+
"my_stopfilter"
36+
],
37+
"tokenizer": "sudachi_tokenizer",
38+
"type": "custom"
39+
}
40+
},
41+
"filter":{
42+
"my_posfilter":{
43+
"type":"sudachi_part_of_speech",
44+
"stoptags":[
45+
"接続詞","助動詞","助詞","記号","補助記号","名詞,数詞",
46+
"名詞,普通名詞,助数詞可能"
47+
]
48+
},"my_stopfilter":{
49+
"type":"sudachi_ja_stop",
50+
"stopwords":[
51+
"は",
52+
"です",
53+
"する",
54+
"いる",
55+
"ため",
56+
"CM",
57+
"cm",
58+
"CM",
59+
"次",
60+
"名",
61+
"行う",
62+
"等",
63+
"者",
64+
"際",
65+
"こと",
66+
"ある",
67+
"この",
68+
"その",
69+
"そこ",
70+
"これ"
71+
]
72+
}
73+
}
74+
}
75+
}
76+
}
77+
}

‎jupyter/data/es_accident_tokenize.py

+95
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
import elasticsearch
2+
import json, argparse, scdv
3+
4+
class elasticsearchClient():
5+
def __init__(self, host, port, index):
6+
self.host = host
7+
self.port = port
8+
self.index = index
9+
self.client = elasticsearch.Elasticsearch(self.host + ":" + self.port)
10+
11+
# sentenceをトークナイズする。
12+
def tokenize(self, sentence):
13+
body_ = {"analyzer": "sudachi_analyzer", "text": sentence}
14+
json_tokens = self.client.indices.analyze(
15+
index = self.index, body=body_)
16+
17+
tokens = [token['token'] for token in json_tokens['tokens']]
18+
return tokens
19+
20+
def parse_data(self, items):
21+
results = []
22+
23+
for item in items:
24+
index = json.dumps(item['_id'])
25+
category = json.dumps(
26+
item['_source']['category'],
27+
indent=2, ensure_ascii=False)
28+
sentence = json.dumps(
29+
item['_source']['sentence'],
30+
indent=2, ensure_ascii=False)
31+
32+
tokens = self.tokenize(sentence)
33+
results.append((index, category, sentence, tokens))
34+
return results
35+
36+
# 全データを取得する
37+
def get_all_data(self, scroll_time, scroll_size):
38+
results = []
39+
40+
data = self.client.search(
41+
index = self.index,
42+
scroll = scroll_time,
43+
size = scroll_size,
44+
body = {})
45+
sid = data['_scroll_id']
46+
scroll_size = len(data['hits']['hits'])
47+
48+
results = self.parse_data(data['hits']['hits'])
49+
50+
while scroll_size > 0:
51+
data = self.client.scroll(
52+
scroll_id = sid,
53+
scroll = scroll_time)
54+
55+
sid = data['_scroll_id']
56+
scroll_size = len(data['hits']['hits'])
57+
scroll_results = self.parse_data(data['hits']['hits'])
58+
results.extend(scroll_results)
59+
60+
return results
61+
def update(self, row_id, body):
62+
response = self.client.update(
63+
index = self.index,
64+
id = row_id,
65+
body = body)
66+
print(response)
67+
68+
def parse_args():
69+
parser = argparse.ArgumentParser()
70+
parser.add_argument('--host', type=str)
71+
parser.add_argument('--port', type=str, default='9200')
72+
parser.add_argument('--index', type=str)
73+
parser.add_argument('--output', type=str)
74+
parser.add_argument('--scroll_limit', type=str, default='1m')
75+
parser.add_argument('--scroll_size', type=int, default=100)
76+
77+
return parser.parse_args()
78+
79+
def main(args):
80+
client = elasticsearchClient(args.host, args.port, args.index)
81+
results = client.get_all_data(args.scroll_limit, args.scroll_size)
82+
83+
output_txt = args.output + '.txt'
84+
output_csv = args.output + '.csv'
85+
with open(output_csv, "w") as f_csv:
86+
with open(output_txt, "w") as f_txt:
87+
f_csv.writelines('ID,category,sentence,tokens\n')
88+
89+
for result in results:
90+
tokens = " ".join(result[3])
91+
f_csv.writelines(result[0] + ',' + result[1] + ',' + result[2] + ',"' + tokens + '"\n')
92+
f_txt.writelines(tokens + '\n')
93+
94+
if __name__ == '__main__':
95+
main(parse_args())

‎jupyter/data/es_anzen_schema.txt

+111
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
{
2+
"aliases" : {},
3+
"mappings":{
4+
"properties" : {
5+
"title" : {
6+
"type" : "nested",
7+
"properties" : {
8+
"title_id" : {"type" : "keyword"},
9+
"text" : { "type" : "text" },
10+
"vector" : {
11+
"type" : "dense_vector",
12+
"dims" : 1000
13+
}
14+
}
15+
},
16+
"situation" : {
17+
"type" : "nested",
18+
"properties" : {
19+
"situation_id" : {"type" : "keyword"},
20+
"text" : { "type" : "text" },
21+
"vector" : {
22+
"type" : "dense_vector",
23+
"dims" : 1000
24+
}
25+
}
26+
},
27+
"cause" : {
28+
"type" : "nested",
29+
"properties" : {
30+
"cause_id" : {"type" : "keyword"},
31+
"text" : { "type" : "text" },
32+
"vector" : {
33+
"type" : "dense_vector",
34+
"dims" : 1000
35+
}
36+
}
37+
},
38+
"measures" : {
39+
"type" : "nested",
40+
"properties" : {
41+
"measures_id" : {"type" : "keyword"},
42+
"text" : { "type" : "text" },
43+
"vector" : {
44+
"type" : "dense_vector",
45+
"dims" : 1000
46+
}
47+
}
48+
}
49+
}
50+
},
51+
"settings": {
52+
"index": {
53+
"analysis": {
54+
"tokenizer": {
55+
"sudachi_tokenizer": {
56+
"mode" : "search",
57+
"settings_path" : "/usr/share/elasticsearch/config/sudachi/sudachi.json",
58+
"resources_path" : "/usr/share/elasticsearch/config/sudachi/",
59+
"type" : "sudachi_tokenizer",
60+
"discard_punctuation" : "true"
61+
}
62+
},
63+
"analyzer": {
64+
"sudachi_analyzer": {
65+
"filter": [
66+
"sudachi_baseform",
67+
"lowercase",
68+
"my_posfilter",
69+
"my_stopfilter"
70+
],
71+
"tokenizer": "sudachi_tokenizer",
72+
"type": "custom"
73+
}
74+
},
75+
"filter":{
76+
"my_posfilter":{
77+
"type":"sudachi_part_of_speech",
78+
"stoptags":[
79+
"接続詞","助動詞","助詞","記号","補助記号","名詞,数詞",
80+
"名詞,普通名詞,助数詞可能"
81+
]
82+
},"my_stopfilter":{
83+
"type":"sudachi_ja_stop",
84+
"stopwords":[
85+
"は",
86+
"です",
87+
"する",
88+
"いる",
89+
"ため",
90+
"CM",
91+
"cm",
92+
"CM",
93+
"次",
94+
"名",
95+
"行う",
96+
"等",
97+
"者",
98+
"際",
99+
"こと",
100+
"ある",
101+
"この",
102+
"その",
103+
"そこ",
104+
"これ"
105+
]
106+
}
107+
}
108+
}
109+
}
110+
}
111+
}

‎jupyter/data/es_anzen_tokenize.py

+137
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# -*- coding: utf-8 -*-
2+
import elasticsearch
3+
import json, argparse
4+
5+
class elasticsearchClient():
6+
def __init__(self, host, port, index):
7+
self.host = host
8+
self.port = port
9+
self.index = index
10+
self.client = elasticsearch.Elasticsearch(self.host + ":" + self.port)
11+
12+
# 文章をトークナイズする。
13+
def tokenize(self, sentence):
14+
body_ = {"analyzer": "sudachi_analyzer", "text": sentence}
15+
json_tokens = self.client.indices.analyze(
16+
index = self.index, body=body_)
17+
18+
tokens = [token['token'] for token in json_tokens['tokens']]
19+
return tokens
20+
21+
def parse_data(self, items):
22+
results = []
23+
24+
for item in items:
25+
index = json.dumps(item['_id'])
26+
title = json.dumps(
27+
item['_source']['title']['text'],
28+
indent=2, ensure_ascii=False)
29+
title_id = json.dumps(
30+
item['_source']['title']['title_id'],
31+
indent=2, ensure_ascii=False)
32+
33+
_cause = item['_source']['cause']
34+
cause = []
35+
cause_id = []
36+
for val in _cause:
37+
cause.append(json.dumps(val['text'], ensure_ascii=False))
38+
cause_id.append(json.dumps(val['cause_id'], ensure_ascii=False))
39+
40+
situation = []
41+
situation_id = []
42+
_situation = item['_source']['situation']
43+
for val in _situation:
44+
situation.append(json.dumps(val['text'], ensure_ascii=False))
45+
situation_id.append(json.dumps(val['situation_id'], ensure_ascii=False))
46+
47+
measures = []
48+
measures_id = []
49+
_measures = item['_source']['measures']
50+
for val in _measures:
51+
measures.append(json.dumps(val['text'], ensure_ascii=False))
52+
measures_id.append(json.dumps(val['measures_id'], ensure_ascii=False))
53+
title_tokens = self.tokenize(title)
54+
if len(title_tokens) > 0:
55+
results.append((index, "title", title_id, title, title_tokens))
56+
57+
for (id, val) in zip(cause_id, cause):
58+
val_tokens = self.tokenize(val)
59+
if len(val_tokens) > 0:
60+
results.append((index, "cause", id, val, val_tokens))
61+
62+
for (id, val) in zip(situation_id, situation):
63+
val_tokens = self.tokenize(val)
64+
if len(val_tokens) > 0:
65+
results.append((index, "situation", id, val, val_tokens))
66+
67+
for (id, val) in zip(measures_id, measures):
68+
val_tokens = self.tokenize(val)
69+
if len(val_tokens) > 0:
70+
results.append((index, "measures", id, val, val_tokens))
71+
72+
return results
73+
74+
# 全データを取得する
75+
def get_all_data(self, scroll_time, scroll_size):
76+
results = []
77+
78+
data = self.client.search(
79+
index = self.index,
80+
scroll = scroll_time,
81+
size = scroll_size,
82+
body = {})
83+
sid = data['_scroll_id']
84+
scroll_size = len(data['hits']['hits'])
85+
86+
results = self.parse_data(data['hits']['hits'])
87+
88+
while scroll_size > 0:
89+
data = self.client.scroll(
90+
scroll_id = sid,
91+
scroll = scroll_time)
92+
93+
sid = data['_scroll_id']
94+
scroll_size = len(data['hits']['hits'])
95+
scroll_results = self.parse_data(data['hits']['hits'])
96+
97+
results.extend(scroll_results)
98+
99+
return results
100+
101+
def update(self, row_id, body):
102+
response = self.client.update(
103+
index = self.index,
104+
id = row_id,
105+
body = body)
106+
print(response)
107+
108+
def parse_args():
109+
parser = argparse.ArgumentParser()
110+
parser.add_argument('--host', type=str, default='localhost')
111+
parser.add_argument('--port', type=str, default='9200')
112+
parser.add_argument('--index', type=str)
113+
parser.add_argument('--output', type=str)
114+
parser.add_argument('--scroll_limit', type=str, default='1m')
115+
parser.add_argument('--scroll_size', type=int, default=100)
116+
117+
return parser.parse_args()
118+
119+
def main(args):
120+
client = elasticsearchClient(args.host, args.port, args.index)
121+
results = client.get_all_data(args.scroll_limit, args.scroll_size)
122+
123+
output_csv = args.output + '.csv'
124+
output_txt = args.output + '.txt'
125+
#output_txt = args.output.replace(".csv", ".txt")
126+
#with open(args.output, "w") as f_csv:
127+
with open(output_csv, "w") as f_csv:
128+
with open(output_txt, "w") as f_txt:
129+
f_csv.writelines('ID,種別,文章ID,文章,分かち書き\n')
130+
131+
for result in results:
132+
tokens = " ".join(result[4])
133+
f_csv.writelines(result[0] + ',' + '"' + result[1] + '",' + result[2] + ',' + result[3].strip() + ',"' + tokens + '"\n')
134+
f_txt.writelines(tokens + '\n')
135+
136+
if __name__ == '__main__':
137+
main(parse_args())

‎jupyter/data/excel_to_csv.py

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
import pandas as pd
2+
import glob, sys
3+
4+
args = sys.argv
5+
excel_dir = args[1]
6+
output_csv_dir = args[2]
7+
8+
files = glob.glob(excel_dir + '/*.xls*')
9+
10+
for file in files:
11+
excel = pd.ExcelFile(file)
12+
13+
print(file)
14+
15+
sheet_names = excel.sheet_names
16+
for i, name in enumerate(sheet_names):
17+
if i > 0:
18+
continue
19+
20+
csv_file = file.replace(excel_dir, output_csv_dir).replace(".xlsx", "").replace(".xls", "") + '_' + str(i) + '.csv'
21+
sheet_df = excel.parse(name, header=[0, 1])
22+
columns_val = sheet_df.columns.values
23+
24+
col_names = []
25+
for col_vals in columns_val:
26+
# セル結合されているタイトル行は_で文字列結合する。
27+
# タイトルの分類名の括弧が全半角混在のため、半角に統一する。
28+
col_name = col_vals[0].replace('\n', '') + '_' + col_vals[1].replace('\n','')
29+
col_name = col_name.replace('(','(').replace(')',')')
30+
31+
if 'Unnamed' in col_vals[1]:
32+
col_name = col_vals[0].replace('\n','')
33+
col_names.append(col_name)
34+
sheet_df.columns = col_names
35+
36+
situation_col_name = '災害状況'
37+
if 'kikaisaigai' in file:
38+
situation_col_name = '災害発生状況'
39+
40+
sheet_df[situation_col_name] = sheet_df[situation_col_name].replace('\r\n','', regex=True).replace('\r','', regex=True).replace('\n','', regex=True)
41+
42+
sheet_df.to_csv(csv_file)

‎jupyter/data/get_doc.sh

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
#!/bin/bash
2+
3+
for ((i=1 ; i<1131 ; i++))
4+
do
5+
num=${i}
6+
file=./html/anzen_${num}.html
7+
8+
echo ${num}
9+
wget https://anzeninfo.mhlw.go.jp/anzen_pg/SAI_DET.aspx?joho_no=${num} -O ${file}
10+
done
11+
12+
for ((i=100003 ; i<101583 ; i++))
13+
do
14+
num=${i}
15+
file=./html/anzen_${num}.html
16+
17+
echo ${num}
18+
wget https://anzeninfo.mhlw.go.jp/anzen_pg/SAI_DET.aspx?joho_no=${num} -O ${file}
19+
done
20+
21+
for ((i=3 ; i<30 ; i++))
22+
do
23+
num=${i}
24+
if [ ${i} -lt 10 ]; then
25+
num=0${i}
26+
fi
27+
28+
file=sibou_db_h${num}.xlsx
29+
30+
if [ ${i} -lt 27 ]; then
31+
file=sibou_db_h${num}.xls
32+
fi
33+
34+
echo ${file}
35+
wget https://anzeninfo.mhlw.go.jp/anzen/sib_xls/${file} -P ./excel/
36+
done
37+
38+
wget https://anzeninfo.mhlw.go.jp/anzen/sai/kikaisaigai_db28.xlsx -P ./excel/

‎jupyter/data/glove.sh

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/bin/bash
2+
set -e
3+
4+
# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
5+
# One optional argument can specify the language used for eval script: matlab, octave or [default] python
6+
7+
8+
CORPUS=tokenized/all_tokens.txt
9+
VOCAB_FILE=./vector/glove_vocab.txt
10+
COOCCURRENCE_FILE=./vector/glove_cooccurrence.bin
11+
COOCCURRENCE_SHUF_FILE=./vector/glove_cooccurrence_shuf.bin
12+
BUILDDIR=GloVe/build
13+
SAVE_FILE=./vector/glove_vectors
14+
VERBOSE=2
15+
MEMORY=4.0
16+
VOCAB_MIN_COUNT=0
17+
VECTOR_SIZE=50
18+
MAX_ITER=50
19+
WINDOW_SIZE=15
20+
BINARY=2
21+
NUM_THREADS=8
22+
X_MAX=10
23+
24+
echo
25+
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
26+
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
27+
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
28+
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
29+
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
30+
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
31+
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
32+
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
33+
if [ "$CORPUS" = 'text8' ]; then
34+
if [ "$1" = 'matlab' ]; then
35+
matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2
36+
elif [ "$1" = 'octave' ]; then
37+
octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
38+
else
39+
echo "$ python eval/python/evaluate.py"
40+
python eval/python/evaluate.py
41+
fi
42+
fi

‎jupyter/data/html_to_json.py

+58
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
import re, json, glob, os.path, sys
2+
from bs4 import BeautifulSoup
3+
4+
args = sys.argv
5+
html_dir = args[1]
6+
json_dir = args[2]
7+
8+
html_files = glob.glob(html_dir + "*")
9+
10+
for html_file in html_files:
11+
file_size = os.path.getsize(html_file)
12+
if file_size < 10000:
13+
print("[ERROR] {}".format(html_file))
14+
continue
15+
16+
print(html_file)
17+
#json_file = html_file.replace("./html/", "./elastic_json/").replace(".html", ".json")
18+
json_file = html_file.replace(html_dir, json_dir).replace(".html", ".json")
19+
20+
id = html_file.replace(html_dir + "anzen_","").replace(".html","")
21+
22+
html = BeautifulSoup(open(html_file, encoding="cp932"), 'html.parser')
23+
24+
for i in html.select("br"):
25+
i.replace_with("\n")
26+
27+
title = html.find('table').find('h1').text.strip()
28+
title_id = id + '_t_0'
29+
30+
_cause = html.find("img", alt="原因").find_parent().find_parent().find('td').text.strip().replace("\u3000", "").split("\n")
31+
cause = []
32+
33+
for i, val in enumerate(_cause):
34+
val = val.strip().replace("\t","").replace("\n","")
35+
if len(val) > 0:
36+
cause.append('{"cause_id":"%s", "text":"%s"}' % (id + '_c_' + str(i), val))
37+
cause = ",".join(cause)
38+
39+
situation = []
40+
_situation = html.find("img", alt="発生状況").find_parent().find_parent().find('td').text.strip().replace("\u3000", "").split("\n")
41+
for i, val in enumerate(_situation):
42+
val = val.strip().replace("\t","").replace("\n","")
43+
if len(val) > 0:
44+
situation.append('{"situation_id":"%s", "text":"%s"}' % (id + '_s_' + str(i), val))
45+
situation = ",".join(situation)
46+
47+
_measures = html.find("img", alt="対策").find_parent().find_parent().find('td').text.strip().replace("\u3000", "").split("\n")
48+
measures = []
49+
for i, val in enumerate(_measures):
50+
val = val.strip().replace("\t","").replace("\n","")
51+
if len(val) > 0:
52+
measures.append('{"measures_id":"%s", "text":"%s"}' % (id + '_m_' + str(i), val))
53+
measures = ",".join(measures)
54+
55+
json_data = '{"index":{"_index":"anzen","_id":"%s"}},\n{"title":{"title_id":"%s", "text":"%s"},"situation":[%s],"cause":[%s],"measures":[%s]}' % (id, title_id, title, situation, cause, measures)
56+
57+
with open(json_file, "w") as jw:
58+
jw.writelines(json_data + "\n\n")

‎jupyter/data/load_accident_es.py

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import pandas as pd
2+
import json, argparse, glob
3+
from elasticsearch import Elasticsearch
4+
from elasticsearch import helpers
5+
6+
def parse_args():
7+
parser = argparse.ArgumentParser()
8+
parser.add_argument('--host', type=str)
9+
parser.add_argument('--port', type=str, default='9200')
10+
parser.add_argument('--index', type=str)
11+
parser.add_argument('--input_dir', type=str)
12+
13+
return parser.parse_args()
14+
15+
def main(args):
16+
csv_files = glob.glob(args.input_dir + '/*.csv')
17+
18+
for csv_file in csv_files:
19+
situation_col_name = '災害状況'
20+
if 'kikaisaigai' in csv_file:
21+
situation_col_name = '災害発生状況'
22+
23+
df = pd.read_csv(csv_file, encoding='utf-8', header=0)
24+
sentences = df[situation_col_name]
25+
categories = df['業種(大分類)_分類名']
26+
27+
es = Elasticsearch(host=args.host, port=args.port)
28+
29+
for col, sentence in enumerate(sentences):
30+
json_data = '{"category":"%s","sentence":"%s"}' % (categories[col], sentence)
31+
32+
print(json_data)
33+
es.index(index=args.index, doc_type="_doc", body=json_data)
34+
35+
if __name__ == '__main__':
36+
main(parse_args())

‎jupyter/data/load_anzen_bulk_es.sh

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#!/bin/bash
2+
3+
for FILE in `find ./json/ -maxdepth 1 -type f`; do
4+
echo ${FILE}
5+
curl -X POST -H "Content-Type: application/json" "elasticsearch:9200/anzen/_bulk?pretty" --data-binary @${FILE}
6+
done

‎jupyter/data/merge_csv.py

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
import pandas as pd
2+
import argparse
3+
4+
5+
def parse_args():
6+
parser = argparse.ArgumentParser()
7+
parser.add_argument('--input_anzen', type=str)
8+
parser.add_argument('--input_accident', type=str)
9+
parser.add_argument('--output_csv', type=str)
10+
11+
return parser.parse_args()
12+
13+
def main(args):
14+
anzen_csv = args.input_anzen
15+
accident_csv = args.input_accident
16+
merge_csv = args.output_csv
17+
18+
anzen_df = pd.read_csv(anzen_csv)
19+
accident_df = pd.read_csv(accident_csv)
20+
21+
# accidentインデックスのデータのIDと分かち書きのみ取得し、INDEX列を追加する
22+
new_accident_df = accident_df.drop('sentence', axis=1).assign(index = 'accident').assign(sentence_id = 0)
23+
24+
new_anzen_df = anzen_df.rename(columns={'種別':'category'}).drop('文章', axis=1).assign(index = 'anzen')
25+
new_anzen_df = new_anzen_df.rename(columns={'文章ID':'sentence_id'}).rename(columns={'分かち書き':'tokens'})
26+
27+
merge_df = pd.concat([new_anzen_df, new_accident_df], sort=False)
28+
merge_df.to_csv(merge_csv, encoding='utf_8')
29+
30+
if __name__ == '__main__':
31+
main(parse_args())

‎jupyter/data/nlp_book.ipynb

+439
Large diffs are not rendered by default.

‎jupyter/data/scdv.py

+275
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
import logging, argparse, pickle, time
2+
import numpy as np
3+
import pandas as pd
4+
import lightgbm as lgb
5+
from gensim.models import KeyedVectors
6+
from tqdm import tqdm
7+
from sklearn.mixture import GaussianMixture
8+
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
9+
from sklearn.model_selection import train_test_split
10+
from sklearn.metrics import classification_report
11+
12+
class SparseCompositeDocumentVectors:
13+
def __init__(self, num_clusters, pname1, pname2):
14+
self.min_no = 0
15+
self.max_no = 0
16+
self.prob_wordvecs = {}
17+
18+
#### 読み込むファイルの設定
19+
# GloVeの単語ベクトルファイル
20+
self.glove_word_vector_file = "vector/glove_vectors.txt"
21+
22+
#### 出力するファイルの設定
23+
# GloVeの単語ベクトルに単語数とベクトルサイズを付与したファイル
24+
self.gensim_glove_word_vector_file = "vector/gensim_glove_vectors.txt"
25+
26+
# GMMの結果を保存するPickleファイル
27+
self.pname1 = pname1
28+
self.pname2 = pname2
29+
30+
#### その他パラメータ
31+
# GMMのクラスタ数
32+
self.num_clusters = num_clusters
33+
34+
# GloVeの次元数
35+
self.num_features = 50
36+
37+
def load_glove_vector(self):
38+
# GloVeの単語ベクトルファイルを読み込み、単語数とベクトルサイズを付与した処理用のファイルを作成する。
39+
vectors = pd.read_csv(self.glove_word_vector_file, delimiter=' ', index_col=0, header=None)
40+
41+
vocab_count = vectors.shape[0] # 単語数
42+
self.num_features = vectors.shape[1] # 次元数
43+
44+
with open(self.glove_word_vector_file, 'r') as original, open(self.gensim_glove_word_vector_file, 'w') as transformed:
45+
transformed.write(f'{vocab_count} {self.num_features}\n')
46+
transformed.write(original.read()) # 2行目以降はそのまま出力
47+
48+
# GloVeの単語ベクトルを読み込む
49+
self.glove_vectors = KeyedVectors.load_word2vec_format(self.gensim_glove_word_vector_file, binary=False)
50+
51+
def cluster_GMM2(self):
52+
glove_vectors = self.glove_vectors.vectors
53+
54+
# Initalize a GMM object and use it for clustering.
55+
gmm_model = GaussianMixture(n_components=num_clusters, covariance_type="tied", init_params='kmeans', max_iter=100)
56+
# Get cluster assignments.
57+
gmm_model.fit(glove_vectors)
58+
idx = gmm_model.predict(glove_vectors)
59+
print ("Clustering Done...")
60+
# Get probabilities of cluster assignments.
61+
idx_proba = gmm_model.predict_proba(glove_vectors)
62+
# Dump cluster assignments and probability of cluster assignments.
63+
pickle.dump(idx, open(self.pname1,"wb"))
64+
print ("Cluster Assignments Saved...")
65+
66+
pickle.dump(idx_proba,open(self.pname2, "wb"))
67+
print ("Probabilities of Cluster Assignments Saved...")
68+
return (idx, idx_proba)
69+
70+
def cluster_GMM(self):
71+
# GMMによるクラスタリング
72+
73+
clf = GaussianMixture(
74+
n_components=self.num_clusters,
75+
covariance_type="tied",
76+
init_params="kmeans",
77+
max_iter=50
78+
)
79+
80+
glove_vectors = self.glove_vectors.vectors
81+
# Get cluster assignments.
82+
clf.fit(glove_vectors)
83+
idx = clf.predict(glove_vectors)
84+
print("Clustering Done...")
85+
# Get probabilities of cluster assignments.
86+
idx_proba = clf.predict_proba(glove_vectors)
87+
# Dump cluster assignments and probability of cluster assignments.
88+
pickle.dump(idx, open(self.pname1, "wb"))
89+
print("Cluster Assignments Saved...")
90+
pickle.dump(idx_proba, open(self.pname2, "wb"))
91+
print("Probabilities of Cluster Assignments saved...")
92+
return (idx, idx_proba)
93+
94+
def read_GMM(self):
95+
# GMMモデルを読み込む。
96+
97+
idx = pickle.load(open(self.idx_name, "rb"))
98+
idx_proba = pickle.load(open(self.idx_proba_name, "rb"))
99+
print("Cluster Model Loaded...")
100+
return (idx, idx_proba)
101+
102+
def get_idf_dict(self, corpus):
103+
# IDFを算出する。
104+
# corpus : 分かち書きした文章のリスト
105+
106+
# 単語の数をカウントする
107+
count_vectorizer = CountVectorizer()
108+
X_count = count_vectorizer.fit_transform(corpus)
109+
110+
# scikit-learn の TF-IDF 実装
111+
tfidf_vectorizer = TfidfVectorizer(token_pattern="(?u)\\b\\w+\\b")
112+
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
113+
114+
feature_names = tfidf_vectorizer.get_feature_names()
115+
idf = tfidf_vectorizer.idf_
116+
117+
word_idf_dict = {}
118+
for pair in zip(feature_names, idf):
119+
word_idf_dict[pair[0]] = pair[1]
120+
121+
return feature_names, word_idf_dict
122+
123+
def get_probability_word_vectors(self, corpus):
124+
"""
125+
corpus: 分かち書き済みの文章のリスト
126+
"""
127+
128+
# GloVeの単語ベクトルを読み込む。
129+
self.load_glove_vector()
130+
131+
# 単語毎のGMMクラスタの確率ベクトル
132+
idx, idx_proba = self.cluster_GMM()
133+
134+
# 各単語が属する確率が高いクラスタのインデックス
135+
word_centroid_map = dict(zip(self.glove_vectors.index2word, idx))
136+
# 各単語が、各クラスタに属する確率
137+
word_centroid_prob_map = dict(zip(self.glove_vectors.index2word, idx_proba))
138+
139+
# IDFを算出する。
140+
featurenames, word_idf_dict = self.get_idf_dict(corpus)
141+
142+
for word in word_centroid_map:
143+
self.prob_wordvecs[word] = np.zeros(self.num_clusters * self.num_features, dtype="float32")
144+
for index in range(self.num_clusters):
145+
try:
146+
self.prob_wordvecs[word][index*self.num_features:(index+1)*self.num_features] = \
147+
self.glove_vectors[word] * word_centroid_prob_map[word][index] * word_idf_dict[word]
148+
except:
149+
continue
150+
self.word_centroid_map = word_centroid_map
151+
152+
def create_cluster_vector_and_gwbowv(self, tokens, flag):
153+
# SDV(Sparse Document Vector)を組み立てる。
154+
155+
bag_of_centroids = np.zeros(self.num_clusters * self.num_features, dtype="float32")
156+
for token in tokens:
157+
try:
158+
temp = self.word_centroid_map[token]
159+
except:
160+
continue
161+
bag_of_centroids += self.prob_wordvecs[token]
162+
norm = np.sqrt(np.einsum('...i,...i', bag_of_centroids, bag_of_centroids))
163+
if norm != 0:
164+
bag_of_centroids /= norm
165+
166+
# 訓練で作成したベクトルをスパース化するために最小と最大を記録しておく。
167+
if flag:
168+
self.min_no += min(bag_of_centroids)
169+
self.max_no += max(bag_of_centroids)
170+
return bag_of_centroids
171+
172+
def make_gwbowv(self, corpus, train=True):
173+
# ドキュメントベクトルのマトリクスを作成する。
174+
# gwbowvには通常のドキュメントベクトルが格納される。
175+
gwbowv = np.zeros((len(corpus), self.num_clusters*self.num_features)).astype(np.float32)
176+
cnt = 0
177+
for tokens in tqdm(corpus):
178+
gwbowv[cnt] = self.create_cluster_vector_and_gwbowv(tokens, train)
179+
cnt += 1
180+
181+
return gwbowv
182+
183+
def dump_gwbowv(self, gwbowv, path="gwbowv_matrix.npy", percentage=0.04):
184+
# スパース化したドキュメントベクトルを保存する。
185+
186+
# スパース化するための閾値を算出する。
187+
min_no = self.min_no*1.0/gwbowv.shape[0]
188+
max_no = self.max_no*1.0/gwbowv.shape[0]
189+
print("Average min: ", min_no)
190+
print("Average max: ", max_no)
191+
thres = (abs(max_no) + abs(min_no))/2
192+
thres = thres * percentage
193+
194+
# 閾値未満のベクトルを0とし、スパース化する。
195+
temp = abs(gwbowv) < thres
196+
gwbowv[temp] = 0
197+
np.save(path, gwbowv)
198+
print("SDV created and dumped...")
199+
200+
def load_matrix(self, name):
201+
return np.load(name)
202+
203+
def parse_args():
204+
parser = argparse.ArgumentParser(
205+
description="GloVeとSCDVのパラメータの設定"
206+
)
207+
parser.add_argument('--csv_file', type=str)
208+
parser.add_argument(
209+
'--num_clusters', type=int, default=20
210+
)
211+
parser.add_argument(
212+
'--pname1', type=str, default="vector/gmm_cluster.pkl"
213+
)
214+
parser.add_argument(
215+
'--pname2', type=str, default="vector/gmm_prob_cluster.pkl"
216+
)
217+
218+
return parser.parse_args()
219+
220+
def build_model(csv_file, num_clusters, gmm_pname1, gmm_pname2):
221+
df = pd.read_csv(csv_file)
222+
223+
index = df['index']
224+
doc_id = df['ID']
225+
sentence_id = df['sentence_id']
226+
categories = df['category']
227+
tokens = df['tokens']
228+
229+
vec = SparseCompositeDocumentVectors(num_clusters, gmm_pname1, gmm_pname2)
230+
# 確率重み付き単語ベクトルを求める
231+
vec.get_probability_word_vectors(tokens)
232+
# データからSCDVを求める
233+
gwbowv = vec.make_gwbowv(tokens)
234+
235+
print("sentence_id len:{}, gwbowv len:{}".format(len(sentence_id), len(gwbowv)))
236+
237+
return zip(index, doc_id, sentence_id, categories, gwbowv)
238+
239+
def main(args):
240+
df = pd.read_csv(args.csv_file)
241+
categories = df['category'].unique()
242+
NUM_TOPICS = len(categories)
243+
244+
# 訓練データとtestデータに分ける
245+
train_data, test_data, train_label, test_label, train_id, test_id = train_test_split(
246+
df['tokens'], df['category'], df['ID'],
247+
test_size=0.1, train_size=0.9, stratify=df['category'], shuffle=True)
248+
249+
vec = SparseCompositeDocumentVectors(args.num_clusters, args.pname1, args.pname2)
250+
# 確率重み付き単語ベクトルを求める
251+
vec.get_probability_word_vectors(train_data)
252+
# 訓練データからSCDVを求める
253+
train_gwbowv = vec.make_gwbowv(train_data)
254+
# テストデータからSCDVを求める
255+
test_gwbowv = vec.make_gwbowv(test_data, False)
256+
257+
print("train size:{} vector size:{}".format(len(train_gwbowv), len(train_gwbowv[0])))
258+
print("test size:{} vector size:{}".format(len(test_gwbowv), len(test_gwbowv[0])))
259+
260+
print("Test start...")
261+
262+
start = time.time()
263+
clf = lgb.LGBMClassifier(objective="multiclass")
264+
clf.fit(train_gwbowv, train_label)
265+
test_pred = clf.predict(test_gwbowv)
266+
267+
# print(test_pred)
268+
269+
print ("Report")
270+
print (classification_report(test_label, test_pred, digits=6))
271+
print ("Accuracy: ",clf.score(test_gwbowv, test_label))
272+
print ("Time taken:", time.time() - start, "\n")
273+
274+
if __name__ == "__main__":
275+
main(parse_args())

‎jupyter/data/scdv_to_es.py

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
import elasticsearch
2+
import json, argparse, scdv
3+
4+
class elasticsearchClient():
5+
def __init__(self, host, port):
6+
self.host = host
7+
self.port = port
8+
self.client = elasticsearch.Elasticsearch(self.host + ":" + self.port, timeout=30)
9+
10+
def update(self, index, doc_id, body):
11+
response = self.client.update(
12+
index = index,
13+
id = doc_id,
14+
body = body)
15+
print(response)
16+
17+
def parse_args():
18+
parser = argparse.ArgumentParser()
19+
parser.add_argument('--host', type=str, default='localhost')
20+
parser.add_argument('--port', type=str, default='9200')
21+
parser.add_argument('--input_csv', type=str)
22+
23+
return parser.parse_args()
24+
25+
def create_script(sentence_type, sentence_id, vector):
26+
script = ""
27+
28+
if sentence_type == "title":
29+
script = '{"script":{"source":"ctx._source.title.vector = params.vector","lang":"painless","params":{"id":"' + sentence_id + '","vector":' + vector + '}}}'
30+
else:
31+
script = '{"script":{"source":"for (int i = 0; i < ctx._source.' + sentence_type + '.length; i++) {if(ctx._source.' + sentence_type + '[i].' + sentence_type + '_id == params.id) { ctx._source.' + sentence_type + '[i].vector = params.vector; break}}","lang":"painless","params":{"id":"' + sentence_id + '","vector":"' + vector + '"}}}'
32+
33+
return script
34+
35+
def main(args):
36+
client = elasticsearchClient(args.host, args.port)
37+
38+
scdv_vec = scdv.build_model(args.input_csv, 20, "gmm_cluster.pkl", "gmm_prob_cluster.pkl")
39+
40+
for index, doc_id, sentence_id, category, vector in scdv_vec:
41+
if index == 'anzen':
42+
sentence_type = "title"
43+
44+
if '_c_' in sentence_id:
45+
sentence_type = "cause"
46+
elif '_m_' in sentence_id:
47+
sentence_type = "measures"
48+
elif '_s_' in sentence_id:
49+
sentence_type = "situation"
50+
51+
vector = str(vector.tolist())
52+
script = create_script(sentence_type, sentence_id, vector)
53+
script = script.replace('"[','[').replace(']"',']')
54+
55+
client.update(index, doc_id, script)
56+
elif index == 'accident':
57+
client.update(index, doc_id, {'doc':{'scdv_vector':vector.tolist()}})
58+
59+
if __name__ == '__main__':
60+
main(parse_args())

‎jupyter/dockerfile_jupyter

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
FROM jupyter/datascience-notebook
2+
3+
USER root
4+
COPY requirements.txt /tmp/
5+
RUN apt-get update -y && apt-get install vim sudo -y && \
6+
python -m pip install --upgrade pip setuptools && \
7+
python -m pip install -r /tmp/requirements.txt --no-cache-dir && \
8+
python -m pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_full-20200127.tar.gz && \
9+
sudachipy link -t full && \
10+
curl -L "https://ipafont.ipa.go.jp/IPAexfont/ipaexg00201.zip" > font.zip && \
11+
unzip font.zip && \
12+
cp ipaexg00201/ipaexg.ttf /usr/share/fonts/truetype/ipaexg.ttf && \
13+
echo "font.family : IPAexGothic" >> /opt/conda/lib/python3.7/site-packages/matplotlib/mpl-data/matplotlibrc && \
14+
rm -r ./.cache && \
15+
jupyter serverextension enable --py jupyterlab && \
16+
chown -R jovyan /opt/conda
17+
COPY sudachi.json /opt/conda/lib/python3.7/site-packages/sudachipy/resources/
18+
COPY sudachi.json /opt/conda/lib/python3.7/site-packages/ja_ginza_dict/sudachidict/
19+
WORKDIR /home/jovyan/work

‎jupyter/requirements.txt

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
elasticsearch==7.0.4
2+
gensim==3.8.0
3+
lightgbm==2.3.0
4+
matplotlib==3.1.1
5+
numpy==1.17.4
6+
pandas==0.25.3
7+
pyLDAvis==2.1.2
8+
scikit-learn==0.22
9+
SudachiDict-full
10+
SudachiPy==0.4.2
11+
tqdm==4.40.2
12+
wordcloud==1.6.0
13+
xlrd==1.2.0
14+
ginza
15+
jupyterlab

‎jupyter/sudachi.json

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
{
2+
"systemDict": "/opt/conda/lib/python3.7/site-packages/sudachidict_full/resources/system.dic",
3+
"characterDefinitionFile" : "char.def",
4+
"inputTextPlugin" : [
5+
{ "class" : "sudachipy.plugin.input_text.DefaultInputTextPlugin" },
6+
{ "class" : "sudachipy.plugin.input_text.ProlongedSoundMarkInputTextPlugin",
7+
"prolongedSoundMarks": ["", "-", "?", "?", "?"],
8+
"replacementSymbol": ""}
9+
],
10+
"oovProviderPlugin" : [
11+
{ "class" : "sudachipy.plugin.oov.MeCabOovProviderPlugin",
12+
"charDef" : "char.def",
13+
"unkDef" : "unk.def" },
14+
{ "class" : "sudachipy.plugin.oov.SimpleOovProviderPlugin",
15+
"oovPOS" : [ "補助記号", "一般", "*", "*", "*", "*" ],
16+
"leftId" : 5968,
17+
"rightId" : 5968,
18+
"cost" : 3857 }
19+
],
20+
"pathRewritePlugin" : [
21+
{ "class" : "sudachipy.plugin.path_rewrite.JoinNumericPlugin",
22+
"enableNormalize" : true },
23+
{ "class" : "sudachipy.plugin.path_rewrite.JoinKatakanaOovPlugin",
24+
"oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ],
25+
"minLength": 3 }
26+
]
27+
}

0 commit comments

Comments
 (0)
Please sign in to comment.