Simple Approach for a Large Language Model
- Locality sensitive encoding for tokens.
- Locality sensitive hashing for documents.
- Standard clustering methods to reduce the complexity.
- lsh-hash the context, including position inside sentences.
- find the nearest token using the centroid of each token example contexts.
- another approach is to cluster each token space to have many centroids per token.
Run:
python llmv2.py
Uses the Bible corpus by default:
$ time python llm.py
[nltk_data] Downloading package punkt to
[nltk_data] /Users/localappleuser/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[micro-llm] corpus has a total of 921363 tokens.
[micro-llm] corpus has a total of 14135 unique tokens.
[micro-llm] persisting token id maps ...
[micro-llm] splitting corpus into docs ...
[micro-llm] total number of docs is 1184
[micro-llm] indexing document subdocs ...
doc len = 0 (doc 0 / 1184)
doc len = 921 (doc 1 / 1184)
doc len = 708 (doc 2 / 1184)
..
..
doc len = 602 (doc 1179 / 1184)
doc len = 883 (doc 1180 / 1184)
doc len = 729 (doc 1181 / 1184)
doc len = 539 (doc 1182 / 1184)
doc len = 880 (doc 1183 / 1184)
[llm] generating new text from "And" seed ...
And
6906 0.0
5859 0.0
182 0.0
50 0.0
Benhadad
4 0.0
34 0.683772233983162
45 0.7113248654051871
334 0.7327387580875756
were
171 0.6150998205402495
62 0.6348516283298893
32 0.6518446880886043
9 0.8315696157866962
host
4 0.8333333333333334
16 1.0
9 1.0
291 1.0
my
22 0.7418011102528389
47 0.7508635604387801
13 0.7550510257216821
..
..
this
246 0.38065363873158886
89 0.38228669538420224
156 0.39715057042827306
16 0.39823178234089296
,
9 0.340924913132306
1178 0.3417517402137905
142 0.34200965765063074
317 0.34209556278088116
and
405 0.3188326648059895
147 0.31886861884972906
11 0.31897644683085113
365 0.31982487461976106
.
405 0.3130719634933531
147 0.3131082216040909
11 0.3132169614973971
365 0.3139893565456462
And Benhadad were host my was past the for of went Jerusalem people people of : to the overthrown moment recovered , over , pray 's against they against they LORD , . pray burn this , and . .
python3 llm.py 2124.52s user 60.65s system 216% cpu 16:48.72 total
The final text in this example that started with And
is :
And Benhadad were host my was past the for of went Jerusalem people people of : to the overthrown moment recovered , over , pray 's against they against they LORD , . pray burn this , and . .