Skip to content

Simple Approach for a Large Language Model using MinHash

License

Notifications You must be signed in to change notification settings

jio-gl/micro-llm

Repository files navigation

micro-llm

Simple Approach for a Large Language Model

Features

  1. Locality sensitive encoding for tokens.
  2. Locality sensitive hashing for documents.
  3. Standard clustering methods to reduce the complexity.

Algorithm

  1. lsh-hash the context, including position inside sentences.
  2. find the nearest token using the centroid of each token example contexts.
  3. another approach is to cluster each token space to have many centroids per token.

Usage

Run:

python llmv2.py

Uses the Bible corpus by default:

$ time python llm.py
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/localappleuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[micro-llm] corpus has a total of 921363 tokens.
[micro-llm] corpus has a total of 14135 unique tokens.
[micro-llm] persisting token id maps ...
[micro-llm] splitting corpus into docs ...
[micro-llm] total number of docs is 1184
[micro-llm] indexing document subdocs ...
doc len = 0  (doc 0 / 1184)
doc len = 921  (doc 1 / 1184)
doc len = 708  (doc 2 / 1184)
..
..
doc len = 602  (doc 1179 / 1184)
doc len = 883  (doc 1180 / 1184)
doc len = 729  (doc 1181 / 1184)
doc len = 539  (doc 1182 / 1184)
doc len = 880  (doc 1183 / 1184)
[llm] generating new text from "And" seed ...
And 
6906 0.0
5859 0.0
182 0.0
50 0.0
Benhadad
4 0.0
34 0.683772233983162
45 0.7113248654051871
334 0.7327387580875756
were
171 0.6150998205402495
62 0.6348516283298893
32 0.6518446880886043
9 0.8315696157866962
host
4 0.8333333333333334
16 1.0
9 1.0
291 1.0
my
22 0.7418011102528389
47 0.7508635604387801
13 0.7550510257216821
..
..
this
246 0.38065363873158886
89 0.38228669538420224
156 0.39715057042827306
16 0.39823178234089296
,
9 0.340924913132306
1178 0.3417517402137905
142 0.34200965765063074
317 0.34209556278088116
and
405 0.3188326648059895
147 0.31886861884972906
11 0.31897644683085113
365 0.31982487461976106
.
405 0.3130719634933531
147 0.3131082216040909
11 0.3132169614973971
365 0.3139893565456462
And Benhadad were host my was past the for of went Jerusalem people people of : to the overthrown moment recovered , over , pray 's against they against they LORD , . pray burn this , and . .
python3 llm.py  2124.52s user 60.65s system 216% cpu 16:48.72 total

The final text in this example that started with And is :

And Benhadad were host my was past the for of went Jerusalem people people of : to the overthrown moment recovered , over , pray 's against they against they LORD , . pray burn this , and . .

About

Simple Approach for a Large Language Model using MinHash

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages