Skip to content

Commit 94f4bf9

Browse files
committedDec 20, 2015
Initial commit
0 parents  commit 94f4bf9

File tree

4 files changed

+54
-0
lines changed

4 files changed

+54
-0
lines changed
 

‎DSCapstone.Rproj

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
Version: 1.0
2+
3+
RestoreWorkspace: Default
4+
SaveWorkspace: Default
5+
AlwaysSaveHistory: Default
6+
7+
EnableCodeIndexing: Yes
8+
UseSpacesForTab: Yes
9+
NumSpacesForTab: 2
10+
Encoding: UTF-8
11+
12+
RnwWeave: Sweave
13+
LaTeX: pdfLaTeX

‎README.md

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Data Science Capstone Project
2+
This is my repo for work towards the capstone project. If you're already looking at GitHub,
3+
this might actually be interesting for you. I'm using it mostly so that I can run large scripts
4+
on a more powerful server.

‎analyser.R

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
library("tm")
2+
library("RWeka")
3+
massive <- readLines("final/en_US/en_US.news.txt", n=10000)
4+
myCorpus <- Corpus(VectorSource(massive))
5+
myCorpus <- tm_map(myCorpus, tolower)
6+
myCorpus <- tm_map(myCorpus, removeNumbers)
7+
myCorpus <- tm_map(myCorpus, removePunctuation)
8+
myCorpus <- tm_map(myCorpus, PlainTextDocument)
9+
print("OK")
10+
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
11+
UnigramTokenizer <- function(x) WordTokenizer(x)
12+
tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = UnigramTokenizer))

‎predictor.R

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Plan: create "Stupid Backoff" predictor
2+
# Do this by getting ngrams up to 3, then search the last two words entered and return the most likely,
3+
# if there are none, search the last one word entered and return the next one, if none, return most
4+
# likely word (which is lame)
5+
6+
# When Stupid Backoff is functional, try contextually-aware selector: select which Stupid Backoff model
7+
# by analysing which corpus the user is most likely employing. Do so by finding a selection of
8+
# 1- or 2-grams which are characteristic of either tweets, blogs, or news. Then return the appropriate
9+
# word from that model (or re-weight models and return a word from the meta-model)
10+
11+
# The create*grams functions should be once on the big server then the data should be loaded
12+
# I might make them caching so that they work either way
13+
14+
create3grams <- function(corp) {
15+
16+
}
17+
18+
create2grams <- function(corp) {
19+
20+
}
21+
22+
create1grams <- function(corp) {
23+
24+
}
25+

0 commit comments

Comments
 (0)
Please sign in to comment.