Skip to content

Commit e4cba67

Browse files
authoredFeb 16, 2018
IMDB Movie Dataset for Hadoop Practice
1 parent b0fb216 commit e4cba67

23 files changed

+802885
-0
lines changed
 

‎moviedata/README

+157
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
SUMMARY & USAGE LICENSE
2+
=============================================
3+
4+
MovieLens data sets were collected by the GroupLens Research Project
5+
at the University of Minnesota.
6+
7+
This data set consists of:
8+
* 100,000 ratings (1-5) from 943 users on 1682 movies.
9+
* Each user has rated at least 20 movies.
10+
* Simple demographic info for the users (age, gender, occupation, zip)
11+
12+
The data was collected through the MovieLens web site
13+
(movielens.umn.edu) during the seven-month period from September 19th,
14+
1997 through April 22nd, 1998. This data has been cleaned up - users
15+
who had less than 20 ratings or did not have complete demographic
16+
information were removed from this data set. Detailed descriptions of
17+
the data file can be found at the end of this file.
18+
19+
Neither the University of Minnesota nor any of the researchers
20+
involved can guarantee the correctness of the data, its suitability
21+
for any particular purpose, or the validity of results based on the
22+
use of the data set. The data set may be used for any research
23+
purposes under the following conditions:
24+
25+
* The user may not state or imply any endorsement from the
26+
University of Minnesota or the GroupLens Research Group.
27+
28+
* The user must acknowledge the use of the data set in
29+
publications resulting from the use of the data set
30+
(see below for citation information).
31+
32+
* The user may not redistribute the data without separate
33+
permission.
34+
35+
* The user may not use this information for any commercial or
36+
revenue-bearing purposes without first obtaining permission
37+
from a faculty member of the GroupLens Research Project at the
38+
University of Minnesota.
39+
40+
If you have any further questions or comments, please contact GroupLens
41+
<grouplens-info@cs.umn.edu>.
42+
43+
CITATION
44+
==============================================
45+
46+
To acknowledge use of the dataset in publications, please cite the
47+
following paper:
48+
49+
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
50+
History and Context. ACM Transactions on Interactive Intelligent
51+
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
52+
DOI=http://dx.doi.org/10.1145/2827872
53+
54+
55+
ACKNOWLEDGEMENTS
56+
==============================================
57+
58+
Thanks to Al Borchers for cleaning up this data and writing the
59+
accompanying scripts.
60+
61+
PUBLISHED WORK THAT HAS USED THIS DATASET
62+
==============================================
63+
64+
Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
65+
Framework for Performing Collaborative Filtering. Proceedings of the
66+
1999 Conference on Research and Development in Information
67+
Retrieval. Aug. 1999.
68+
69+
FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
70+
==============================================
71+
72+
The GroupLens Research Project is a research group in the Department
73+
of Computer Science and Engineering at the University of Minnesota.
74+
Members of the GroupLens Research Project are involved in many
75+
research projects related to the fields of information filtering,
76+
collaborative filtering, and recommender systems. The project is lead
77+
by professors John Riedl and Joseph Konstan. The project began to
78+
explore automated collaborative filtering in 1992, but is most well
79+
known for its world wide trial of an automated collaborative filtering
80+
system for Usenet news in 1996. The technology developed in the
81+
Usenet trial formed the base for the formation of Net Perceptions,
82+
Inc., which was founded by members of GroupLens Research. Since then
83+
the project has expanded its scope to research overall information
84+
filtering solutions, integrating in content-based methods as well as
85+
improving current collaborative filtering technology.
86+
87+
Further information on the GroupLens Research project, including
88+
research publications, can be found at the following web site:
89+
90+
http://www.grouplens.org/
91+
92+
GroupLens Research currently operates a movie recommender based on
93+
collaborative filtering:
94+
95+
http://www.movielens.org/
96+
97+
DETAILED DESCRIPTIONS OF DATA FILES
98+
==============================================
99+
100+
Here are brief descriptions of the data.
101+
102+
ml-data.tar.gz -- Compressed tar file. To rebuild the u data files do this:
103+
gunzip ml-data.tar.gz
104+
tar xvf ml-data.tar
105+
mku.sh
106+
107+
u.data -- The full u data set, 100000 ratings by 943 users on 1682 items.
108+
Each user has rated at least 20 movies. Users and items are
109+
numbered consecutively from 1. The data is randomly
110+
ordered. This is a tab separated list of
111+
user id | item id | rating | timestamp.
112+
The time stamps are unix seconds since 1/1/1970 UTC
113+
114+
u.info -- The number of users, items, and ratings in the u data set.
115+
116+
u.item -- Information about the items (movies); this is a tab separated
117+
list of
118+
movie id | movie title | release date | video release date |
119+
IMDb URL | unknown | Action | Adventure | Animation |
120+
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
121+
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
122+
Thriller | War | Western |
123+
The last 19 fields are the genres, a 1 indicates the movie
124+
is of that genre, a 0 indicates it is not; movies can be in
125+
several genres at once.
126+
The movie ids are the ones used in the u.data data set.
127+
128+
u.genre -- A list of the genres.
129+
130+
u.user -- Demographic information about the users; this is a tab
131+
separated list of
132+
user id | age | gender | occupation | zip code
133+
The user ids are the ones used in the u.data data set.
134+
135+
u.occupation -- A list of the occupations.
136+
137+
u1.base -- The data sets u1.base and u1.test through u5.base and u5.test
138+
u1.test are 80%/20% splits of the u data into training and test data.
139+
u2.base Each of u1, ..., u5 have disjoint test sets; this if for
140+
u2.test 5 fold cross validation (where you repeat your experiment
141+
u3.base with each training and test set and average the results).
142+
u3.test These data sets can be generated from u.data by mku.sh.
143+
u4.base
144+
u4.test
145+
u5.base
146+
u5.test
147+
148+
ua.base -- The data sets ua.base, ua.test, ub.base, and ub.test
149+
ua.test split the u data into a training set and a test set with
150+
ub.base exactly 10 ratings per user in the test set. The sets
151+
ub.test ua.test and ub.test are disjoint. These data sets can
152+
be generated from u.data by mku.sh.
153+
154+
allbut.pl -- The script that generates training and test sets where
155+
all but n of a users ratings are in the training data.
156+
157+
mku.sh -- A shell script to generate all the u data sets from u.data.

‎moviedata/allbut.pl

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
#!/usr/local/bin/perl
2+
3+
# get args
4+
if (@ARGV < 3) {
5+
print STDERR "Usage: $0 base_name start stop max_test [ratings ...]\n";
6+
exit 1;
7+
}
8+
$basename = shift;
9+
$start = shift;
10+
$stop = shift;
11+
$maxtest = shift;
12+
13+
# open files
14+
open( TESTFILE, ">$basename.test" ) or die "Cannot open $basename.test for writing\n";
15+
open( BASEFILE, ">$basename.base" ) or die "Cannot open $basename.base for writing\n";
16+
17+
# init variables
18+
$testcnt = 0;
19+
20+
while (<>) {
21+
($user) = split;
22+
if (! defined $ratingcnt{$user}) {
23+
$ratingcnt{$user} = 0;
24+
}
25+
++$ratingcnt{$user};
26+
if (($testcnt < $maxtest || $maxtest <= 0)
27+
&& $ratingcnt{$user} >= $start && $ratingcnt{$user} <= $stop) {
28+
++$testcnt;
29+
print TESTFILE;
30+
}
31+
else {
32+
print BASEFILE;
33+
}
34+
}

‎moviedata/mku.sh

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
#!/bin/sh
2+
3+
trap `rm -f tmp.$$; exit 1` 1 2 15
4+
5+
for i in 1 2 3 4 5
6+
do
7+
head -`expr $i \* 20000` u.data | tail -20000 > tmp.$$
8+
sort -t" " -k 1,1n -k 2,2n tmp.$$ > u$i.test
9+
head -`expr \( $i - 1 \) \* 20000` u.data > tmp.$$
10+
tail -`expr \( 5 - $i \) \* 20000` u.data >> tmp.$$
11+
sort -t" " -k 1,1n -k 2,2n tmp.$$ > u$i.base
12+
done
13+
14+
allbut.pl ua 1 10 100000 u.data
15+
sort -t" " -k 1,1n -k 2,2n ua.base > tmp.$$
16+
mv tmp.$$ ua.base
17+
sort -t" " -k 1,1n -k 2,2n ua.test > tmp.$$
18+
mv tmp.$$ ua.test
19+
20+
allbut.pl ub 11 20 100000 u.data
21+
sort -t" " -k 1,1n -k 2,2n ub.base > tmp.$$
22+
mv tmp.$$ ub.base
23+
sort -t" " -k 1,1n -k 2,2n ub.test > tmp.$$
24+
mv tmp.$$ ub.test
25+

0 commit comments

Comments
 (0)