-
Notifications
You must be signed in to change notification settings - Fork 14
Importing MusicBrainz data into Neo4J (work in progress)
So you want to import the awesome MusicBrainz database into the awesome Neo4J graph database? Here's one way to do it.
MusicBrainz database has several "core entities" (see http://musicbrainz.org/statistics)
Core entities | as of 2013-08-30 |
---|---|
Artists | 780,259 |
Release Groups | 977,377 |
Releases | 1,185,344 |
Mediums | 1,316,108 |
Recordings | 12,624,038 |
Tracks | 15,146,240 |
Labels | 73,504 |
Works | 414,645 |
URLs | 1,833,144 |
Areas | 22,969 |
I personally use https://bitbucket.org/lalinsky/mbslave with Postgresl.
And for this write-up, I downloaded http://mirrors.dotsrc.org/MusicBrainz/data/fullexport/20130824-033012/mbdump.tar.bz2 (thus from the EU mirror, this is big, 1,5+GB). See http://musicbrainz.org/doc/MusicBrainz_Database/Download for details on how and where to fetch the data.
Follow the instuctions from mbslave's README and you should end up with these tables and sizes close to the following (I only imported mbdump.tar.bz2):
We're not interested in all tables but nearly everything is interesting.
When importing recordings and tracks, batch-import choked on some rows. I still don't know if the problem comes from Postgresl export (see below) or batch-import CSV reader.
I fixed the entries in the official MusicBrainz database, and the edits should be included in the latest dumps (starting from dump 20130907-001934).
If you still have problems, I suggest you perform these operations on the track_name table's name columns:
- change TABs to spaces
- change
\"
to"
UPDATE track_name SET name=translate(name, E'\\\\\\\\"', '"') WHERE name LIKE E'%\\\\\\\\"%';
UPDATE track_name SET name=translate(name, E'\t', ' ') WHERE name LIKE E'%\t%';
One way to bulk-insert data into Neo4J is to use Michael Hunger's batch-import https://github.com/jexp/batch-import
After having played with a single nodes.csv
(for nodes) and a single rels.csv
(for relations), I decided to try out a new feature of batch-import
: use multiple CSV files for nodes and relations.
MusicBrainz are deeply normalized so that for names for example is not duplicated. For example when you look at the SQL table definitions for artist
(in https://github.com/metabrainz/musicbrainz-server/blob/master/admin/sql/CreateTables.sql):
CREATE TABLE artist (
id SERIAL,
gid UUID NOT NULL,
name INTEGER NOT NULL, -- references artist_name.id
sort_name INTEGER NOT NULL, -- references artist_name.id
begin_date_year SMALLINT,
begin_date_month SMALLINT,
begin_date_day SMALLINT,
end_date_year SMALLINT,
end_date_month SMALLINT,
end_date_day SMALLINT,
type INTEGER, -- references artist_type.id
area INTEGER, -- references area.id
gender INTEGER, -- references gender.id
comment VARCHAR(255) NOT NULL DEFAULT '',
edits_pending INTEGER NOT NULL DEFAULT 0 CHECK (edits_pending >= 0),
last_updated TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
ended BOOLEAN NOT NULL DEFAULT FALSE
..
begin_area INTEGER, -- references area.id
end_area INTEGER -- references area.id
);
which is linked to the artist_name
table for name
and sortname
CREATE TABLE artist_name (
id SERIAL,
name VARCHAR NOT NULL
);
MusicBrainz schema also defines a few "simple views" (see https://github.com/metabrainz/musicbrainz-server/blob/master/admin/sql/CreateSimpleViews.sql) so that you can "resolve" the name links for example with something like this:
SELECT
a.id,
a.gid,
n.name,
a.begin_date_year,
a.end_date_year,
atype.name,
area.name AS area,
g.name AS gender,
a.comment,
a.ended
FROM artist a
JOIN artist_name n ON a.name=n.id
LEFT JOIN artist_type atype ON a.type=atype.id
LEFT JOIN area ON a.area=area.id
LEFT JOIN gender g ON a.gender=g.id
"That's all good but how do we get this data into Neo4J??" I hear you say.
Well it happens Postgresl can very well export CSV files for batch-import directly (TSV files actually, tab-delimited)
Suppose for each MusicBrainz entity we want to store its kind (artist, recording, label...), it's MBID (unique identifier), it's name, it's primary key in the database for reference and various properties depending on the entity, we can write an SQL query like this and feed it to psql
(or ./mbslave-psql.sh
which is easier):
COPY(
SELECT
'artist' AS kind,
a.id AS pk,
a.gid AS mbid,
n.name AS name,
a.begin_date_year,
a.end_date_year,
atype.name AS type,
area.name AS area,
g.name AS gender,
a.comment,
a.ended
FROM artist a
JOIN artist_name n ON a.name=n.id
LEFT JOIN artist_type atype ON a.type=atype.id
LEFT JOIN area ON a.area=area.id
LEFT JOIN gender g ON a.gender=g.id
)
TO stdout CSV HEADER DELIMITER E'\t';
(see http://maxdemarzi.com/2012/02/28/batch-importer-part-1/ and http://maxdemarzi.com/2012/02/28/batch-importer-part-2/ for details)
Redirect this to a new file and you have something like this:
kind pk mbid name begin_date_year end_date_year type area gender comment ended
artist 345225 c3d70436-faa2-4669-be5a-f8ba61f3ac29 Ben Christian "" f
artist 708456 1789287b-d124-47c6-80c4-adb0dfdec1e2 Sean Grissom "" f
artist 514886 a7f3c871-3ba3-40b1-ba58-d08b40312789 Uffe Andersen Person "" f
artist 99032 816688b7-6850-4af9-9c63-cebd0ff48546 Genet "" f
artist 689027 61406f75-53b9-4557-b39f-7cd5a5ad1dc7 Pentti Mikkonen Person "" f
...
A nice new feature of batch-import is automatic indexing by adding an index name to the property type: https://github.com/jexp/batch-import#automatic-indexing
Let's modify our SQL query to add these (again for the artist
entity):
COPY(
SELECT
'artist' AS "kind:string:mb",
a.id AS pk,
a.gid AS "mbid:string:mbid",
n.name AS "name:string:mb",
a.begin_date_year,
a.end_date_year,
atype.name AS "type:string:mb",
area.name AS area,
g.name AS gender,
a.comment,
a.ended
FROM artist a
JOIN artist_name n ON a.name=n.id
LEFT JOIN artist_type atype ON a.type=atype.id
LEFT JOIN area ON a.area=area.id
LEFT JOIN gender g ON a.gender=g.id
)
TO stdout CSV HEADER DELIMITER E'\t';
Here I use 2 indexes, "mb" and "mbid", as suggested by @peterneubauer in this issue: "mbid" is used to hold all MBIDs for the different entities, and "mb" is used for the other properties that you want to be fulltext-searchable.
And the output file is modifies like this:
kind:string:mb pk mbid:string:mbid name:string:mb begin_date_year end_date_year etype:string:mb area gender comment ended
artist 345225 c3d70436-faa2-4669-be5a-f8ba61f3ac29 Ben Christian "" f
artist 708456 1789287b-d124-47c6-80c4-adb0dfdec1e2 Sean Grissom "" f
artist 514886 a7f3c871-3ba3-40b1-ba58-d08b40312789 Uffe Andersen Person "" f
artist 99032 816688b7-6850-4af9-9c63-cebd0ff48546 Genet "" f
artist 689027 61406f75-53b9-4557-b39f-7cd5a5ad1dc7 Pentti Mikkonen Person "" f
...
I've prepared SQL queries for all MusicBrainz core entities at https://github.com/redapple/sql2graph/tree/multi_nodescsv/examples/musicbrainz/sql.autoindex
You can run all these queries with an script (also proposed by @peterneubauer) that you can find (and adapt to your paths) at https://github.com/redapple/sql2graph/blob/multi_nodescsv/examples/musicbrainz/export_mb_to_csv.autoindex.sh.default
So now you have CSV files (gzipped in the default export script) for nodes to feed to batch-import
. batch-import
assumes the node IDs you feed it with follow the order they are read in the input files.
So you must build the relations file carefully, remembering what nodeID will be associated with which entity entry in MusicBrainz.
sql2graph allows you to do that with the help of a configuration file and a schema file. I've prepared the files for you in https://github.com/redapple/sql2graph/blob/multi_nodescsv/examples/musicbrainz/:
- musicbrainz2neo4j.py: defines the entities properties and relations (and their properties) for virtually all core entities
- mb2neo-simple.conf.default: that you need to adapt to your file system, where you stored all your .sql.csv(.gz) files exported by Postgresl
Then, you run:
$ python musicbrainz2neo4j.py --config mb2neo-simple.conf
which will create the relation files for you.
You'll notice that sql2graph will create "nodes_...csv.gz" as symbolic links to the various exported SQL dumps since the nodes files don't need to be recreated. Though, they must follow the prefix you define in mb2neo-simple.conf (believe me, it's easier for the script :)
Finally, you can run batch-import (check the path to batch-import in mb2neo-simple.conf) with
$ /path/to/sql2graph/examples/musicbrainz$ python ../../run_batchimport.py --config mb2neo-simple.conf
Schema | Name | Type | Size |
---|---|---|---|
musicbrainz | annotation | table | 8192 bytes |
musicbrainz | application | table | 8192 bytes |
musicbrainz | area | table | 2320 kB |
musicbrainz | area_alias | table | 1176 kB |
musicbrainz | area_alias_type | table | 16 kB |
musicbrainz | area_annotation | table | 0 bytes |
musicbrainz | area_gid_redirect | table | 8192 bytes |
musicbrainz | area_type | table | 8192 bytes |
musicbrainz | artist | table | 71 MB |
musicbrainz | artist_alias | table | 8120 kB |
musicbrainz | artist_alias_type | table | 16 kB |
musicbrainz | artist_annotation | table | 0 bytes |
musicbrainz | artist_credit | table | 36 MB |
musicbrainz | artist_credit_name | table | 50 MB |
musicbrainz | artist_deletion | table | 8192 bytes |
musicbrainz | artist_gid_redirect | table | 1512 kB |
musicbrainz | artist_ipi | table | 576 kB |
musicbrainz | artist_isni | table | 144 kB |
musicbrainz | artist_meta | table | 0 bytes |
musicbrainz | artist_name | table | 80 MB |
musicbrainz | artist_rating_raw | table | 0 bytes |
musicbrainz | artist_tag | table | 0 bytes |
musicbrainz | artist_tag_raw | table | 0 bytes |
musicbrainz | artist_type | table | 8192 bytes |
musicbrainz | autoeditor_election | table | 0 bytes |
musicbrainz | autoeditor_election_vote | table | 0 bytes |
musicbrainz | cdtoc | table | 87 MB |
musicbrainz | cdtoc_raw | table | 8192 bytes |
musicbrainz | clientversion | table | 48 kB |
musicbrainz | country_area | table | 40 kB |
musicbrainz | edit | table | 8192 bytes |
musicbrainz | edit_area | table | 0 bytes |
musicbrainz | edit_artist | table | 0 bytes |
musicbrainz | edit_label | table | 0 bytes |
musicbrainz | edit_note | table | 8192 bytes |
musicbrainz | edit_recording | table | 0 bytes |
musicbrainz | edit_release | table | 0 bytes |
musicbrainz | edit_release_group | table | 0 bytes |
musicbrainz | edit_url | table | 0 bytes |
musicbrainz | edit_work | table | 0 bytes |
musicbrainz | editor | table | 8192 bytes |
musicbrainz | editor_collection | table | 8192 bytes |
musicbrainz | editor_collection_release | table | 0 bytes |
musicbrainz | editor_language | table | 0 bytes |
musicbrainz | editor_oauth_token | table | 8192 bytes |
musicbrainz | editor_preference | table | 0 bytes |
musicbrainz | editor_subscribe_artist | table | 0 bytes |
musicbrainz | editor_subscribe_artist_deleted | table | 0 bytes |
musicbrainz | editor_subscribe_collection | table | 0 bytes |
musicbrainz | editor_subscribe_editor | table | 0 bytes |
musicbrainz | editor_subscribe_label | table | 0 bytes |
musicbrainz | editor_subscribe_label_deleted | table | 0 bytes |
musicbrainz | editor_watch_artist | table | 0 bytes |
musicbrainz | editor_watch_preferences | table | 0 bytes |
musicbrainz | editor_watch_release_group_type | table | 0 bytes |
musicbrainz | editor_watch_release_status | table | 0 bytes |
musicbrainz | gender | table | 8192 bytes |
musicbrainz | iso_3166_1 | table | 40 kB |
musicbrainz | iso_3166_2 | table | 240 kB |
musicbrainz | iso_3166_3 | table | 8192 bytes |
musicbrainz | isrc | table | 22 MB |
musicbrainz | iswc | table | 3360 kB |
musicbrainz | l_area_area | table | 1304 kB |
musicbrainz | l_area_artist | table | 0 bytes |
musicbrainz | l_area_label | table | 0 bytes |
musicbrainz | l_area_recording | table | 0 bytes |
musicbrainz | l_area_release | table | 0 bytes |
musicbrainz | l_area_release_group | table | 0 bytes |
musicbrainz | l_area_url | table | 2608 kB |
musicbrainz | l_area_work | table | 40 kB |
musicbrainz | l_artist_artist | table | 10 MB |
musicbrainz | l_artist_label | table | 376 kB |
musicbrainz | l_artist_recording | table | 116 MB |
musicbrainz | l_artist_release | table | 19 MB |
musicbrainz | l_artist_release_group | table | 216 kB |
musicbrainz | l_artist_url | table | 40 MB |
musicbrainz | l_artist_work | table | 37 MB |
musicbrainz | l_label_label | table | 424 kB |
musicbrainz | l_label_recording | table | 464 kB |
musicbrainz | l_label_release | table | 96 kB |
musicbrainz | l_label_release_group | table | 0 bytes |
musicbrainz | l_label_url | table | 3480 kB |
musicbrainz | l_label_work | table | 1176 kB |
musicbrainz | l_recording_recording | table | 1968 kB |
musicbrainz | l_recording_release | table | 40 kB |
musicbrainz | l_recording_release_group | table | 0 bytes |
musicbrainz | l_recording_url | table | 856 kB |
musicbrainz | l_recording_work | table | 62 MB |
musicbrainz | l_release_group_release_group | table | 312 kB |
musicbrainz | l_release_group_url | table | 14 MB |
musicbrainz | l_release_group_work | table | 0 bytes |
musicbrainz | l_release_release | table | 928 kB |
musicbrainz | l_release_release_group | table | 0 bytes |
musicbrainz | l_release_url | table | 44 MB |
musicbrainz | l_release_work | table | 0 bytes |
musicbrainz | l_url_url | table | 0 bytes |
musicbrainz | l_url_work | table | 3008 kB |
musicbrainz | l_work_work | table | 3424 kB |
musicbrainz | label | table | 6872 kB |
musicbrainz | label_alias | table | 560 kB |
musicbrainz | label_alias_type | table | 16 kB |
musicbrainz | label_annotation | table | 0 bytes |
musicbrainz | label_deletion | table | 8192 bytes |
musicbrainz | label_gid_redirect | table | 184 kB |
musicbrainz | label_ipi | table | 64 kB |
musicbrainz | label_isni | table | 8192 bytes |
musicbrainz | label_meta | table | 0 bytes |
musicbrainz | label_name | table | 4344 kB |
musicbrainz | label_rating_raw | table | 0 bytes |
musicbrainz | label_tag | table | 0 bytes |
musicbrainz | label_tag_raw | table | 0 bytes |
musicbrainz | label_type | table | 8192 bytes |
musicbrainz | language | table | 448 kB |
musicbrainz | link | table | 5808 kB |
musicbrainz | link_attribute | table | 3704 kB |
musicbrainz | link_attribute_credit | table | 8192 bytes |
musicbrainz | link_attribute_type | table | 136 kB |
musicbrainz | link_creditable_attribute_type | table | 48 kB |
musicbrainz | link_type | table | 136 kB |
musicbrainz | link_type_attribute_type | table | 40 kB |
musicbrainz | medium | table | 84 MB |
musicbrainz | medium_cdtoc | table | 34 MB |
musicbrainz | medium_format | table | 8192 bytes |
musicbrainz | medium_index | table | 8192 bytes |
musicbrainz | puid | table | 397 MB |
musicbrainz | recording | table | 852 MB |
musicbrainz | recording_annotation | table | 0 bytes |
musicbrainz | recording_gid_redirect | table | 60 MB |
musicbrainz | recording_meta | table | 0 bytes |
musicbrainz | recording_puid | table | 297 MB |
musicbrainz | recording_rating_raw | table | 0 bytes |
musicbrainz | recording_tag | table | 0 bytes |
musicbrainz | recording_tag_raw | table | 0 bytes |
musicbrainz | release | table | 117 MB |
musicbrainz | release_annotation | table | 0 bytes |
musicbrainz | release_country | table | 40 MB |
musicbrainz | release_coverart | table | 0 bytes |
musicbrainz | release_gid_redirect | table | 6312 kB |
musicbrainz | release_group | table | 71 MB |
musicbrainz | release_group_annotation | table | 0 bytes |
musicbrainz | release_group_gid_redirect | table | 6472 kB |
musicbrainz | release_group_meta | table | 0 bytes |
musicbrainz | release_group_primary_type | table | 8192 bytes |
musicbrainz | release_group_rating_raw | table | 0 bytes |
musicbrainz | release_group_secondary_type | table | 16 kB |
musicbrainz | release_group_secondary_type_join | table | 11 MB |
musicbrainz | release_group_tag | table | 0 bytes |
musicbrainz | release_group_tag_raw | table | 0 bytes |
musicbrainz | release_label | table | 45 MB |
musicbrainz | release_meta | table | 0 bytes |
musicbrainz | release_name | table | 62 MB |
musicbrainz | release_packaging | table | 8192 bytes |
musicbrainz | release_raw | table | 8192 bytes |
musicbrainz | release_status | table | 8192 bytes |
musicbrainz | release_tag | table | 0 bytes |
musicbrainz | release_tag_raw | table | 0 bytes |
musicbrainz | release_unknown_country | table | 1712 kB |
musicbrainz | replication_control | table | 8192 bytes |
musicbrainz | script | table | 40 kB |
musicbrainz | script_language | table | 8192 bytes |
musicbrainz | tag | table | 0 bytes |
musicbrainz | tag_relation | table | 0 bytes |
musicbrainz | track | table | 1340 MB |
musicbrainz | track_gid_redirect | table | 0 bytes |
musicbrainz | track_name | table | 406 MB |
musicbrainz | track_raw | table | 8192 bytes |
musicbrainz | url | table | 184 MB |
musicbrainz | url_gid_redirect | table | 208 kB |
musicbrainz | vote | table | 0 bytes |
musicbrainz | work | table | 28 MB |
musicbrainz | work_alias | table | 3768 kB |
musicbrainz | work_alias_type | table | 16 kB |
musicbrainz | work_annotation | table | 0 bytes |
musicbrainz | work_attribute | table | 8192 bytes |
musicbrainz | work_attribute_type | table | 8192 bytes |
musicbrainz | work_attribute_type_allowed_value | table | 8192 bytes |
musicbrainz | work_gid_redirect | table | 6776 kB |
musicbrainz | work_meta | table | 0 bytes |
musicbrainz | work_name | table | 30 MB |
musicbrainz | work_rating_raw | table | 0 bytes |
musicbrainz | work_tag | table | 0 bytes |
musicbrainz | work_tag_raw | table | 0 bytes |
musicbrainz | work_type | table | 8192 bytes |