Importing MusicBrainz data into Neo4J (work in progress)

So you want to import the awesome MusicBrainz database into the awesome Neo4J graph database? Here's one way to do it.

MusicBrainz database has several "core entities" (see http://musicbrainz.org/statistics)

Core entities	as of 2013-08-30
Artists	780,259
Release Groups	977,377
Releases	1,185,344
Mediums	1,316,108
Recordings	12,624,038
Tracks	15,146,240
Labels	73,504
Works	414,645
URLs	1,833,144
Areas	22,969

Get a local copy of the MusicBrainz database

I personally use https://bitbucket.org/lalinsky/mbslave with Postgresl.

And for this write-up, I downloaded http://mirrors.dotsrc.org/MusicBrainz/data/fullexport/20130824-033012/mbdump.tar.bz2 (thus from the EU mirror, this is big, 1,5+GB). See http://musicbrainz.org/doc/MusicBrainz_Database/Download for details on how and where to fetch the data.

Follow the instuctions from mbslave's README and you should end up with these tables and sizes close to the following (I only imported mbdump.tar.bz2):

See below

We're not interested in all tables but nearly everything is interesting.

Fix some MusicBrainz entries

When importing recordings and tracks, batch-import choked on some rows. I still don't know if the problem comes from Postgresl export (see below) or batch-import CSV reader.

I fixed the entries in the official MusicBrainz database, and the edits should be included in the latest dumps (starting from dump 20130907-001934).

If you still have problems, I suggest you perform these operations on the track_name table's name columns:

change TABs to spaces
change \" to "

UPDATE track_name SET name=translate(name, E'\\\\\\\\"', '"') WHERE name LIKE E'%\\\\\\\\"%';
UPDATE track_name SET name=translate(name, E'\t', ' ') WHERE name LIKE E'%\t%';

batch-import and CSV files

One way to bulk-insert data into Neo4J is to use Michael Hunger's batch-import https://github.com/jexp/batch-import

After having played with a single nodes.csv (for nodes) and a single rels.csv (for relations), I decided to try out a new feature of batch-import: use multiple CSV files for nodes and relations.

MusicBrainz are deeply normalized so that for names for example is not duplicated. For example when you look at the SQL table definitions for artist (in https://github.com/metabrainz/musicbrainz-server/blob/master/admin/sql/CreateTables.sql):

CREATE TABLE artist (
    id                  SERIAL,
    gid                 UUID NOT NULL,
    name                INTEGER NOT NULL, -- references artist_name.id
    sort_name           INTEGER NOT NULL, -- references artist_name.id
    begin_date_year     SMALLINT,
    begin_date_month    SMALLINT,
    begin_date_day      SMALLINT,
    end_date_year       SMALLINT,
    end_date_month      SMALLINT,
    end_date_day        SMALLINT,
    type                INTEGER, -- references artist_type.id
    area                INTEGER, -- references area.id
    gender              INTEGER, -- references gender.id
    comment             VARCHAR(255) NOT NULL DEFAULT '',
    edits_pending       INTEGER NOT NULL DEFAULT 0 CHECK (edits_pending >= 0),
    last_updated        TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    ended               BOOLEAN NOT NULL DEFAULT FALSE
    ..
    begin_area          INTEGER, -- references area.id
    end_area            INTEGER -- references area.id
);

which is linked to the artist_name table for name and sortname

CREATE TABLE artist_name (
    id                  SERIAL,
    name                VARCHAR NOT NULL
);

MusicBrainz schema also defines a few "simple views" (see https://github.com/metabrainz/musicbrainz-server/blob/master/admin/sql/CreateSimpleViews.sql) so that you can "resolve" the name links for example with something like this:

    SELECT
        a.id,
        a.gid,
        n.name,
        a.begin_date_year,
        a.end_date_year,
        atype.name,
        area.name AS area,
        g.name AS gender,
        a.comment,
        a.ended
    FROM artist a
    JOIN artist_name n ON a.name=n.id
    LEFT JOIN artist_type atype ON a.type=atype.id
    LEFT JOIN area ON a.area=area.id
    LEFT JOIN gender g ON a.gender=g.id

"That's all good but how do we get this data into Neo4J??" I hear you say.

Well it happens Postgresl can very well export CSV files for batch-import directly (TSV files actually, tab-delimited)

Suppose for each MusicBrainz entity we want to store its kind (artist, recording, label...), it's MBID (unique identifier), it's name, it's primary key in the database for reference and various properties depending on the entity, we can write an SQL query like this and feed it to psql (or ./mbslave-psql.sh which is easier):

COPY(
    SELECT
        'artist' AS kind,
        a.id AS pk,
        a.gid AS mbid,
        n.name AS name,
        a.begin_date_year,
        a.end_date_year,
        atype.name AS type,
        area.name AS area,
        g.name AS gender,
        a.comment,
        a.ended
    FROM artist a
    JOIN artist_name n ON a.name=n.id
    LEFT JOIN artist_type atype ON a.type=atype.id
    LEFT JOIN area ON a.area=area.id
    LEFT JOIN gender g ON a.gender=g.id
)
TO stdout CSV HEADER DELIMITER E'\t';

(see http://maxdemarzi.com/2012/02/28/batch-importer-part-1/ and http://maxdemarzi.com/2012/02/28/batch-importer-part-2/ for details)

Redirect this to a new file and you have something like this:

kind	pk	mbid	name	begin_date_year	end_date_year	type	area	gender	comment	ended
artist	345225	c3d70436-faa2-4669-be5a-f8ba61f3ac29	Ben Christian						""	f
artist	708456	1789287b-d124-47c6-80c4-adb0dfdec1e2	Sean Grissom						""	f
artist	514886	a7f3c871-3ba3-40b1-ba58-d08b40312789	Uffe Andersen			Person			""	f
artist	99032	816688b7-6850-4af9-9c63-cebd0ff48546	Genet						""	f
artist	689027	61406f75-53b9-4557-b39f-7cd5a5ad1dc7	Pentti Mikkonen			Person			""	f
...

Adding autoindex fields and column types to CSV files

A nice new feature of batch-import is automatic indexing by adding an index name to the property type: https://github.com/jexp/batch-import#automatic-indexing

Let's modify our SQL query to add these (again for the artist entity):

COPY(
    SELECT
        'artist' AS "kind:string:mb",
        a.id AS pk,
        a.gid AS "mbid:string:mbid",
        n.name AS "name:string:mb",
        a.begin_date_year,
        a.end_date_year,
        atype.name AS "type:string:mb",
        area.name AS area,
        g.name AS gender,
        a.comment,
        a.ended
    FROM artist a
    JOIN artist_name n ON a.name=n.id
    LEFT JOIN artist_type atype ON a.type=atype.id
    LEFT JOIN area ON a.area=area.id
    LEFT JOIN gender g ON a.gender=g.id
)
TO stdout CSV HEADER DELIMITER E'\t';

Here I use 2 indexes, "mb" and "mbid", as suggested by @peterneubauer in this issue: "mbid" is used to hold all MBIDs for the different entities, and "mb" is used for the other properties that you want to be fulltext-searchable.

And the output file is modifies like this:

kind:string:mb	pk	mbid:string:mbid	name:string:mb	begin_date_year	end_date_year	etype:string:mb	area	gender	comment	ended
artist	345225	c3d70436-faa2-4669-be5a-f8ba61f3ac29	Ben Christian						""	f
artist	708456	1789287b-d124-47c6-80c4-adb0dfdec1e2	Sean Grissom						""	f
artist	514886	a7f3c871-3ba3-40b1-ba58-d08b40312789	Uffe Andersen			Person			""	f
artist	99032	816688b7-6850-4af9-9c63-cebd0ff48546	Genet						""	f
artist	689027	61406f75-53b9-4557-b39f-7cd5a5ad1dc7	Pentti Mikkonen			Person			""	f
...

I've prepared SQL queries for all MusicBrainz core entities at https://github.com/redapple/sql2graph/tree/multi_nodescsv/examples/musicbrainz/sql.autoindex

You can run all these queries with an script (also proposed by @peterneubauer) that you can find (and adapt to your paths) at https://github.com/redapple/sql2graph/blob/multi_nodescsv/examples/musicbrainz/export_mb_to_csv.autoindex.sh.default

So now you have CSV files (gzipped in the default export script) for nodes to feed to batch-import. batch-import assumes the node IDs you feed it with follow the order they are read in the input files.

So you must build the relations file carefully, remembering what nodeID will be associated with which entity entry in MusicBrainz.

sql2graph allows you to do that with the help of a configuration file and a schema file. I've prepared the files for you in https://github.com/redapple/sql2graph/blob/multi_nodescsv/examples/musicbrainz/:

musicbrainz2neo4j.py: defines the entities properties and relations (and their properties) for virtually all core entities
mb2neo-simple.conf.default: that you need to adapt to your file system, where you stored all your .sql.csv(.gz) files exported by Postgresl

Then, you run:

$ python musicbrainz2neo4j.py --config mb2neo-simple.conf

which will create the relation files for you.

You'll notice that sql2graph will create "nodes_...csv.gz" as symbolic links to the various exported SQL dumps since the nodes files don't need to be recreated. Though, they must follow the prefix you define in mb2neo-simple.conf (believe me, it's easier for the script :)

Finally, you can run batch-import (check the path to batch-import in mb2neo-simple.conf) with

$ /path/to/sql2graph/examples/musicbrainz$ python ../../run_batchimport.py --config mb2neo-simple.conf

Annex A - Size of MusicBrainz Postgresql tables

Schema	Name	Type	Size
musicbrainz	annotation	table	8192 bytes
musicbrainz	application	table	8192 bytes
musicbrainz	area	table	2320 kB
musicbrainz	area_alias	table	1176 kB
musicbrainz	area_alias_type	table	16 kB
musicbrainz	area_annotation	table	0 bytes
musicbrainz	area_gid_redirect	table	8192 bytes
musicbrainz	area_type	table	8192 bytes
musicbrainz	artist	table	71 MB
musicbrainz	artist_alias	table	8120 kB
musicbrainz	artist_alias_type	table	16 kB
musicbrainz	artist_annotation	table	0 bytes
musicbrainz	artist_credit	table	36 MB
musicbrainz	artist_credit_name	table	50 MB
musicbrainz	artist_deletion	table	8192 bytes
musicbrainz	artist_gid_redirect	table	1512 kB
musicbrainz	artist_ipi	table	576 kB
musicbrainz	artist_isni	table	144 kB
musicbrainz	artist_meta	table	0 bytes
musicbrainz	artist_name	table	80 MB
musicbrainz	artist_rating_raw	table	0 bytes
musicbrainz	artist_tag	table	0 bytes
musicbrainz	artist_tag_raw	table	0 bytes
musicbrainz	artist_type	table	8192 bytes
musicbrainz	autoeditor_election	table	0 bytes
musicbrainz	autoeditor_election_vote	table	0 bytes
musicbrainz	cdtoc	table	87 MB
musicbrainz	cdtoc_raw	table	8192 bytes
musicbrainz	clientversion	table	48 kB
musicbrainz	country_area	table	40 kB
musicbrainz	edit	table	8192 bytes
musicbrainz	edit_area	table	0 bytes
musicbrainz	edit_artist	table	0 bytes
musicbrainz	edit_label	table	0 bytes
musicbrainz	edit_note	table	8192 bytes
musicbrainz	edit_recording	table	0 bytes
musicbrainz	edit_release	table	0 bytes
musicbrainz	edit_release_group	table	0 bytes
musicbrainz	edit_url	table	0 bytes
musicbrainz	edit_work	table	0 bytes
musicbrainz	editor	table	8192 bytes
musicbrainz	editor_collection	table	8192 bytes
musicbrainz	editor_collection_release	table	0 bytes
musicbrainz	editor_language	table	0 bytes
musicbrainz	editor_oauth_token	table	8192 bytes
musicbrainz	editor_preference	table	0 bytes
musicbrainz	editor_subscribe_artist	table	0 bytes
musicbrainz	editor_subscribe_artist_deleted	table	0 bytes
musicbrainz	editor_subscribe_collection	table	0 bytes
musicbrainz	editor_subscribe_editor	table	0 bytes
musicbrainz	editor_subscribe_label	table	0 bytes
musicbrainz	editor_subscribe_label_deleted	table	0 bytes
musicbrainz	editor_watch_artist	table	0 bytes
musicbrainz	editor_watch_preferences	table	0 bytes
musicbrainz	editor_watch_release_group_type	table	0 bytes
musicbrainz	editor_watch_release_status	table	0 bytes
musicbrainz	gender	table	8192 bytes
musicbrainz	iso_3166_1	table	40 kB
musicbrainz	iso_3166_2	table	240 kB
musicbrainz	iso_3166_3	table	8192 bytes
musicbrainz	isrc	table	22 MB
musicbrainz	iswc	table	3360 kB
musicbrainz	l_area_area	table	1304 kB
musicbrainz	l_area_artist	table	0 bytes
musicbrainz	l_area_label	table	0 bytes
musicbrainz	l_area_recording	table	0 bytes
musicbrainz	l_area_release	table	0 bytes
musicbrainz	l_area_release_group	table	0 bytes
musicbrainz	l_area_url	table	2608 kB
musicbrainz	l_area_work	table	40 kB
musicbrainz	l_artist_artist	table	10 MB
musicbrainz	l_artist_label	table	376 kB
musicbrainz	l_artist_recording	table	116 MB
musicbrainz	l_artist_release	table	19 MB
musicbrainz	l_artist_release_group	table	216 kB
musicbrainz	l_artist_url	table	40 MB
musicbrainz	l_artist_work	table	37 MB
musicbrainz	l_label_label	table	424 kB
musicbrainz	l_label_recording	table	464 kB
musicbrainz	l_label_release	table	96 kB
musicbrainz	l_label_release_group	table	0 bytes
musicbrainz	l_label_url	table	3480 kB
musicbrainz	l_label_work	table	1176 kB
musicbrainz	l_recording_recording	table	1968 kB
musicbrainz	l_recording_release	table	40 kB
musicbrainz	l_recording_release_group	table	0 bytes
musicbrainz	l_recording_url	table	856 kB
musicbrainz	l_recording_work	table	62 MB
musicbrainz	l_release_group_release_group	table	312 kB
musicbrainz	l_release_group_url	table	14 MB
musicbrainz	l_release_group_work	table	0 bytes
musicbrainz	l_release_release	table	928 kB
musicbrainz	l_release_release_group	table	0 bytes
musicbrainz	l_release_url	table	44 MB
musicbrainz	l_release_work	table	0 bytes
musicbrainz	l_url_url	table	0 bytes
musicbrainz	l_url_work	table	3008 kB
musicbrainz	l_work_work	table	3424 kB
musicbrainz	label	table	6872 kB
musicbrainz	label_alias	table	560 kB
musicbrainz	label_alias_type	table	16 kB
musicbrainz	label_annotation	table	0 bytes
musicbrainz	label_deletion	table	8192 bytes
musicbrainz	label_gid_redirect	table	184 kB
musicbrainz	label_ipi	table	64 kB
musicbrainz	label_isni	table	8192 bytes
musicbrainz	label_meta	table	0 bytes
musicbrainz	label_name	table	4344 kB
musicbrainz	label_rating_raw	table	0 bytes
musicbrainz	label_tag	table	0 bytes
musicbrainz	label_tag_raw	table	0 bytes
musicbrainz	label_type	table	8192 bytes
musicbrainz	language	table	448 kB
musicbrainz	link	table	5808 kB
musicbrainz	link_attribute	table	3704 kB
musicbrainz	link_attribute_credit	table	8192 bytes
musicbrainz	link_attribute_type	table	136 kB
musicbrainz	link_creditable_attribute_type	table	48 kB
musicbrainz	link_type	table	136 kB
musicbrainz	link_type_attribute_type	table	40 kB
musicbrainz	medium	table	84 MB
musicbrainz	medium_cdtoc	table	34 MB
musicbrainz	medium_format	table	8192 bytes
musicbrainz	medium_index	table	8192 bytes
musicbrainz	puid	table	397 MB
musicbrainz	recording	table	852 MB
musicbrainz	recording_annotation	table	0 bytes
musicbrainz	recording_gid_redirect	table	60 MB
musicbrainz	recording_meta	table	0 bytes
musicbrainz	recording_puid	table	297 MB
musicbrainz	recording_rating_raw	table	0 bytes
musicbrainz	recording_tag	table	0 bytes
musicbrainz	recording_tag_raw	table	0 bytes
musicbrainz	release	table	117 MB
musicbrainz	release_annotation	table	0 bytes
musicbrainz	release_country	table	40 MB
musicbrainz	release_coverart	table	0 bytes
musicbrainz	release_gid_redirect	table	6312 kB
musicbrainz	release_group	table	71 MB
musicbrainz	release_group_annotation	table	0 bytes
musicbrainz	release_group_gid_redirect	table	6472 kB
musicbrainz	release_group_meta	table	0 bytes
musicbrainz	release_group_primary_type	table	8192 bytes
musicbrainz	release_group_rating_raw	table	0 bytes
musicbrainz	release_group_secondary_type	table	16 kB
musicbrainz	release_group_secondary_type_join	table	11 MB
musicbrainz	release_group_tag	table	0 bytes
musicbrainz	release_group_tag_raw	table	0 bytes
musicbrainz	release_label	table	45 MB
musicbrainz	release_meta	table	0 bytes
musicbrainz	release_name	table	62 MB
musicbrainz	release_packaging	table	8192 bytes
musicbrainz	release_raw	table	8192 bytes
musicbrainz	release_status	table	8192 bytes
musicbrainz	release_tag	table	0 bytes
musicbrainz	release_tag_raw	table	0 bytes
musicbrainz	release_unknown_country	table	1712 kB
musicbrainz	replication_control	table	8192 bytes
musicbrainz	script	table	40 kB
musicbrainz	script_language	table	8192 bytes
musicbrainz	tag	table	0 bytes
musicbrainz	tag_relation	table	0 bytes
musicbrainz	track	table	1340 MB
musicbrainz	track_gid_redirect	table	0 bytes
musicbrainz	track_name	table	406 MB
musicbrainz	track_raw	table	8192 bytes
musicbrainz	url	table	184 MB
musicbrainz	url_gid_redirect	table	208 kB
musicbrainz	vote	table	0 bytes
musicbrainz	work	table	28 MB
musicbrainz	work_alias	table	3768 kB
musicbrainz	work_alias_type	table	16 kB
musicbrainz	work_annotation	table	0 bytes
musicbrainz	work_attribute	table	8192 bytes
musicbrainz	work_attribute_type	table	8192 bytes
musicbrainz	work_attribute_type_allowed_value	table	8192 bytes
musicbrainz	work_gid_redirect	table	6776 kB
musicbrainz	work_meta	table	0 bytes
musicbrainz	work_name	table	30 MB
musicbrainz	work_rating_raw	table	0 bytes
musicbrainz	work_tag	table	0 bytes
musicbrainz	work_tag_raw	table	0 bytes
musicbrainz	work_type	table	8192 bytes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Importing MusicBrainz data into Neo4J (work in progress)

Get a local copy of the MusicBrainz database

Fix some MusicBrainz entries

batch-import and CSV files

Adding autoindex fields and column types to CSV files

Annex A - Size of MusicBrainz Postgresql tables

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally