Skip to content

Importing MusicBrainz data into Neo4J (work in progress)

Paul Tremberth edited this page Sep 14, 2013 · 15 revisions

So you want to import the awesome MusicBrainz database into the awesome Neo4J graph database? Here's one way to do it.

MusicBrainz database has several "core entities" (see http://musicbrainz.org/statistics)

Core entities as of 2013-08-30
Artists 780,259
Release Groups 977,377
Releases 1,185,344
Mediums 1,316,108
Recordings 12,624,038
Tracks 15,146,240
Labels 73,504
Works 414,645
URLs 1,833,144
Areas 22,969

Get a local copy of the MusicBrainz database

I personally use https://bitbucket.org/lalinsky/mbslave with Postgresl.

And for this write-up, I downloaded http://mirrors.dotsrc.org/MusicBrainz/data/fullexport/20130824-033012/mbdump.tar.bz2 (thus from the EU mirror, this is big, 1,5+GB). See http://musicbrainz.org/doc/MusicBrainz_Database/Download for details on how and where to fetch the data.

Follow the instuctions from mbslave's README and you should end up with these tables and sizes close to the following (I only imported mbdump.tar.bz2):

See below

We're not interested in all tables but nearly everything is interesting.

Fix some MusicBrainz entries

When importing recordings and tracks, batch-import choked on some rows. I still don't know if the problem comes from Postgresl export (see below) or batch-import CSV reader.

I fixed the entries in the official MusicBrainz database, and the edits should be included in the latest dumps (starting from dump 20130907-001934).

If you still have problems, I suggest you perform these operations on the track_name table's name columns:

  • change TABs to spaces
  • change \" to "
UPDATE track_name SET name=translate(name, E'\\\\\\\\"', '"') WHERE name LIKE E'%\\\\\\\\"%';
UPDATE track_name SET name=translate(name, E'\t', ' ') WHERE name LIKE E'%\t%';

batch-import and CSV files

One way to bulk-insert data into Neo4J is to use Michael Hunger's batch-import https://github.com/jexp/batch-import

After having played with a single nodes.csv (for nodes) and a single rels.csv (for relations), I decided to try out a new feature of batch-import: use multiple CSV files for nodes and relations.

MusicBrainz are deeply normalized so that for names for example is not duplicated. For example when you look at the SQL table definitions for artist (in https://github.com/metabrainz/musicbrainz-server/blob/master/admin/sql/CreateTables.sql):

CREATE TABLE artist (
    id                  SERIAL,
    gid                 UUID NOT NULL,
    name                INTEGER NOT NULL, -- references artist_name.id
    sort_name           INTEGER NOT NULL, -- references artist_name.id
    begin_date_year     SMALLINT,
    begin_date_month    SMALLINT,
    begin_date_day      SMALLINT,
    end_date_year       SMALLINT,
    end_date_month      SMALLINT,
    end_date_day        SMALLINT,
    type                INTEGER, -- references artist_type.id
    area                INTEGER, -- references area.id
    gender              INTEGER, -- references gender.id
    comment             VARCHAR(255) NOT NULL DEFAULT '',
    edits_pending       INTEGER NOT NULL DEFAULT 0 CHECK (edits_pending >= 0),
    last_updated        TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    ended               BOOLEAN NOT NULL DEFAULT FALSE
    ..
    begin_area          INTEGER, -- references area.id
    end_area            INTEGER -- references area.id
);

which is linked to the artist_name table for name and sortname

CREATE TABLE artist_name (
    id                  SERIAL,
    name                VARCHAR NOT NULL
);

MusicBrainz schema also defines a few "simple views" (see https://github.com/metabrainz/musicbrainz-server/blob/master/admin/sql/CreateSimpleViews.sql) so that you can "resolve" the name links for example with something like this:

    SELECT
        a.id,
        a.gid,
        n.name,
        a.begin_date_year,
        a.end_date_year,
        atype.name,
        area.name AS area,
        g.name AS gender,
        a.comment,
        a.ended
    FROM artist a
    JOIN artist_name n ON a.name=n.id
    LEFT JOIN artist_type atype ON a.type=atype.id
    LEFT JOIN area ON a.area=area.id
    LEFT JOIN gender g ON a.gender=g.id

"That's all good but how do we get this data into Neo4J??" I hear you say.

Well it happens Postgresl can very well export CSV files for batch-import directly (TSV files actually, tab-delimited)

Suppose for each MusicBrainz entity we want to store its kind (artist, recording, label...), it's MBID (unique identifier), it's name, it's primary key in the database for reference and various properties depending on the entity, we can write an SQL query like this and feed it to psql (or ./mbslave-psql.sh which is easier):

COPY(
    SELECT
        'artist' AS kind,
        a.id AS pk,
        a.gid AS mbid,
        n.name AS name,
        a.begin_date_year,
        a.end_date_year,
        atype.name AS type,
        area.name AS area,
        g.name AS gender,
        a.comment,
        a.ended
    FROM artist a
    JOIN artist_name n ON a.name=n.id
    LEFT JOIN artist_type atype ON a.type=atype.id
    LEFT JOIN area ON a.area=area.id
    LEFT JOIN gender g ON a.gender=g.id
)
TO stdout CSV HEADER DELIMITER E'\t';

(see http://maxdemarzi.com/2012/02/28/batch-importer-part-1/ and http://maxdemarzi.com/2012/02/28/batch-importer-part-2/ for details)

Redirect this to a new file and you have something like this:

kind	pk	mbid	name	begin_date_year	end_date_year	type	area	gender	comment	ended
artist	345225	c3d70436-faa2-4669-be5a-f8ba61f3ac29	Ben Christian						""	f
artist	708456	1789287b-d124-47c6-80c4-adb0dfdec1e2	Sean Grissom						""	f
artist	514886	a7f3c871-3ba3-40b1-ba58-d08b40312789	Uffe Andersen			Person			""	f
artist	99032	816688b7-6850-4af9-9c63-cebd0ff48546	Genet						""	f
artist	689027	61406f75-53b9-4557-b39f-7cd5a5ad1dc7	Pentti Mikkonen			Person			""	f
...

Adding autoindex fields and column types to CSV files

A nice new feature of batch-import is automatic indexing by adding an index name to the property type: https://github.com/jexp/batch-import#automatic-indexing

Let's modify our SQL query to add these (again for the artist entity):

COPY(
    SELECT
        'artist' AS "kind:string:mb",
        a.id AS pk,
        a.gid AS "mbid:string:mbid",
        n.name AS "name:string:mb",
        a.begin_date_year,
        a.end_date_year,
        atype.name AS "type:string:mb",
        area.name AS area,
        g.name AS gender,
        a.comment,
        a.ended
    FROM artist a
    JOIN artist_name n ON a.name=n.id
    LEFT JOIN artist_type atype ON a.type=atype.id
    LEFT JOIN area ON a.area=area.id
    LEFT JOIN gender g ON a.gender=g.id
)
TO stdout CSV HEADER DELIMITER E'\t';

Here I use 2 indexes, "mb" and "mbid", as suggested by @peterneubauer in this issue: "mbid" is used to hold all MBIDs for the different entities, and "mb" is used for the other properties that you want to be fulltext-searchable.

And the output file is modifies like this:

kind:string:mb	pk	mbid:string:mbid	name:string:mb	begin_date_year	end_date_year	etype:string:mb	area	gender	comment	ended
artist	345225	c3d70436-faa2-4669-be5a-f8ba61f3ac29	Ben Christian						""	f
artist	708456	1789287b-d124-47c6-80c4-adb0dfdec1e2	Sean Grissom						""	f
artist	514886	a7f3c871-3ba3-40b1-ba58-d08b40312789	Uffe Andersen			Person			""	f
artist	99032	816688b7-6850-4af9-9c63-cebd0ff48546	Genet						""	f
artist	689027	61406f75-53b9-4557-b39f-7cd5a5ad1dc7	Pentti Mikkonen			Person			""	f
...

I've prepared SQL queries for all MusicBrainz core entities at https://github.com/redapple/sql2graph/tree/multi_nodescsv/examples/musicbrainz/sql.autoindex

You can run all these queries with an script (also proposed by @peterneubauer) that you can find (and adapt to your paths) at https://github.com/redapple/sql2graph/blob/multi_nodescsv/examples/musicbrainz/export_mb_to_csv.autoindex.sh.default

So now you have CSV files (gzipped in the default export script) for nodes to feed to batch-import. batch-import assumes the node IDs you feed it with follow the order they are read in the input files.

So you must build the relations file carefully, remembering what nodeID will be associated with which entity entry in MusicBrainz.

sql2graph allows you to do that with the help of a configuration file and a schema file. I've prepared the files for you in https://github.com/redapple/sql2graph/blob/multi_nodescsv/examples/musicbrainz/:

  • musicbrainz2neo4j.py: defines the entities properties and relations (and their properties) for virtually all core entities
  • mb2neo-simple.conf.default: that you need to adapt to your file system, where you stored all your .sql.csv(.gz) files exported by Postgresl

Then, you run:

$ python musicbrainz2neo4j.py --config mb2neo-simple.conf

which will create the relation files for you.

You'll notice that sql2graph will create "nodes_...csv.gz" as symbolic links to the various exported SQL dumps since the nodes files don't need to be recreated. Though, they must follow the prefix you define in mb2neo-simple.conf (believe me, it's easier for the script :)

Finally, you can run batch-import (check the path to batch-import in mb2neo-simple.conf) with

$ /path/to/sql2graph/examples/musicbrainz$ python ../../run_batchimport.py --config mb2neo-simple.conf

Annex A - Size of MusicBrainz Postgresql tables

Schema Name Type Size
musicbrainz annotation table 8192 bytes
musicbrainz application table 8192 bytes
musicbrainz area table 2320 kB
musicbrainz area_alias table 1176 kB
musicbrainz area_alias_type table 16 kB
musicbrainz area_annotation table 0 bytes
musicbrainz area_gid_redirect table 8192 bytes
musicbrainz area_type table 8192 bytes
musicbrainz artist table 71 MB
musicbrainz artist_alias table 8120 kB
musicbrainz artist_alias_type table 16 kB
musicbrainz artist_annotation table 0 bytes
musicbrainz artist_credit table 36 MB
musicbrainz artist_credit_name table 50 MB
musicbrainz artist_deletion table 8192 bytes
musicbrainz artist_gid_redirect table 1512 kB
musicbrainz artist_ipi table 576 kB
musicbrainz artist_isni table 144 kB
musicbrainz artist_meta table 0 bytes
musicbrainz artist_name table 80 MB
musicbrainz artist_rating_raw table 0 bytes
musicbrainz artist_tag table 0 bytes
musicbrainz artist_tag_raw table 0 bytes
musicbrainz artist_type table 8192 bytes
musicbrainz autoeditor_election table 0 bytes
musicbrainz autoeditor_election_vote table 0 bytes
musicbrainz cdtoc table 87 MB
musicbrainz cdtoc_raw table 8192 bytes
musicbrainz clientversion table 48 kB
musicbrainz country_area table 40 kB
musicbrainz edit table 8192 bytes
musicbrainz edit_area table 0 bytes
musicbrainz edit_artist table 0 bytes
musicbrainz edit_label table 0 bytes
musicbrainz edit_note table 8192 bytes
musicbrainz edit_recording table 0 bytes
musicbrainz edit_release table 0 bytes
musicbrainz edit_release_group table 0 bytes
musicbrainz edit_url table 0 bytes
musicbrainz edit_work table 0 bytes
musicbrainz editor table 8192 bytes
musicbrainz editor_collection table 8192 bytes
musicbrainz editor_collection_release table 0 bytes
musicbrainz editor_language table 0 bytes
musicbrainz editor_oauth_token table 8192 bytes
musicbrainz editor_preference table 0 bytes
musicbrainz editor_subscribe_artist table 0 bytes
musicbrainz editor_subscribe_artist_deleted table 0 bytes
musicbrainz editor_subscribe_collection table 0 bytes
musicbrainz editor_subscribe_editor table 0 bytes
musicbrainz editor_subscribe_label table 0 bytes
musicbrainz editor_subscribe_label_deleted table 0 bytes
musicbrainz editor_watch_artist table 0 bytes
musicbrainz editor_watch_preferences table 0 bytes
musicbrainz editor_watch_release_group_type table 0 bytes
musicbrainz editor_watch_release_status table 0 bytes
musicbrainz gender table 8192 bytes
musicbrainz iso_3166_1 table 40 kB
musicbrainz iso_3166_2 table 240 kB
musicbrainz iso_3166_3 table 8192 bytes
musicbrainz isrc table 22 MB
musicbrainz iswc table 3360 kB
musicbrainz l_area_area table 1304 kB
musicbrainz l_area_artist table 0 bytes
musicbrainz l_area_label table 0 bytes
musicbrainz l_area_recording table 0 bytes
musicbrainz l_area_release table 0 bytes
musicbrainz l_area_release_group table 0 bytes
musicbrainz l_area_url table 2608 kB
musicbrainz l_area_work table 40 kB
musicbrainz l_artist_artist table 10 MB
musicbrainz l_artist_label table 376 kB
musicbrainz l_artist_recording table 116 MB
musicbrainz l_artist_release table 19 MB
musicbrainz l_artist_release_group table 216 kB
musicbrainz l_artist_url table 40 MB
musicbrainz l_artist_work table 37 MB
musicbrainz l_label_label table 424 kB
musicbrainz l_label_recording table 464 kB
musicbrainz l_label_release table 96 kB
musicbrainz l_label_release_group table 0 bytes
musicbrainz l_label_url table 3480 kB
musicbrainz l_label_work table 1176 kB
musicbrainz l_recording_recording table 1968 kB
musicbrainz l_recording_release table 40 kB
musicbrainz l_recording_release_group table 0 bytes
musicbrainz l_recording_url table 856 kB
musicbrainz l_recording_work table 62 MB
musicbrainz l_release_group_release_group table 312 kB
musicbrainz l_release_group_url table 14 MB
musicbrainz l_release_group_work table 0 bytes
musicbrainz l_release_release table 928 kB
musicbrainz l_release_release_group table 0 bytes
musicbrainz l_release_url table 44 MB
musicbrainz l_release_work table 0 bytes
musicbrainz l_url_url table 0 bytes
musicbrainz l_url_work table 3008 kB
musicbrainz l_work_work table 3424 kB
musicbrainz label table 6872 kB
musicbrainz label_alias table 560 kB
musicbrainz label_alias_type table 16 kB
musicbrainz label_annotation table 0 bytes
musicbrainz label_deletion table 8192 bytes
musicbrainz label_gid_redirect table 184 kB
musicbrainz label_ipi table 64 kB
musicbrainz label_isni table 8192 bytes
musicbrainz label_meta table 0 bytes
musicbrainz label_name table 4344 kB
musicbrainz label_rating_raw table 0 bytes
musicbrainz label_tag table 0 bytes
musicbrainz label_tag_raw table 0 bytes
musicbrainz label_type table 8192 bytes
musicbrainz language table 448 kB
musicbrainz link table 5808 kB
musicbrainz link_attribute table 3704 kB
musicbrainz link_attribute_credit table 8192 bytes
musicbrainz link_attribute_type table 136 kB
musicbrainz link_creditable_attribute_type table 48 kB
musicbrainz link_type table 136 kB
musicbrainz link_type_attribute_type table 40 kB
musicbrainz medium table 84 MB
musicbrainz medium_cdtoc table 34 MB
musicbrainz medium_format table 8192 bytes
musicbrainz medium_index table 8192 bytes
musicbrainz puid table 397 MB
musicbrainz recording table 852 MB
musicbrainz recording_annotation table 0 bytes
musicbrainz recording_gid_redirect table 60 MB
musicbrainz recording_meta table 0 bytes
musicbrainz recording_puid table 297 MB
musicbrainz recording_rating_raw table 0 bytes
musicbrainz recording_tag table 0 bytes
musicbrainz recording_tag_raw table 0 bytes
musicbrainz release table 117 MB
musicbrainz release_annotation table 0 bytes
musicbrainz release_country table 40 MB
musicbrainz release_coverart table 0 bytes
musicbrainz release_gid_redirect table 6312 kB
musicbrainz release_group table 71 MB
musicbrainz release_group_annotation table 0 bytes
musicbrainz release_group_gid_redirect table 6472 kB
musicbrainz release_group_meta table 0 bytes
musicbrainz release_group_primary_type table 8192 bytes
musicbrainz release_group_rating_raw table 0 bytes
musicbrainz release_group_secondary_type table 16 kB
musicbrainz release_group_secondary_type_join table 11 MB
musicbrainz release_group_tag table 0 bytes
musicbrainz release_group_tag_raw table 0 bytes
musicbrainz release_label table 45 MB
musicbrainz release_meta table 0 bytes
musicbrainz release_name table 62 MB
musicbrainz release_packaging table 8192 bytes
musicbrainz release_raw table 8192 bytes
musicbrainz release_status table 8192 bytes
musicbrainz release_tag table 0 bytes
musicbrainz release_tag_raw table 0 bytes
musicbrainz release_unknown_country table 1712 kB
musicbrainz replication_control table 8192 bytes
musicbrainz script table 40 kB
musicbrainz script_language table 8192 bytes
musicbrainz tag table 0 bytes
musicbrainz tag_relation table 0 bytes
musicbrainz track table 1340 MB
musicbrainz track_gid_redirect table 0 bytes
musicbrainz track_name table 406 MB
musicbrainz track_raw table 8192 bytes
musicbrainz url table 184 MB
musicbrainz url_gid_redirect table 208 kB
musicbrainz vote table 0 bytes
musicbrainz work table 28 MB
musicbrainz work_alias table 3768 kB
musicbrainz work_alias_type table 16 kB
musicbrainz work_annotation table 0 bytes
musicbrainz work_attribute table 8192 bytes
musicbrainz work_attribute_type table 8192 bytes
musicbrainz work_attribute_type_allowed_value table 8192 bytes
musicbrainz work_gid_redirect table 6776 kB
musicbrainz work_meta table 0 bytes
musicbrainz work_name table 30 MB
musicbrainz work_rating_raw table 0 bytes
musicbrainz work_tag table 0 bytes
musicbrainz work_tag_raw table 0 bytes
musicbrainz work_type table 8192 bytes