Skip to content

MusicBrainz batch import file using PostgreSQL alone

Paul Tremberth edited this page Sep 9, 2013 · 8 revisions

Inspired by http://maxdemarzi.com/2012/02/28/batch-importer-part-2/

For the impatient: the result of this experiment is at https://gist.github.com/redapple/0365a430a9f25e9acbea

Situation so far

So we have several entities that we can dump independently with

-- COPY (SELECT * FROM area) TO stdout CSV HEADER DELIMITER E'\t';
COPY(
    SELECT
        'area' AS "kind:string:mb",
        a.id AS pk,
        a.gid AS "mbid:string:mbid",
        a.name AS "name:string:mb",
        atype.name AS "etype:string:mb"
    FROM area a
    JOIN area_type atype ON a.type=atype.id
)
TO stdout CSV HEADER DELIMITER E'\t';

and

--COPY (SELECT * FROM area_alias) TO stdout CSV HEADER DELIMITER E'\t';
COPY (
    SELECT
        'area_alias' AS "kind:string:mb",
        aa.id AS pk,
        aa.area AS artist_fk,
        aa.name AS "name:string:mb",
        aa.locale,
        --aa.primary_for_locale,
        aat.name AS "etype:string:mb",
        aa.begin_date_year,
        aa.end_date_year
    FROM area_alias aa
    LEFT JOIN area_alias_type aat ON aa.type=aat.id
)
TO stdout CSV HEADER DELIMITER E'\t';

One file fits all

We now want to generate the data to have all entities in a single file.

Let's make a UNION query of these. They must have the same number of columns, so for example for the area_alias table, there's not MBID, so we simply add an "mbid" output column with NULL:

    SELECT
        'area' AS entity,
        a.id AS pk,
        a.gid AS "mbid",
        a.name AS "name",
        atype.name AS "type"
    FROM area a
    JOIN area_type atype ON a.type=atype.id

    UNION

    SELECT
        'area_alias' AS entity,
        aa.id AS pk,
        NULL AS "mbid",
        aa.name AS "name",
        aat.name AS "type"
    FROM area_alias aa
    LEFT JOIN area_alias_type aat ON aa.type=aat.id

This outputs something like

   entity   |  pk   |                 mbid                 |                              name                              |     type     
------------+-------+--------------------------------------+----------------------------------------------------------------+--------------
 area       |     1 | aa95182f-df0a-3ad6-8bfb-4b63482cd276 | Afghanistan                                                    | Country
 area       |     2 | 1c69b790-b46b-3e92-b6b4-93b4364badbc | Albania                                                        | Country
 area       |     3 | 28242750-534a-326b-8ed6-1b03dfb88cd0 | Algeria                                                        | Country
 area       |     4 | e228a3c1-53c0-3ec9-842b-ec1b2138e387 | American Samoa                                                 | Country
 area       |     5 | e01da61e-99a8-3c76-a27d-774c3f4982f0 | Andorra                                                        | Country
...

and further down in the results:

 area_alias | 11394 |                                      | Basilicate                                                     | Area name
 area_alias | 11395 |                                      | Basilicata                                                     | Area name
 area_alias | 11396 |                                      | バジリカータ州                                                 | Area name
 area_alias | 11397 |                                      | Basilicata                                                     | Area name
 area_alias | 11398 |                                      | Basilicata                                                     | Area name
 area_alias | 11399 |                                      | Rajon Cantemir                                                 | Area name
 area_alias | 11400 |                                      | Cantemir District                                              | Area name
 area_alias | 11401 |                                      | Distrito de Cantemir                                           | Area name
 area_alias | 11402 |                                      | distretto di Cantemir                                          | Area name
 area_alias | 11403 |                                      | Cantemir                                                       | Area name
...

If we add all the useful columns, we basically have a nodes file for batch-import (with the help appropriate COPY(...) TO <file> CSV HEADER DELIMITER E'\t'; command)

But ideally in the end, we want to generate a relationship file right?

Building the relationships CSV file

Max de Marzi suggests using the row_number() window function to generate an incrementing ID:

    SELECT
    row_number() OVER (ORDER BY pk) AS node_id,
    entity_union.entity,
    entity_union.pk as entity_pk
    FROM
    (
    SELECT
        'area' AS entity,
        a.id AS pk,
        a.gid AS "mbid",
        a.name AS "name",
        atype.name AS "type"
    FROM area a
    JOIN area_type atype ON a.type=atype.id

    UNION

    SELECT
        'area_alias' AS entity,
        aa.id AS pk,
        NULL AS "mbid",
        aa.name AS "name",
        aat.name AS "type"
    FROM area_alias aa
    LEFT JOIN area_alias_type aat ON aa.type=aat.id
    ) AS entity_union;

which outputs something like:

 node_id |   entity   | entity_pk 
---------+------------+-----------
       1 | area       |         1
       2 | area_alias |         1
       3 | area       |         2
       4 | area_alias |         2
       5 | area       |         3
       6 | area_alias |         3
       7 | area       |         4
       8 | area_alias |         4
       9 | area       |         5
      10 | area_alias |         5
      11 | area       |         6
      12 | area_alias |         6
      13 | area       |         7
      14 | area_alias |         7
      15 | area       |         8
      16 | area_alias |         8
      17 | area       |         9
      18 | area_alias |         9
      19 | area       |        10
      20 | area_alias |        10
...

We can store that in a temporary table to resolve relationships.

CREATE TEMPORARY TABLE entity_mapping
AS
(
    SELECT
    row_number() OVER (ORDER BY pk) AS node_id,
    entity_union.entity,
    entity_union.pk as entity_pk
    FROM
    (
    SELECT
        'area' AS entity,
        a.id AS pk,
        a.gid AS "mbid",
        a.name AS "name",
        atype.name AS "type"
    FROM area a
    JOIN area_type atype ON a.type=atype.id

    UNION

    SELECT
        'area_alias' AS entity,
        aa.id AS pk,
        NULL AS "mbid",
        aa.name AS "name",
        aat.name AS "type"
    FROM area_alias aa
    LEFT JOIN area_alias_type aat ON aa.type=aat.id
    )
    
    AS entity_union
);

CREATE UNIQUE INDEX idx_entpk ON entity_union (entity, entity_py);

And the relationships data can be generated with queries like this:

SELECT
    start_entity.node_id AS start,
    end_entity.node_id AS end,
    'is alias for' AS rel_type
FROM area_alias aa
LEFT JOIN area_alias_type aat ON aa.type=aat.id
JOIN entity_mapping start_entity
    ON (
        start_entity.entity_pk = aa.id
        AND
        start_entity.entity = 'area_alias'
    )
JOIN entity_mapping end_entity
    ON (
        end_entity.entity_pk = aa.type
        AND
        end_entity.entity = 'area'
    )
LIMIT 100;

which outputs

 start |  end  |   rel_type   
-------+-------+--------------
   147 |   145 | is alias for
    34 |    35 | is alias for
    96 |    95 | is alias for
    40 |    38 | is alias for
     7 |     8 | is alias for
    42 |    43 | is alias for
    12 |    10 | is alias for
     3 |     1 | is alias for
     5 |     6 | is alias for
    44 |    45 | is alias for
    98 |   100 | is alias for
    14 |    15 | is alias for
    49 |    47 | is alias for
   149 |   151 | is alias for
    51 |    53 | is alias for
    16 |    17 | is alias for
...

Generating the SQL export commands

Get https://github.com/redapple/sql2graph/tree/sqlautogen/experimental/mbslave-sqlautogen

and run

...sql2graph/experimental/mbslave-sqlautogen$ ./mbslave-sql2graph-export.py | ./mbslave-psql.py 

You can grab a pre-generated SQL script file that you can feed to psql or ./mblsave-psql.py at https://gist.github.com/redapple/0365a430a9f25e9acbea

Importing to neo4j

  • Use batch-import.
  • Make sure you set indexes in your batch.properties file
batch_import.node_index.mb=fulltext
batch_import.node_index.mbid=exact

Then run

$ java -server -Xmx2G -jar /path/to/batch-import/target/batch-import-jar-with-dependencies.jar \
     /path/to/neo4j-community-1.9.3/data/graph.db \
     /path/to/musicbrainz__nodes__full.csv /path/to/musicbrainz__rels__full.csv

Variation: use multiple files for nodes and relationships

For this you have to change ./mbslave-sql2graph-export.py, at the end, change to

print exporter.create_mapping_table_query(multiple=True)
print exporter.create_nodes_query(multiple=True)
print exporter.create_relationships_query(multiple=True)

(I should really make this an option)

Then, to feed the files to batch-import, nodes files need to be imported in the same order as they were generated (the Python script numbers them). So I used:

$ java -server -Xmx2G -jar /path/to/batch-import/target/batch-import-jar-with-dependencies.jar \
     /path/to/neo4j-community-1.9.3/data/graph.db \
    `ls -1 musicbrainz__nodes__full.????.csv | tr "\\n" ","` \
    `ls -1 musicbrainz__rels__full.????.csv | tr "\\n" ","`

TO DO

  • add properties to relationships
  • migrate to Neo4J 2.0