MusicBrainz batch import file using PostgreSQL alone

Inspired by http://maxdemarzi.com/2012/02/28/batch-importer-part-2/

For the impatient: the result of this experiment is at https://gist.github.com/redapple/0365a430a9f25e9acbea

Situation so far

So we have several entities that we can dump independently with

-- COPY (SELECT * FROM area) TO stdout CSV HEADER DELIMITER E'\t';
COPY(
    SELECT
        'area' AS "kind:string:mb",
        a.id AS pk,
        a.gid AS "mbid:string:mbid",
        a.name AS "name:string:mb",
        atype.name AS "etype:string:mb"
    FROM area a
    JOIN area_type atype ON a.type=atype.id
)
TO stdout CSV HEADER DELIMITER E'\t';

and

--COPY (SELECT * FROM area_alias) TO stdout CSV HEADER DELIMITER E'\t';
COPY (
    SELECT
        'area_alias' AS "kind:string:mb",
        aa.id AS pk,
        aa.area AS artist_fk,
        aa.name AS "name:string:mb",
        aa.locale,
        --aa.primary_for_locale,
        aat.name AS "etype:string:mb",
        aa.begin_date_year,
        aa.end_date_year
    FROM area_alias aa
    LEFT JOIN area_alias_type aat ON aa.type=aat.id
)
TO stdout CSV HEADER DELIMITER E'\t';

One file fits all

We now want to generate the data to have all entities in a single file.

Let's make a UNION query of these. They must have the same number of columns, so for example for the area_alias table, there's not MBID, so we simply add an "mbid" output column with NULL:

    SELECT
        'area' AS entity,
        a.id AS pk,
        a.gid AS "mbid",
        a.name AS "name",
        atype.name AS "type"
    FROM area a
    JOIN area_type atype ON a.type=atype.id

    UNION

    SELECT
        'area_alias' AS entity,
        aa.id AS pk,
        NULL AS "mbid",
        aa.name AS "name",
        aat.name AS "type"
    FROM area_alias aa
    LEFT JOIN area_alias_type aat ON aa.type=aat.id

This outputs something like

   entity   |  pk   |                 mbid                 |                              name                              |     type     
------------+-------+--------------------------------------+----------------------------------------------------------------+--------------
 area       |     1 | aa95182f-df0a-3ad6-8bfb-4b63482cd276 | Afghanistan                                                    | Country
 area       |     2 | 1c69b790-b46b-3e92-b6b4-93b4364badbc | Albania                                                        | Country
 area       |     3 | 28242750-534a-326b-8ed6-1b03dfb88cd0 | Algeria                                                        | Country
 area       |     4 | e228a3c1-53c0-3ec9-842b-ec1b2138e387 | American Samoa                                                 | Country
 area       |     5 | e01da61e-99a8-3c76-a27d-774c3f4982f0 | Andorra                                                        | Country
...

and further down in the results:

 area_alias | 11394 |                                      | Basilicate                                                     | Area name
 area_alias | 11395 |                                      | Basilicata                                                     | Area name
 area_alias | 11396 |                                      | バジリカータ州                                                 | Area name
 area_alias | 11397 |                                      | Basilicata                                                     | Area name
 area_alias | 11398 |                                      | Basilicata                                                     | Area name
 area_alias | 11399 |                                      | Rajon Cantemir                                                 | Area name
 area_alias | 11400 |                                      | Cantemir District                                              | Area name
 area_alias | 11401 |                                      | Distrito de Cantemir                                           | Area name
 area_alias | 11402 |                                      | distretto di Cantemir                                          | Area name
 area_alias | 11403 |                                      | Cantemir                                                       | Area name
...

If we add all the useful columns, we basically have a nodes file for batch-import (with the help appropriate COPY(...) TO <file> CSV HEADER DELIMITER E'\t'; command)

But ideally in the end, we want to generate a relationship file right?

Building the relationships CSV file

Max de Marzi suggests using the row_number() window function to generate an incrementing ID:

    SELECT
    row_number() OVER (ORDER BY pk) AS node_id,
    entity_union.entity,
    entity_union.pk as entity_pk
    FROM
    (
    SELECT
        'area' AS entity,
        a.id AS pk,
        a.gid AS "mbid",
        a.name AS "name",
        atype.name AS "type"
    FROM area a
    JOIN area_type atype ON a.type=atype.id

    UNION

    SELECT
        'area_alias' AS entity,
        aa.id AS pk,
        NULL AS "mbid",
        aa.name AS "name",
        aat.name AS "type"
    FROM area_alias aa
    LEFT JOIN area_alias_type aat ON aa.type=aat.id
    ) AS entity_union;

which outputs something like:

 node_id |   entity   | entity_pk 
---------+------------+-----------
       1 | area       |         1
       2 | area_alias |         1
       3 | area       |         2
       4 | area_alias |         2
       5 | area       |         3
       6 | area_alias |         3
       7 | area       |         4
       8 | area_alias |         4
       9 | area       |         5
      10 | area_alias |         5
      11 | area       |         6
      12 | area_alias |         6
      13 | area       |         7
      14 | area_alias |         7
      15 | area       |         8
      16 | area_alias |         8
      17 | area       |         9
      18 | area_alias |         9
      19 | area       |        10
      20 | area_alias |        10
...

We can store that in a temporary table to resolve relationships.

CREATE TEMPORARY TABLE entity_mapping
AS
(
    SELECT
    row_number() OVER (ORDER BY pk) AS node_id,
    entity_union.entity,
    entity_union.pk as entity_pk
    FROM
    (
    SELECT
        'area' AS entity,
        a.id AS pk,
        a.gid AS "mbid",
        a.name AS "name",
        atype.name AS "type"
    FROM area a
    JOIN area_type atype ON a.type=atype.id

    UNION

    SELECT
        'area_alias' AS entity,
        aa.id AS pk,
        NULL AS "mbid",
        aa.name AS "name",
        aat.name AS "type"
    FROM area_alias aa
    LEFT JOIN area_alias_type aat ON aa.type=aat.id
    )
    
    AS entity_union
);

CREATE UNIQUE INDEX idx_entpk ON entity_union (entity, entity_py);

And the relationships data can be generated with queries like this:

SELECT
    start_entity.node_id AS start,
    end_entity.node_id AS end,
    'is alias for' AS rel_type
FROM area_alias aa
LEFT JOIN area_alias_type aat ON aa.type=aat.id
JOIN entity_mapping start_entity
    ON (
        start_entity.entity_pk = aa.id
        AND
        start_entity.entity = 'area_alias'
    )
JOIN entity_mapping end_entity
    ON (
        end_entity.entity_pk = aa.type
        AND
        end_entity.entity = 'area'
    )
LIMIT 100;

which outputs

 start |  end  |   rel_type   
-------+-------+--------------
   147 |   145 | is alias for
    34 |    35 | is alias for
    96 |    95 | is alias for
    40 |    38 | is alias for
     7 |     8 | is alias for
    42 |    43 | is alias for
    12 |    10 | is alias for
     3 |     1 | is alias for
     5 |     6 | is alias for
    44 |    45 | is alias for
    98 |   100 | is alias for
    14 |    15 | is alias for
    49 |    47 | is alias for
   149 |   151 | is alias for
    51 |    53 | is alias for
    16 |    17 | is alias for
...

Generating the SQL export commands

Get https://github.com/redapple/sql2graph/tree/sqlautogen/experimental/mbslave-sqlautogen

and run

...sql2graph/experimental/mbslave-sqlautogen$ ./mbslave-sql2graph-export.py | ./mbslave-psql.py

You can grab a pre-generated SQL script file that you can feed to psql or ./mblsave-psql.py at https://gist.github.com/redapple/0365a430a9f25e9acbea

Importing to neo4j

Use batch-import.
Make sure you set indexes in your batch.properties file

batch_import.node_index.mb=fulltext
batch_import.node_index.mbid=exact

Then run

$ java -server -Xmx2G -jar /path/to/batch-import/target/batch-import-jar-with-dependencies.jar \
     /path/to/neo4j-community-1.9.3/data/graph.db \
     /path/to/musicbrainz__nodes__full.csv /path/to/musicbrainz__rels__full.csv

Variation: use multiple files for nodes and relationships

For this you have to change ./mbslave-sql2graph-export.py, at the end, change to

print exporter.create_mapping_table_query(multiple=True)
print exporter.create_nodes_query(multiple=True)
print exporter.create_relationships_query(multiple=True)

(I should really make this an option)

Then, to feed the files to batch-import, nodes files need to be imported in the same order as they were generated (the Python script numbers them). So I used:

$ java -server -Xmx2G -jar /path/to/batch-import/target/batch-import-jar-with-dependencies.jar \
     /path/to/neo4j-community-1.9.3/data/graph.db \
    `ls -1 musicbrainz__nodes__full.????.csv | tr "\\n" ","` \
    `ls -1 musicbrainz__rels__full.????.csv | tr "\\n" ","`

TO DO

~~add properties to relationships~~
migrate to Neo4J 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MusicBrainz batch import file using PostgreSQL alone

Situation so far

One file fits all

Building the relationships CSV file

Generating the SQL export commands

Importing to neo4j

Variation: use multiple files for nodes and relationships

TO DO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally