-
Notifications
You must be signed in to change notification settings - Fork 15
MusicBrainz batch import file using PostgreSQL alone
Inspired by http://maxdemarzi.com/2012/02/28/batch-importer-part-2/
For the impatient: the result of this experiment is at https://gist.github.com/redapple/0365a430a9f25e9acbea
So we have several entities that we can dump independently with
-- COPY (SELECT * FROM area) TO stdout CSV HEADER DELIMITER E'\t';
COPY(
SELECT
'area' AS "kind:string:mb",
a.id AS pk,
a.gid AS "mbid:string:mbid",
a.name AS "name:string:mb",
atype.name AS "etype:string:mb"
FROM area a
JOIN area_type atype ON a.type=atype.id
)
TO stdout CSV HEADER DELIMITER E'\t';
and
--COPY (SELECT * FROM area_alias) TO stdout CSV HEADER DELIMITER E'\t';
COPY (
SELECT
'area_alias' AS "kind:string:mb",
aa.id AS pk,
aa.area AS artist_fk,
aa.name AS "name:string:mb",
aa.locale,
--aa.primary_for_locale,
aat.name AS "etype:string:mb",
aa.begin_date_year,
aa.end_date_year
FROM area_alias aa
LEFT JOIN area_alias_type aat ON aa.type=aat.id
)
TO stdout CSV HEADER DELIMITER E'\t';
We now want to generate the data to have all entities in a single file.
Let's make a UNION
query of these. They must have the same number of columns, so for example for the area_alias
table, there's not MBID, so we simply add an "mbid" output column with NULL:
SELECT
'area' AS entity,
a.id AS pk,
a.gid AS "mbid",
a.name AS "name",
atype.name AS "type"
FROM area a
JOIN area_type atype ON a.type=atype.id
UNION
SELECT
'area_alias' AS entity,
aa.id AS pk,
NULL AS "mbid",
aa.name AS "name",
aat.name AS "type"
FROM area_alias aa
LEFT JOIN area_alias_type aat ON aa.type=aat.id
This outputs something like
entity | pk | mbid | name | type
------------+-------+--------------------------------------+----------------------------------------------------------------+--------------
area | 1 | aa95182f-df0a-3ad6-8bfb-4b63482cd276 | Afghanistan | Country
area | 2 | 1c69b790-b46b-3e92-b6b4-93b4364badbc | Albania | Country
area | 3 | 28242750-534a-326b-8ed6-1b03dfb88cd0 | Algeria | Country
area | 4 | e228a3c1-53c0-3ec9-842b-ec1b2138e387 | American Samoa | Country
area | 5 | e01da61e-99a8-3c76-a27d-774c3f4982f0 | Andorra | Country
...
and further down in the results:
area_alias | 11394 | | Basilicate | Area name
area_alias | 11395 | | Basilicata | Area name
area_alias | 11396 | | バジリカータ州 | Area name
area_alias | 11397 | | Basilicata | Area name
area_alias | 11398 | | Basilicata | Area name
area_alias | 11399 | | Rajon Cantemir | Area name
area_alias | 11400 | | Cantemir District | Area name
area_alias | 11401 | | Distrito de Cantemir | Area name
area_alias | 11402 | | distretto di Cantemir | Area name
area_alias | 11403 | | Cantemir | Area name
...
If we add all the useful columns, we basically have a nodes file for batch-import (with the help appropriate COPY(...) TO <file> CSV HEADER DELIMITER E'\t';
command)
But ideally in the end, we want to generate a relationship file right?
Max de Marzi suggests using the row_number() window function to generate an incrementing ID:
SELECT
row_number() OVER (ORDER BY pk) AS node_id,
entity_union.entity,
entity_union.pk as entity_pk
FROM
(
SELECT
'area' AS entity,
a.id AS pk,
a.gid AS "mbid",
a.name AS "name",
atype.name AS "type"
FROM area a
JOIN area_type atype ON a.type=atype.id
UNION
SELECT
'area_alias' AS entity,
aa.id AS pk,
NULL AS "mbid",
aa.name AS "name",
aat.name AS "type"
FROM area_alias aa
LEFT JOIN area_alias_type aat ON aa.type=aat.id
) AS entity_union;
which outputs something like:
node_id | entity | entity_pk
---------+------------+-----------
1 | area | 1
2 | area_alias | 1
3 | area | 2
4 | area_alias | 2
5 | area | 3
6 | area_alias | 3
7 | area | 4
8 | area_alias | 4
9 | area | 5
10 | area_alias | 5
11 | area | 6
12 | area_alias | 6
13 | area | 7
14 | area_alias | 7
15 | area | 8
16 | area_alias | 8
17 | area | 9
18 | area_alias | 9
19 | area | 10
20 | area_alias | 10
...
We can store that in a temporary table to resolve relationships.
CREATE TEMPORARY TABLE entity_mapping
AS
(
SELECT
row_number() OVER (ORDER BY pk) AS node_id,
entity_union.entity,
entity_union.pk as entity_pk
FROM
(
SELECT
'area' AS entity,
a.id AS pk,
a.gid AS "mbid",
a.name AS "name",
atype.name AS "type"
FROM area a
JOIN area_type atype ON a.type=atype.id
UNION
SELECT
'area_alias' AS entity,
aa.id AS pk,
NULL AS "mbid",
aa.name AS "name",
aat.name AS "type"
FROM area_alias aa
LEFT JOIN area_alias_type aat ON aa.type=aat.id
)
AS entity_union
);
CREATE UNIQUE INDEX idx_entpk ON entity_union (entity, entity_py);
And the relationships data can be generated with queries like this:
SELECT
start_entity.node_id AS start,
end_entity.node_id AS end,
'is alias for' AS rel_type
FROM area_alias aa
LEFT JOIN area_alias_type aat ON aa.type=aat.id
JOIN entity_mapping start_entity
ON (
start_entity.entity_pk = aa.id
AND
start_entity.entity = 'area_alias'
)
JOIN entity_mapping end_entity
ON (
end_entity.entity_pk = aa.type
AND
end_entity.entity = 'area'
)
LIMIT 100;
which outputs
start | end | rel_type
-------+-------+--------------
147 | 145 | is alias for
34 | 35 | is alias for
96 | 95 | is alias for
40 | 38 | is alias for
7 | 8 | is alias for
42 | 43 | is alias for
12 | 10 | is alias for
3 | 1 | is alias for
5 | 6 | is alias for
44 | 45 | is alias for
98 | 100 | is alias for
14 | 15 | is alias for
49 | 47 | is alias for
149 | 151 | is alias for
51 | 53 | is alias for
16 | 17 | is alias for
...
Get https://github.com/redapple/sql2graph/tree/sqlautogen/experimental/mbslave-sqlautogen
and run
...sql2graph/experimental/mbslave-sqlautogen$ ./mbslave-sql2graph-export.py | ./mbslave-psql.py
You can grab a pre-generated SQL script file that you can feed to psql or ./mblsave-psql.py
at https://gist.github.com/redapple/0365a430a9f25e9acbea
- Use batch-import.
- Make sure you set indexes in your
batch.properties
file
batch_import.node_index.mb=fulltext
batch_import.node_index.mbid=exact
Then run
$ java -server -Xmx2G -jar /path/to/batch-import/target/batch-import-jar-with-dependencies.jar \
/path/to/neo4j-community-1.9.3/data/graph.db \
/path/to/musicbrainz__nodes__full.csv /path/to/musicbrainz__rels__full.csv
For this you have to change ./mbslave-sql2graph-export.py
, at the end, change to
print exporter.create_mapping_table_query(multiple=True)
print exporter.create_nodes_query(multiple=True)
print exporter.create_relationships_query(multiple=True)
(I should really make this an option)
Then, to feed the files to batch-import, nodes files need to be imported in the same order as they were generated (the Python script numbers them). So I used:
$ java -server -Xmx2G -jar /path/to/batch-import/target/batch-import-jar-with-dependencies.jar \
/path/to/neo4j-community-1.9.3/data/graph.db \
`ls -1 musicbrainz__nodes__full.????.csv | tr "\\n" ","` \
`ls -1 musicbrainz__rels__full.????.csv | tr "\\n" ","`
add properties to relationships- migrate to Neo4J 2.0