4

Error Creating Relationships on a Large Data Set

 3 years ago
source link: https://www.codesd.com/item/error-creating-relationships-on-a-large-data-set.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Error Creating Relationships on a Large Data Set

advertisements

My question is similar to the one pointed here :

Creating unique node and relationship NEO4J over huge dataset

I have 2 tables Entity (Entities.txt) & Relationships (EntitiesRelationships_Updated.txt) which looks like below: Both the tables are inside an import folder within the Neo4j database. What I am trying to do is load the tables using the load csv command and then create relationships.

As in the table below: If ParentID is 0, it means that ENT_ID does not have a parent. If it is populated, then it has a parent. For example in the table below, ENT_ID = 3 is the parent of ENT_ID = 4 and ENT_ID = 1 is the parent of ENT_ID = 2

**Entity Table**
ENT_ID  Name  PARENTID
1      ABC     0
2      DEF     1
3      GHI     0
4      JKG     3

**Relationship Table**
RID ENT_IDPARENT ENT_IDCHILD
 1     1        2
 2     3        5

The Entity table has 2 million records and the relationship tables has about 400K lines

Each RID has a particular tag associated with it. For example RID = 1 has it that the relation is A FATHER_OF B; RID = 2 has it that the relation is A MOTHER_OF B. Similarly there are 20 such RIDs associated.

Both of these are in txt format.

My first step is to load the entity table. I used the following script:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Entities.txt" AS Entity FIELDTERMINATOR '|'
CREATE (n:Entity{ENT_ID: toInt(Entity.ENT_ID),NAME: Entity.NAME,PARENTID: toInt(Entity.PARENTID)})

This query works fine. It takes about 10 minutes to load 2.8mil records. The next step I do is to index the records:

CREATE INDEX ON :Entity(PARENTID)
CREATE INDEX ON :Entity(ENT_ID)

This query runs fine as well. Following this I tried creating the relationships from the relationship table using a similar query as in the above link:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (n:A {ENT_IDPARENT : Rships.ENT_IDPARENT})
with Entity, n
MATCH (m:B {ENT_IDCHILD : Rships.ENT_IDCHILD})
with m,n
MERGE (n)-[r:RELATION_OF]->(m);

As I do this, my query keeps running for about an hour and it stops at a particular size(in my case 2.2gb) I followed this query based on the link above. This includes the edit from the solution below and still does not work

I have one more query, which would be as follows (Based on the above link). I run this query as I want to create a relationship based of the Entity table

PROFILE
MATCH(Entity)
MATCH (a:Entity {ENT_ID : Entity.ENT_ID})
WITH Entity, a
MATCH (b:Entity {PARENTID : Entity.PARENTID})
WITH a,b
MERGE (a)-[r:PARENT_OF]->(b)

While I tried running this query, I get a Java Heap Space Error. Unfortunately, I have not been able to get the solution for these.

Could you please advice if I am doing something wrong?


This query allows you to take advantage of your :Entity(ENT_ID) index:

MATCH (child:Entity)
WHERE child.PARENTID > 0
WITH child.PARENTID AS pid, child
MATCH (parent:Entity {ENT_ID : pid})
MERGE (parent)-[:PARENT_OF]->(child);

Cypher does not use indices when the property value comes from another node. To get around that, the above query uses a WITH clause to represent child.PARENTID as a variable (pid). The time complexity of this query should be O(N). You original query has a complexity of O(N * N).

[EDITED]

If the above query takes too long or encounters errors that might be related to running out of memory, try this variant, which creates 1000 new relationships at a time. You can change 1000 to any number that is workable for you.

MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT 1000
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);

The WHERE clause filters out child nodes that already have a parent relationship. And the MERGE operation has been changed to a simpler CREATE operation, since we have already ascertained that the relationship does not yet exist. The query returns a count of the number of relationships created. If the result is less than 1000, then all parent relationships have been created.

Finally, to make the repeated queries automated, you can install the APOC plugin on the neo4j server and use the apoc.periodic.commit procedure, which will repeatedly invoke a query until it returns 0. In this example, I use a limit parameter of 10000:

CALL apoc.periodic.commit(
 "MATCH (child:Entity)
  WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
  WITH child.PARENTID AS pid, child
  LIMIT {limit}
  MATCH (parent:Entity {ENT_ID : pid})
  CREATE (parent)-[:PARENT_OF]->(child)
  RETURN COUNT(*);",
{limit: 10000});


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK