Thursday 15 May 2014

database - More efficient ways to save large object graph and append to it in python -


I am crawling the Pubbs database of research papers and running in a problem because my nodes are too large Here's how my data structure works:

  class network (object): def __init __ (self): itself .__ author = {} # ant is a name and each value is a The object itself is __papers = {} #each key is a pubied id and each value is an object class author (def): def __init __ (self, not M = '', PaperIeds = []): self .__ name = name itself .__ paperID = set (paperheads) self.coAuthors = {} with the author's name as key #dict and many times the authors value class Formed together in the form of paper (): def __init __ (self, title = '', pageID = '', abstract = '', date = '', keywords = [], cited BIID = [], author name = [ ]): Self .__ title = title itself.__ pageid = PageID self.__ abstract = essence self.__ date = date self__ keyword = keyword itself .__ quoted = UTI = cited BIIDS Your .__ authorNames = authorNames #This list is managed by the importance of the network x.pagerank itself .__ cited = [] #IIDS itself.__ doesCite = [] #IIDS itself.__ author = [] # object   

Currently I pass the network as OJM and pick up the entire network CT:

  def saveGraph (self, obj, filename): Open As with (file name, 'w'): pickle.dump (obj, outf)   

The problem now is that the pickle file is getting very large Suggesting that saving and loading it takes a long time, and it becomes too large, say 20 GB, it will not be loaded into memory.

My first and most important problem is crawling for more information. I crawl and collect letters that refer to each leaf and collect papers by each author. To check this method I must check that a paper is already present in any word and if it adds a quote link, otherwise create a new paper. I want to back up quite a bit during crawling, but saving a pickle file is so big.

There is another way to store the data; A more efficient way to pick up my items; There is probably a way to update my database every time I change files; And is it possible to load only a part of the objects in my memory?

I suggest writing tools to pump the device into a graph database. Neo4j:

  • Titan:

    Gremlin is a language that allows you to graph regardless of storage technology

    If you need a cheap server to practice, then I recommend to fire an example in Amazon 2 EC2 . You can start the server, do your work, then shut it down.

  • No comments:

    Post a Comment