Saturday 15 June 2013

c - Parsing simple HTML into tree -


I would like to ask what is the best way to parse a simple HTML code like this in a DOM node tree:

example tree

Here are some obstacles I am facing

  • The HTML code contains only the tag, no attribute, and I have to ignore the empty space
  • & lt; P & gt; , & lt; H1 & gt; , & lt; A & gt; etc.
  • I can not use libraries

    I was thinking about regex, but never tried it .. Any thoughts?

    Every node in the tree has this structure:

      typedef struct tag {struct tag * parent; Structure tag * next sibling; Structure tag * previous siphoning; Structure tag * firstchild; Structure tag * last baby; Four names; Four * text; } Node;    

    I know that it is not in C, but this presentation will give you some input You can deal with this problem efficiently.

    I have also written a very simple parser example in javascript (not in c, but hopefully you also know js) initial requirements, which means that Will not parse any features and will not automatically control the closing tag and many other things that should be handled according to HTML space. This will produce a parse tree in this format:

      {cn: [{tag: 'html', cn: [{tag: 'body', cn: [{tag: 'h1 ',' Cn: ['test']}, 'some text', ...]}]}]}   

    Here is the code and the saying:

    Meditation Please note that the white space is not ignored and will be captured in the text nodes.

      var html = '& lt; Html & gt; & Lt; Body & gt; & Lt; H1 & gt; Trial & lt; / H1> Some text & lt; Div & gt; & Lt; P & gt; Text & lt; / P & gt; & Lt; / Div & gt; & Lt; / Body & gt; & Lt; / Html & gt; '; Var parseHTML = (function () {var nodesStack = [], i = 0, len = html.length, stateFn = parseText, parseTree = {cn: []}, alpha pnmx = / \ w /, currentNode = parseTree, text = '', Tag = '', new node; function parsag (token) {if (token === '/') {return parseCloseTag;} i--; // backtrack first tag character return parases open tag;} function parseCloseTag (Token) {If (token === '& gt;') {if (currentNode.tag! == tag) {'throw the wrong closed tag around' + I;} tag = ''; nodesStack.pop ( ); CurrentNode = CurrentNode.parentNode; return parseText;} assertValidTagNameChar (token); tag + = token; return parceltote;} function pars openag (token) {if (Token === '& gt;') {currentNode.cn.push (newNode = {tag: tag, parent node: turnodode, cn: []}); nodesstakes pius (presentnode = newnode); tag = ''; return Parascript;} Internet ValidTagNameChar (Token); Tag + = Token; Return Pars OpenTag;} Function Pars Text (Token) {If (Token === '& lt;') {if (text) {currentNode.cn.push ( Text); text = '';} return passes ETag;} text + = token; Return parascript; } Function assertValidTagNameChar (c) {if (! AlphaNumRx.test (c)) {throw 'invalid tag name on char' + i; }} Return Function (html) {for (; i & lt; len; i ++) {stateFn = stateFn (html [i]); } If (currentNode = nodesStack.pop ()) {Skip the unbalanced tags: '+ CurrentNotTag +' is never closed. '; } Return Pursuit; }; }) (); Console.log (parse HTML (HTML);    

No comments:

Post a Comment