Thursday 15 January 2015

Python regex to strip html a tags without href attribute -


I'm getting a string that has been cleaned with LXL cleaner, so all the links are now in the form content. I would like to leave all those links which do not have any href attribute, e.g.

  & lt; A rel = "nofollow" & gt; Removal links & lt; / A & gt;   

should be

  link to delete   

for the same:

  & lt; A & gt; Another link to remove & lt; / A & gt;   

Be the shield:

  Other links to be removed   

All links with just the missing href attribute. It is not regex, but since LXML gives a clean markup structure, it should be possible. What I need is a string that snaps this type of non-functional tag.

Use Drop_tag method.

  import lxml.html root = lxml.html .fromstring ('
test & lt; a rel = "nofollow"> be linking & lt; ( One for 'A [not (@heref)]' one: (A & gt; deleted & lt; / a & gt; and & lt; a href = "# "Gt; link & lt; b & gt; removed & quot; link & lt; Lt; / b & gt; & lt; a href = "#" & gt; Link & lt; / a & gt; & lt; / div & gt; '

.drop_tag (): Leaves the tag, but keeps its children and the text.

No comments:

Post a Comment