Thursday 15 May 2014

java - Hwo to check if a PDF document contains an image -


I am reading text from PDF documents using the iText library. However, some PDF documents may be embedded with an image-in addition to the text in it.

I am wondering if there is any way through iText or any other, to determine whether you can make accurate and 100% reliable checks in PDF document using a PDF library. Are there.

>

However you might read PDFs as text and process it in that way Can be quite reliable check. You first have to check that the PDF header is initially pdf,

 % PDF ...   

then scan after seeing this tag. If you press, you need to check back and forth in the stream; & Lt; & Lt; & Gt; And & gt; & Gt; Limitations of the dictionary to be taken out of the full XObject dictionary can be nested in & lt; & Lt; And & gt; & Gt; So you want to check back for 'obje' and 'stream' entries. Either way you will end up with some things that look like this,

    

You need to check here that this / subtype entry and one / image are different from some white space if you come to that hit then you have an image.

So what are the limitations of this approach?

It is possible to embed an image properly in the document, but do not use it. The result will be false positive, I think it is not quite likely though. It is very inefficient to do this and only a really illusory creator can do this.

Images can be embedded in page content streams, as Hugo pointed out above will result in false negatives. These are very unusual though. This is one of the bits of the device which was never a good idea and it is not widely used. If you have documents of a single producer (as often happens) it will appear very fast if it does or not, though I think it would be very unusual. According to the estimate, I can not imagine that this construct will be more than 1% in wild PDFs.

It is possible to embed these xobject tags directly into objects instead of objects. But I think you can totally miss it. Legally it will be completely bizarre, I do not think you will ever be able to see it.

Correctly involves scanning and parsing all content streams in PDF. What we do in ABCPDF (which I work) but this is a lot of work and much processing power. It can be several seconds on a large document.

Think if 99% credibility is going to be quite good: -)

No comments:

Post a Comment