Tuesday 15 June 2010

java - An "Empty" Character Extracted from a PDF -


I recently tried to use PDF to extract text from a PDF file. This works well for most PDFs, but for a PDF (which is unfortunately not allowed to share me), all the periods of sentences are not taken out. Instead, I get phrases like the following:

  ... what will this happen ... it will be important later ...   

It seems that this is just a place rather than a space, but it is not (at least on Mac OS X) if you copy the text into a text editor and begin to move the text cursor through the phrase , Then there is a "blank letter" immediately after "T" in "foot" To various:

  • Keep preceding cursor "feet" in the letter "T" and press the right arrow key. The cursor moves one step to the right.
  • Press the right arrow key again, you are right where you are.
  • Press the right arrow key one more time, you can see the other side of the space.
  • Exits as expected to continue by pressing the right arrow key

    It appears that the PDF box has some kind of "blank character" Removed period I have tried to change it in a few different ways, but I have not found any luck:

      string old text = text; Text = text.replace ('\ u0000', '.'); // Unicode tap text = text Location ('' 0 ','. '); // C System System.out.println (oldText.equals (text)); // Returns True // Also tried the text Location (zero, '.'), But it does not compile   

    What is this "empty character" and how can I change it?

    Edit: It has been suggested that the character can be a letter like \ uFEFF , but it is not suggested to try to replace it with a rijks < / P>

    After feeling that the character \ uFEFF or \ u0000 , two values ​​of Unicode which used to run other stack overflow users, I decided to run a test to decide what the code was really about. Using the code to determine what the Unicode value is, I discovered that the mysterious character was \ u008 , which is "". Why was I not known from PDF, I did not know, but text = text.replace ('\ u0008', '.') now it changes with missing time.

No comments:

Post a Comment