Friday 15 July 2011

python - How can I merge together sentence objects? -


Then I created a sentence tokenizer, which divides the paragraph into sentence, words and characters ... each of these The data is type But the punishment system is a two-stage system, because 'things like' . 'Throw it, once it gets a letter, but it works fine if it' ... 'is not an empty space.

The product is expanded in small amounts, but if I can do some secondary processing on it, then it will work perfectly. So this is where my question comes ... I am not sure how to write a system that allows me to add every sentence, because of which no sentence is opened at the end of the previous sentence. Without losing anything, the way.

Here's what the output looks like and why do I need it for:

Some sentences that are separated <

And there is a continuation

This can not be confused by the United States

that

the last sentence. ..

An abbreviation named has ended!

So sentence objects, which is the word delimiter i.e. ','? ','! ' The next sentence needs to be attached ... As long as there is no sentence with the actual end of the speech divider and the other thing is that it is difficult '. . 'Is counted as a continuity, not end of sentence. Therefore it should be added as well.

How should this happen:

Some sentences that are spliced ​​... and there is a continuation.

This can not be confused by the United States

In that last sentence ... the sentence ended in a nutshell!

Here is the code with which I was working:

  last = [] merge = [] for the stream S. if the last: old = Last pop () if '.' Not old. AS_UTF 8 and '?' Not old. AS_UTF 8 and '!' Not in old.as_utf8: new = old + s merge.append (new) Other: Merge .append (s) last.append (s)   

So there are some problems with this method ...

  1. It only adds one sentence to another, but it does not include 2 or 3 if it needs to be added.

  2. This leaves the first sentence if there is no punctuation in it.

  3. Do not deal with it ' . 'As continuation I know I did not do anything for this, and this is the reason why I am not completely sure how to know this problem, sentences should be terminated with the abbreviation, because I am Can count how many '.' In punishment, but it is actually 'USA' Will be thrown by. Because it is counted as 3 periods

    I have written the code as a __ method, so that you type the Sentence + sentence which works as a way of connecting one with another

    Any help will be highly appreciated and tell me that none of these Is not too obscure, and I will do my best to go away.

    OK, here's some working code. Do you want it broadly? I am not happy with it yet, it looks a bit ugly, but I want to know that this is the right direction.

      word = '' 'some punishments which are left ... and there is a continuity. It can not be confused by the United States ... the last sentence ... the sentence of a short name has ended! ''. Partition () DF format_centence (word): output = [] for words in words: if word.endswith ('...') or not word.endswith ('.'): Output.append (word) output.append ('') Elif word.endswith ('.'): Output.append (term) output.append ('\ n') Other: Increase ValueError ('Unexpected result with word:% r'% word) Return ''. Joint (output) print format_centence (word)   

    Output:

      Some punishments have been ... and there is a continuation. This is the U.S.A. Can not be confused by. In that last sentence ... the sentence ended in a nutshell!    

No comments:

Post a Comment