Tuesday 15 April 2014

regex - Name matching in the Congressional Record -


I am trying to come up with regular expressions which will identify specific naming conventions used in Congress records.

The speech of the Congress is always before the speaker's name. For example, here's an excerpt:

Mr. Mr. President of Doren, California, I was going to create my friend, but I have a problem. The Intelligence Committee is organizing

Can I tolerate that gentleman for 15 minutes and see?

Mr. RITTER. If the gentleman can only give me 6 minutes.

Mr. Can Daren K. Sajner of California make it in 4?

Mr. President, I produce a gentleman from Pennsylvania [Mr. De la CRUZ].

Mr. De La CRUZ blah blah blah

Mascot of Ms. Washington.

The naming convention used in the record of the Congress begins with the title (Mr., Mrs., Ms.), after the last name (in all caps). In some cases, the last name is in the form of state (Mr. Doranen of California).

In the words, regular expression must match the string with the following criteria:

  1. See either (Mr., Mrs or Ms.) at the beginning of the string.
  2. (rarely) some of its lower case words ('de la CRUZ example).
  3. Find names in all (or most of all, as a MacCoram example) caps
  4. (in some cases) the name of '[State Name]'
  5. End in a period.

    First of all it is easily done ^ (Mr. | Mrs.JS. MS)

    But the rest have stuck to me.

    How about:

      ^ ((?: Mr. .. Mrs.. | MS.] [^.] * [AGED] {2,}) (? :(?: K) ([^.] *)) {0,1} \.   

    View

No comments:

Post a Comment