Wednesday 15 July 2015

huffman code - compression algorithm for street names wanted -


I have millions of street names and compression algorithms I'm not sure which algorithm will be the best fit. In most street names, they have general substrings, such as "road", "path", ...

The set of all road names is fixed and will not change dynamically < P> At first I was thinking about Huffman coding, but only one letter code, so this would not give great performance, so I thought of generating a trilogy and calculating the most common substring. Then I can code some sort of code to get the word back, and pressing these codes using something like Huffman coding, I'm not sure that it is not making it more complicated than that.

Does anyone know a compression technique that makes sense in my case?

Edit1

I am used like this: I have a limited storage size phone device. This phone needs to have all the street names of all roads in a particular country. Now every street object has some value and one of them is the name of the road in the form of a string. It takes up more space and I would like to reduce it because the name is quite similar, that means ending at "... road" or "... way", I thought that a specific compression algorithm applied to this scenario May be worth doing.

A simple Gzip has brought about a compression of 50%, I think it is possible to get more than that.

Edit 2

AB M. The solution of Pedersen is actually giving very good performance results. Here are some code (written in C #):

  Private induced item [_] _ items; Public Zero Compressions (string [] strings) {array; strings; _Items = new indexed item [string.label]; String end string = string Empty; For (int i = 0; i  J & amp; Previous string [J] == Strings [I] [J]) {J ++; } _Items [i] = New Induced Entom () {prefix = J, suffix = strings [I]. Asbestring (j)}; Last string = wired [i]; }} Private start indexed item {public byte prefix; Public string suffix; }   

Even after compression, I am sending it through DeflateStream, which results in approximately 30%

Full compression happens to answer

Depending on your data set, You can start by giving order, and then you can represent the name of each street in the form of the name of the previous street + 'different part'.

An example of some similar street names:

  Copying the name of the previous road is in Hex. Road Name Rest Original Original VVV Oriz Size New Size Broadwalk 0 Broadwalk 9 10 Broadwater 7 Tier 8 Broadwater Access A Access 17 Broadwater Bluff B Bluff 16 Broadwater Branch C Farm 17 6 ​​Broadwater Bridge Dee Dies 17 Broadwater Cametery B Semitic 19 9 Broadwater Creek Sea Reich 16 5 Broadwater Point B Point 16 6 Broadwater Private Sea Weight 14 Broadway A211 Broadway 7 W8 Broadway And Union 8 And Union 18 11 Broadway A Article 9 Partitions 19 10 Broadway Avenue 9 Places 15 6 --- 220 93 You will need a series of names to be able to get the real names, but if you fulfill Record each N in the way you can customize it according to your needs.  

Only add 5-6 bit per letter to it together, and maybe replace some normal objects, you should see 50% with BPIP.

No comments:

Post a Comment