PDA

View Full Version : PicBasic text compress routine



MegaADY
- 24th November 2005, 12:30
Hello ! I am working on a project in which I need to store some result aquired in a serial EEPROM. The problem is that I need a lot of space in the memory, and I don't want to use cascaded eeproms or flash cards. So I thought about a text compressing routine. Can anyone help me with this ? Thank you

Kamikaze47
- 25th November 2005, 05:52
Ive got a feeling that the pbp code required to compress and de-compress text would be awfully long. I hope you have a lot of program space left on your PIC.

MegaADY
- 25th November 2005, 07:51
I know, that's why I use only 1k of 8k program memory.

tdavis80
- 25th November 2005, 14:21
any chance that any of the data would contain 'often used' words?

for example: If it was storing text that seemed to have lots of DayOf MonthOf , YearOf info, you could parse the data so those words are stored as a single byte. If you used bytes > 127 as 'compressed' words, you could compress about 127 words.

Tuesday, November 1, 2005 = 25 bytes
[128], [129] 1, [130] = 8 bytes

The above example assumes that all of the received text would have an ascii value less than 128 so that you could use the high half of a byte for compressed characters.

peudoCode:

if ascii of character < 128 then character is normal
if ascii of character > 127 then
table_position = character - 127 ' example: 128 - 127 = 1
lookup table_position to find uncompessed word
endif

does this make sense?

MegaADY
- 25th November 2005, 14:45
My whole character set is : 0123456789;=? And I think the most repeated character is 0 . So I thought about this scheme:

0 - 0
00 - c
000 - k
0000 - M
and so on .. do you have any ideea ?

Luciano
- 25th November 2005, 15:18
Hi,

- What is the possible range of the values?

Example:

0.00 (First possible value)
0.01
0.99
...
...
199.99 (Last possible value).


- How many stored values do you need?

- What is the available space in the EEPROM? (bytes)


Luciano

MegaADY
- 25th November 2005, 15:29
I need to store strings of 40 characters like
938437437=002;3482?0000284=332

Theese are not values ... just 40 chars strings
I have 512k eeprom
and I need to store as much as possible

Luciano
- 25th November 2005, 16:41
Hi,

Your characters are in the range from ASCII 48 to ASCII 63.
You can put two characters in one byte.

The characters : > < are not used.
You can use them to implement your idea.

Use : for 00
Use < for 000
Use > for 0000

Best regards,

Luciano

http://img8.picsplace.to/img8/1/encoding.jpg

keithdoxey
- 25th November 2005, 22:12
My whole character set is : 0123456789;=? And I think the most repeated character is 0 . So I thought about this scheme:

0 - 0
00 - c
000 - k
0000 - M
and so on .. do you have any ideea ?

Very difficult to say without seeing more examples of your data.

The method described by Luciano will give you a guaranteed 2:1 compression irrespective of what the data is. If you need to compress the data more than that and have instances where characters (other than 0 ) repeat for 2 or more consecutive places then how about this for an idea.

As you say, you only have a character set of 13 characters and Luciano has already pointed out that you could fit two of your characters into a single byte.

My idea is similar but one half of the byte contains the code for the character and the other half of the byte the number of counts for that character eg

nnnncccc where n = number of characters and c = character

08700900200 (11 bytes) would be 10 18 17 20 19 20 12 20 = 8 bytes = 72%

however

22377777773354500000661666 (26 byte) would be
22 13 77 23 15 14 15 50 26 11 36 = 11 bytes = 42%

Obviously you cant have a count greater than 15 but if you had say twenty five "Zeros" then that would compress as F0 A0 = 25 chars into two bytes = 8%

As I say, compression depends on the data, but that is true for any compression routine. Worst case, your 40 characters take 40 bytes, best case your 40 characters are all identical eg
40 x 2 and would compress to three bytes
F2 F2 A2 = 15 x 2 + 15 x 2 + 10 x 2 = 40 x 2

You need to analyse your data for repeating patterns to see if that would help.

Easy to encode.
Get first character and nnnn = 1
while next character is the same and nnnn is < 15 increase nnnn
store it and get the next character and nnnn = 1.

To decode
For x = 1 to nnnn
character = cccc

Thinking a bit further, this method could also be used for a larger number of characters eg A to Z (upper case only) is 26 characters so could be stored in 5 bits allowing the remaining 3 bits to be used as a counter meaning that upto 7 consecutive letters could be compressed to a single byte.

Thanks for makng me think about this.... its given me an idea for something I am already working on :)

Regards

tdavis80
- 26th November 2005, 04:54
As I read your example another idea comes to mind.

Are the strings predictable? Are the non-numerics always in the same place?
If yes, ignore them and only store the numbers.

for the numbers, use packed decimal format. It is VERY easy to encode/decode with picbasic. Packed decimal stores 2 numbers per byte by using a nibble per digit.

Example:

14332569 = 8 bytes
Store the hex as: 14 33 25 69 = 4 bytes

You can encode/decode by simply shifting the data on the byte.

If you 40 chars were 36 numbers plus 4 alphas, the result would be 18 bytes which you could decode and reinsert the alphas in the appropriate places.

If the alphas could appear anywhere but are always 1 of a small set of possibilities your still in luck. The packed decimal uses 0-9 but still leaves you a-f for your own use (6 special characters).

another example:

Lets assume ";" = A "=" = B

Then 12254324;3654=6554 (18 bytes) would encode as:

hex= 12 25 43 24 A3 65 4B 65 54 (9 bytes)

are we getting closer?

ps: I just read your comment about your entire character set. This is good news because it means you could definately use the packed decimal format for 2:1 encoding with low code overhead.

Luciano
- 26th November 2005, 09:51
Is your data the TRACK 2 of a credit card?

See this link and read about TRACK 2:
http://www.gae.ucm.es/~padilla/extrawork/tracks.html

Also in this forum you have started another thread where
you are asking how to send out data over a DC power line.
Is that the power line of a credit card reader?

Thread about "Serial comms on VCC line"
http://www.picbasic.co.uk/forum/showthread.php?t=2806


I hope I am wrong.

Luciano

MegaADY
- 28th November 2005, 07:56
Thank you guys for helping me with the compression stuff. This week I will start to analyze the data to see what is the propper method of compression.
Track2 ? No, it's just a meteo station,whit a lot of home made modules, made by someone who was fired and left no documentation of what he's done, and I have to dig into his work, and improove it . :(
Thank you all !