OT but maybe interesting-compression schemes?

I was thinking last night that a way to compress digital files without loss is to number long strings of ones and zeros. In BCD only 16 ones or zeros are required to describe a string of 1024 ones and zeros. This is because (I think) 2 to the 9th is the number of possible combinations of a string of ones and zeros 1024 digits long. So a number from 0 to 1023 could be assigned to each of the possible combinations. Then these numbers can be used to uncompress the file. This scheme is so simple that there must be ways of compressing files even further without loss. Are there any folks here who can explain any of these to a layman? Thanks, Eric R Snow, Who could use a little compressing around the middle.

Reply to
Eric R Snow
Loading thread data ...

You've just described perhaps the oldest algorithm for compressing images. So old and so good that virtually all fax machines use it.

google run-length encoding

Reply to
Jim Stewart

Eric, your thinking is sound but you're about 20 years behind the state of the art. I don't mind if you want to try to do better than a bunch of C. Sci. PhD's from Berkeley, but personally I suggest you may be wasting your time.

GWE

Reply to
Grant Erwin

Me, I don't care if a guy's twenty years behind the times or not, if he figures out a good way to do something he's doing good.

John

Reply to
JohnM

There are 2^9 (that is, 2 to the 9th) distinct strings of

9 ones and zeroes. There are 2^1024 strings of 1024 such. Proof: Let b_i be the ith bit of k bits. There are 2 ways to choose it, independently of every other bit. So there are 2*2*2*2...*2 (that is, 2^k) ways to choose a string of k bits.

I don't know what you mean by "In BCD only 16 ones or zeroes are required to describe a string of 1024 ones and zeroes." If you mean strings of only ones, or only zeroes, you only need

1 bit; you can have 0 stand for 1024 0's, and 1 for 1024 1's.

RLE [run length encoding as mentioned by Jim Stewart] basically stores one bit or character and a run length in place of runs. For bit-mapped-graphics files, often there is good compression. But on average, RLE takes at least k bits to represent a random k-bit string. See the explanation of "Lossless data compression must always make some files longer" in wikipedia article

formatting link
.

OT OT note, I see there is apparently an IBM patent with 11 months left to run for LZW, the formerly-Sperry-patented compression method commonly used for GIF compression, per

formatting link
.

-jiw

Reply to
James Waldby

In the world of lossless file-compression theory, think of absolutely every file as just a number. The shortest file being zero bytes long, represented as zero, the next shortest being a single 0 bit, represented by one, then a single 1 bit represented as 2, and so on. Even a multi-gigabyte file would simply be a number with a shit load of numerals in it. Compression is basically rearrangement of this set of numbers. There is no actual compression going on. A simple compression algorithm that's in most books on the subject can take "aaaaa" and replace it with "a(5)" but a string like "(((" would have to be flagged somehow (typically with "\(\(\(") or it would confuse the decompression algorithm. Some stuff gets shorter, some gets longer, the net result is zero, and you get lossless compression. Your compression algorithm tries to skip that step--no trade-off, so by definition it won't compress anything. Getting specific, 1024 binary bits can represent 2^1024 unique numbers. Just 2^32 is 4,294,967,296. Way more than the 1023 you were planning on using to hold 2^1024 combinations. A popular simple compression method that I can't remember the name of actually builds a binary tree. Stuff that occurs frequently is near the trunk, and stuff that's rare is near the tips. The file gets compressed into a string of ones and zeros describing how to traverse the tree to reach various nodes. Frequent items, since they're near the base, need only a very short description to be reached. So you frequently make short trips through the tree. Rare items will need a long description, so you'll have to make the occasional long trip. Data with high repetition of a small number of items, like a text document, compresses very well this way. Data with lots of different items that occur less frequently (like what you see in sound files) would compress poorly or even grow. And that's the trade-off. Text files get smaller, but binary files get larger. What you have to do is pick and choose compression methods that best match the data you're trying to describe.

Reply to
B.B.

which depends on the entropy of the data.

Information theory is a fascinating subject in which I make no pretense of any expertise.

I was amazed at how basic information theory made the following puzzle trivial:

The king had 12 coins, one of which was counterfeit -- either heavy or light compared to genuine coins. The wizard had a simple balance. His task was to determine which of the 12 coins was counterfeit, and whether it was heavy or light, with three experiments using his simple balance. How did he do it?

Solving this problem is a trivial exercise using the concept of information entropy.

When first presented as a homework problem, only one very determined student found a solution by beating it into submission. When presented again three weeks later after some study of info theory, nearly every student was able to solve it easily.

Reply to
Don Foreman

a 16 bit BCD*) is a 4 digit decimal number. This can only have a range from 0...999, not 0..1024. It also has not 1024 bits, but 1000 different _values_. To represent them in a more compact binary form (BCD is _not_ the most compact), it requires 10 bits.

BCD is binary coded decimal. Every digit requires 4 bits.

Nick

Reply to
Nick Müller

Oh, Sh** Correct: a 16 bit BCD*) is a 4 digit decimal number. This can have a range from 0...9999, not 0..1024. It also has not 1024 bits, but 10000 different _values_. To represent them in a more compact binary form (BCD is _not_ the most compact), it requires 14 bits.

*) BCD is Binary Coded Decimal. Every (decimal) digit requires 4 bits.

Nick

Reply to
Nick Müller
[big snip]

It's called Huffman Coding, and I think it was invented in the 1950s. It is in my textbooks from the late 1960s.

There has been a lot of work over the decades on compression schemes, and it would be hard to come up with a new scheme at this point. But not impossible. In any event, some serious library research is in order.

Joe Gwinn

Reply to
Joseph Gwinn

Grant, I'm not trying to re-invent the wheel. It just occurred to me a couple nights ago that the method mentioned is so simple that it must have been thougt of before and I am curious about other methods. I sure as hell don't have the math ability to come up with some compression scheme that is better than what already exists. Eric

Reply to
Eric R Snow

Thanks to Nick and all the others who have responded. Late at night, letting my mind wander, sometimes brings attention to me something I've never really thought about. Then I get curious. It sure is great to have this well of informative people accessible. ERS

Reply to
Eric R Snow

Reply to
william_b_noble

PolyTech Forum website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.