64个字符一共需要2的6次方,即6位2进制来存储,
6/8 = x/100 解出x=75,所以压缩率至少能保证在75%
压缩的时候采用多种压缩方法
采用压缩率最高的算法
可行的压缩算法有:Huffman算法、字典算法--目前我就能想出这两种
Huffman算法:这个就不用说了吧....
字典算法:把整篇文章中出现几率最高的几对"词组"用剩下的尚未使用的其他ascii码进行表示,这样就实现了以少搏多
以上就是我的看法了~~~谁有更好的,说出来参考参考
淘宝杜琨
你指你用了这个教程的算法吧?你是这篇文章的原创作者?
http://bbs.ggv.com.cn/wqxbbs/read.php?tid=64499
既然你觉得汇编那么难实现都能实现出来,那用C语言或者你熟悉的java应该很简单吧?
来试试啊!
[此贴子已经被作者于2007-2-28 15:59:32编辑过]
[此贴子已经被作者于2007-2-28 16:00:55编辑过]
LZW Compression
by Bill Lucier (Blucier@ersys.edmonton.ab.ca, b.lucier1 on Genie)
LZW is perhaps the most widely used form of data compression today. It
is simple to implement and achieves very decent compression at a fairly
quick pace. LZW is used in PKZIP (shrink),PKARC (crunch), gifs,V.42bis
and unix's compress. This article will attempt to explain how the
compression works with a short example and 6502 source code in Buddy
format.
Originally named lz78, it was invented by Jacob Ziv and Abraham Lempel
in 1978 , it was later modified by Terry Welch to its present format.
The patent for the LZW compression method is presently held by Unisys.
LZW compresses data by taking phrases and compressing them into codes.
The size of the codes could vary from 9 bits to 16 bits. Although for
this implementation we will be using only 12 bits. As byte are read in
from a file they are added to a dictionary. For a 12-bit implementation
a dictionary will be 4k (2^12=4096) . Each entry in the dictionary
requires five bytes, which will be built in the form of a tree. It is
not a binary tree because each node may have more than two offsprings.
In fact, because our dictionary can hold up to 4096 different codes it
is possible to have one node with 3800 children nodes, although this is
not likely to happen. The five bytes that make up our tree will be:
The parent code: Each node has one and only one parent node. When the parent
code is less then 255 it is the end of a phrase. The codes
0-255 do not actually exist in the tree. The following
values do not appear either as they have special meaning:
256 : End of Stream-This marks the end of one compressed file
257 : This tells the decompressor to increase the number
of bits its reading by one.
258 : Wipe out dictionary
The code value : This is the code that will be sent to the compressor.
The character : The value contained at this node. It have a value of 0-255.
Initially we send out codes that are 9 bits long, this will cover the values
0-511. Once we have reached 511, we will need to increase the number of
bits to write by 1. This will give room for code numbers 512-1023, or
(2^10)-1. At this point we must ensure that the decompressor knows how
bits to read in at once so a code number 257 is sent to indicate that
the number of bits to be read is to be bumped up by one. The size of the
dictionary is finite so at some point we do have to be concerned with
what we will do when it does fill up. We could stop compiling new
phrases and just compress with the ones that are already in the
dictionary. This is not a very good choice, files tend to change
frequently (eg. program files as they change from code to data) so
sticking with the same dictionary will actually increase the size of the
file or at best, give poor compression. Another choice is to wipe the
dictionary out and start building new codes and phrases, or wipe out
some of the dictionary leaving behind only the newer codes and phrases.
For the sake of simplicity this program will just wipe out the
dictionary when it becomes full.
To illustrate how LZW works a small phrase will be compressed : heher.
To start the first two characters would be read in. The H would be
treated as the parent code and E becomes the character code. By means of
a hashing routine (the hashing routine will be explained more fully in
the source code) the location where HE should be is located. Since we
have just begun there will be nothing there,so the phrase will be added
to the dictionary. The codes 0-258 are already taken so we start using
259 as our first code. The binary tree would look something like this:
node # 72 - H
|
node #3200 259 - E
The node # for E is an arbitrary one. The compressor may not choose
that location, 3200 is used strictly for demonstration purposes. So at
node #3200 the values would be:
Parent code - 72
code value - 259
character - E
The node #72 is not actually used. As soon as a value less than 255 is
found it is assumed to be the actual value. We can't compress this yet
so the value 72 is sent to the output file(remember that it is sent in 9
bits). The E then becomes the parent code and a new character code ( H )
is read in. After again searching the dictionary the phrase EH is not
found. It is added to the dictionary as code number 260. Then we send
the E to the disk and H becomes the new parent code and the next E
becomes the new character code. After searching the dictionary we find
that we can compress HE into the code 259,we want to compress as much as
possible into one code so we make 259 the parent code. There may be a
longer string then HE that can be compressed. The R is read in as the
new character code. The dictionary is searched for the a 259 followed a
R, since it is not found it is added to the dictioary and it looks like
this:
node #72 - H node #69 - E
| |
node #3200 259 - E node #1600 260 - H
|
node #1262 261 - R
Then the value 259 is sent to the output file (to represent the HE) and
since that is the EOF the R is sent as well,as well as a 256 to indicate
the EOF has been reached.
Decompression is extremely simple. As long as the decompressor maintains
the dictionary as the compressor did, there will be no problems,except
for one problem that can be handled as an exceptional case. All of the
little details of increasing the number of bits to read, and when to
flush the dictionary are taken care of by the compressor. So if the
dictionary was increased to 8k, the compressor would have to be set up
to handle a larger dictionary, but the decompressor only does as the
compressed file tells it to and will work with any size dictionary. The
only problem would be that a larger dictionary will creep into the ram
under the rom or possibly even use all available memory, but assuming
that the ram is available the decompressor will not change. The
decompressor would start out reading 9 bits at a time, and starts it
free code at 259 as the compressor did. To use the above input from the
compressor as an example, the output was:
72 - For the First H
69 - For the First E
259 - For the Compressed HE
82 - For the R
256 - Eof indicator
To begin decompressing, two values are needed. The H and E are read in,
(note they will both be 9 bits long). As they are both below 256 they
are at the end of the string and are sent straight to the output file.
The first free code is 259 so that is the value assigned to the phrase
HE. Note when decompressing there is no need for the hashing routine,
the codes are the absolute locations in the dictionary (i.e. If the
dictionary was considered to be an array then the entry number 259 would
be dictionary[259]), because of this, the code value is no longer
needed. So the decompressor would have an entry that looks like this:
Node # 259
Parent Code - H
Character - E
The decompressor will read in the next value (259). Because the node
number is at the end of the compressed string we will have to take the
code value and place it on a stack, and take them off in a
Last-in,First-out (LIFO) fashion. That is to say that the first
character to go on the stack (in this case the E) will be the last to
come off. The size of the stack is dependent on the size of the
dictionary, so for this implementation we need a stack that is 4k long.
After all the characters from the string have been placed on the stack
they are taken off and sent to the outputfile.
There is one small error that is possible with LZW because of the way
the compressor defines strings. Consider the compression dictionary that
has the following in it:
node # Code Parent character
Value code
------ ------ ------ ---------
65 65 n/a A
723 259 65 C
1262 260 259 U
2104 261 260 T
2506 262 261 E
Now if the compressor was to try to compress the string ACUTEACUTEA The
compressor will find a match for the first five characters 'ACUTE' and
will send a 262 to the output file. Then it will add the following entry
to the dictionary:
3099 263 262 A
Now it will try to compress the remaining characters, and it finds that
it can compress the entire string with the code 263, but notice that the
middle A, the one that was just added onto the end of the string 'ACUTE'
was never sent to the output file. The decompressor will not have the
code 263 defined in it's dictionary. The last code it will have defined
will be 262. This problem is easily remedied though, when the
decompressor does not have a code defined, it takes the first letter
from the last phrase defined and tacks it onto the end of the last
phrase. IE It takes the first letter (the A) from the phrase and adds it
on to the end as code #263.
This particular implementation is fairly slow because it reads a byte
and then writes one, it could be made much faster with some buffering.
It is also limited to compressing and decompressing one file at a time
and has no error checking capabilities. It is meant strictly to teach
LZW compression, not provide a full fledged compressor.
And now for the code:
SYS 4000 ; sys 999 on a 64
.DVO 9 ; or whatever drive used for output
.ORG 2500
.OBJ "LZW.ML"
TABLESIZE =5021
; THE TABLESIZE IS ACTUALLY 5021, ABOUT 20% LARGER THEN 4K. THIS GIVES
; THE HASHING ROUTINE SOME ROOM TO MOVE. IF THE TABLE WAS EXACTLY 4K
; THERE WOULD BE FREQUENT COLLISIONS WHERE DIFFERENT COMBINATIONS OF
; CHARACTERS WOULD HAVE THE SAME HASH ADDRESS. INCREASING THE TABLE SIZE
; REDUCES THE NUMBER OF COLLISIONS.
EOS =256 ; eos = End of stream This marks the end of file
FIRSTCODE =259
MAXCODE =4096
BUMPCODE =257 ; Whenever a 257 is encountered by the decompressor it
; increases the number of bits it reads by 1
FLUSHCODE =258
TABLEBASE =14336 ; The location that the dictionary is located at
DECODESTACK =9300 ; The location of the 4k LIFO stack
; ORG = DECOMPRESS FILE
; ORG + 3 = COMPRESS FILE
JMP EXPANDFILE
;********************************
; COMPRESSFILE
;********************************
COMPRESSFILE JSR INITDIC ; EMPTY THE DICTIONARY
LDA #128
STA BITMASK
LDA #0
STA RACK
JSR GETCHAR ; GET A CHAR FROM THE INPUT FILE
STA STRINGCODE ; INITIALIZE THE STRINGCODE (PARENT CODE)
LDA #0
STA STRINGCODE+1
NEXTCHAR JSR GETCHAR
STA CHARACTER
JSR FINDNODE ; FINDNODE CALCULATES THE HASHED LOCATION OF
LDA ($FE),Y ; THE STRINGCODE AND CHARACTER IN THE DICT.
INY ; AND SETS $FE/$FF POINTING TO IT. IF THE ENTRY
AND ($FE),Y ; HAS TWO 255 IN IT THEN IT IS EMPTY AND SHOULD
CMP #255 ; BE ADDED TO THE DICTIONARY.
BEQ ADDTODICT
LDA ($FE),Y ; IT HAS A DEFINED PHRASE. STORE THE CODE VALUE IN
STA STRINGCODE+1; THE PARENT CODE
DEY
LDA ($FE),Y
STA STRINGCODE
JMP EOF
ADDTODICT LDY #0
- LDA NEXTCODE,Y
STA ($FE),Y
INY
CPY #5
BNE -
INC NEXTCODE ; INCREASE THE NEXTCODE
BNE +
INC NEXTCODE+1
+ JSR OUTPUT
LDA NEXTCODE+1 ; CHECK IF NEXTCODE=4096 IF SO THEN FLUSH THE
CMP #>MAXCODE ; DICTIONARY AND START ANEW
BNE CHECKBUMP
LDA NEXTCODE
CMP #<MAXCODE
BNE CHECKBUMP
LDA #<FLUSHCODE ; SEND THE FLUSH CODE TO THE COMPRESSED FILE SO
STA STRINGCODE ; THE DECOMPRESSOR WILL KNOW TO FLUSH THE
LDA #>FLUSHCODE ; DICTIONARY
STA STRINGCODE+1
JSR OUTPUT
JSR INITDIC
JMP CHECKEOF
CHECKBUMP LDA NEXTBUMP+1
CMP NEXTCODE+1 ; CHECKBUMP CHECK TO SEE IF THE NEXTCODE HAS
BNE CHECKEOF ; REACHED THE MAXIMUM VALUE FOR THE CURRENT
LDA NEXTBUMP ; NUMBER OF BITS BEING OUTPUT.
CMP NEXTCODE ; FOR X BITS NEXTCODE HAS Y PHRASES
BNE CHECKEOF ; -------- -----------------------
LDA #>BUMPCODE ; 9 511
STA STRINGCODE+1 ; 10 1023
LDA #<BUMPCODE ; 11 2047
STA STRINGCODE ; 12 4095
JSR OUTPUT
INC CURRENTBITS
ASL NEXTBUMP
ROL NEXTBUMP+1
CHECKEOF LDA #0
STA STRINGCODE+1
LDA CHARACTER
STA STRINGCODE
EOF LDA 144
BNE DONE
JMP NEXTCHAR
DONE JSR OUTPUT
LDA #>EOS ; SEND A 256 TO INDICATE EOF
STA STRINGCODE+1
LDA #<EOS
STA STRINGCODE
JSR OUTPUT
LDA BITMASK
BEQ +
JSR $FFCC
LDX #3
JSR $FFC9
LDA RACK ; SEND WHAT BITS WEREN'T SEND WHEN OUTPUT
JSR $FFD2
+ JSR $FFCC
LDA #3
JSR $FFC3
LDA #2
JMP $FFC3
;**********************************
; INITDIC
; INITIALIZES THE DICTIONARY, SETS
; THE NUMBER OF BITS TO 9
;**********************************
INITDIC LDA #9
STA CURRENTBITS
LDA #>FIRSTCODE
STA NEXTCODE+1
LDA #<FIRSTCODE
STA NEXTCODE
LDA #>512
STA NEXTBUMP+1
LDA #<512
STA NEXTBUMP
LDA #<TABLEBASE
STA $FE
LDA #>TABLEBASE
STA $FF
LDA #<TABLESIZE
STA $FC
LDA #>TABLESIZE
STA $FD
- LDY #0
LDA #255 ; ALL THE CODE VALUES ARE INIT TO 255+256*255
STA ($FE),Y ; OR -1 IN TWO COMPLEMENT
INY
STA ($FE),Y
CLC
LDA #5 ; EACH ENTRY IN THE TABLE TAKES 5 BYTES
ADC $FE
STA $FE
BCC +
INC $FF
+ LDA $FC
BNE +
DEC $FD
+ DEC $FC
LDA $FD
ORA $FC
BNE -
RTS
;************************************
; GETCHAR
;************************************
GETCHAR JSR $FFCC
LDX #2
JSR $FFC6
JMP $FFCF
;************************************
; OUTPUT
;************************************
OUTPUT LDA #0 ; THE NUMBER OF BITS OUTPUT CAN BE OF A VARIABLE
STA MASK+1 ; LENGTH,SO THE BITS ARE ACCUMULATED TO A BYTE IS
LDA #1 ; FULL AND THEN IT IS SENT TO THE OUTPUT FILE
LDX CURRENTBITS
DEX
- ASL
ROL MASK+1
DEX
BNE -
STA MASK
MASKDONE LDA MASK
ORA MASK+1
BNE +
RTS
+ LDA MASK
AND STRINGCODE
STA 3
LDA MASK+1
AND STRINGCODE+1
ORA 3
BEQ NOBITON
LDA RACK
ORA BITMASK
STA RACK
NOBITON LSR BITMASK
LDA BITMASK
BNE +
JSR $FFCC
LDX #3
JSR $FFC9
LDA RACK
JSR $FFD2
LDA #0
STA RACK
LDA #128
STA BITMASK
+ LSR MASK+1
ROR MASK
JMP MASKDONE
;******************************
; FINDNODE
; THIS SEARCHES THE DICTIONARY TILL IT FINDS A PARENT NODE THAT MATCHES
; THE STRINGCODE AND A CHILD NODE THAT MATCHES THE CHARACTER OR A EMPTY
; NODE.
;*******************************
; THE HASHING FUNCTION - THE HASHING FUNCTION IS NEEDED BECAUSE
; THERE ARE 4096 X 4096 (16 MILLION) DIFFERENT COMBINATIONS OF
; CHARACTER AND STRINGCODE. BY MULTIPLYING THE CHARACTER AND STRINGCODE
; IN A FORMULA WE CAN DEVELOP OF METHOD OF FINDING THEM IN THE
; DICTIONARY. IF THE STRINGCODE AND CHARACTER IN THE DICTIONARY
; DON'T MATCH THE ONES WE ARE LOOKING FOR WE CALCULATE AN OFFSET
; AND SEARCH THE DICTIONARY FOR THE RIGHT MATCH OR A EMPTY
; SPACE IS FOUND. IF AN EMPTY SPACE IS FOUND THEN THAT CHARACTER AND
; STRINGCODE COMBINATION IS NOT IN THE DICTIONARY
FINDNODE LDA #0
STA INDEX+1
LDA CHARACTER ; HERE THE HASHING FUNCTION IS APPLIED TO THE
ASL ; CHARACTER AND THE STRING CODE. FOR THOSE WHO
ROL INDEX+1 ; CARE THE HASHING FORMULA IS:
EOR STRINGCODE ; (CHARACTER << 1) ^ STRINGCODE
STA INDEX ; FIND NODE WILL LOOP TILL IT FINDS A NODE
LDA INDEX+1 ; THAT HAS A EMPTY NODE OR MATCHES THE CURRENT
EOR STRINGCODE+1 ; PARENT CODE AND CHARACTER
STA INDEX+1
ORA INDEX
BNE +
LDX #1
STX OFFSET
DEX
STX OFFSET+1
JMP FOREVELOOP
+ SEC
LDA #<TABLESIZE
SBC INDEX
STA OFFSET
LDA #>TABLESIZE
SBC INDEX+1
STA OFFSET+1
FOREVELOOP JSR CALCULATE
LDY #0
LDA ($FE),Y
INY
AND ($FE),Y
CMP #255
BNE +
LDY #0
RTS
+ INY
- LDA ($FE),Y
CMP STRINGCODE-2,Y
BNE +
INY
CPY #5
BNE -
LDY #0
RTS
+ SEC
LDA INDEX
SBC OFFSET
STA INDEX
LDA INDEX+1
SBC OFFSET+1
STA INDEX+1
AND #128
BEQ FOREVELOOP
CLC
LDA #<TABLESIZE
ADC INDEX
STA INDEX
LDA #>TABLESIZE
ADC INDEX+1
STA INDEX+1
JMP FOREVELOOP
;***************************
; CALCULATE
; TAKES THE VALUE IN INDEX AND CALCULATES ITS LOCATION IN THE DICTIONARY
;****************************
CALCULATE LDA INDEX
STA $FE
LDA INDEX+1
STA $FF
ASL $FE
ROL $FF
ASL $FE
ROL $FF
CLC
LDA INDEX
ADC $FE
STA $FE
LDA INDEX+1
ADC $FF
STA $FF
CLC
LDA #<TABLEBASE
ADC $FE
STA $FE
LDA #>TABLEBASE
ADC $FF
STA $FF
LDY #0
RTS
;******************************
; DECODESTRING
;******************************
DECODESTRING TYA ; DECODESTRING PUTS THE STRING ON THE STACK
STA COUNT ; IN A LIFO FASHION.
LDX #>DECODESTACK
CLC
ADC #<DECODESTACK
STA $FC
STX $FD
LDA #0
STA COUNT+1
- LDA INDEX+1
BEQ +
JSR CALCULATE
LDY #4
LDA ($FE),Y
LDY #0
STA ($FC),Y
LDY #2
LDA ($FE),Y
STA INDEX
INY
LDA ($FE),Y
STA INDEX+1
JSR INFC
JMP -
+ LDY #0
LDA INDEX
STA ($FC),Y
INC COUNT
BNE +
INC COUNT+1
+ RTS
;******************************
; INPUT
;******************************
INPUT LDA #0 ; THE INPUT ROUTINES IS USED BY THE DECOMPRESSOR
STA MASK+1 ; TO READ IN THE VARIABLE LENGTH CODES
STA RETURN
STA RETURN+1
LDA #1
LDX CURRENTBITS
DEX
- ASL
ROL MASK+1
DEX
BNE -
STA MASK
- LDA MASK
ORA MASK+1
BEQ INPUTDONE
LDA BITMASK
BPL +
JSR GETCHAR
STA RACK
+ LDA RACK
AND BITMASK
BEQ +
LDA MASK
ORA RETURN
STA RETURN
LDA MASK+1
ORA RETURN+1
STA RETURN+1
+ LSR MASK+1
ROR MASK
LSR BITMASK
LDA BITMASK
BNE +
LDA #128
STA BITMASK
+ JMP -
INPUTDONE RTS
;*******************************
; EXPANDFILE
; WHERE THE DECOMPRESSION IS DONE
;*******************************
EXPANDFILE LDA #0
STA RACK
LDA #128
STA BITMASK
START JSR INITDIC
JSR INPUT
LDA RETURN+1
STA OLDCODE+1 ; Save the first character in OLDCODE
LDA RETURN
STA CHARACTER
STA OLDCODE
CMP #<EOS
BNE +
LDA RETURN+1 ; If return = EOS (256) then all done
CMP #>EOS
BNE +
JMP CLOSE
+ JSR $FFCC
LDX #3
JSR $FFC9
LDA OLDCODE ; Send oldcode to the output file
JSR $FFD2
NEXT JSR INPUT
LDA RETURN
STA NEWCODE
LDA RETURN+1
STA NEWCODE+1
CMP #1 ; All of the special codes Flushcode,BumpCode & EOS
BNE ++ ; Have 1 for a MSB.
LDA NEWCODE
CMP #<BUMPCODE
BNE +
INC CURRENTBITS
JMP NEXT
+ CMP #<FLUSHCODE
BEQ START
CMP #<EOS
BNE +
JMP CLOSE
+ SEC ; Here we compare the newcode just read in to the
LDA NEWCODE ; next code. If newcode is greater than it is a
SBC NEXTCODE ; ACUTEACUTEA situation and must be handle differently.
STA 3
LDA NEWCODE+1
SBC NEXTCODE+1
ORA 3
BCC +
LDA CHARACTER
STA DECODESTACK
LDA OLDCODE
STA INDEX
LDA OLDCODE+1
STA INDEX+1
LDY #1
BNE ++
+ LDA NEWCODE ; Point index to newcode spot in the dictionary
STA INDEX ; So DECODESTRING has a place to start
LDA NEWCODE+1
STA INDEX+1
LDY #0
+ JSR DECODESTRING
LDY #0
LDA ($FC),Y
STA CHARACTER
INC $FC
BNE +
INC $FD
+ JSR $FFCC
LDX #3
JSR $FFC9
L1 LDA COUNT+1 ; Count contains the number of characters on the stack
ORA COUNT
BEQ +
JSR DECFC
LDY #0
LDA ($FC),Y
JSR $FFD2
JMP L1
+ LDA NEXTCODE ; Calculate the spot in the dictionary for the string
STA INDEX ; that was just entered.
LDA NEXTCODE+1
STA INDEX+1
JSR CALCULATE
LDY #2 ; The last character read in is tacked onto the end
LDA OLDCODE ; of the string that was just taken off the stack
STA ($FE),Y ; The nextcode is then incremented to prepare for the
INY ; next entry.
LDA OLDCODE+1
STA ($FE),Y
INY
LDA CHARACTER
STA ($FE),Y
INC NEXTCODE
BNE +
INC NEXTCODE+1
+ LDA NEWCODE
STA OLDCODE
LDA NEWCODE+1
STA OLDCODE+1
JMP NEXT
CLOSE JSR $FFCC
LDA #2
JSR $FFC3
LDA #3
JMP $FFC3
DECFC LDA $FC
BNE +
DEC $FD
+ DEC $FC
LDA COUNT
BNE +
DEC COUNT+1
+ DEC COUNT
RTS
INFC INC $FC
BNE +
INC $FD
+ INC COUNT
BNE +
INC COUNT+1
+ RTS
NEXTCODE .WOR 0
STRINGCODE .WOR 0
CHARACTER .BYT 0
NEXTBUMP .WOR 0
CURRENTBITS .BYT 0
RACK .BYT 0
BITMASK .BYT 0
MASK .WOR 0
INDEX .WOR 0
OFFSET .WOR 0
RETURN .WOR 0
COUNT .WOR 0
NEWCODE .WOR 0
OLDCODE .WOR 0
TEST .BYT 0
TO DRIVE THE ML I WROTE THIS SMALL BASIC PROGRAM. NOTE THAT CHANNEL TWO IS
ALWAYS THE INPUT AND CHANNEL THREE IS ALWAYS THE OUTPUT. EX AND CO MAY BE
CHANGED TO SUIT WHATEVER LOCATIONS THE PROGRAM IS REASSEMBLED AT.
live41你现在让我有一种与天斗的感觉
与天斗,不可为,不可为....
你想达到60%以上的压缩率简值是与推翻牛顿三大定律一样一难.......