custom lz4 compression

Programming related discussions related to game research
spiritovod
Posts: 719
Joined: Sat Sep 28, 2019 7:00 pm

custom lz4 compression

Post by spiritovod »

Originally, this game was using light obfuscation (xor in particular) for compressed assets, but since official release they've switched to something that looks like modified implementation of lz4.

Personally I'm not interested in the game, so the question would be - is that possible to guess something from the compressed data instead of reversing compression code? I took a quick look at how lz4 blocks are formed (according to official documentation) and it appeared like some sequences in a block are malformed or handled differently (different approach to overlap match?). For example, in uasset block from samples first 4 sequences seems fine, but in the 5th suggested sequence size if bigger than actual one, so decompressor fails on reading offset part of that sequence - also, I've noticed that function returns actual position in file where decompressor fails in the quickbms error, which is quite convenient.

Regardless, any thoughts would be appreciated just for research purposes. Note that samples contains suggested uncompressed size in the name, and one of sample can be actually decompressed (not sure if correctly or not).
SleepyBell
Posts: 3
Joined: Thu Mar 10, 2022 10:05 pm

Re: custom lz4 compression

Post by SleepyBell »

I'm looking into this and my current hypothesis is that there's no alternative LZ4 compression handling logic. Instead, I believe that it is regular LZ4 compression altered after the fact by some kind of encryption or obfuscation. Perhaps something like the XOR that you mentioned the game used to use.

The obfuscation seems to alter some of the bytes, leaving most of the bytes in the clear.

When this obfuscation alters bytes that are compression "literals" or "offsets", a regular decompression routine will output wrong values but continues to decompress without noticing a problem.
When this obfuscation alters compression "tokens" or "lengths", a regular decompressor will likely get lost, hit invalid sequences, and maybe give up.

So far, I haven't seen any obvious pattern for which bytes get altered nor how they change. Sometimes there are 10 clear bytes in a row, then an altered one. Sometimes the clear run is as few as 4 bytes. I haven't yet spotted a definite case of two altered bytes in a row.

I've been focusing on the asset file you mentioned because it's got plenty of ASCII strings that I can spot holes and guess what was there originally.
Here's a list of some offsets in that compressed asset file, my guess as to the clear byte that was there before encryption, and the byte that I really found there. (Numbers are in decimal and CSV format so you can import them into a spreadsheet or script for analysis.)

Code: Select all

compressed_pos,guessed_clear_value,found_obfuscated_value
60,14,50
124,105,138
135,108,251
146,87,165
167,117,107
213,85,38
220,47,48
228,116,10
234,50,34
242,110,16
308,111,26
342,105,131
387,108,218
428,108,206
457,105,204
471,99,49
477,67,101
529,111,193
571,101,219
579,109,182
638,108,211
766,104,41
777,100,86
891,97,118
900,101,0
907,101,115
1196,105,72



Notice that the byte value 108 seems to be represented in the encrypted stream as at least four different values: 251, 218, 206, 211.

On the other hand, I have yet to find any of the encrypted values reused.

Also notice that the byte value 101 encodes to 0 at offset 900, while 7 bytes later the same input value encodes to 115. What's the relationship between these numbers? I don't know.


As for detecting an LZ4 file with these modifications, I suppose an algorithm could look for "unreasonable" offset values (seeking further back than the decompressed bytes go). Theoretically, it could even try to get back on track by trying out some nearby possibilities.