Overview of game file formats and archives

Videos, guides, manuals, documents and tutorials about using tools and performing tasks
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Overview of game file formats and archives

Post by aluigi »

This is a paper I wrote in April 2013 but it has been never released until now.
It offers just an introduction and overview of the formats that you see daily on this forum.
There are also some statistics that I took in 2013.
The text version of the document is available here:
http://aluigi.org/papers/game_formats_stats.txt

Every post is a section of the paper.
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Overview of game file formats and archives

Post by aluigi »

Introduction and formats

Games use a lot of different file types: textures, sounds, musics, 3d
models, AI scripts, animation scripts, configuration files, images,
videos and so on.

Instead of having those files sparse in the game's folder, the
developers prefer to store them in one or more archives for the
following main reasons:

  • performance
    accessing one single file (the whole archive) requires less resources
    than opening and closing every single file resource, it results in
    minor loading times, less memory and disk usage (no disk allocation
    unit alignment and continuous opening of different files).
  • content protection
    often these archives contain encrypted content, game developers and
    publishers try to avoid its usage for modding or personal user (for
    example listening a soundtrack) and obviously to avoid its embedding
    in other commercial projects.
    In this case the adopted solutions range from the simple obfuscation
    of the content by XORing the data with a fixed byte or key to
    customized encryption algorithms.
  • saving space
    many archives use compression algorithms and other mechanisms for
    saving space on disk for their games, it was quite common in the past
    just like it's necessary nowadays where games occupy gigabytes of
    space.

These solutions can be used alone or often combined, so it's not rare
to see an archive containing compressed and encrypted content.


When we are in front of an archive or an encrypted/compressed file our
target is just dumping its content and later understanding how to use
the dumped files, for example a 3d model in software like 3ds Max or an
Ogg file in a media player or customized formats that must be converted
to other formats and so on.

The part of the procedure covered by this document is just the first
step, understanding a file format for extracting its content "as is".

Usually only the following parameters are necessary:

  • offset of the resources, location of the file inside the archive
    (where it begins)
  • size of the resource
  • optional compressed/uncompressed size if the file has been shrinked
    with a compression algorithm
  • optional name of the resource, often the original name of the
    archived file

Usually these information are stored in an index table usually called
TOC (table of content), in some games it may be encrypted to avoid the
correct dumping of the resources while in other formats the resources
are stored sequentially avoiding to specify an offset field for each
file.

In the next examples I will use some words to identify some common
fields:
  • OFFSET location of the file in hexadecimal (0x22 = 34)
  • ZSIZE compressed size
  • SIZE normal and uncompressed size
  • FILES amount of files stored in the archive
  • NAME name of the stored file
  • FILE the content (data) of the stored file

Example of Index table:

Code: Select all

        +-----------------+
        | FILES         2 |
        +-----------------+
      /-| OFFSET 00000022 |
      | +-----------------+
      | | SIZE         41 |
      | +-----------------+
      | | NAME   test.txt |
      | +-----------------+
    /-+-| OFFSET 0000004b |
    | | +-----------------+
    | | | SIZE         20 |
    | | +-----------------+
    | | | NAME   blah.dat |
    | | +-----------------+-------------------------+
    | \>| FILE 1                                    |
    |   +----------------------+--------------------+
    \-->| FILE 2               |
        +----------------------+



Example of sequential files:

Code: Select all

        +-----------------+
        | SIZE         41 |
        +-----------------+
        | NAME   test.txt |
        +-----------------+-------------------------+
        | FILE 1                                    |
        +-----------------+-------------------------+
        | SIZE         20 |
        +-----------------+
        | NAME   blah.dat |
        +-----------------+----+
        | FILE 2               |
        +----------------------+



Example of sequential table followed by sequential files:

Code: Select all

        +-----------------+
        | FILES         2 |
        +-----------------+
        | SIZE         41 |
        +-----------------+
        | NAME   test.txt |
        +-----------------+
        | SIZE         20 |
        +-----------------+
        | NAME   blah.dat |
        +-----------------+-------------------------+
        | FILE 1                                    |
        +----------------------+--------------------+
        | FILE 2               |
        +----------------------+



Some variants and customizations:

  • relative file offsets, usually the absolute offset from which are
    calculated the relative file offsets is specified directly at the
    beginning of the archive or calculated before or after having read
    the whole TOC:
    • before: it can be accomplished only with fixed size file entries,
      for example with filenames having a maximum length:
      BASE_OFF = offset_first_entry + (entries * sizeof(entry))
    • after: it's necessary to parse the whole entries before knowing
      this offset
  • sector offset: quite common on PlayStation games where the specified
    offsets must be multiplied by 2048 (size of disk sector)
  • TOC at the end: the TOC is often located at the beginning of the
    archive but some games prefer to put it at the end for being able to
    update the archive in future with new content, usual methods:
    • header at beginning telling the offsets where is located the TOC
    • few bytes of information at the end containing the TOC offset or
      just the size of the TOC from which can be retrieved the offset
  • nested tree: usually the filenames already include the full path like
    models\character\chara_1.mdl but sometimes the whole directory tree
    is stored in the archive (folders and files) and it requires to be
    parsed recursively
  • sometimes TOC may be compressed
  • chunked files: see later
  • TOC in a separate file: usually called "index file", a small file
    that contains all the information of the files archive in a "data
    file", usually they share the same name and different extension, for
    example: archive.idx and archive.dat
  • ZIP format: sometimes games use just a ZIP archive for containing
    their files, some games may try to implement a custom version of the
    ZIP format as it happens with those that add a new compression
    algorithm (Forza Motorsport and Dark Sector) or those that use some
    different fields or don't use the classical "PK" magic values for the
    various sections of the ZIP archive.


There are even archives in which the format is really complex because
they don't store the original files but they use them as direct
"resources" ready to be used in the game engine and so there are more
steps to accomplish our target.

If an archive uses a block cipher encryption like AES or Blowfish
there is also a third size component to take in consideration, the
block aligned size of the resource.
If this value is missing, usually it's automatically calculated or the
game uses CipherFinal of OpenSSL or stream modes like CTR.

Example of stored file encrypted with a block cipher:

Code: Select all

        +-----------------+
        | OFFSET 00000022 |
        +-----------------+
        | ZSIZE        41 |     compressed size
        +-----------------+
        | SIZE        180 |     uncompressed size
        +-----------------+
        | XSIZE        48 |     archive size (aligned)
        +-----------------+
        | NAME   test.txt |
        +-----------------+--------------------------------+
        | FILE 1 (compressed and encrypted)        PADDING |
        +--------------------------------------------------+



A solution that is often used to save space is dividing the archived
files in small parts called "chunks".
The advantage of this technique is that the chunks are compressed
only if the compressed size is lower than the uncompressed one but the
disadvantage is that the usage of small chunks doesn't take the
benefits of the most advanced compression algorithms because the
dictionary/window doesn't have enough data to be filled and used.
Usually the decompressed size of the chunks is not specified because
it's hardcoded in the game.
A compressed chunk with size zero or equal than the chunk decompressed
size means it's stored "as-is" without compression.

Example of chunk based file:

Code: Select all

        +-----------------+
        | OFFSET 00000022 |
        +-----------------+
        | SIZE        180 |
        +-----------------+
        | NAME   test.txt |
        +-----------------+
        | CHUNKS        3 +
        +-----------------+
        | CHUNK ZSIZE  30 |     * let's say CHUNK SIZE is 64
        +-----------------+
        | CHUNK ZSIZE  42 |
        +-----------------+
        | CHUNK ZSIZE  35 |
        +-----------------+--------------+
        | CHUNK 1                        |
        +--------------------------------+-----------+
        | CHUNK 2                                    |
        +-------------------------------------+------+
        | CHUNK 3                             |
        +-------------------------------------+
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Overview of game file formats and archives

Post by aluigi »

The modding perspective: rebuilders


Usually the purposes of obtaining a resource from an archive are the
following:

  • using the resource
    a typical example is the music of a game to listen on the own
    computer or images to use as wallpaper
  • modding the same game
    editing the extracted content and reinjecting it back in the archive
    or just rebuilding the whole archive from scratch
  • using the resource obtained from a game in another different game
    reinjecting the resource or rebuilding the archive of another game

In the last two cases the user needs a way to force the game to load a
non archived file or to rebuild the archive or to reinject it in the
original archive:

  • Usage of non archived resources
    in some cases it's possible to use the extracted resources in the
    game by default because the developers left this feature enabled for
    debugging or because the usage of archives was meant only to improve
    loading performances.
    In some other cases it's necessary to activate a specific option from
    a configuration file or command-line (like in Need for Speed Shift),
    while in other situations there is no way to force the game to read
    the extracted files.
  • Archive rebuilding
    this is the best solution but unfortunately it's also the most
    expensive because extracting a file is completely different than
    rebuilding the whole archive.
    For rebuilding an archive it's necessary to know "all" the fields
    used in the TOC and it's not possible to ignore most of them as we
    did with extraction, additionally creating a rebuilder requires more
    effort and programming work than writing an extractor.
  • Reinjecting/Reimporting
    this is the way that requires the minimal effort and in most cases
    can be implemented even automatically just like I do in my QuickBMS
    tool that allows an extraction script to be used also in reimport
    mode without any change.
    The downsides of this method are:

    • no CRC/checksum/hash recalculation if used in the archive, exist
      some work-arounds that can be applied like automatically
      recalculating and overwriting the CRC field but this is not
      possible if the algorithm is not a common one, some games ignore
      the different CRC, others will reject the edited file

    • in the past there was a limitation with the size of the new files
      which has been bypassed by a new reimport method (reimport2), but
      still some archives are incompatible if they use sequential offsets

    • in case of custom encryption and compression algorithms it's
      possible that doesn't exist the code to re-encrypt or re-compress
      the data (this is valid for the rebuilding solution too)

    • in some cases it's possible that the new version of the archive is
      not fully compatible with the game, maybe the game checks the hash
      of the archive before using it or something else

    Anyway it's worth to note that the benefits of this solution are
    incredible for both the writer of the script and the modder and many
    mods, cheats and customizations have been created in this way.


If the archive uses asymmetric cryptography and/or digital signature
it's not possible to perform rebuilding or reimporting due to the lack
of the private key. An example are the GameGuard files.
In these cases the only solution is modifying the game executable for
removing the check of the signature or using a known private/public key
generated by us.
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Overview of game file formats and archives

Post by aluigi »

All the material that has been evaluated for creating this document
comes from my personal research available on my personal website.

The main source is composed by the scripts for my QuickBMS program
started in 2009: http://quickbms.aluigi.org

The secondary source are the stand-alone tools available in my Research
page: http://papers.aluigi.org

The last source used is my collection of archives passwords:
http://aluigi.org/papers.htm#info

The scripts and tools selected for the statistics are those that work
on the files of the games, so any tool related to the encryption of
network data or the decryption of content generated by the user
(savegames) or non-game related stuff have not been included.

Evaluated scripts:
about 810 (this document has been originally created in 2013), these
scripts are too many for being listed here.
They cover many types of games of big and small vendors, of any
platform like Xbox, Xbox 360, PC, PS3, PS2, PSP, Wii and others.
They even cover multiple versions of the same file format.
So it's possible to see the script for Crysis 2 and at the same
time the one for games of which I have never heard their name.

Evaluated tools:
rfactorgmdec, rfactordec, wtcced, hldlldec, halomus, rdbigext,
scfdec, umodext, unxwb, uniginex, mmviewer_dumper, osrwdec,
molebox2ext, sdgundamext, tdudec, partydec, ttarchext, asurauncmp,
ssaext, canhelpaczip, sgpdec, uodemoext, egoxext, cauldronext,
bsrdec, motorm4xdec, pyroblazerext, worldshiftext, ssnam67ext,
msmixext, xsoext, ysext, orkdec, ps2ext, vitalext, hedwadext,
borpak, ccftfext, fsbext, nexusext, tnt2zip, cbfext, virtdec,
unvirt, zanzapak, gguardfile, rtwsndext, manext, lin2ed.

Note that many scripts/tools work on multiple games and in some cases
two or more scripts may overlap (different script but same game), so
for realizing these statistics I counted just the scripts/tools and not
each single game they cover just because it's hard if not even
impossible to know what games are covered by a specific engine or if a
file format is used in other games.

Note also that some scripts use more than one algorithm, that's why the
sum of entries is bigger than the number of scripts and tools which
have been evaluated.

All the information have been collected the 13 Apr 2013 with the
manual and automatic checking of each source.

If you are interested in other externals sources (to which I contribute
too) take a look at the ZenHAX forum: https://zenhax.com

Regarding the results showed below, please note that they have been
obtained automatically by using a program over all the scripts
available on my website so some results may be redundant (for example
used multiple times in the same script or maybe two versions of the
same script) and some information may be missing (some scripts are
difficult to parse automatically).
So PLEASE do not take these results too seriously.
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Overview of game file formats and archives

Post by aluigi »

Results: Encryption and Obfuscation


Code: Select all

+-----------------------------------------------------------+---------+
| no encryption                                             |     676 |
+-----------------------------------------------------------+---------+
| XOR with one byte                                         |      44 |
+-----------------------------------------------------------+---------+
| XOR with key (multiple bytes)                             |      53 |
+-----------------------------------------------------------+---------+
| rotate (add/sub) with one byte                            |       4 |
+-----------------------------------------------------------+---------+
| rotate (add/sub) with key (multiple bytes)                |      12 |
+-----------------------------------------------------------+---------+
| AES                                                       |      18 |
+-----------------------------------------------------------+---------+
| Blowfish                                                  |      10 |
+-----------------------------------------------------------+---------+
| DES/3DES                                                  |       3 |
+-----------------------------------------------------------+---------+
| charset / substitution table                              |       3 |
+-----------------------------------------------------------+---------+
| incremental XOR                                           |       9 |
+-----------------------------------------------------------+---------+
| RC4                                                       |      12 |
+-----------------------------------------------------------+---------+
| TEA/XTEA/XXTEA                                            |       4 |
+-----------------------------------------------------------+---------+
| custom encryption / obfuscation                           |      48 |
+-----------------------------------------------------------+---------+

+-----------------------------------------------------------+---------+
| password protected archives (mainly ZIP, RAR and FSB)     |      53 |
+-----------------------------------------------------------+---------+



Results: Compression


Code: Select all

+-----------------------------------------------------------+---------+
| no compression                                            |     500 |
+-----------------------------------------------------------+---------+
| zlib                                                      |     188 |
+-----------------------------------------------------------+---------+
| LZO                                                       |      20 |
+-----------------------------------------------------------+---------+
| deflate                                                   |      36 |
+-----------------------------------------------------------+---------+
| LZMA                                                      |      20 |
+-----------------------------------------------------------+---------+
| Microsoft XMem (LZX)                                      |      27 |
+-----------------------------------------------------------+---------+
| LZSS                                                      |      13 |
+-----------------------------------------------------------+---------+
| gzip                                                      |      10 |
+-----------------------------------------------------------+---------+
| bzip2                                                     |       9 |
+-----------------------------------------------------------+---------+
| custom / proprietary / less known                         |      41 |
+-----------------------------------------------------------+---------+



Results: Structure


Sorry, not available yet.

Code: Select all

+-----------------------------------------------------------+---------+
| Index table                                               |       ? |
+-----------------------------------------------------------+---------+
| Sequential files                                          |       ? |
+-----------------------------------------------------------+---------+
| Chunks                                                    |       ? |
+-----------------------------------------------------------+---------+
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Overview of game file formats and archives

Post by aluigi »

Notes and information


During the reverse engineering of these files formats have been noticed
some interesting things.

In some cases the target platform makes the difference due to possible
in-hardware optimizations or the endianess of the CPU.
For example, on Xbox 360 it's quite common to see the Microsoft LZX
algorithm (XMemCompress) in use in place of zlib used for the same games
on other platforms and it's also common to see the archives packed using
the big endianess instead of the little endianess of the PC versions.

Another interesting point is about the version of the file formats
because some of them (like the MAS one for the ISI Gmotor engine) exist
from various years and have been used in many games with the result of
creating many versions very different between each other.
This is caused not only due to the enhancing of the format in the years
but mainly due to desire of customizing the format adopted by different
developers.

Games like those developed by Simbin use common archives (like the MAS
one mentioned above) with an additional layer of encryption that has
been updated game after game trying to make harder the life of the
maintainer of the decryption tools.
This is valid also for the Telltale Games archives in which these
continuous changes lasted various years for various versions.

In other cases a more complex and custom encryption algorithm has been
added after the developers have been aware of the existence of tools
for decrypting and extracting the content of the archives, a recent
example is Farming Simulator 2013 1.4 beta.

The most common compression algorithms are the zlib and deflate ones,
note that zlib is just a deflate stream with a header and a CRC so
basically they are the same thing.
This algorithm is used really in a lot of games and it's also the most
easy to identify because all the job can be performed with programs
like offzip that scan the whole archive finding the zlib data (thanks
to its CRC that avoids false positives) and returning the offset plus
the compressed and uncompressed size that can be used to identify the
index table in the archive.

On the encryption and obfuscation side the most used is without doubts
the classical and simple XOR solution followed by the custom and
proprietary solutions that go from simple obfuscations to the
customizing of known algorithms and even the implementation of
algorithms never seen online.

The password protected archives are a lot but they rely on known file
formats like ZIP, Rar and Fmod FSB so I have preferred to keep them out
from the final considerations.
Why developers opt for this solution? Because there are libraries
already available to handle these known archives and just a simple
password trying to keep modders out.

When a researcher encounters a custom encryption or compression
algorithms there are usually the following ways to solve the puzzle:

  • try to reverse engineer the pre-compiled algorithm in a higher level
    language like C or others
  • use a binary to C/pseudo code converted like IDA Pro or REC and then
    fix the resulted code (it may be a painful process)
  • dump the whole function and fix it where necessary, depending by the
    interest in the game and the complexity of the algorithm usually this
    is a very good compromise
  • if you are very lucky probably the game uses an external dll that can
    be used to perform the same tasks from any custom tool

As already said, remember that this document is based ONLY on the work
publicly available on my website so doesn't cover other game extractors
written by other people or the scripts for QuickBMS written by users in
the community (that I personally thank for their feedback and support).
aluigi
Site Admin
Posts: 12984
Joined: Wed Jul 30, 2014 9:32 pm

Re: Overview of game file formats and archives

Post by aluigi »

Feel free to provide any feedback, your comments and your personal experience with file formats and archives.
ExtractResponseUnit
Posts: 12
Joined: Tue Sep 08, 2020 3:31 pm

Re: Overview of game file formats and archives

Post by ExtractResponseUnit »

It's a elaborately documentation of analysis thank you for puplishing it.
Incredible helpful for the older generation among us like me.