Deceitful Zip
29 Sep 2019
Tags:
compression
cryptography
file formats
lookup magic
visualization
What appeared to be a regular zip file could not be successfully extracted. Each extracted file would be empty or contain junk bytes. The file hierarchy could be read, and none of those files were password protected. Could there be some actual corruption in the zip, or was something else going on?
Analysis
Various extractors complained about corrupted data, or crashed in mysterious ways, such as jar xvf
with java.io.IOException: Push back buffer is full
.
Our zip was an instance of a xod
, which is a web optimized xps used by the web viewer PDFTron. A xps is functionally similar to a pdf: it is organized as a hierarchy of pages (xml) and detached resources contained in those pages (fonts and images). The xod
has some changes in the hierarchy, but the underlying implementation uses the same xml elements as a xps.
We can grab a xod
example and list files with unzip -l
(to improve readability, the following was formatted with tree
):
.
├── Annots.xfdf
├── [Content_Types].xml
├── Document
│ ├── DocProps
│ │ └── core.xml
│ └── FixedDocument.fdoc
├── FixedDocumentSequence.fdseq
├── Fonts
│ ├── 0a362aa2-30ce-bf6f-e547-af1200000000.odttf
...
│ └── 0a362aa2-30ce-bf6f-e547-af1200000009.odttf
├── Images
│ ├── 1.jpg
│ └── 3.jpg
├── Pages
│ ├── 1.xaml
│ ├── 2.xaml
│ └── _rels
│ ├── 1.xaml.rels
│ └── 2.xaml.rels
├── _rels
├── Struct
│ ├── 1.xml
│ ├── 2.xml
│ └── font.xml
└── Thumbs
├── 1.jpg
└── 2.jpg
This hierarchy matches the one present in the invalid zip.
Hypothesis: It seems all files are present, so maybe our issue is in the metadata.
Attempting to fix it with zip -F
or zip -FF
didn’t work (the latter recreates the central directory listing, so that can be ruled out of the issue). Therefore, a manual approach was needed.
To explore this metadata and check if all its values were valid, we used kaitai_struct
, in particular the Web IDE:
In addition to the IDE, a parser can be generated, so that general purpose scripting can be done on a zip file:
git clone --recursive https://github.com/kaitai-io/kaitai_struct.git
./kaitai-struct-compiler-0.8/bin/kaitai-struct-compiler \
--target python \
./kaitai_struct/formats/archive/zip.ksy
Each compressed file is represented by a PkSection
field, with its corresponding metadata contained in PkSection/LocalFile/LocalFileHeader
and its compressed data contained in PkSection/LocalFile/body
.
To iterate through all PkSection
fields in our scripts, the generated parser was modified to keep track of each starting address:
@@ -253,6 +253,7 @@
self._read()
def _read(self):
+ self.global_pos = self._io.pos()
self.magic = self._io.ensure_fixed_contents(b"\x50\x4B")
self.section_type = self._io.read_u2le()
_on = self.section_type
We can aggregate the files under observation by file type.
Text files (e.g. xml) are always compressed (compressionMethod = DEFLATED
), while image files (e.g. jpg) are left uncompressed (compressionMethod = NONE
). A jpg can be identified by the string JFIF
in the byte sequence FF D8 FF E0 ?? ?? 4A 46 49 46
. Since image data is encoded with a lossy compression, these files aren’t compressed again in a zip. That would be a waste of CPU resources due to diminishing returns.
As a result, the magic bytes of a jpg file are preserved inside the body field. Even if this file format didn’t have magic bytes, we could still observe other artifacts, such as sequences of bytes with the same value.
When comparing fields of this example zip with the invalid zip, some differences become explicit:
- Image files don’t have magic bytes. Instead, the body seems to have a random distribution of byte values, similar to compressed text files.
- CRC values are zero for all files. This checksum is used to ensure the integrity of decompressed data is preserved after extracting these files from the zip. While this is an optional check done by decompressors, by default a compressor will calculate these values. We can confirm it isn’t needed by taking a valid zip, patching the CRC value to
0
, then runningjar xvf
, which will successfully extract the file with the following warning:java.util.zip.ZipException: invalid entry CRC (expected 0x0 but got 0xd0d30aae)
. - The compressed data doesn’t match the specification of the
DEFLATE
algorithm. The header format describes the first 3 bits of a compressed stream as:
First bit: Last-block-in-stream marker:
1: this is the last block in the stream.
0: there are more blocks to process after this one.
Second and third bits: Encoding method used for this block type:
00: a stored/raw/literal section, between 0 and 65,535 bytes in length.
01: a static Huffman compressed block, using a pre-agreed Huffman tree.
10: a compressed block complete with the Huffman table supplied.
11: reserved, don't use.
Therefore, it would be unexpected if bit sequences 110
or 111
were present. We grabbed a valid xod
with a larger file count, similar to the one in the invalid zip file, to count and compare the first 3 bits of each PkSection/LocalFile/body
inside each zip:
DEFLATE
compressed files in a valid vs an invalid zip:
For the valid zip, compressed bits follow the protocol by not matching the reserved method 11
; uncompressed bits 111
match the first magic byte of jpg files, while 001
match the first magic byte of png files.
For the invalid zip, we do get unexpected sequences, proving that the compressed data isn’t just a DEFLATE
stream.
Reversing the body encoding
Hypothesis: The value reported in PkSection/LocalFile/LocalFileHeader/compressedSize
doesn’t match the actual body length.
If this was the case, kaitai_struct
would error out while parsing the file. In addition, this can be easily checked by subtracting the addresses of the next PkSection
magic bytes with the start of the body.
Hypothesis: Another compression method is actually being used, but it was overwritten with DEFLATE
and NONE
.
These methods can be ruled out by bruteforcing through all possible values, with the following steps:
- copy bytes of a
PkSection
to a new file (skip the central directory, since it’s optional for decompressing); - set field
PkSection/LocalFile/LocalFileHeader/compressionMethod
to a value in range0-19
or95-98
; - extract the new file.
Hypothesis: There is password protection, but the metadata that specifies this feature was cleared.
Marking a file in a zip as password protected is as simple as setting fields PkSection/LocalFile/LocalFileHeader/flags
and PkSection/CentralDirEntry/flags
with value 1
.
We still need a password. The invalid zip is used by a closed-source application. After decompiling it, finding the hardcoded password was just a matter of running a proximity search for keywords xod
and password
:
grep -A 10 -B 10 --color=always -rin 'xod' . | \
tee >(grep -i 'password')
However, simply using the password didn’t result in a successful extraction.
Hypothesis: Compressed data is an encrypted stream.
According to the PDFTron docs, AES encryption can be applied to a xod
1. In our application, we can find the call to the mentioned web worker constructor (which is how we found out that PDFTron was being used, in addition to keywords DecryptWorker.js
window.forge
aes
).
The SDK is available for download with npx @pdftron/webviewer-downloader
.
The decryption web worker lies within the suggestive file webviewer/public/lib/core/DecryptWorker.js
.
We now have all the pieces to decrypt our files: encryption method (AES), password and filenames (both are used to build the AES key), and the source code for decryption. It’s just a matter of getting it to run.
After decrypting the files, those that were compressed with DEFLATE
still needed to be “inflated”.
With the files decrypted and decompressed, we can put them under the same filesystem hierarchy as described in the original zip, then create a new zip with that directory’s contents. The result will be a valid unencrypted xod
.
Source code
Available in a git repository.
-
Despite the zip file format supporting AES encryption with compression method 99, these
xod
files do not have such method set. [return]