← Back to blog

How to parse an NTFS $MFT file in Python

· 3 min read

Short answer: Use analyzeMFT for pure-Python parsing when you can install it, libmft when you want a typed object model, or shell out to omerbenamram/mft when you care about speed. Pure-Python parsing is ~10–50× slower than the Rust crate but is fine for one-off scripts.

What you are reading

The NTFS Master File Table is a sequence of fixed-size 1,024-byte records. To parse it from Python you only need to:

  1. Open the $MFT file (or read it from a disk image).
  2. Step through it 1,024 bytes at a time.
  3. Apply the fixup array to each record. See the record anatomy for the byte-level layout.
  4. Walk the attribute stream inside each record.

The libraries below handle all four steps. Most analysts only fall back to raw struct.unpack when a library does not expose a field they need.

Option 1: analyzeMFT

analyzeMFT is the classic pure-Python MFT parser, originally by David Kovar and still maintained. CLI-first, but importable.

# pip install analyzeMFT
from analyzeMFT.mft_analyzer import MFTAnalyzer

analyzer = MFTAnalyzer(mft_file="path/to/$MFT", output_file="out.csv")
analyzer.analyze()

The CSV it produces has one row per record with timestamps from both $STANDARD_INFORMATION and $FILE_NAME. Good enough for spreadsheet-driven triage.

When to use: small $MFT files, ad-hoc scripts, no native dependencies allowed.

Limits: slow on multi-gigabyte inputs (single-threaded pure Python), and the object model is geared toward CSV emission rather than programmatic walks.

Option 2: libmft (typed object model)

If you want to query records as Python objects, libmft exposes a typed model close to the on-disk structure.

# pip install libmft
from libmft.api import MFT

with open("path/to/$MFT", "rb") as f:
    mft = MFT(f)
    for entry in mft:
        if not entry.is_deleted():
            continue
        name = entry.get_full_path()
        si = entry.get_attributes(0x10)[0]  # $STANDARD_INFORMATION
        print(name, si.created, si.modified)

libmft resolves parent references so you can ask each entry for its full path without writing the traversal yourself. It also handles $ATTRIBUTE_LIST extension records transparently — something analyzeMFT's CSV layer hides from you.

When to use: you want to write logic that walks records, filters by attribute, and emits a custom shape.

Option 3: shell out to a Rust parser

When the $MFT is large (~1 GB+) or you are batching across many disks, the fastest practical option is to shell out from Python to a native parser and read its JSON.

import json
import subprocess

# omerbenamram/mft — `cargo install mft` or download a release binary
proc = subprocess.run(
    ["mft_dump", "-o", "json", "path/to/$MFT"],
    capture_output=True, check=True,
)
for line in proc.stdout.splitlines():
    record = json.loads(line)
    if record["header"]["flags"] & 0x1 == 0:  # IN_USE clear → deleted
        print(record["entry"], record["file_name"]["name"])

mft_dump emits JSON Lines — one record per line — which streams cleanly into Python without loading the full output into memory. Compared with analyzeMFT on the same input, the Rust parser is typically 10–50× faster and uses a tenth of the memory.

When to use: production pipelines, large inputs, or anywhere parsing time matters.

Reading $MFT straight from a disk image

If you have a raw .dd or .E01 image rather than an extracted $MFT file, use pytsk3 (Python bindings for The Sleuth Kit) to seek to $MFT on the volume and stream its bytes:

import pytsk3

img = pytsk3.Img_Info("disk.dd")
fs = pytsk3.FS_Info(img, offset=0)  # use the NTFS partition offset
mft_file = fs.open_meta(inode=0)    # $MFT is always inode 0
size = mft_file.info.meta.size
data = mft_file.read_random(0, size)
# data now contains $MFT; feed it to libmft or write to disk

This is the cleanest approach when the volume is encrypted at the partition level but mounted via a decryptor that gives you a raw image.

Common pitfalls

  • Forgetting the fixup array. Reading raw 1,024-byte chunks without applying the USA gives you garbage at offsets 510 and 1022 of every record. Every library above does this for you — only roll your own parser if you understand the fixup mechanism (see the record anatomy post).
  • Treating record number as identity. Record numbers are reused. The 64-bit file reference (record number plus sequence number) is the identifier that does not collide.
  • Confusing the two timestamp sets. Every record carries timestamps in both $STANDARD_INFORMATION (updated frequently) and $FILE_NAME (mostly stable). For timestomping detection, you need both — see the four MFT timestamps.

When to skip Python entirely

For one-off interactive analysis without any installation, drop the $MFT onto the browser parser on this site. It runs the same omerbenamram/mft crate compiled to WebAssembly, filters and searches client-side, and exports CSV — no Python required.

External resources