MIME / File Type Sniffer
Identify a file's real type by its magic bytes, not its extension.
Overview
The MIME / file type sniffer identifies a file's real format by reading its leading bytes — the "magic number" — rather than trusting the extension. A .jpg that is really a PDF, a .zip that is really a .docx, or a renamed binary masquerading as a text file all show their true colours.
Security responders triaging uploads, content-moderation engineers, and developers building file-upload validation reach for this when the extension is unreliable or absent entirely. Long-tail searches that lead here include "detect file type from bytes", "identify file by magic number", and "MIME type sniffer online".
How it works
Most binary formats start with a fixed signature at offset 0 (or a known offset). The sniffer carries a table of common signatures: 25 50 44 46 for PDF, 89 50 4E 47 0D 0A 1A 0A for PNG, FF D8 FF for JPEG, 50 4B 03 04 for ZIP (which also covers .docx, .xlsx, .epub, .jar), 1F 8B for gzip, 37 7A BC AF 27 1C for 7-Zip, 52 61 72 21 for RAR, 49 44 33 for MP3 with ID3 tag, and many more.
For ZIP-based formats, the sniffer also peeks at the central directory's filenames so it can distinguish a generic ZIP from a docx (contains word/document.xml), xlsx (xl/workbook.xml), or epub (mimetype at offset 30). For text inputs, the encoding is detected via the BOM (EF BB BF for UTF-8, FF FE / FE FF for UTF-16) and content heuristics.
Examples
- Catch a
.pngupload that is actually a renamed executable. - Determine whether a
.zipis a real archive or actually a Microsoft Office document. - Confirm a downloaded
binary.datis a gzip stream by recognising1F 8B. - Identify an EPUB by its embedded
mimetypemember rather than by extension.
FAQ
Is the extension ignored entirely?
The bytes are the source of truth. The reported type is what the file actually is; the original extension is shown for comparison.
Can it be fooled?
Magic-byte sniffing identifies the container format. A malicious file can put a valid PNG header in front of arbitrary trailing data; some formats (like polyglot files) are intentionally valid as multiple types. Treat the result as a strong hint, not a security guarantee.
Why does my Office document show as ZIP?
Modern Office formats (.docx, .xlsx, .pptx) are ZIP containers. The sniffer normally drills in to identify the specific Office type; if it falls back to plain ZIP, the inner manifest is non-standard.
Does it detect text encodings?
Yes — UTF-8 BOM, UTF-16 LE/BE BOM, and plain ASCII are detected. UTF-8 without a BOM is reported when the first few hundred bytes successfully decode as valid UTF-8.
Can it detect file versions, like PDF 1.4 vs 2.0?
The PDF header includes the version (%PDF-1.7), which is reported when present. Other formats vary in how clearly the version is encoded in the header.