This could then lead to later analysis on the file to see whether it is indeed well-formed or if it is lying about its magic bytes.
We would end up with a dataset linking them to a qualitative description similar to:
- PNG image data, 500 x 320, 8-bit/color RGBA, non-interlaced
- ASCII text
- gzip compressed data, was "foo.tar", last modified: Sat Jun 25 21:49:32 2016, from Unix
- Zip archive data, at least v1.0 to extract
- MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
- FLAC audio bitstream data, 16 bit, stereo, 44.1 kHz, 7997976 samples
From there a curator would be able to say, "I'll take a look at all of the foo type data and try to further add some categorization"
Eventually, when the structure for managing the collection emerges, we can etch a daily index to aid in discovery of the new data.
Obviously, this would apply to only data which may be decoded where it is either plain or the key has been made public.
-Franco