Scans can’t be trusted as Xerox machines switch numbers around.
Photocopiers exist to produce close enough replicas of original documents. Traditionally, they just spit out the result onto paper. Most copiers these days can operate as (generally rather large) scanners, generating PDFs, TIFFs, or other electronic representations. But some Xerox copiers have recently been found to produce scans that, well, aren’t that close to the originals at all. The copiers are producing documents that look superficially similar to the originals but switch around numbers apparently at random.
German computer scientist David Kriesel wrote about the problem last week [see my post, Xerox scanners/photocopiers randomly alter numbers in scanned documents (original report)]. He scanned some construction plans with a Xerox WorkCentre 7535 and noticed that the photocopier was resizing the rooms in his floorplans. One room annotated as being 21.11 square meters (roughly 277 square feet) got shrunk to 14.13 square meters (152 sq. ft.). So too did a room that should have been 17.42 square meters (187.5 sq. ft.). In both cases, the photocopier was taking the numbers from a third room—one that really should be 14.13 square meters—and using them for the other two rooms.
Further investigation revealed that this was not an isolated incident. A table of prices also came out wrong: a price of €65.40 ($86.71) became €85.40 ($113.22).
Kriesel currently speculates that this is an artifact of the compression being used. The scanned output can generally use any of several compression formats. One of these formats is called JBIG2. It’s designed for black-and-white (bitonal) images. The algorithm recognizes text-like portions of the image, which it then breaks down into a series of symbols (typically one symbol per character). The compressed image contains a sequence of symbols along with a dictionary to look up the shape each symbol represents.
It appears that the Xerox machines are somehow mixing up their symbols, perhaps judging two different characters to be so similar that they should be represented by the same symbol even when they should not.
Kriesel says he has received reports that, in addition to the two models of WorkCentre that he’s had issues with, other Xerox copiers show the same problem. The faulty scans can apparently be avoided by ensuring that JBIG2 compression isn’t used. Xerox has been notified of the errors but has not produced a fix yet.