What are Sheets?

A sheet is a representation of a continuous partial function between two spaces which are locally like either (a) n-dimensional Euclidean space or (b) an n-dimensional integer grid. The function values do not follow a simple formula and, thus, are represented using a table of values at a finite set of sample locations. The function is represented to finite precision: values only at a finitely dense set of locations, values only to a fixed precision. Each sheet contains homogeneous data. For example, all values have the same type and the same precision. This allows packed storage and optimization of compiled code.

Traditionally, computer vision algorithms have stored image-like data in 8-bit or 16-bit integer arrays. Much of the complexity and non-generality of vision code is due to the fact that the programmer must hand-code the transformation between the user-level representation of the data and the actual, optimized array format. For example

• user-level: floating point stereo disparities, each disparity a vector with the first component in the range [-100.0,100.0] and the second in the range [-5,5], represented to the nearest 10th of a unit
• actual array format: an array of 16-bit signed integers plus an array of 8-bit unsigned integer, with -32768 used to fill locations for which no disparity is available (e.g. due to occlusion)

Sheets provide a layer of abstraction which insulates the programmer from such details. Specifically, the sheet representation provides support for:

• Partiality (missing values)
• Circularity (e.g. angular values, log-polar maps)
• Multi-dimensional output values (vector-valued images)
• Precision and range of values (bit-depth of the storage array)
• Scaling and offset of domain coordinates,
• Interpolation and the distinction between continuous and discrete spaces
• Restrictions (sub-images sharing storage with the full image)

Distinctions which rarely affect the structure of computer vision algorithms are handled automatically, so that a single, simple low-level vision function can be applied to data of several types. For example, one can write a median filter or canny edge finder algorithm which works regardless of the precision of the image values, whether they are signed or unsigned, whether the image is tubular (e.g. a log-polar representation), and whether data is not available for certain locations in the domain (e.g. stereo disparities. This greatly increases the generality and portability of the algorithm.

Partiality

The ideal domain of many computer vision functions is an infinite space. For example, an ideal perspective image has domain R². Obviously, we don't have values (even sampled values) for this entire space. We can still regard our image as a function with domain R², so long as we allow the function to be partial.

It is common for computer vision code to assume that data is available for a rectangular region of the image and, thus, that the partial function on R² can be represented as a total function on a rectangular array. However, this line of reasoning fails in circumstances such as the following

• 1) When an image is rotated, its sides are no longer aligned with the coordinate axes.
• 2) When radial lens distortion is corrected, the output image is not square: its sides are warped inwards.
• 3) Stereo disparity data may be missing in the middle of an image due to occlusion; motion/texture orientations may be missing in regions with uniform intensities.
• 4) A user may select only part of an image, e.g. only the skin areas, for training an algorithm or to focus the attention of a later stage of processing.
• 5) When a panoramic image is created by pasting together images taken from several camera positions, the composite image may have holes of diverse shape.

In such cases, the function must clearly be represented as partial and we must include a mechanism for marking certain locations as having no value. Once this machinery is in place, it is simplest to regard all flat 2D images as partial functions on R². Cases (1) and (2) make it obvious that there is no strong distinction between locations outside the storage array and locations inside the storage array whose values are missing. In Envision, a special "missing" value is returned when a numerical value is unavailable, regardless of the reason why the value is unavailable.

When the codomain is multi-dimensional (e.g. a field of 2D motion vectors), the value at each location must either be entirely present or entirely missing. This is a consequence of our homogeneity assumption. The coordinate axes exist for computation convenience and are arbitrary. The stored data should be covariant under rotations of the output coordinate system.

When only 1D data is available for a 2D output value, it is typical that the 1D data lies along a vector not aligned with the coordinate axes. For example, in motion computations, the missing output component is likely to lie perpendicular to the image gradient direction. Such cases cannot be handled by allowing individual coordinates of points to be missing. Rather, they require a totally different representation of the data.

Failure to handle non-rectangular images or internal missing values is an extremely common limitation of computer vision algorithms. It discourages researchers from taking algorithms developed for 8-bit gray-scale images (e.g. edge finders) and using them on other types of data (e.g. stereo disparities).

Scaled and offset coordinates in the domain

In C and scheme, array indices are always positive and start at zero. Camera images, however, have a natural notion of center (the intersection of the camera's optical axis with the image plane). This center varies from image to image, but almost never lies at the corner. Therefore, many computer vision algorithms are cluttered by addition and subtraction of the coordinates of the image center.

In addition, the unit of measure appropriate at a high level may not match the spacing of the sample locations (pixels). For example, many stereo matching algorithms initially match a subsampled version of the input images. The result is then used to guide matching of the images at full resolution. Conceptually, image has a fixed size and, thus, the distance between two features should be the same at the two scales. The pixelwise distances, however, are different.

Envision allows the user to specific the location of the image center and the scaling required to convert pixel distances into units meaningful in the application. Conversion between user locations and raw pixel locations is then done automatically.

Range and precision limits in the codomain

Essentially all numbers used in computer vision algorithms have limited precision, due to a combination of quantization and contamination with random noise. Therefore, bulk data is usually stored as integers, but converted to floating point for numerical operations, using a magic constant. Magic constants limit generality and portability; mistakes (e.g. forgetting to apply the multiplier on access or storage) are a frequent source of bugs.

In Envision, it appears to the programmer that each sheet contains floating point numbers. The precision of these numbers is immediately evident, and enforced when numbers are stored. The conversion of these numbers to integers is performed automatically.

Digitized camera images contain values in a limited (and known) range, usually [0,255]. The values produced by each stage in a low-level computer vision algorithm also lie in fixed range, due to the restricted range of the input and the fact that ubiquitous random noise limits the precision of all numbers. These range restrictions are used to store images in memory, and in disk files, using only 1-2 bytes per pixel.

True floating point numbers, i.e. real numbers with an extremely wide range of magnitudes, are essentially non-existent. When they appear to be present, this is usually a sign of poor numerical handling, e.g. inverting a small number. The exceptions to this principle seem to occur in high-level code, not in code handling sheet data.

In Envision, values stored into a sheet are automatically clipped to the declared range of the sheet. That is, higher and lower values are automatically changed to values at the extreme ends of the declared range. This is the behavior favored by most computer vision algorithms. Of course, if a specific algorithm requires a different behavior (e.g. an error break), it is simple for code to include an explicit test for underflow and overflow.

The range of values does not need to fit the standard signed or unsigned, 8-bit or 16-bit ranges provided by C. For example, a program producing angular output (e.g. edge orientations) may specify that they must lie in the range [0.0,360.0) and are represented to the nearest tenth of a degree. Envision enforces this declared range.

Failure to handle overflow and underflow is a common bug in computer vision algorithms. The algorithm may be tested on images for which overflow does not occur (e.g. the highest intensity value is only 200). Later on, when it is applied to other sorts of images, it may crash without warning.

Or, more likely and far worse, because images are stored in unsigned arrays and unsigned numbers in C are modular, the code may simply generate wrong numbers. A few wrong numbers may be ignored as random or be blamed on another cause. The true problem may not become apparent until someone happens to run an image in which many values over/underflow.

Circularity

The possible domains for computer vision functions obviously include R, and R². Possible codomains clearly include R, R², and R³. All of these spaces are flat (linear).

A few applications, however, require either the domain or the codomain to have a more interesting topological shape.

• The codomain values may be angles, which form a circular space.
• Contours (boundaries of regions in images) are often circular and, thus, naturally represented as a sheet with a circular domain.
• Log-polar image representions require a tubular domain.
• Fourier transforms treat the domain of a function as circular (in 1D) or toroidal (in 2D).

Circles, tubes, and tori are manifolds, i.e. they look locally like R or R². Furthermore, they are easy to represent with rectangular arrays: simply change the boundary conditions at the edges.

The current implementation supports the following possibilities:

• domain: R, R², circle, tube, and torus
• codomain: R, R², R³, and circle

The dimensions of the codomain can be limited to 3D (or perhaps 4D) because of the homogeneity assumption built into the definition of sheets. A single sheet is supposed to contain only data of a single type, and the coordinates of a multi-dimensional output value are also supposed to share a common type. For example, a sheet with 2D codomain might contain 2D motion vectors, or 2D color values. However, it should not contain a a homogeneous mixture of values, such as a two color components and a list of values for texture properties. A function that produces a list of heterogenous values at each point should be represented using a list of several sheets.

The change required for circular boundary conditions is conceptually simple and easy to add (once) to the image coprocessor. By contrast, coding it individually for each algorithm is a nuisance, it clutters the code, and it is a source of errors.

Interpolation

Many of the functions used in low-level vision conceptually take continuous values as input and produce continuous values as output. However, values can only be stored at finitely-spaced sample locations. Values at other locations must be obtained by interpolation. Many simple operations, such as rotating or resizing images, require interpolated values.

Interpolation is a nuisance to code and almost invariably coded poorly. Typical practice seems to be to use bilinear interpolation, despite the fact that it shaves significant amounts off peaks and troughs. This is particularly annoying because peaks and troughs are often the focus of attention in vision algorithms, and because the problem may appear trivial when the operation is applied once but become worse if the operation (e.g. rotation) is applied repeatedly.

Finally, implementing interpolation requires considerable care when there may be missing values. The required work is almost never done. Thus, algorithms requiring interpolation either do not work on images with missing values, or work poorly.

In Envision, sheets with continuous domains automatically support references at non-integer locations. These values are generated using a second-order polynomial interpolation algorithm which handles any pattern of missing values. [ADD REFERENCE TO PAGE WITH DETAILS]

Continuous vs. discrete spaces

Exact integers and inexact real numbers have very different numerical properties and must be handled in different ways. Or, said another way, continuous and discrete spaces are very different topologically. These differences have practical consequences. For example, when the domain is conceptually discrete, it is an error to ask for values from non-integer locations.

Some bulk data in computer vision is best regarded as a function with discrete domain. Typical examples include large tables (e.g. databases of object models) and long lists of points (e.g. image features, edge locations). Long lists of points, triangles, or rectangles are common in computer graphics.

Some functions used in computer vision also have a discrete codomain. Examples include histograms, bit-encoded or binary edge maps, and compressed (e.g. run-length encoded) images. Values retrieved from these sheets are integers and should be processed using exact arithmetic.

Envision supports three types of sheets:

• manifolds: continous domain and codomain
• real grids: discrete domain, continuous codomain
• integer grids: discrete domain and codomain

Sheets with continuous domain and discrete codomain are not supported. Theoretically, the functions they would represent cannot be continuous. In practical terms, specifying the domain to be continous implies that it supports interpolation of values for non-integer locations. However, discrete values cannot be interpolated except by, essentially, treating them as if they were continuous.

Focus Areas

Associated with each sheet is a "focus area." This is a rectangular sub-area of the sheet's storage array. Envision includes a scan primitive that allows an operation to be repeated at every location in a sheet, without the user having to explicitly write the double loops. The focus area is used to set the parameters for this loop. Similarly, display operations consider only the contents of the focus area.

Subsections of 2D images are created to focus attention on part of the image, or to divide the image into pieces for parallel processing. Curves (e.g. the boundary of an image region), represented as 1D sheets, are frequently divided into sections by curve segmentation algorithms. For example, a curve may be divided at sharp corners.

Focus areas are useful because they allow a sub-section of a sheet to be extracted without allocating new storage. The new sheet appears to be a subsection for most practical purposes, but it shares storage with its parent. Because images are so large, storage is precious in computer vision algorithms. Sheets which share storage arrays are said to belong to the same "storage group." (Similar shared storage can also be created by moving the origin of a sheet or rescaling its coordinate system.)

The focus area is used to set up the loop used in scanning. However, when scanning through all points in a sheet's focus area, an algorithm looking at a location inside the focus area can can retrieve values from nearby locations outside the focus area. This is extremely useful because most low-level computer vision operations require data from a local neighborhood, not just at the point of interest. Depriving them of context outside the focus area typically results in less accurate, and sometimes entirely wrong, output values.