NLM Home
PageVHP Home Page


Next: References Previous:Title Page Contents: Conference page

Squaring the Circle: Validation Without Ground Truth



Introduction

     Insight is a software toolkit being developed by a consortium of academic and industrial research institutions sponsored by the National Library of Medicine and its partner institutes and agencies. An essential component of this toolkit is the development of a validation methodology for various segmentation techniques.
      By applying Quality Function Deployment (QFD), the subcommittee of the consortium has established the basic design principles of such a validation suite ([1]). Among the 13 most important requirements that the QFD Analysis revealed were the following features:
  1. 3. Statistical Foundation
  2. 4. Ground Truth
  3. 5. Quantitative evaluation
     Yoo et al. ([1]) summarized: "The intractable issues of Ground Truth were a high priority in the committee's QFD analysis". This statement is fundamentally correct in the context of clinical imaging even if one were to incorporate elements of physical measurements in future data acquisitions such as the VHP data sets. As the previous work of several researchers has shown, synthetic and phantoms have a limited usefulness.
     We present the outlines of the principles for establishing a muti-level validation suite, that will allow for a quantitative assessment of the accuracy of segmentation techniques. The core of this suite and its novelty consists in taking advantage of the mutual information of anatomical knowledge. We believe that the suite outlined herein will be adopted by the algorithm developers' community as the gold standard against which a variety of segmentation algorithm can be consistently compared.
 

1. The 8 Karat Gold Standard

     The work of Guttmann et al ([2]) focussed on the reproducibility and comparison of various segmentation techniques within useful limits, but has not addressed the truth thereof, in the absence of a ground truth. In other words, it made possible the relative comparison of different techniques against each other, but it failed to answer the question of how well the computer generated model of a particular structure, normal or pathologic, reflects the real structure. Based on this work, we proceeded to building the first module of a validation suite, consisting of an arbitrary number of individual labelmaps of a given structure from the VHF data. The segmentation is being performed by means of interactive technique, done by trained specialists, who are familiar with the anatomy of the region as well as with gross and cross-sectional anatomy. Each segmentation is “as good as it gets” based on clinical and anatomical judgement that has been tested and is currently in use in clinical, image guide neurosurgery procedures as well as in several federally funded research projects. Each individual label set, as well as a statistical mean of all the labels can serve as a rough, yet dependable (8 karat) standard against which automatic segmentation techniques can be compared.. Figure 1. illustrates the process, and Figure 2. presents a magnified and simplified example. An algorithm will have passed the test if its result will fall within the limits of either of the individual labelmaps, or within the “average" thereof.  A detailed description of this suite will be published upon completion, in the near future. Admittedly, this is a relativistic method which falls short of scientific expectations. Nevertheless, once completed, it will present a valuable progress, since at present no common standard exists and the performance of each algorithm is being assessed in a clinical judgement style, with no means to compare several algorithms in a uniform, if not objective manner.

2. The 14 Karat Gold Standard - Mutual anatomical information

     In the current practice, a given anatomical structure is being segmented based on its image properties, as they appear in the data set, such as color, edges and shape, with or without the use of preexisting knowledge of such properties. The structure is being focussed on in an anatomical vacuum, so to speak. While this is the case in the situations in which the structure (e.g. viscera, blood vessel or nerve) appears surrounded by relatively homogeneous tissue, such as connective tissue, almost any given structure is intersecting other significant and well defined structures, and these neighborhoods are anatomically significant. A validation method that takes into account these critical points can exclude not only "improbable" boundaries but also "impossible" ones. In other words, the boundaries of a nerve may be hard to establish with certainty where the nerve is surrounded by connective tissue, but it becomes clear-cut when the nerve intersects a blood vessel, which has its own precise boundaries. Hence, the boundaries of the nerve can be defined as a multitude of points, some of which are "probable" other being "certain".
In the example shown in Figure 3, the C3 nerve on the right side (arrow) is neighbored by the vertebral artery and by the second and third cervical vertebrae. Thus, the relative indeterminateness of the boundaries can be significantly improved upon, if one takes into account these anatomical neighborhoods of the nerve (Figure 4 A through D). In this way, the cloud of points from the 8 karat suite, within which any segmentation should "reasonably" be found, can be imporved on by adding a number of points which the segmentation may not include without failing the test.

3. The 18 Karat Gold Standard

     The space between the structures in Figure 4. still leaves room for uncertainty and makes us wishing for an unattainable solid gold standard. The VHF data set has an isotropic resolution of 0.33 mm/voxel. On the other hand, the segmentation algorithms that we want to validate are typically developed for radiologic, mostly magnetic resonance imaging (MRI) data, with a resolution of 1-1.5 mm/pixel and a slice thickness of 2-4 mm. This translates into a loss of resolution by a factor of 5 in plane (Figure 5.) and a factor of 12 in the z-direction (Figure 6.). Hence, we can greatly increase the relative accuracy of our 14 karat gold standard by creating it on the full resolution data set and applying it to a reduced data set approximating the image properties of the clinical MRI images.

4. The 24 Karat Gold Standard

     If the image properties of a synthetic MRI data set generated from the VHF cryosections can simulate adequately those of a real data set, it follows that the segmentation of the high resolution color cryosections can yield information of a higher order of magnitude than the data used for validation and hence it can be used as a gold standard for algorithms designed to perform on radiologic data.
 

Conclusions

     Although the prospect of a ground truth for validating anatomical segmentation remains ellusive, it is possible to develop a multi-level suite for validation of segmentation algorithms.
     A sufficient number of structures of different levels of difficulty can constitute a Gold Standard against which the performance of different segmentation algorithms can be reliably compared.


Next: References Previous:Title Page Contents: Conference page

Office of High Performance Computing and Communications
Lister Hill National Center for Biomedical Communications
U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health
Department of Health & Human Services
Copyright and Privacy Policy
Last updated: 2 July 2001