Using sparse parametrization of deformation fields as means to classification

Using sparse parametrization of deformation fields as means to classification

Title	Using sparse parametrization of deformation fields as means to classification
Publication Type	thesis
School or College	College of Engineering
Department	Computing
Author	Tirpankar, Nishith
Date	2013-05
Description	Large Deformation Di eomorphic Metric Mapping is a powerful technique which has been used to quantify variations in anatomical structures in medical images. This allows us to compare various images within and across a populations of classes using the underlying deformation eld which maps each image with the representative images of the class. The deformation eld can be described by a low-dimensional control point parameterization. We investigate the potential of this low-dimensional parameterization in classi cation and study the e ect of the underlying classi er parameters on the classi cation accuracy.
Type	Text
Publisher	University of Utah
Subject	Atlas Estimation; Classification; Diffeomorphic Deformation; LDDMM; Optimization; Registration
Dissertation Institution	University of Utah
Dissertation Name	Master of Science
Language	eng
Rights Management	Copyright © Nishith Tirpankar 2013
Format	application/pdf
Format Medium	application/pdf
Format Extent	1,766,062 bytes
ARK	ark:/87278/s6sf3b1v
DOI	https://doi.org/doi:10.26053/0H-DS2Q-0HG0
Setname	ir_etd
ID	195922
OCR Text	Show USING SPARSE PARAMETRIZATION OF DEFORMATION FIELDS AS MEANS TO CLASSIFICATION by Nishith Tirpankar A thesis submitted to the faculty of The University of Utah in partial ful llment of the requirements for the degree of Master of Science in Computing School of Computing The University of Utah May 2013 Copyright c Nishith Tirpankar 2013 All Rights Reserved Th e Uni v e r s i t y o f Ut a h Gr a dua t e S cho o l STATEMENT OF THESIS APPROVAL The thesis of Nishith Tirpankar has been approved by the following supervisory committee members: Guido Gerig , Chair 11/28/2012 Date Approved Sarang Joshi , Member 11/28/2012 Date Approved Tom Fletcher , Member 11/28/2012 Date Approved Stanley Durrleman , Member 12/10/2012 Date Approved and by , Chair of the Department of School of Computing and by Donna M. White, Interim Dean of The Graduate School. Alan Davis ABSTRACT Large Deformation Di eomorphic Metric Mapping is a powerful technique which has been used to quantify variations in anatomical structures in medical images. This allows us to compare various images within and across a populations of classes using the underlying deformation eld which maps each image with the representative images of the class. The deformation eld can be described by a low-dimensional control point parameterization. We investigate the potential of this low-dimensional parameterization in classi cation and study the e ect of the underlying classi er parameters on the classi cation accuracy. To Nikhil, Anshul, Mark, and Purba for being such a great support. CONTENTS ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii CHAPTERS 1. INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 The problem of statistics on high-dimensional image data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Velocity eld as a feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Datasets used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2. IMAGE REGISTRATION FRAMEWORK : : : : : : : : : : : : : : : : : : : : : : : 6 2.1 Registering Isrc with Itar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Results and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3. ATLAS ESTIMATION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 3.1 Derivation of the atlas formation process . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.1 Atlas formation using iterative averaging . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Computing an optimal set of landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4. BINARY CLASSIFICATION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 4.1 Classi cation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Receiver operating characteristics plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 E ect of varying and r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 E ect of varying the dimensionality of the deformation descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.5 E ect of varying the number of training examples used in atlas formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5. MULTICLASS CLASSIFICATION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36 5.1 Multiclass classi cation extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Multiclass classi cation using optimally situated control points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.4 Using a higher density of control points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.5 Using the gradient as a feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6. CONCLUSION AND FUTURE WORK : : : : : : : : : : : : : : : : : : : : : : : : : : 43 6.1 Image registration and atlas formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.2 Binary classi er . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.3 Multiclass classi cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47 vi LIST OF FIGURES 1.1 Synthetic snowman dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Sample images from ZIP code digits training dataset. . . . . . . . . . . . . . . . . . . 4 1.3 Sample images from ZIP code digits test dataset. . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Results of deformation. Top Left: Source Image of digit 6. Top Right: Target Image of image 6. Bottom Left: Deformed image with 256 momenta vectors overlain over the control points. Bottom Right: The value of the objective against the iterations of the gradient descent. . . . . . . . . . . . . . . . . . 9 3.1 Results of atlas formation. Top Left: Template images in atlas formed using averaging for = 1. Top Right: Template images in atlas formed using averaging for = 3. Bottom Left: Template images in atlas formed using splatting for = 1. Bottom Right: Template images in atlas formed using splatting for = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Variance of norm over atlas. Top to bottom, Left to right: L2 norm of variance of i for the interpolating kernel width = 0:5; 1; 2; 3; and 4. The variance tends to be more distributed over the entire image domain as we increase the value of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Peaks of variance norm. Top to bottom, Left to right: Variance peaks for the interpolating kernel width = 0:5; 1; 2; 3; and 4. The peaks seem to hug the contours from the outside. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Union of variance peaks across classes. Top to bottom, Left to right: Peaks found as a union of the peaks from the previous steps. . . . . . . . . . . . . . 18 4.1 ROC curve for classi cation between digits 1 and 3. . . . . . . . . . . . . . . . . . . . . 23 4.2 Magni ed version of the plot, magni ed for FPR between 0 and 0.1 and TPR between 0.9 and 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Roc curves of classi cation between digits 1 and 3 for = 2. Left: ROC curve. Right: Magni ed version of the same plot, magni ed for FPR between 0 and 0.1 and TPR between 0.9 and 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.4 ROC curve for classi cation between digits 1 and 3 for r = 0:01. Left: ROC curve. Right: Magni ed version of the same plot, magni ed for FPR between 0 and 0.1 and TPR between 0.9 and 1. . . . . . . . . . . . . . . . . . . . . . . . 26 4.5 ROC curve for classi cation between digits 2 and 5. . . . . . . . . . . . . . . . . . . . . 26 4.6 Magni ed version of the same plot, magni ed for FPR between 0 and 0.1 and TPR between 0.9 and 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.7 ROC curve for classi cation between digits 1 and 3 along with the area under the curve denoted by AUC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.8 ROC curves for the binary classi er between digits 1 and 3 for di erent dimensionality of the classi er. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.9 E ect of changing the number of control points on the Area under the ROC curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.10 E ect of changing the dimensionality of classi er. Top: ROC plots for digits 2 and 5 changing dimensionality of classi er. Bottom: E ect of changing the number of control points on Area under the ROC curve. . . . . . . . . . . . . . 32 4.11 Changing the number of training examples for the binary classi er between digits 1 and 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.12 Area under ROC curve as the number of training samples is changed. . . . . . . 34 5.1 Confusion matrices plotted for multiclass classi cation with 25 control points using = 3, r = 0:25 with gradient descent on the control point positions using di erent classi cation criteria. Top Left: Data matching criterion used for classi cation gives excellent results. Average Error rates are 0.12 Top Right: Magnitude of the momenta vectors when used for classi cation gives average error rate of 0.38. Most digits tend to get confused with the digit 1, 6, 7, and 9. Bottom: The Mahalanobis distance does not perform very well with an average error of 0.49. Most digits tend to get confused with the digit 8 as well as with digits 3, 4, and 5. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Confusion matrices plotted for multiclass classi cation with 8 8 grid of 64 control points using = 3, r = 0:1 using di erent classi cation criteria. Top Left: Data matching criterion has an Average Error rate of 0.1221 Top Right: Magnitude of the momenta vectors for classi cation give average error rate of 0.3671. Confusion with the digit 1,6,7, and 9 occurs frequently. Bottom: The Mahalanobis distance has an average error of 0.4936. Most digits tend to get confused with the digit 4 and 8 as well as with 3 and 5. . . . 40 5.3 Confusion matrices plotted for multiclass classi cation using gradient of the images as the image feature with 8 8 grid of 64 control points using = 3, r = 0:1 using di erent classi cation criteria. Left: Data matching criterion has an average error rate of 0.49 Middle: Magnitude of the momenta vectors for classi cation give average error rate of 0.62. Right: The Mahalanobis distance has an average error of 0.5826. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.1 Error rate for various classi cation methods using the ZIP code digits database. Data taken from [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 viii CHAPTER 1 INTRODUCTION 1.1 The problem of statistics on high-dimensional image data Image data are intrinsically high dimensional. In a dataset of such images, the pos- sible variability is on the order of the size of the images. In many nite databases, the underlying variability can be described on a much lower dimensionality. The variability is constrained by the characteristics of the dataset itself. Medical image datasets tend to have smooth spatial variations characteristic of the fact that locally, pixels do not move independently. Handwritten image datasets have variations only along the contours of the writing. Surveillance datasets have variations along principal paths of travel, restricted to certain parts of the image. In order to perform statistical analysis on such data, it is important to parametrize the variability in the dataset e ciently, in order to reduce the dimensionality while maintaining the information desired. Understanding the constraints posed by the variational characteristics of the data helps in this parametrization. Also, in order to perform statistics, particularly binary and multiclass classi cation, we will use a registration framework that maintains the information for the task while reducing the size of the descriptor. 1.2 Velocity eld as a feature Landmark matching-based image registration between two images using the technique of large deformation di eomorphic metric matching [8] nds a smooth velocity eld that warps one image to minimize the smoothness constraint and L2 norm between them. The smoothness constraint is characteristic of the dataset we have chosen. This underlying velocity eld is a representative of the variation between the two images. The velocity eld is described completely by the momenta vectors at landmark positions. The variation required to warp one image to another can thus be encoded in the momentum vectors at landmarks. Since the velocity eld gives us a measure of the change required to warp one image to another, it can be used as a feature to measure variation between images. 2 Please note that we will be using the small deformation approximation to the large deformation framework. This implies that the estimated deformations may not be di eo- morphic. This is done in order to simplify the process and reduce simulation times for gradient descent. The small deformation framework can easily be switched for the large deformation di eomorphic model easily. 1.2.1 Datasets used We have mainly worked with two datasets. The rst is a set of synthetic Snowman data. Some examples of the dataset have been shown in the Figure 1.1. This dataset consists of 4 images alone and served as a good working dataset for testing the methods initially. The second dataset which has been used for most of our work is the zip digits hand- written database taken from [6]. It consists of two repositories for training and test. The dataset consists of normalized handwritten digits automatically scanned from the envelopes by the U.S. Postal Service. The original scanned digits are binary and of di erent sizes and orientations; they have been deslanted and size normalized, resulting in 16x16 gray-scale images. There are 7291 images in the training dataset and 2007 images in the test dataset. Table 1.1 shows the digits per class in each of the datasets. Each line in the data le consists of the digit id (0-9) followed by 256 gray-scale values. Each gray-scale value lies between -1 and 1. The dataset is available at [4]. Figure 1.2 shows a few images of the digits from the training dataset. As it can be seen, the digits are size normalized to t the 16 16 boundaries and are also deslanted and centered. The training dataset is considerably large and hence can be assumed to contain most of the variation that people introduce while writing the digits. It is certainly an exhaustive database to train from. In order to test the classi er, the additional test database has been provided. As seen in Figure 1.3, the images from this dataset are considerably di cult to classify. Consider the image of the handwritten 2. The loop at the base of the 2 makes it considerably di erent from the handwritten 2 without the loop. The digit 4 looks similar to a 9. In fact, as a tip from the dataset providers, the test dataset is notoriously di cult and an error rate of 2.5% is excellent. 3 Figure 1.1. Synthetic snowman dataset. Table 1.1. Distribution of ZIP codes digits. 0 1 2 3 4 5 6 7 8 9 Total Train 1194 1005 731 658 652 556 664 645 542 644 7291 Test 359 264 198 166 200 160 170 147 166 177 2007 4 Figure 1.2. Sample images from ZIP code digits training dataset. 5 Figure 1.3. Sample images from ZIP code digits test dataset. CHAPTER 2 IMAGE REGISTRATION FRAMEWORK In this chapter, we derive the framework used to register two images. Since the handwritten digits tend to have smooth variations along the contours, we need to place smoothness constraints on the velocity eld that registers images. Large deformation di eomorphic metric matching is a technique which has been used to estimate a smooth velocity eld that registers two images [1]. The success of this technique lies in the estimation of the deformation eld with certain smoothness properties. Also, the control point parametrization gives us the feature for comparing deformations. 2.1 Registering Isrc with Itar Let be an intensity-preserving deformation eld that maps each point in the source image domain to the target image domain. Let Isrc and Itar be continuous functions in the source and target domains. Thus, the objective of registering the source image with the target is minimizing the L2 norm between these images: A(y) = jjIsrc 􀀀1 􀀀 Itarjj2 (2.1) = MX k=1 (Isrc( 􀀀1(yk)) 􀀀 Itar(yk))2 (2.2) Let c = fc1; :::cNg be a nite set of control points. The deformation eld is parametrized by momenta vectors at the control points = f 1; ::: Ng. The velocity eld being continuous can be found at any point x in the source image domain by using a Gaussian interpolating kernel: v(x) = XN i=1 K(x; ci) i (2.3) where K(x; y) = exp(􀀀jx 􀀀 yj2= 2) (2.4) The transform in this small deformation setting can be seen as (x) = x + v(x). It should be noted that the inverse of the eld is approximated as 􀀀1(yk) = yk 􀀀v(yk). The 7 regularity term can be de ned as the the kinetic energy of the deformation eld which makes sure that the eld is regularized as: jjvjj2 = XN i=1 XN j=1 T i K(ci; cj) j (2.5) Now we can write the objective function that we minimize in order to match the source image to the target image: E(c; ) = jjIsrc 􀀀1 􀀀 Itarjj2 + jjvjj2 (2.6) = A(y) + jjvjj2 (2.7) is the trade-o between the image delity term and the regularization term. The higher its value, the smoother the velocity eld. We perform unconstrained line search using the gradient descent algorithm [11] on this objective to get the optimal value of momenta vectors as well as the control points. The gradient with respect to the momenta vectors can be written as: 1 2 r iE = 􀀀 MX k=1 K(ci; yk)(Isrc(yk 􀀀 v(yk)) 􀀀 Itar(yk))ryk􀀀v(yk)Isrc + XN j=1 K(ci; cj) j Although we will not be using the gradient update for nding the optimal control point positions since we need to have a common basis for comparison, we mention the gradient of the objective with respect to the control point positions: 1 2 rciE = MX k=1 􀀀 2 2K(ci; yk)(Isrc(yk 􀀀 v(yk)) 􀀀 Itar(yk))(ryk􀀀v(yk)Isrc)T i(ci 􀀀 yk) 􀀀 XN i=1 XN j=1 K(ci; cj) i j(ci 􀀀 cj) The gradient descent uses a convergence criterion, the breaking ratio. The breaking ratio is de ned as the ratio of drop in objective function value referred as the energy in the current iteration, to the drop in energy from the start of the gradient descent process. If E0 is the value of the objective function at the beginning of gradient descent and En is the value at the nth iteration, then the breaking ratio at the nth iteration is de ned in the following equation Brn = E0 􀀀 En En􀀀1 􀀀 En The gradient descent is terminated when the breaking ratio value is less than a prede ned threshold, say Brth, i.e., when Brn Brth. 8 2.2 Results and conclusion Figure 2.1 shows the result of deforming the image of a digit 6 to match another image of a handwritten 6. The two images have been taken from the training dataset. The image on the top left is the source image which we map to the target image on the top right. The registration is performed using a dense grid of 256 control points distributed regularly on the image. The objective energy keeps decreasing and the rate of decrease is small after about 8 iterations. The total energy is the weighted sum of the L2 di erence between the deformed and target image and the regularity term governing the smoothness of the velocity eld. The gradient descent stops after 21 iterations since it has satis ed the convergence criterion. The gradient descent is stable with the given parameter settings that have been used. The control parameters that a ect the result of the gradient descent are the number of control points N, the width of the Gaussian kernel , and the regularity trade-o . The control parameters that a ect the rate of convergence are the breaking ratio threshold Brth and the step size of the gradient descent. 9 Figure 2.1. Results of deformation. Top Left: Source Image of digit 6. Top Right: Target Image of image 6. Bottom Left: Deformed image with 256 momenta vectors overlain over the control points. Bottom Right: The value of the objective against the iterations of the gradient descent. CHAPTER 3 ATLAS ESTIMATION In order to perform population analysis for tasks such as classi cation, we need to obtain a representative template of a class. This can be obtained by performing joint optimization of an objective function discussed here across the entire population of a class. We use the optimization framework discussed in [3]. The representative template image is the mean given by the L2 norm on the space of the images of the class. This mean on this space is called a template. The atlas is a set comprising the template and a collection of deformation vectors that register the template with each of the images of the class in the dataset. In this step, we jointly optimize the template image and the deformation momenta that map each image in the class to the template to get an optimum atlas. We are also motivated to reduce the number of control points in order to reduce the size of the feature descriptor. A technique to compute an optimal set of landmarks is also discussed here. 3.1 Derivation of the atlas formation process If I0 is the template image of the class, c the set of control point vectors, and i the set of deformation vectors that register the template with the image Is of the class, then the objective function that we are attempting to minimize in order to nd the optimal template can be written as: E(I0; c; 1; :::; Ns) = XNs s=1 fAs(y) + jjvsjj2g (3.1) where As(y) = jjIs 􀀀1 􀀀 I0jj2 (3.2) vs = velocity eld registering Is with template image I0 (3.3) The gradient of the objective with respect to the momenta is: r sE = r sEs (3.4) where Ns = total number of images of class l (3.5) 11 Also, the gradient of the objective with respect to the template image I0 which we use in order to get a better estimate of the template I0 is the sum of the gradient rI0As(ys(0)) over the Ns images. It can be shown to be equal to the sum of the splatted version of the residual images: rI0E = A(y(0)) (3.6) = XNs s=1 splat (I0 􀀀1 s 􀀀 Is) into template domain (3.7) We perform a gradient descent on the momenta vectors and the template image simul- taneously in order to get the optimal template image and optimal deformation momenta vectors. This is a straightforward method of performing gradient descent on the given objective to nd the template image. Note that we have not mentioned the update to the control point position since we will not use it hereafter. 3.1.1 Atlas formation using iterative averaging Another method to update the atlas is using iterative averaging. Here we take the objective as de ned in 3.1 but do not compute the gradient with respect to the template image. The objective is not a function of the template image and hence is de ned as in 3.8. E(c; 1; :::; Ns) = XNs s=1 fAs(y) + jjvsjj2g (3.8) where As(y) = jjI0 􀀀1 􀀀 Isjj2 (3.9) vs = velocity eld registering Is with template image I0 (3.10) In order to compute the template image, we start with an initial estimate of the template image I0 at iteration 0. Next, we register all the images in the class with this estimate of the class template. The average of the deformed images is the new estimate of the template as given in 3.11. I0 = 1 Ns XNs s=1 Is s (3.11) This process is repeated till the objective de ned in 3.8 is less than the prede ned breaking ratio. 3.1.2 Results We have run the atlas formation procedure using both splatting as well as averaging. Figure 3.1 shows the results of running the atlas formation using both the techniques for two 12 Figure 3.1. Results of atlas formation. Top Left: Template images in atlas formed using averaging for = 1. Top Right: Template images in atlas formed using averaging for = 3. Bottom Left: Template images in atlas formed using splatting for = 1. Bottom Right: Template images in atlas formed using splatting for = 3. 13 values of the kernel width . As can be seen from Figure 3.1, the averaging technique tends to shrink the template compared to the actual images in the dataset itself. This behavior is possibly due to the regularization term reducing the magnitude of the momenta vectors, leading to an update in the template that is smaller than it should be. Due to the shrunk template images, the result of the averaging technique was not appealing since they are not characteristic means of the images. The result of splatting is better and has been prescribed in [3] as well. The only downside of splatting is that the template images have negative values. This is due to large gradient steps which tend to decrease the overall objective but result in negative values for some pixels. This behavior was remedied by adaptive change in the step size which is a form of gradient descent line search described in [11]. For further discussion, let us denote l as the set of all the momenta vectors that deform the mean image of the class l to the images of class l in the dataset. Let Ili denote the image i of class l in the dataset, l denote the mean image of class l, be the deformation eld parametrized by the momenta vectors . Thus, we can de ne: l = f ij l( i(x)) Ili(x)g (3.12) Note that the deformed mean image l i is not exactly equal to the target image Ili since the registration process does not exactly match the two. 3.2 Computing an optimal set of landmarks We will be classifying the images based upon the deformation that is required to match the template image of each class with the test image. We can compare deformation elds using the momenta vectors that parametrize each of the deformation elds. In order to compare the momenta vectors, they need to be de ned at the same set of control points. Thus, we cannot move the control points in any step of the entire process. We can always place the control points on a regularly spaced grid. However, since we cannot move the control points, there are two issues we face. The rst issue is that we need to have a reasonably dense distribution of control points in order to capture the variations in the data. If we increase the number of control points in order to increase this density, then the number of feature vectors goes up. Next, we need to capture any deformation possible from any template source image to any image in the database. Capturing a deformation implies that in a given region in the image domain which would require a deformation to match some source image in the dataset or atlas to some other image in the dataset, we would need a control point in that region. 14 Since we are interested only in deformations within images of a single class and not outside, we need to nd out all the possible deformations that can occur between the template (which we can approximate with the mean image) and all the images of a class in the dataset. To nd this, we place control points at all the grid locations in the image and register the mean image of a class with each image of the class in the dataset. The variance of the momenta vectors will tell us which control points tend to have the most varying momenta. Such points are valid candidates for being control points. We nd such high variance points for each class in the entire dataset and take a union of all such sets to get the nal set of control points. The process to do the above is as follows. Let us denote l as the set of all the momenta vectors that deform the mean image of the class l to the images of class l in the dataset, assuming that we have a control points at each grid element or pixel in the image. Let Ii l denote the image i of class l in the dataset, l denote the mean image of class l and the deformation eld parametrized by the momenta vectors . Thus, we can de ne the set l as in 3.13. l = f ij l( i(x)) Ii l (x)g (3.13) The deformed mean image l i is not exactly equal to the target image Ili since the registration process does not exactly match the two. The L2 norm of the variance of the momenta vectors de ned for each grid point(pixel) over the set l can be obtained as: jj 2l jj(x) = jjE[( i 􀀀 E[ i])2]jj(x) (3.14) where E acts on all the momenta vectors i for class l (3.15) This value is de ned for each pixel position x over the image domain for each class l. The images in Figure 3.2 show the L2 norm of variance for di erent values of the kernel width used for the interpolation kernel K(x; y) de ned in 2.4. As can be seen from the rst image in Figure 3.2, the smaller kernel tends to give us a better judgment of which pixels have high variance. Intuitively, it can be seen that the variance should be on the boundary of the main contour of any handwritten digit which is what we see for smaller values of . To nd the optimum position of the control points, we perform a form of discrete peak detection that tells us which pixels have the largest variation of deformation vectors for each class. In this process, for each grid location, we check to see if it has a value greater than all its neighbors. If so, it is a valid peak. Figure 3.3 shows the result of the peak detection operation. 15 Figure 3.2. Variance of norm over atlas. Top to bottom, Left to right: L2 norm of variance of i for the interpolating kernel width = 0:5; 1; 2; 3; and 4. The variance tends to be more distributed over the entire image domain as we increase the value of . 16 Figure 3.3. Peaks of variance norm. Top to bottom, Left to right: Variance peaks for the interpolating kernel width = 0:5; 1; 2; 3; and 4. The peaks seem to hug the contours from the outside. 17 As can be seen in Figure 3.3, the number of potential candidates for being control points decreases as the kernel width increases. For = 0:5, we have the highest number of control points. We have repeated the procedure of peak nding on the sum of the L2 norm of variance images for each class to get the nal set of control points which is a union of the peaks found in the earlier step. The results of performing this step are shown in Figure 3.4. As can be seen, the peaks found with lower values of tend to be more well distributed and the number is larger. 18 Figure 3.4. Union of variance peaks across classes. Top to bottom, Left to right: Peaks found as a union of the peaks from the previous steps. CHAPTER 4 BINARY CLASSIFICATION To verify the performance of momenta as the feature vector, we will construct a binary classi er. It should be noted that in order to compare velocity elds mapping the template images of di erent classes to the test image, the control points need to be at the same location. In this chapter, we will discuss the binary classi er to distinguish between two classes. We will discuss the various classi cation criteria, the receiver operating characteristics used to compare binary classi er performance, the e ect of varying the classi er and feature parameters, and the e ect of changing the size of the training dataset. 4.1 Classi cation criteria We have mainly experimented with three classi cation criteria. Each criterion uses a metric that de nes a distance from the decision boundary. Following is a description of the metrics and the criteria they imply: 1. Mahalanobis distance: We assume that the training data for class l is the defor- mation eld that registers the template of class l with each of the subject images in the training dataset of class l. Thus, the training data for class l consist of a set l = f igl as described in equation 3.13. Now, to classify the test image, we register the template of the class l with the test image using the technique described in Chapter 2, giving us the deformation eld l test. Let S denote the covariance matrix of the set of deformation momenta vectors of class l in the set l which is de ned in equation 4.1 S = E[(A 􀀀 E(A))(A 􀀀 E(A))T ] (4.1) where A = 2 6664 1 2 ... n 3 7775 (4.2) Then the Mahalanobis distance of the test deformation l from the mean deformation l can be given by: 20 Ml( l test) = q ( l test 􀀀 l)S􀀀1 l ( l test 􀀀 l) (4.3) The Mahalanobis distance tells us how many standard deviations our test deformation is from the mean deformation of the class. The closer to the mean deformation of a class the test deformation is, the more likely is it to belong to that class. As can be seen, the Mahalanobis distance is a normalized metric. It is normalized by the variance of the training set A of each class as seen in 4.1. Thus, the classi cation criterion we have used is the smallest Mahalanobis distance, which can be written as given in equation 4.4. ^y = arg min l2f1;2gMl( l test) (4.4) The Mahalanobis distance is used to account for the correlations within a dataset as described in [10]. It is invariant to scale and is referred to as a normalized Euclidean distance. 2. Magnitude of momentum: If a test image belongs to a class, the deformation that maps the template image of that class to it should be small. This implies that the L2 norm of the momenta characterizing it should be small, compared to the deformation required to map the template image of any other classes to it. Thus, assuming the notation discussed in the criterion 1 above, the classi cation criterion can be written as in equation 4.5. ^y = arg min l2f1;2gjj l testjjmag (4.5) where jj l testjjmag = sX i jj ijjL2 (4.6) This is written assuming the binary classi er between the classes l we are comparing against. 3. Using the data matching term: The data matching criterion uses the accuracy of the registration as a means to classi cation. If a template of a class registers accurately with the test image, then the test image would belong to the speci c class. Let us say that the deformation obtained by registering the template of class l with the test image is a function of the momenta 􀀀1( l test). Then, applying the deformation to the template gives us an image which is closely matched to the 21 test image. Based on how well the two images match, we propose the classi cation criterion in equation 4.7: ^y = arg min l2f1;2gjjIl template ( l test) 􀀀 ItestjjL2 (4.7) This is based upon the quality of the registration. If the deformation found results in the template image closely matching the test image, the test image belongs to the class. In order to measure how closely the test image matches the deformed template image, we take the L2 norm of the di erence between the two as seen in equation 4.7. This is the criterion which has yielded the best results, as can be seen in following sections. 4.2 Receiver operating characteristics plots The performance of binary classi ers can be visualized and measured using Receiver Operating Characteristic plots, also referred to as ROC curves from here onwards. They can be used for the selection of internal parameters in classi ers. In our case, we will vary the parameters , , the number of control points, and the number of training samples to nd out their e ect on the classi er. The ROC curve is plotted by changing the classi cation threshold between the classi cation distances in binary classi ers. The traditional binary classi er can be written as given in 4.8. ^y = 1 if d( 1 test) d( 2 test) (4.8) = 2 if d( 1 test) > d( 2 test) (4.9) Here, the function d is any distance metric which has been discussed in section 4.1. In order to plot the performance of the classi er, we introduce a threshold term which changes the classi er equation to 4.10: ^y = 1 if d( 1 test) 􀀀 d( 2 test) (4.10) = 2 if d( 1 test) 􀀀 > d( 2 test) (4.11) As can be seen in the above equation, varying the threshold will result in a shift in the linear classi er boundary. When testing the classi er, if we plot the results of the true positive rate (TPR) against the false positive rate (FPR) for various values of , we get the ROC curve which quanti es the quality of the classi er. Finally, the classi cation criterion can be given by 4.12 ^y = 1 if d( 1 test) 􀀀 d( 2 test) 􀀀 0 (4.12) = 2 if d( 1 test) 􀀀 d( 2 test) 􀀀 > 0 (4.13) 22 The ROC graphs are two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis. Each point in the ROC graph represents the performance of a single classi er. If we vary in the above set of equations, then we get a ROC curve that tells us the performance of the classi er for a given set of parameters. The graph denotes the relative trade-o between true positives and false negatives. The closer the graph to the top left corner and the larger the area under the ROC curve, the better the performance of the classi er. Some of the properties of the ROC graphs that are attractive are its insensitivity to class skew, and ease of comparison of classi ers by the metrics area under the curve as well as ROC average comparison [5]. We use the algorithm for computing the area under the curve and the average of the curve that has been discussed in [5]. The binary classi er using metrics discussed in section 4.1 has been implemented for classi cation between two sets of digits. The rst binary classi er is between the digits 1 and 3 while the second set discusses results of the binary classi er between digits 2 and 5. The major motive of this experiment is to decide the optimal value of and which needs to be used for classi cation. Also, the e ect of varying the number of training samples and the dimensionality of the deformation descriptor which is the number of control points has been discussed. We will use ROC plots to measure and compare the performance of classi ers. Note that all of the following tests use a regular grid distribution of control points and the Mahalanobis distance metric from equation 4.4 for classi cation. 4.3 E ect of varying and r Let us plot the ROC curves for the binary classi er between digits 1 and 3 for = f1; 2; 3; 4; 5g and r = f0:001; 0:01; 0:1; 0:5; 0:9g. Figure 4.1 and Figure 4.2 show the ROC plots for the classi er between digits 1 and 3 varying and . The plot makes it di cult to compare the various curves and nd a good operating point. Instead, we can plot the di erent curves one for each value of r, for each value of in order to nd which value of is ideal. This can be done by nding the value of for which the 2D ROC curves have a low variance. Although a variance metric can be used to do the same, a visual inspection of the ROC curves gives us a good idea of the correct value of that can be selected. Similarly, we can nd a good value of r. The curves which helped us conclude the ideal value of = 2 are shown in Figure 4.3. Although the curves for the other values of are not shown here, it can easily be seen that the variance of the 5 curves is small. It is smaller than the variance of the curves for other values of . This 23 Figure 4.1. ROC curve for classi cation between digits 1 and 3. 24 Figure 4.2. Magni ed version of the plot, magni ed for FPR between 0 and 0.1 and TPR between 0.9 and 1. Figure 4.3. Roc curves of classi cation between digits 1 and 3 for = 2. Left: ROC curve. Right: Magni ed version of the same plot, magni ed for FPR between 0 and 0.1 and TPR between 0.9 and 1. 25 led to the conclusion that = 2 is a good operating point for the binary classi er. We will use the same technique of visual inspection to nd the value of r. Plotting the ROC curves for r for di erent values of and comparing the variance tells us that r = 0:01 has the lowest variance. Figure 4.4 shows the ROC curves used to come to this conclusion. Thus, using Figures 4.3 and 4.4, we can conclude that the binary classi er between digits 1 and 3 has an ideal operating point at = 2 and r = 0:01. For the digits database, we wanted to test the hypothesis that the same kernel width and regularity penalty does not give similar results for binary classi ers between other classes of images. This was done by building a binary classi er between the digits 2 and 5 and repeating the above tests. Figure 4.5 and 4.6 show the various ROC curves plotted for various values of and r for the binary classi er between digits 2 and 5. A preliminary inspection of the results tell us that the binary classi er between digits 2 and 5 does not perform as well as the classi er between digits 1 and 3. Similarly, plotting the values of and r, we concluded that = 3 and r = 0:1 is a good operating point for this binary classi er. Hence, the same value of kernel width and regularity penalty does not give good classi cation results. In order to verify our results for the binary classi er between digits 1 and 3, we nd the area under the curve. The larger the area under the curve, the higher the accuracy. Figure 4.7 shows the plot for all the values of and r along with the area under the curve. As we can see, the area under the curve for = 2; r = 0:01 has AUC = 0:98729 which is high and hence reinforces our results. We can conclude that the value of the optimal corresponds to the mean width of the curves of the handwritten digits. Hence, the optimal for the classi er between digits 1 and 3 has a lower value of 2 while the classi er between digits 2 and 5 has an optimal value of 3. This is due to the fact that the images of the digit 2 in the dataset consistently have a small loop on the lower left which causes the average width of the digit to increase. 4.4 E ect of varying the dimensionality of the deformation descriptor To test the e ect of dimensionality of the shape descriptor on classi cation, we will plot the ROC curves with the same setup as used before but by changing the number of control points used. We will form the atlas of the two digits using 4 4 = 16, 5 5 = 25 up to 8 8 = 64 control points and verify the classi cation error using the ROC graphs. Figure 4.8 shows the desired plot. The above plot tells us that there is an optimal density 26 Figure 4.4. ROC curve for classi cation between digits 1 and 3 for r = 0:01. Left: ROC curve. Right: Magni ed version of the same plot, magni ed for FPR between 0 and 0.1 and TPR between 0.9 and 1. Figure 4.5. ROC curve for classi cation between digits 2 and 5. 27 Figure 4.6. Magni ed version of the same plot, magni ed for FPR between 0 and 0.1 and TPR between 0.9 and 1. 28 Figure 4.7. ROC curve for classi cation between digits 1 and 3 along with the area under the curve denoted by AUC. 29 Figure 4.8. ROC curves for the binary classi er between digits 1 and 3 for di erent dimensionality of the classi er. 30 of control points that give us good classi cation results. If we increase or decrease this density of control points, the error rate increases. This density of control points is directly related to the kernel width and regularity r. This can be inferred from the fact that we have used a regular distribution of points on a 5 5 grid of 25 control points to nd the optimum value of = 2 and r = 0:01 and that the optimal number of control points is 25, as can be seen in Figure 4.9. The graph gives a trend of the number of control points against the Area under the ROC curve. Figure 4.9 tells us that the area under the ROC curve is highest for the 6 6 regular grid of 36 control points. To con rm the relation of the density of control points with and r, we will have to nd the optimum value of and r for a di erent density of control points which we will not attempt to do here. In order to see if the results are consistent, let us have a look at the ROC curve as plotted for the binary classi er between digits 2 and 5. We have a slightly di erent trend for the area under the ROC curve as seen in the Bottom plot of Figure 4.10, but there is still an optimal value for the number of control points required, which is 25 according to this graph. The data-point 36 control points is an outlier in this case. This leads us to the conclusion that the classi er accuracy is high for an optimal number of control points. If we increase or decrease the size of the feature vector which is the deformation descriptor, the classi er accuracy decreases. Hence, in order to obtain good classi er accuracy, we need to optimize the control point density, i.e., their number and their placement in the domain. 4.5 E ect of varying the number of training examples used in atlas formation We want to nd out the e ect of sparsity of the deformation descriptor on the classi - cation rate as the number of training samples change. To test this out, let us have a look at the e ect of changing the number of training examples for a xed number of control points. As before, we will test on a regular grid of 5 5 = 25 control points. Let us change the number of training examples in this setting for = 2 and r = 0:01 and observe its e ect on the ROC curves as shown in Figure 4.11. From Figure 4.11, the e ect of the number of training examples on classi er accuracy is not exactly clear. To clarify this, let us plot the area under each ROC curve against the number of training examples. Figure 4.12 shows the same. Figure 4.12 tells us that as the training examples increase, the classi er gives better results. No clear relationship can be identi ed other than the fact that the classi cation 31 Figure 4.9. E ect of changing the number of control points on the Area under the ROC curve. 32 Figure 4.10. E ect of changing the dimensionality of classi er. Top: ROC plots for digits 2 and 5 changing dimensionality of classi er. Bottom: E ect of changing the number of control points on Area under the ROC curve. 33 Figure 4.11. Changing the number of training examples for the binary classi er between digits 1 and 3. 34 Figure 4.12. Area under ROC curve as the number of training samples is changed. 35 error tends to decrease with an increase in training samples. This is because we do not know the underlying distribution of the training data. CHAPTER 5 MULTICLASS CLASSIFICATION The handwritten digits dataset has 10 classes. In view of the results obtained for binary classi cation and the need to distinguish between 10 classes, we were encouraged to move onto multiclass classi cation using simple extensions of the binary classi cation metrics discussed in section 4.1. Note that these multiclass classi ers follow the OVA (One versus All) paradigm of classi ers as discussed in [7] since we compare the similarity criterion of the test image with each class and then nd the one to which it is closest. Let us rst start with a discussion of the classi cation criterion extensions. 5.1 Multiclass classi cation extensions The three classi cation criteria which we have discussed for binary classi ers in section 4.1 are extended in order to di erentiate between multiple classes as follows: 1. Mahalanobis distance: In binary classi cation, we have restricted ourselves to comparing the distance between 2 classes. If we extend this comparison as given in equation 4.4 to all the 10 classes that we have, then we get the classi cation criterion given in equation 5.1 ^y = arg min l2f1;2;:::10gMl( l test) (5.1) The notion behind this classi cation criterion is that the closer the deformation is to the mean deformation of a certain class, the more likely it is to belong to that class. The notion of the distance that was used for the binary classi cation has been extended to multiclass classi cation. Note that this classi cation criterion is simple and cannot give good results if the clusters overlap. 2. Magnitude of the momenta vectors: The same notion of the magnitude of the momenta vectors discussed in equation 4.5 is extended in equation 5.2 to accommo- date multiple classes: 37 ^y = arg min l2f1;2;:::10gjj l testjjmag (5.2) where jj l testjjmag = sX i jj ijjL2 (5.3) 3. Data matching term: By far, this metric provides the best classi cation rates. The metric compares the L2 norm between the test image and the deformed template image to tell us the di erence between the two. The lower the di erence, the more likely the test image belongs to the class. The classi cation can be described as given in equation 5.4. ^y = arg min l2f1;2;:::10gjjIl template ( l test) 􀀀 ItestjjL2 (5.4) As can be seen here, the metric depends upon the quality of the registration, an extension of the rule given in equation 5.4. 5.2 Confusion matrix The confusion matrix is used to organize and visualize multiclass classi er accuracy. The columns of the matrix represent the class predicted by the classi er, while rows represent the actual class of the instance. The confusion matrix is a square matrix of dimensionality l l where l is the total number of classes. The entry xij in the ith row and jth column represents the number of samples or proportion of samples of class j that have been predicted to be classi ed as class i. Each multiclass classi er with a speci c parameter settings will have one confusion matrix for each test dataset. It is easy to see if the system confuses two classes using the confusion matrix by looking at the row corresponding to the actual class. When a dataset is unbalanced, the error rate of a classi er is not representative of the true performance of the classi er. This is when the confusion matrix helps us. There are several accuracy measures that have been derived to measure the performance of multiclass classi ers from the confusion matrix as discussed in [12]. Although many measures are proposed, we will be using the simple measure of average error rate across the classes to measure performance of the classi er. 5.3 Multiclass classi cation using optimally situated control points First let us plot the confusion matrices for the multiclass classi cation between a set of 25 optimally situated control points selected using the method described in section 3.2. The three confusion matrices using the classi cation criteria have been displayed in Figure 5.1. The confusion matrices above have been plotted using the entire training data in the 38 Figure 5.1. Confusion matrices plotted for multiclass classi cation with 25 control points using = 3, r = 0:25 with gradient descent on the control point positions using di erent classi cation criteria. Top Left: Data matching criterion used for classi cation gives excellent results. Average Error rates are 0.12 Top Right: Magnitude of the momenta vectors when used for classi cation gives average error rate of 0.38. Most digits tend to get confused with the digit 1, 6, 7, and 9. Bottom: The Mahalanobis distance does not perform very well with an average error of 0.49. Most digits tend to get confused with the digit 8 as well as with digits 3, 4, and 5. 39 training phase which are the atlas formation and the entire test data. A cursory inspection of the 3 plots tells us that the data matching criterion of equation 5.4 gives us the best classi cation results with an average error rate of 12%. The Mahalanobis distance is not a good classi cation criterion with an average error rate of almost 50% for the current con guration of parameters. This is because the test dataset is incredibly challenging and the deformations from multiple templates tend to look very similar to their respective atlas deformations. Hence, it is more useful to compare the actual deformed image with the test image as is done using the data matching criterion. The results led to the belief that the low error rate was a result of the deformation descriptor being very low dimensional. The next section will analyze the e ect of increasing the dimensionality of the deformation descriptor. 5.4 Using a higher density of control points In order to analyze the e ect of increasing the deformation descriptor dimensionality, we will increase the density of control points. The previous section used 25 regularly distributed control points. Let us have a look at the confusion matrices plotted for the three classi cation criteria using a grid of 8 8 = 64 control points. In order to reduce the running time, we have used a slightly smaller training set for construction of the atlas. The results are shown in Figure 5.2. As can be seen, the average classi cation error has not changed appreciably by increas- ing the control point density. The error using the data matching criterion increased from 12% to 12.21%. It is almost stable. The magnitude of momenta criterion has decreased from 38% to 36.71%. Although there is an improvement due to an increase in the resolution in the deformation descriptor, it is not appreciably large. The Mahalanobis distance error changes from 49% to 49.36%, which is not signi cant. This set of experiments has led to the conclusion that the data matching criterion yields the best results if we wish to use the velocity eld as described in section 1.2 obtained using the registration technique described in Chapters 2 and 3. To ratify our results, we compare them with the benchmarks discussed in detail in [9]. The dataset used is the ZIP code digits database. From the comparative results given in this paper, we see that the data matching gives results compared to a simple linear classi er. Hence, this is not a good classi er to use all by itself. It should be used in conjunction with other classi er or with higher dimensional images to make the advantages of the system apparent. 40 Figure 5.2. Confusion matrices plotted for multiclass classi cation with 8 8 grid of 64 control points using = 3, r = 0:1 using di erent classi cation criteria. Top Left: Data matching criterion has an Average Error rate of 0.1221 Top Right: Magnitude of the momenta vectors for classi cation give average error rate of 0.3671. Confusion with the digit 1,6,7, and 9 occurs frequently. Bottom: The Mahalanobis distance has an average error of 0.4936. Most digits tend to get confused with the digit 4 and 8 as well as with 3 and 5. 41 The error rates obtained in the above set of experiments are high in comparison with results discussed in the LeCunn paper [9]. The above two set of results tell us that in order to use the momenta vectors for performing classi cation, we need to use a di erent set of features instead of the image data directly if we want to use the same classi cation criterion. Otherwise, we may have to optimize all sets of AVA classi ers and build a decision tree-based multiclass classi er. Hence, we have explored the use of one feature in the following section. 5.5 Using the gradient as a feature One simple attempt towards using a di erent set of features was using the gradient of the images as a feature instead. In this case, we have formed the atlas using the gradient of the subject images. Also, for classi cation, instead of using the images directly, we have used the gradient of the images. Thus, in the atlas formation process and the classi cation process, we have: Is 7! jjrIsjjL2 Itest 7! jjrItestjjL2 use the gradient magnitude of the image instead of the image itself Using the above feature instead of the image itself, Figure 5.3 shows how the multiclass classi er performs. Average error rates of 49%, 62%, and 58.26% obtained in the above tests tell us that the gradient itself is not a good feature to be used with this method. 42 Figure 5.3. Confusion matrices plotted for multiclass classi cation using gradient of the images as the image feature with 8 8 grid of 64 control points using = 3, r = 0:1 using di erent classi cation criteria. Left: Data matching criterion has an average error rate of 0.49 Middle: Magnitude of the momenta vectors for classi cation give average error rate of 0.62. Right: The Mahalanobis distance has an average error of 0.5826. CHAPTER 6 CONCLUSION AND FUTURE WORK The method of using the velocity eld obtained using the registration framework of landmark matching is successful for performing statistics on high-dimensional data. The low-dimensional descriptor at control points improves the results of the classi cation pro- cess. Let us discuss the results obtained along with the implications. 6.1 Image registration and atlas formation The image registration framework is based on a gradient descent over an objective that balances the smoothness and image match. The gradient descent is stable and is in uenced largely by the kernel width of the Gaussian kernel. The optimal kernel width that should be used is equal to the width of the contours in the image. This results in the control points being able to in uence the motion of the contours. The contours or level set boundaries are the sections of the image that in uence the correspondence or match between images. The time required by the gradient descent to converge increases as the di erence between two images increases but is balanced by the breaking ratio stopping criterion. If the gradient descent is not able to in uence the objective function value much, the gradient descent is terminated. The atlas formation using iterative averaging shrinks the template images. This is due to the regularity term as discussed in section 3.1.2. Splatting results in better atlas reconstruction. Also, precomputing an optimal set of landmarks to be used for atlas construction results in better classi cation results. Using the variance in the atlas formation process as discussed in section 3.2 also helps in reducing the dimensionality and improving the classi er accuracy. The process of atlas construction is time consuming but can be done o -line before the actual classi cation. The registration process has a time complexity of O(NM) where N is the number of control points and M is the size of the image in terms of the number of pixels in it. With the relevant parameters settings, adaptive step length, the number of iterations required for convergence is small and constant. Thus, assuming a constant number of 44 control points, registration has a time complexity linear to the size of the image. Hence, it is computationally e cient. 6.2 Binary classi er The performance of the binary classi ers was tested using ROC curves. It is found that the performance of the classi ers is optimal for an optimal setting of kernel width and regularity trade-o r. This parameter setting is dependent upon the average width of the contours of the data. This can be inferred from the fact that the optimal value of is 2 for the binary classi er between digits 1 and 3 while the optimal is 3 for the classi er between digits 2 and 5. This is veri ed not only by the variance of the ROC curves but also by the area under the curve. This tells us that the parameter setting of kernel width should be made so that it is optimal for the data under consideration. We have also analyzed the e ect of varying the dimensionality of the deformation descriptor. It can be seen from section 4.4 and Figure 4.9 that the classi er is accurate for an optimal dimensionality of the deformation descriptor. This extends the results from [3] and proves that the deformation descriptor has an optimal dimensionality. A descriptor with low dimensionality cannot capture all modes of variation of the data while a high dimensional descriptor adds noise artifacts and can bias the classi er towards a certain class. We also wanted to analyze the e ect of the number of training examples on the classi er output. As seen in 4.12, the accuracy of the classi er tends to increase with an increase in the number of training samples. This is the output that we would expect, but it assumes that the test and training samples have been drawn from a uniform random distribution. The trend is somewhat unclear, although it seems to increase due to lack of knowledge of the underlying distribution. 6.3 Multiclass classi cation The performance of multiclass classi ers is measured by using confusion matrices and average error rates across the classes. The rst set of experiments is performed using an optimally situated set of 25 control points. In the case of the binary classi ers, the Mahalanobis distance criterion yields good results. However, for multiclass extensions of the three metrics used, it is seen in Figure 5.1 that the data matching criterion yields the best results. The data matching criterion has an error rate of 12%. This is due to the fact that as the number of classes increase, the deformations from the test image to multiple class templates look similar. The Mahalanobis distance is not a good metric to use when 45 the number of classes increases. In our case, this is also due to smaller degrees of variation in the digits dataset. Increasing the density of control points does not have an appreciable a ect on the classi er accuracy. As seen in section 5.4, we can see that the average error rate does not improve much. Figure 6.1 shows an excerpt from [9] that details the error rates obtained for various techniques using the ZIP codes database that we have used. Comparing to the output of our classi er, we can see that we get results similar to a simple linear classi er. Thus, we have the same computational complexity with far fewer parameters in our system as compared to the linear classi er discussed in [9]. The linear classi er discussed in [9] uses 7850 free parameters. Thus, we have achieved better run times by reducing the dimensionality of the descriptor while keeping the error rate low. 6.4 Future work The objective of this e ort was to quantify the utility of deformation elds as features for tasks such as classi cation. The technique of using the parametrized optimal deforma- tion eld found by a gradient descent on the objective that enforces data matching and smoothness of deformation works well with the given data. In order to improve the accuracy of the classi er, we will have to incorporate the results of multiple classi ers. This can be done by a weighted sum of the result using ensemble techniques like bagging and boosting. Classi ers using structural information such as ones performing skeletonization of the curves have been shown to yield good results on the digits database [2]. Combining the results of such multiple low-cost classi ers can be more e cient than using expensive classi ers using neural networks [9]. The true potential of this technique can be seen in classi cation in medical imaging data [3]. Anatomical structures have variations that are locally consistent. Thus, in a normalized dataset, the voxels do not move independently. This fact has been incorporated in the objective function as a smoothness constraint. Also, another advantage of this technique is the massive reduction in the dimensionality of the feature descriptor. Such an improvement in computational cost is not possible using expensive techniques such as neural networks. Thus, the next step is using the technique to answer clinical questions using medical data. 46 Figure 6.1. Error rate for various classi cation methods using the ZIP code digits database. Data taken from [9] REFERENCES [1] M. F. Beg, M. I. Miller, A. Trouv, and L. Younes, Computing large deforma- tion metric mappings via geodesic ows of di eomorphisms, International Journal of Computer Vision, 61 (2005), pp. 139{157. 10.1023/B:VISI.0000043755.93987.aa. [2] S. Behnke, M. Pfister, and R. Rojas, Recognition of handwritten digits using structural information, in Neural Networks, 1997., International Conference on, vol. 3, IEEE, 1997, pp. 1391{1396. [3] S. Durrleman, M. Prastawa, G. Gerig, and S. Joshi, Optimal data-driven sparse parameterization of di eomorphisms for population analysis, (2011), pp. 123{ 134. [4] L. C. et al, Zip Code Normalized handwritten digits. http://www-stat.stanford. edu/~tibs/ElemStatLearn/, 1990. [Online; accessed 19-July-2008]. [5] T. Fawcett, Roc graphs: Notes and practical considerations for researchers, (2004). [6] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, Springer Series in Statistics, Springer, 2009. [7] H. D. III, A Course in Machine Learning, 2012. [8] S. Joshi and M. Miller, Landmark matching via large deformation di eomor- phisms, Image Processing, IEEE Transactions on, 9 (2000), pp. 1357 {1370. [9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278 {2324. [10] P. C. Mahalanobis, On the generalised distance in statistics, Proceedings of the National Institute of Sciences of India 2, 1 (1936), pp. 49 { 55. [11] J. Nocedal and S. Wright, Numerical Optimization, Springer Series in Operations Research, Springer, 1999. [12] S. V. Stehman, Selecting and interpreting measures of thematic classi cation accu- racy, Remote Sensing of Environment, 62 (1997), pp. 77 { 89.
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6sf3b1v