![]() | ![]() |
Astron. Astrophys. 341, 371-384 (1999) 3. Spectral classificationFor the classification of the galaxies on the basis of their spectrum, we use a two-step scheme in which we first describe a spectrum in terms of its most significant Principal Components (PCs), and then use a trained Artificial Neural Network to classify the galaxy on the basis of those components. In this section, we summarize the methods that we used and the details of their implementation as far as those are required for an appreciation of the results. Several authors applied PCA (either by itself or in combination with an ANN) to the problem of trying to determine from a spectrum the stellar or galaxy type. Deeming (1963), in classifying stellar spectra, found a very good correlation between the first, most important PC and the stellar type. Francis et al. (1992) carried out a study of a large sample of QSO spectra, and developed a classification scheme based on the first three PCs. Von Hippel et al. (1994) used an ANN to classify stellar spectra and concluded that they could recover the stellar type to within 1.7 spectral subtypes. Sodré Jr. & Cuevas (1994) showed that the spectroscopic parameters extracted from the spectra of galaxies, like the amplitude of the 4000 Å break or of the CN band, correlate well with Hubble type. Zaritsky et al. (1995) decomposed galaxy spectra into an old
stellar component, a young stellar component and various emission-line
spectra. They classified the galaxies by comparing the relative
weights of the components with those of galaxies of known
morphological type and found that the spectral classification agreed
with the morphological classification to within one type (e.g. E to S0
or Sa to Sb) for Folkes et al. (1996) combined PCA and ANN to classify galaxy spectra. Their purpose was to investigate galaxy classification from spectra to be obtained in the 2dF Galaxy Redshift Survey. They generated artificial spectra and obtained a success rate of more than 90% in recovering the galaxy type from the spectrum. Lahav et al. (1996) used ESO-LV galaxies (Lauberts & Valentijn 1989) and grouped them in three ways. From a PCA applied to 13 galaxy parameters they found that different morphological types occupy distinct regions in the plane defined by the two most important PCs. They also used an ANN with the 13 galaxy parameters as input and concluded that with a single output node, there is a strong correlation between the galaxy type indicated by the ANN-output and the input type. Using two output nodes, one for early- and one for late-type galaxies, the overall success rate was 90%, which decreased to 64% if 5 output nodes were used, viz. E, S0, Sa+Sb, Sc+Sd and I. In the last few years several more applications of PCA analysis, combined PCA and ANN analysis, or ANN analysis were published. In the field of stellar classification see e.g. Bailer-Jones (1997), Ibata & Irwin (1997), Weaver & Torres-Dodgen (1997) and Singh et al. (1998), and in the area of galaxy classification e.g. Galaz & de Lapparant (1998) and Bromley et al. (1998). 3.1. Principal component analysisPrincipal Component Analysis (PCA) is a technique developed for data compression as well as data analysis. As measured parameters, like e.g. spectral fluxes, may be correlated, it is of interest to determine the minimum number of independent variables that can describe the larger amount of correlated observed parameters. A full description of PCA can be found in e.g. Kendall & Stuart (1968) and Folkes et al. (1996). In our analysis, the PCA is an important first step as it reduces the number of parameters that describe the galaxy spectrum, while it recovers essentially all significant information and reduces the noise. 3.1.1. Preparing the dataBefore we could apply PCA to the ENACS galaxy spectra some preparations were necessary. First, all spectra were inspected and a few spectra with strong discontinuities or other non-physical features were discarded. Secondly, sky lines were removed by linear interpolation. Thirdly, the spectra were shifted back to zero-redshift and corrected for the response functions of the OPTOPUS instrument (spectrograph and CCD detector). Fourthly, a maximum common (zero-redshift) wavelength range had to be established for as large a subset of the galaxy sample as possible. Using all galaxies in the ENACS survey, this common wavelength
range would be rather small because background galaxies have redshifts
up to For some field galaxies the zero-redshift spectrum did not fully
cover the wavelength range [ Finally, the spectra were normalized to unit integral. For the normalization we interpolated the spectrum in the regions of 20 Å centered on the emission lines because a strong emission line may result in a continuum which is too low. Leaving out all spectra that were observed in September 1992 and
rejecting all galaxies for which more than 20 pixels had to be added
at either one or both ends to fill the spectral range
[ 3.1.2. Determining the principal componentsAfter the preparation described in Sect. 3.1.1 the resampled
spectrum of each galaxy defines a j-dimensional vector
However, there are cases in which the different dispersions or the relative strengths of the inputs are important (see e.g. Folkes et al. 1996). E.g., in our PCA the components of the spectral vectors that contain the principal emission lines will have a larger variance, as these may or may not be present, and it may be important to retain this information. We did the PCA with and without normalization with the standard deviation, and obtained better results with normalization than without. The PCA solves for the weights In Fig. 1 we show two examples of spectra and their PCA reconstruction, based on the first 15 PCs. These examples illustrate that the spectra can be reconstructed quite well with only a limited number of PCs. There may be some indication that the spectrum of an elliptical is easier to represent with 15 PCs than that of a spiral, but the difference is slight.
3.1.3. Physical meaning of the PCsIn Fig. 2 we show the average spectrum of the 3798 galaxies in the
data set, together with the weights for the 3 most significant PCs,
i.e.,
Note that we expressed the PCs in terms of the spectral data, which
provides an immediate physical meaning of the weights
3.2. Artificial neural networksThe first 15 PCs derived for each spectrum are used as input for an Artificial Neural Network (ANN).The ANN determines the optimum way of combining the PCs in order to obtain a single number which maps, with maximum discriminating power, onto the desired quantity which, in our case, is morphological type. ANN's are frequently used to recognize patterns in input data. An array of parameters is presented as input to the ANN, which must have been trained to recognize the desired patterns. The ANN then yields the class of object for which the input array is most characteristic. The classification is objective: the ANN is true to the training it received, and repeatable. An ANN uses weights to translate the input data into one or more parameters which can be compared with the corresponding parameters for the training set in order to estimate the class of an object. The weights in the ANN are determined by an iterative least-squares minimization using a back-propagation algorithm. In each iteration step, the current values of the weights are updated according to the difference between the supplied output type and the calculated output type. For a full description of ANN's, the reader is referred to e.g. Hertz et al. (1991), Kröse & van der Smagt (1993) and Folkes et al. (1996). 3.2.1. Training the ANN and tuning its parametersWe trained the ANN by using the spectra of 150 of the 270 galaxies in our sample of 3798 for which Dressler (1980) gives a morphological type. The median redshift of the clusters studied by Dressler is about 0.04, which is significantly smaller than the median redshift of the ENACS sample of about 0.07. The 10 clusters in common between D80 and ENACS have redshifts between 0.04 and 0.07. The training set contains approximately equal numbers of galaxies in each of the three morphological classes that we attempted to `resolve', viz. E, S0 and S+I. The complexity of an ANN depends strongly on the number of inputs per galaxy and on the number of hidden nodes, layers, outputs, and connections. Therefore, one is well advised to use as few of these as possible (de Villiers & Barnard 1993), as long as the discriminating power of the ANN is not affected. For that reason, only the 15 most significant PCs of the galaxy spectra were presented to the ANN, rather than all 371 original spectral fluxes. By using only the 15 most significant PCs, we also reduce the noise considerably, as the latter is mostly contained in the higher-order PCs. We used only one hidden layer, which contains 5 nodes. This makes the backpropagation network much more rapid to train (see e.g. de Villiers & Barnard 1993). Only one output node was used, with output values in the interval [0,1]. Some authors define a separate output node for each of the morphological types that can be assigned to the galaxies (e.g. Storrie-Lombardi et al. 1992). The output node which has the highest `activity' then determines the galaxy morphology. However, because galaxies are thought to form a continuous instead of a discrete sequence of different morphologies (e.g. Naim et al. 1995), we have chosen to describe the sequence with only one output node with a continuous range of output values. A schematic diagram of the ANN we used is shown in Fig. 3.
When training the ANN, it is essential to stop the iterative minimization at the right moment. One option is to stop when the total error between calculated output types and supplied output types of the training set , the so-called `cost function', drops below a certain value, or changes little between successive iteration steps. However, this may result in `over-fitting', i.e., one may interpret the statistical fluctuations in the training set as global characteristics. Another option is to minimize the cost function as calculated for the test set (Lahav et al. 1996). Because the ANN is not trained on this set, the cost function will usually have a true minimum at a certain iteration step, and increase after that. In our case the results are almost identical for both options. As we want to extend the analysis in Paper III to early- and late-type galaxies, we are primarily interested in a two-class classification. Therefore, we trained the ANN for a pure early-/late-type division which allows a separation of the heterogeneous class of non-ELG into early- and late-type galaxies. An additional reason for taking ellipticals and S0's together in one class was given by Lahav et al. (1996), who found that 76% of all early-type galaxies were correctly classified by their ANN, but that of the S0's only 66% was classified correctly. They suggested that this may be an indication that the S0's form a `transition class' in the Hubble sequence. Sodré & Cuevas (1997) found that the first, most significant PC of ellipticals and S0's are very similar, so it is hard to distinguish between them on the basis of their spectrum. We have also trained the ANN for a three-class division into E, S0 and S+I. We defined the output values of the ANN for these three classes to be 1/6, 1/2 and 5/6. In principle, we could have defined different output values for these three categories which would have resulted in different weights in the trained ANN. However, we find that an ANN with output values of 0, 1/2 and 1 gives classification results that are essentially identical to those obtained with the output values 1/6, 1/2 and 5/6. Galaxies are assigned the morphological classification for which the difference between their ANN output parameter and the output parameter defined for the class is smallest. After running the three-class ANN we also sum the E and S0 categories to produce the equivalent of the early-type category in the two-class ANN. We find that there are no significant differences between the results of a true two-class ANN and a semi two-class ANN obtained by combining the E and S0 classes in a three-class ANN. Below we will describe the results for the latter. 3.2.2. Testing the ANNIn addition to training the ANN, we tested it with a test set consisting of the remaining 120 galaxies with morphology from D80. The results for this test set, in terms of the success in classifying the galaxies correctly, are valid for the entire data set of ENACS galaxies for which no morphological classification is available. However, for the latter we do have information on the presence or absence of detectable emission lines in the ENACS spectrum, and we will use that to refine the determination of the ANN output parameter which best separates early- and late-type galaxies. 3.2.3. Optimizing the classification resultsOur goals are to optimize the success rate of the classification, to obtain the observed fraction of late-type galaxies among the ELG (viz. 86%, see Paper III), and to obtain the correct fractions of E, S0 and S+I galaxies used to train and test the ANN. The only freedom one has to achieve these goals, after tuning all parameters as described in Sect. 3.2.1, is to set the output ranges within which a galaxy will be classified as E, S0 or S+I. A priori, the most logical choice the output ranges is [0,1/3], [1/3,2/3], and [2/3,1] for E, S0 and S+I, respectively. However, we find that the ranges [0.00,0.34], [0.34,0.59] and [0.59,1.00] produce a fraction of early-type galaxies among the ELG that is more consistent with observations than that found with the a priori choice. In the following we will therefore use the ranges [0.00,0.34], [0.34,0.59] and [0.59,1.00]. Note, however, that the success rates for the two sets of output ranges differ by at most a few percent. 3.3. Possible causes of misclassificationThere are a number of factors that determine the performance of the classification algorithm. First, the representation of the spectra by the first 15 PCs is not perfect. However, the error one makes if one only uses the first 15 PCs is probably quite small (see Fig. 1), while the results are not expected to depend much on the exact number of PCs used, as long as this number is large enough (see also Sect. 4.2). On the other hand, it is likely that the correspondence between the characteristics of the spectrum (as quantified in the first PCs) and the morphology is not entirely one-to-one. For instance, the spectrum of a late-type galaxy is likely to depend on the location in the galaxy; viz., if the aperture of the spectrograph covered only the central region of such a galaxy, a significant contribution of a bulge may well create an apparent inconsistency between morphological and spectral classification. Secondly, morphological classification is not easy. For example, it is likely that some S0 galaxies, especially those seen face-on, are classified as ellipticals, on the basis of the image only (this is less likely if the brightness profile is used as well). On the other hand, edge-on SO galaxies and spirals are not always easy to classify correctly. Naim et al. (1994) showed that there is indeed some ambiguity in classifying galaxies solely on the basis of the morphology of the images. They found 6 experts willing to classify a set of 831 galaxy images. The results show that both types of disagreement mentioned above do indeed occur, as well as differences in the verdict of whether a spiral galaxy is of early or late type. The r.m.s. differences between verdicts of experts ranged from 1.3 to 2.1 Revised Hubble types. This is as large as the r.m.s. dispersion between the mean classification of the 6 experts on the one hand and the results of an ANN analysis on the other (Naim et al. 1995). Even an expert is not always totally consistent. Using the 40 galaxies in Dressler's catalogue that were classified twice, once in the cluster DC 0326-53 and once in DC 0329-52, this effect can be quantified. A comparison between both classifications of D80 is given in Table 1. In the three-class system, 8 out of the 40 galaxies have an inconsistent classification, and at least 4 of the 40 classifications (or 10%) are thus incorrect. In addition, it cannot be excluded that galaxies for which both independent classifications are identical, have yet been classified incorrectly; so, the 10% of misclassifications is a lower limit. If one takes E and S0 galaxies together to obtain an early- vs. late-type classification, the number of misclassifications is at least 5%. Note, however, that the other way to read this number is that D80 is close to 95% consistent: an impressive achievement, as will be confirmed by anybody who has done morphological classification. Table 1. Distribution of morphological type for the galaxies that Dressler (1980) classified twice, in the cluster DC 0326-53 as well as in DC 0329-52. The last column and the bottom line of each half of the table indicate the fraction of galaxies that is classified into the same class twice. Thirdly, as mentioned before, the spectral difference between E and S0 galaxies probably is not very large (see Lahav et al. (1996) and Sodré & Cuevas (1997)). Fourthly, Zaritsky et al. (1995) found that for 51 of the 304 galaxies in their sample (i.e. for 17%) the spectral typing is not consistent with the morphological classification to within one morphological type (E, S0, Sa, Sb, Sc and Irr). In 36 cases there is a discrepancy between morphological and spectral classification that transgresses the early-/late-type galaxy boundary. It is noteworthy that there are 16 cases of early-type morphology with late-type spectrum (mostly on the basis of emission lines), and 20 cases of late-type morphology with early-type spectrum. I.e. the effects of misclassification appear to be more or less symmetric, and can therefore be considered as sources of random errors, like the effects mentioned above. For several hundred of the ENACS galaxies that we studied in this
paper and for which we obtained a spectral classification, we also
have obtained CCD images which yield a morphological classification.
Provisional results indicate that for our spectral classification
method, the misclassification probably is not symmetric between
early-type morphology/late-type spectrum confusion and vice versa
(Thomas & Katgert, in preparation). Only 10% of the E and E/S0
galaxies in our sample have a 3.4. Dependence of results on algorithm parametersA number of choices were made and parameters were chosen in our classification algorithm. The first one is the number of PCs that is used in the ANN. In principle, this number is important, as using too few components to describe the spectrum will make the fits to the original spectra less accurate. The ANN will then have more problems to classify the galaxies. If one uses too many components, one may model the spectra too precisely and use PCs which are too noisy. In Sect. 4.2 we will check how the results depend on the number of PCs. If one does not normalize the spectral fluxes to unit variance (see Sect. 3.1.2), the relative strengths of the spectral points are retained. This may be important (FLM96) and it emphasizes certain emission or absorption lines. If we do not normalize to unit variance, the success rates are smaller than when all pixels are normalized to unity. Sodré & Cuevas (1997) also mention that the spread in the first two PCs is larger if the input parameters are all normalized to unit variance. The exact number of nodes in the hidden layer of the ANN is of minor importance. Using 5 or 7 nodes gives essentially the same results, but using only 3 nodes produces results that are slightly worse. The exact values of the learning parameters and the weight-decay term of the ANN (see e.g. Kröse & van der Smagt 1993) are not important either, as long as they are sufficiently small, i.e., 0.001-0.01. The number of cycles through the input list of training galaxies that is needed, however, does depend on the values of the learning parameters. The number of galaxies used to train the ANN is not critical, as long as it is sufficiently large, say 150. However, the r.m.s. values around the average classification result (see Sect. 4.2) may depend on the number of galaxies in each of the morphological classes. The number of galaxies with which we have chosen to train the ANN, viz. 150, is a compromise between having a sufficient number of galaxies to train the ANN, and having enough galaxies left to test the performance of the ANN. It is important, however, that all three main morphological types are represented in the training set with roughly equal numbers. If one morphology is overrepresented with respect to the other types in the training set, there will be a positive bias for that particular type. So, in principle, the composition of the training set should closely mimic the composition of the sample to be classified in order to have minimum bias. ![]() ![]() ![]() ![]() © European Southern Observatory (ESO) 1999 Online publication: December 4, 1998 ![]() |