Forum Springer Astron. Astrophys.
Forum Whats New Search Orders

Astron. Astrophys. 341, 371-384 (1999)

Previous Section Next Section Title Page Table of Contents

3. Spectral classification

For the classification of the galaxies on the basis of their spectrum, we use a two-step scheme in which we first describe a spectrum in terms of its most significant Principal Components (PCs), and then use a trained Artificial Neural Network to classify the galaxy on the basis of those components. In this section, we summarize the methods that we used and the details of their implementation as far as those are required for an appreciation of the results.

Several authors applied PCA (either by itself or in combination with an ANN) to the problem of trying to determine from a spectrum the stellar or galaxy type. Deeming (1963), in classifying stellar spectra, found a very good correlation between the first, most important PC and the stellar type. Francis et al. (1992) carried out a study of a large sample of QSO spectra, and developed a classification scheme based on the first three PCs. Von Hippel et al. (1994) used an ANN to classify stellar spectra and concluded that they could recover the stellar type to within 1.7 spectral subtypes. Sodré Jr. & Cuevas (1994) showed that the spectroscopic parameters extracted from the spectra of galaxies, like the amplitude of the 4000 Å break or of the CN band, correlate well with Hubble type.

Zaritsky et al. (1995) decomposed galaxy spectra into an old stellar component, a young stellar component and various emission-line spectra. They classified the galaxies by comparing the relative weights of the components with those of galaxies of known morphological type and found that the spectral classification agreed with the morphological classification to within one type (e.g. E to S0 or Sa to Sb) for [FORMULA] of the galaxies. Connolly et al. (1995) decomposed each spectrum into eigenspectra and found that the distribution of spectral types can be well described by the first two eigenspectrum coefficients.

Folkes et al. (1996) combined PCA and ANN to classify galaxy spectra. Their purpose was to investigate galaxy classification from spectra to be obtained in the 2dF Galaxy Redshift Survey. They generated artificial spectra and obtained a success rate of more than 90% in recovering the galaxy type from the spectrum.

Lahav et al. (1996) used ESO-LV galaxies (Lauberts & Valentijn 1989) and grouped them in three ways. From a PCA applied to 13 galaxy parameters they found that different morphological types occupy distinct regions in the plane defined by the two most important PCs. They also used an ANN with the 13 galaxy parameters as input and concluded that with a single output node, there is a strong correlation between the galaxy type indicated by the ANN-output and the input type. Using two output nodes, one for early- and one for late-type galaxies, the overall success rate was 90%, which decreased to 64% if 5 output nodes were used, viz. E, S0, Sa+Sb, Sc+Sd and I.

In the last few years several more applications of PCA analysis, combined PCA and ANN analysis, or ANN analysis were published. In the field of stellar classification see e.g. Bailer-Jones (1997), Ibata & Irwin (1997), Weaver & Torres-Dodgen (1997) and Singh et al. (1998), and in the area of galaxy classification e.g. Galaz & de Lapparant (1998) and Bromley et al. (1998).

3.1. Principal component analysis

Principal Component Analysis (PCA) is a technique developed for data compression as well as data analysis. As measured parameters, like e.g. spectral fluxes, may be correlated, it is of interest to determine the minimum number of independent variables that can describe the larger amount of correlated observed parameters. A full description of PCA can be found in e.g. Kendall & Stuart (1968) and Folkes et al. (1996). In our analysis, the PCA is an important first step as it reduces the number of parameters that describe the galaxy spectrum, while it recovers essentially all significant information and reduces the noise.

3.1.1. Preparing the data

Before we could apply PCA to the ENACS galaxy spectra some preparations were necessary. First, all spectra were inspected and a few spectra with strong discontinuities or other non-physical features were discarded. Secondly, sky lines were removed by linear interpolation. Thirdly, the spectra were shifted back to zero-redshift and corrected for the response functions of the OPTOPUS instrument (spectrograph and CCD detector). Fourthly, a maximum common (zero-redshift) wavelength range had to be established for as large a subset of the galaxy sample as possible.

Using all galaxies in the ENACS survey, this common wavelength range would be rather small because background galaxies have redshifts up to [FORMULA]. We have chosen to use the zero-redshift wavelength range from [FORMULA] Å to [FORMULA] 5014 Å. This range includes all 4 major emission lines (see Sect. 2) and provides at least 7 Å continuum beyond the [OII] 3727 Å and [OIII] 5007 Å lines. All spectra were resampled in the range [[FORMULA]] with [FORMULA] Å, which yields 371 spectral fluxes.

For some field galaxies the zero-redshift spectrum did not fully cover the wavelength range [[FORMULA]]. When the wavelength coverage of a galaxy spectrum fell short by more than 70 Å (i.e., 20 pixels) from the [[FORMULA]] interval, the galaxy was removed from the sample. When the galaxy spectrum fell short by less than 70 Å, it was extrapolated by a second-order polynomial either down to [FORMULA] or up to [FORMULA], or both. This extrapolation does not introduce major errors in the fluxes at the edges of the spectrum.

Finally, the spectra were normalized to unit integral. For the normalization we interpolated the spectrum in the regions of 20 Å centered on the emission lines because a strong emission line may result in a continuum which is too low.

Leaving out all spectra that were observed in September 1992 and rejecting all galaxies for which more than 20 pixels had to be added at either one or both ends to fill the spectral range [[FORMULA]], we retained 3798 galaxies for the PCA. For 270 of these, a morphological classification is available from D80.

3.1.2. Determining the principal components

After the preparation described in Sect. 3.1.1 the resampled spectrum of each galaxy defines a j-dimensional vector [FORMULA], whose components represent the flux in the j pixels of the spectrum, with j = 1-371. From each component [FORMULA] we subtract the mean over all galaxies, [FORMULA], to centre the j-th parameter on zero (remember that we normalized all spectra to the same integral of 1.0). The values [FORMULA] can be used in two different ways in the PCA. Firstly, [FORMULA] may be normalized by its standard deviation [FORMULA]. In that case, each of the components of the spectral vectors has unit variance for the set of spectra used. This method is sometimes recommended as it puts each input parameter on a similar scale. In this way, one may construct vectors from components that are not related, such as e.g. mass and size.

However, there are cases in which the different dispersions or the relative strengths of the inputs are important (see e.g. Folkes et al. 1996). E.g., in our PCA the components of the spectral vectors that contain the principal emission lines will have a larger variance, as these may or may not be present, and it may be important to retain this information. We did the PCA with and without normalization with the standard deviation, and obtained better results with normalization than without.

The PCA solves for the weights [FORMULA] that define the 371 PCs [FORMULA] which follow from the spectral fluxes by the relation: [FORMULA]. The PCs are thus linear combinations of the normalized spectral fluxes and form an orthogonal basis. The first PC, [FORMULA], contains most of the variance between the spectra and describes the most characteristic difference between the spectra. The last PC contains least of the variance and will be most affected by noise. In practice, we have restricted the ANN analysis to the 15 most significant PCs (see Sect. 4.2).

In Fig. 1 we show two examples of spectra and their PCA reconstruction, based on the first 15 PCs. These examples illustrate that the spectra can be reconstructed quite well with only a limited number of PCs. There may be some indication that the spectrum of an elliptical is easier to represent with 15 PCs than that of a spiral, but the difference is slight.

[FIGURE] Fig. 1a and b. Comparison of two ENACS spectra with the reconstruction of the same spectra from the first 15 Principal Components. The solid line is the observed galaxy spectrum, the dashed line is the reconstructed spectrum. The dotted line indicates the difference between the observed and the reconstructed spectrum. a  Elliptical galaxy, b  Spiral galaxy.

3.1.3. Physical meaning of the PCs

In Fig. 2 we show the average spectrum of the 3798 galaxies in the data set, together with the weights for the 3 most significant PCs, i.e., [FORMULA], [FORMULA] and [FORMULA]. The weights for the first PC, [FORMULA], indicate that [FORMULA] represents the colour of the galaxy, as it measures the flux in the interval [FORMULA] Å minus the flux in the interval [FORMULA] Å. The wavelength dependence of [FORMULA] is very similar to that in Sodré & Cuevas (1997, their Fig. 5). The second PC, with weights [FORMULA], apparently measures the curvature of the spectrum, i.e. the flux between [FORMULA] Å and [FORMULA] Å, minus the flux below and above these wavelengths. The weights for the third PC, [FORMULA], have a signature just redwards of the 4000 Å break, and [FORMULA] thus seems to be sensitive to the strength of this break. Sodré & Cuevas (1994) noted that the 4000 Å break correlates well with Hubble-type. The third PC also appears to weigh the G-band at [FORMULA] Å. It gets progressively more difficult to understand in detail the physical meaning of the higher order PCs, but they gauge the various less conspicuous features in the spectrum, such as the many absorption and emission lines.

[FIGURE] Fig. 2. Average galaxy spectrum, and the weights [FORMULA] for the first 3 Principal Components (i=1-3), calculated for the entire data set of 3798 galaxies. See text for more details.

Note that we expressed the PCs in terms of the spectral data, which provides an immediate physical meaning of the weights [FORMULA]. In an alternative but fully equivalent representation, the spectral data can be approximated by a weighted sum of eigenspectra; see e.g. Connolly et al. (1995) or Galaz & de Lapparent (1998).

3.2. Artificial neural networks

The first 15 PCs derived for each spectrum are used as input for an Artificial Neural Network (ANN).The ANN determines the optimum way of combining the PCs in order to obtain a single number which maps, with maximum discriminating power, onto the desired quantity which, in our case, is morphological type. ANN's are frequently used to recognize patterns in input data. An array of parameters is presented as input to the ANN, which must have been trained to recognize the desired patterns. The ANN then yields the class of object for which the input array is most characteristic. The classification is objective: the ANN is true to the training it received, and repeatable.

An ANN uses weights to translate the input data into one or more parameters which can be compared with the corresponding parameters for the training set in order to estimate the class of an object. The weights in the ANN are determined by an iterative least-squares minimization using a back-propagation algorithm. In each iteration step, the current values of the weights are updated according to the difference between the supplied output type and the calculated output type. For a full description of ANN's, the reader is referred to e.g. Hertz et al. (1991), Kröse & van der Smagt (1993) and Folkes et al. (1996).

3.2.1. Training the ANN and tuning its parameters

We trained the ANN by using the spectra of 150 of the 270 galaxies in our sample of 3798 for which Dressler (1980) gives a morphological type. The median redshift of the clusters studied by Dressler is about 0.04, which is significantly smaller than the median redshift of the ENACS sample of about 0.07. The 10 clusters in common between D80 and ENACS have redshifts between 0.04 and 0.07. The training set contains approximately equal numbers of galaxies in each of the three morphological classes that we attempted to `resolve', viz. E, S0 and S+I.

The complexity of an ANN depends strongly on the number of inputs per galaxy and on the number of hidden nodes, layers, outputs, and connections. Therefore, one is well advised to use as few of these as possible (de Villiers & Barnard 1993), as long as the discriminating power of the ANN is not affected. For that reason, only the 15 most significant PCs of the galaxy spectra were presented to the ANN, rather than all 371 original spectral fluxes. By using only the 15 most significant PCs, we also reduce the noise considerably, as the latter is mostly contained in the higher-order PCs.

We used only one hidden layer, which contains 5 nodes. This makes the backpropagation network much more rapid to train (see e.g. de Villiers & Barnard 1993).

Only one output node was used, with output values in the interval [0,1]. Some authors define a separate output node for each of the morphological types that can be assigned to the galaxies (e.g. Storrie-Lombardi et al. 1992). The output node which has the highest `activity' then determines the galaxy morphology. However, because galaxies are thought to form a continuous instead of a discrete sequence of different morphologies (e.g. Naim et al. 1995), we have chosen to describe the sequence with only one output node with a continuous range of output values. A schematic diagram of the ANN we used is shown in Fig. 3.

[FIGURE] Fig. 3. Schematic diagram of the Artificial Neural Network that we used. The network determines the galaxy type (`output node'), from the Principal Components describing the galaxy spectrum. Each node in a given layer is connected to all nodes in the adjacent layers by weight vectors.

When training the ANN, it is essential to stop the iterative minimization at the right moment. One option is to stop when the total error between calculated output types and supplied output types of the training set , the so-called `cost function', drops below a certain value, or changes little between successive iteration steps. However, this may result in `over-fitting', i.e., one may interpret the statistical fluctuations in the training set as global characteristics. Another option is to minimize the cost function as calculated for the test set (Lahav et al. 1996). Because the ANN is not trained on this set, the cost function will usually have a true minimum at a certain iteration step, and increase after that. In our case the results are almost identical for both options.

As we want to extend the analysis in Paper III to early- and late-type galaxies, we are primarily interested in a two-class classification. Therefore, we trained the ANN for a pure early-/late-type division which allows a separation of the heterogeneous class of non-ELG into early- and late-type galaxies. An additional reason for taking ellipticals and S0's together in one class was given by Lahav et al. (1996), who found that 76% of all early-type galaxies were correctly classified by their ANN, but that of the S0's only 66% was classified correctly. They suggested that this may be an indication that the S0's form a `transition class' in the Hubble sequence. Sodré & Cuevas (1997) found that the first, most significant PC of ellipticals and S0's are very similar, so it is hard to distinguish between them on the basis of their spectrum.

We have also trained the ANN for a three-class division into E, S0 and S+I. We defined the output values of the ANN for these three classes to be 1/6, 1/2 and 5/6. In principle, we could have defined different output values for these three categories which would have resulted in different weights in the trained ANN. However, we find that an ANN with output values of 0, 1/2 and 1 gives classification results that are essentially identical to those obtained with the output values 1/6, 1/2 and 5/6. Galaxies are assigned the morphological classification for which the difference between their ANN output parameter and the output parameter defined for the class is smallest. After running the three-class ANN we also sum the E and S0 categories to produce the equivalent of the early-type category in the two-class ANN. We find that there are no significant differences between the results of a true two-class ANN and a semi two-class ANN obtained by combining the E and S0 classes in a three-class ANN. Below we will describe the results for the latter.

3.2.2. Testing the ANN

In addition to training the ANN, we tested it with a test set consisting of the remaining 120 galaxies with morphology from D80. The results for this test set, in terms of the success in classifying the galaxies correctly, are valid for the entire data set of ENACS galaxies for which no morphological classification is available. However, for the latter we do have information on the presence or absence of detectable emission lines in the ENACS spectrum, and we will use that to refine the determination of the ANN output parameter which best separates early- and late-type galaxies.

3.2.3. Optimizing the classification results

Our goals are to optimize the success rate of the classification, to obtain the observed fraction of late-type galaxies among the ELG (viz. 86%, see Paper III), and to obtain the correct fractions of E, S0 and S+I galaxies used to train and test the ANN. The only freedom one has to achieve these goals, after tuning all parameters as described in Sect. 3.2.1, is to set the output ranges within which a galaxy will be classified as E, S0 or S+I. A priori, the most logical choice the output ranges is [0,1/3], [1/3,2/3], and [2/3,1] for E, S0 and S+I, respectively. However, we find that the ranges [0.00,0.34], [0.34,0.59] and [0.59,1.00] produce a fraction of early-type galaxies among the ELG that is more consistent with observations than that found with the a priori choice. In the following we will therefore use the ranges [0.00,0.34], [0.34,0.59] and [0.59,1.00]. Note, however, that the success rates for the two sets of output ranges differ by at most a few percent.

3.3. Possible causes of misclassification

There are a number of factors that determine the performance of the classification algorithm.

First, the representation of the spectra by the first 15 PCs is not perfect. However, the error one makes if one only uses the first 15 PCs is probably quite small (see Fig. 1), while the results are not expected to depend much on the exact number of PCs used, as long as this number is large enough (see also Sect. 4.2). On the other hand, it is likely that the correspondence between the characteristics of the spectrum (as quantified in the first PCs) and the morphology is not entirely one-to-one. For instance, the spectrum of a late-type galaxy is likely to depend on the location in the galaxy; viz., if the aperture of the spectrograph covered only the central region of such a galaxy, a significant contribution of a bulge may well create an apparent inconsistency between morphological and spectral classification.

Secondly, morphological classification is not easy. For example, it is likely that some S0 galaxies, especially those seen face-on, are classified as ellipticals, on the basis of the image only (this is less likely if the brightness profile is used as well). On the other hand, edge-on SO galaxies and spirals are not always easy to classify correctly. Naim et al. (1994) showed that there is indeed some ambiguity in classifying galaxies solely on the basis of the morphology of the images. They found 6 experts willing to classify a set of 831 galaxy images. The results show that both types of disagreement mentioned above do indeed occur, as well as differences in the verdict of whether a spiral galaxy is of early or late type. The r.m.s. differences between verdicts of experts ranged from 1.3 to 2.1 Revised Hubble types. This is as large as the r.m.s. dispersion between the mean classification of the 6 experts on the one hand and the results of an ANN analysis on the other (Naim et al. 1995).

Even an expert is not always totally consistent. Using the 40 galaxies in Dressler's catalogue that were classified twice, once in the cluster DC 0326-53 and once in DC 0329-52, this effect can be quantified. A comparison between both classifications of D80 is given in Table 1. In the three-class system, 8 out of the 40 galaxies have an inconsistent classification, and at least 4 of the 40 classifications (or 10%) are thus incorrect. In addition, it cannot be excluded that galaxies for which both independent classifications are identical, have yet been classified incorrectly; so, the 10% of misclassifications is a lower limit. If one takes E and S0 galaxies together to obtain an early- vs. late-type classification, the number of misclassifications is at least 5%. Note, however, that the other way to read this number is that D80 is close to 95% consistent: an impressive achievement, as will be confirmed by anybody who has done morphological classification.


Table 1. Distribution of morphological type for the galaxies that Dressler (1980) classified twice, in the cluster DC 0326-53 as well as in DC 0329-52. The last column and the bottom line of each half of the table indicate the fraction of galaxies that is classified into the same class twice.

Thirdly, as mentioned before, the spectral difference between E and S0 galaxies probably is not very large (see Lahav et al. (1996) and Sodré & Cuevas (1997)).

Fourthly, Zaritsky et al. (1995) found that for 51 of the 304 galaxies in their sample (i.e. for 17%) the spectral typing is not consistent with the morphological classification to within one morphological type (E, S0, Sa, Sb, Sc and Irr). In 36 cases there is a discrepancy between morphological and spectral classification that transgresses the early-/late-type galaxy boundary. It is noteworthy that there are 16 cases of early-type morphology with late-type spectrum (mostly on the basis of emission lines), and 20 cases of late-type morphology with early-type spectrum. I.e. the effects of misclassification appear to be more or less symmetric, and can therefore be considered as sources of random errors, like the effects mentioned above.

For several hundred of the ENACS galaxies that we studied in this paper and for which we obtained a spectral classification, we also have obtained CCD images which yield a morphological classification. Provisional results indicate that for our spectral classification method, the misclassification probably is not symmetric between early-type morphology/late-type spectrum confusion and vice versa (Thomas & Katgert, in preparation). Only 10% of the E and E/S0 galaxies in our sample have a [FORMULA]50% probability that their spectrum is late-type. For the S0 galaxies this fraction increases to about 20%. However, about 30% of the spiral galaxies has a spectrum that has a [FORMULA]50% chance of being indicative of an early-type galaxy. This suggests that the chance of a spectral misclassification of an early-type galaxy is considerably smaller than that of a spectral misclassification of a late-type galaxy. Presumably, the fact that the ENACS spectra sample only the central few kpcs of the galaxies, amplifies the influence of the bulges on the spectral classification of late-type galaxies.

3.4. Dependence of results on algorithm parameters

A number of choices were made and parameters were chosen in our classification algorithm. The first one is the number of PCs that is used in the ANN. In principle, this number is important, as using too few components to describe the spectrum will make the fits to the original spectra less accurate. The ANN will then have more problems to classify the galaxies. If one uses too many components, one may model the spectra too precisely and use PCs which are too noisy. In Sect. 4.2 we will check how the results depend on the number of PCs.

If one does not normalize the spectral fluxes to unit variance (see Sect. 3.1.2), the relative strengths of the spectral points are retained. This may be important (FLM96) and it emphasizes certain emission or absorption lines. If we do not normalize to unit variance, the success rates are smaller than when all pixels are normalized to unity. Sodré & Cuevas (1997) also mention that the spread in the first two PCs is larger if the input parameters are all normalized to unit variance.

The exact number of nodes in the hidden layer of the ANN is of minor importance. Using 5 or 7 nodes gives essentially the same results, but using only 3 nodes produces results that are slightly worse. The exact values of the learning parameters and the weight-decay term of the ANN (see e.g. Kröse & van der Smagt 1993) are not important either, as long as they are sufficiently small, i.e., 0.001-0.01. The number of cycles through the input list of training galaxies that is needed, however, does depend on the values of the learning parameters.

The number of galaxies used to train the ANN is not critical, as long as it is sufficiently large, say 150. However, the r.m.s. values around the average classification result (see Sect. 4.2) may depend on the number of galaxies in each of the morphological classes. The number of galaxies with which we have chosen to train the ANN, viz. 150, is a compromise between having a sufficient number of galaxies to train the ANN, and having enough galaxies left to test the performance of the ANN. It is important, however, that all three main morphological types are represented in the training set with roughly equal numbers. If one morphology is overrepresented with respect to the other types in the training set, there will be a positive bias for that particular type. So, in principle, the composition of the training set should closely mimic the composition of the sample to be classified in order to have minimum bias.

Previous Section Next Section Title Page Table of Contents

© European Southern Observatory (ESO) 1999

Online publication: December 4, 1998