Документ взят из кэша поисковой машины. Адрес оригинального документа : http://star.arm.ac.uk/PhD-Thesis/files/cwr/CWR_phd.pdf
Дата изменения: Tue Aug 31 12:33:13 2010
Дата индексирования: Tue Oct 2 08:19:18 2012
Кодировка:

Поисковые слова: molecular cloud
On the Automatic Analysis of Stellar Sp ectra
A thesis submitted for the degree of Do ctor of Philosophy

by

Christopher Winter, B.Eng.

Armagh Observatory Armagh, Northern Ireland & Faculty of Science and Agriculture Department of Pure and Applied Physics The Queen's University of Belfast Belfast, Northern Ireland

March 2006





"Quia non erit impossibile apud Deum omne verbum"



To Stacey

"Qui invenit mulierem invenit bonum et hauriet iucunditatem a Domino"



Acknowledgements
I would like to acknowledge and thank my sup ervisor, C.S. Jeffery, for his sound advice and direction over the course of this pro ject, and the staff and students of the Armagh Observatory for their helpful supp ort and assistance. I am very grateful to J.S. Drilling, E.M. Green, and A. Ahmad, all of whom supplied sp ectroscopic data that was used in this pro ject. In addition, my thanks go to C.A.L Bailer-Jones for the use of his neural network code, STATNET. This work was carried out as part of the CosmoGrid pro ject, funded under the Programme for Research in Third Level Institutions (PRTLI) administered by the Irish Higher Education Authority under the National Development Plan and with partial supp ort from the Europ ean Regional Development Fund. This work also uses data from the Sloan Digital Sky Survey (SDSS) data archive. Funding for the creation and distribution of the SDSS Archive has b een provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Aeronautics and Space Administration, the National Science Foundation, the U.S. Department of Energy, the Japanese Monbukagakusho, and the Max Planck Society. The SDSS Web site is http://www.sdss.org/. The SDSS is managed by the Astrophysical Research Consortium(ARC) for the Participating Institutions. The Participating Institutions are The University of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation Group, The Johns Hopkins University, the Korean Scientist Group, Los Alamos National Lab oratory, the Max-Planck-Institute for Astronomy (MPIA), the Max-Planck-Institute for Astrophysics (MPA), New Mexico State University, University of Pittsburgh, University of Portsmouth, Princeton University, the United States Naval Observatory, and the University of Washington.

Chris Winter March, 2006

iii



Abstract
This pro ject investigates the problem of automatically searching for and analysing astronomical sp ectra from large data sets. The three core problems of (1) sp ectral classification, (2) physical parameterisation, and (3) searching are examined, and a generalisable set of tools is established based on the techniques of artificial neural networks (ANNs), 2 minimisation, and principal comp onents analysis (PCA). These tools are then applied to the archives of the Sloan Digital Sky Survey (SDSS) to automatically search for and analyse the sp ectra of hot sub dwarf stars. Sp ectral classification is tackled by the versatile statistical machine learning method of ANNs. An ANN is trained to classify hot sub dwarf sp ectra onto the classification system defined by Drilling et al. (2006), obtaining global errors (rms ) of 2 subtyp es for sp ectral typ e, 1 sub class for luminosity class, and 4 sub classes for the helium class. These errors are in line with accuracies achieved by human classifiers. Physical parameters are obtained by fitting observations to grids of theoretical models using a 2 minimisation procedure. A new methodology has b een develop ed for managing and indexing large grids of theoretical models in the 2 minimisation code, SFIT. Concepts from the field of computational geometry are used to remove several limitations from this code, and pave the way for its use in a distributed parallel computing environment. Searching for the sp ectra of a particular typ e of ob ject in large, unknown data sets is accomplished using the multivariate statistical technique, PCA. The mechanics of this tool are outlined, and its use demonstrated by searching for hot sub dwarf sp ectra in the SDSS. This solution provides a means to reduce unknown data sets to quantities suitable for visual insp ection. 282 sp ectra of hot sub dwarf candidates are obtained from the SDSS and analysed. The results evidence several unexplained phenomena of extended horizontal branch stars, namely: 1) the existence of the second horizontal branch gap of Newell (1973); 2) two sdB nHe ­Teff sequences; and 3) a clustering of hot, helium rich stars at Teff 44, 000K , log g = 5.7. These findings p ose imp ortant questions for stellar evolution theory in the realms of the extended horizontal branch.

v



Contents
Acknowledgements Abstract List of Tables List of Figures 1 Intro duction 1.1 1.2 1.3 Astronomical Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . Large Data Sets And Their Sources . . . . . . . . . . . . . . . . . . . . . Astronomical Sp ectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 1.3.2 1.4 Typ es Of Ob jects And Their Sp ectra . . . . . . . . . . . . . . . . Automatic Methods of Analysis . . . . . . . . . . . . . . . . . . . iii v xii xvi 1 3 6 12 13 17 19 19 21 26 26 27 29 32

Hot Sub dwarf Stars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 1.4.2 1.4.3 1.4.4 Sp ectroscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stellar Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . Why Study Them? . . . . . . . . . . . . . . . . . . . . . . . . . . Why Search For Them In The SDSS? . . . . . . . . . . . . . . .

1.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Classification - Artificial Neural Networks 2.1 Classifying Hot Sub dwarfs . . . . . . . . . . . . . . . . . . . . . . . . . . vii


viii 2.1.1 2.1.2 2.1.3 2.2

CONTENTS The Training Sample . . . . . . . . . . . . . . . . . . . . . . . . . Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 35 38 40 43 45 49 51 51 55 57 58 62 62 64 67 72 80 81 83 84 86 95

Physical Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Parameterisation - 2 Fitting 3.1 3.2 Analysing Stellar Sp ectra . . . . . . . . . . . . . . . . . . . . . . . . . . SFIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.3 Limitations of SFIT . . . . . . . . . . . . . . . . . . . . . . . . . Prop osal to Remove SFIT's Limitatons . . . . . . . . . . . . . .

Tetrahedralisation: Interp olation and Indexing . . . . . . . . . . . . . . 3.3.1 3.3.2 3.3.3 Simplex Interp olation . . . . . . . . . . . . . . . . . . . . . . . . Grid Index - Delaunay Triangulation . . . . . . . . . . . . . . . . Navigating the Index - Point Location . . . . . . . . . . . . . . .

3.4 3.5

Testing the Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Filtering - Principal Comp onents Analysis 4.1 Constructing A PCA-Based Filter . . . . . . . . . . . . . . . . . . . . . 4.1.1 4.1.2 4.2 4.3 Mathematics of PCA . . . . . . . . . . . . . . . . . . . . . . . . . Building A Hot Sub dwarf Filter ..................

Searching the SDSS for Hot Sub dwarfs . . . . . . . . . . . . . . . . . . .

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 107

5 Application I - SDSS Hot Sub dwarfs 5.1

Search Criteria And Data Sets . . . . . . . . . . . . . . . . . . . . . . . 107


CONTENTS 5.2 5.3 5.4

ix

PCA Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.4.1 5.4.2 5.4.3 Parameterisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Radial Velocities . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.5 5.6 5.7

Sources of Error

Analysis of PCA Filter Efficiency . . . . . . . . . . . . . . . . . . . . . . 123 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 131

6 Application I I - Other Data Sets 6.1 6.2 6.3 6.4

2MASS-Selected Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 SDSS sdB-He Stars of Harris et al. (2003) . . . . . . . . . . . . . . . . . 137 Ahmad & Jeffery (2003) He-sdBs . . . . . . . . . . . . . . . . . . . . . . 138 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 141 152

7 Conclusions And Future Work Bibliography

Appendices
A Results for 192 Drilling et al. (2006) Hot Sub dwarfs B Results for 282 SDSS DR3 Hot Sub dwarf Candidates C Results for 83 2MASS-Selected Hot Sub dwarf Candidates D The Armagh Observatory Cluster

161
163 175 189 193

D.1 Hardware Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 D.2 Software Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 On the Automatic Analysis of Stellar Sp ectra


x

CONTENTS D.3 MPICH 1.2.4 RPM Sp ec File . . . . . . . . . . . . . . . . . . . . . . . . 202

E LTE-CODES

207

E.1 Directory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 E.2 Build System Organisation . . . . . . . . . . . . . . . . . . . . . . . . . 209 E.3 Installation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 212


List of Tables
2.1 Results of the leave-one-out procedure as applied to a committee of five 901:10:3 ANNs, for 150, 300, 500, 700 and 1000 training iterations. 2.2 2.3 2.4 As Table 2.1, but for the committee of five 901:5:5:3 ANNs. Results of parameterising the 60 calibration stars. .. 38 39 45

......

............

A comparison b etween ANNs and 2 minimisation for parameterising the 133 unparameterised stars. . . . . . . . . . . . . . . . . . . . . . . . 49 72

3.1 3.2

Details of the model grid used in the comparison . . . . . . . . . . . . . Initial parameters used for the Amoeba and Levenb erg-Marquardt optimisation routines. The step sizes used for Amoeba are also given . . . . Results of BD+10 2179 analysis with the unmodified version of SFIT . Results of BD+10 2179 analysis with the modified version of SFIT . . . The model grid used to obtain physical parameters of the set of test models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73 73 74

3.3 3.4 3.5

74

3.6

RMS comparison of parameterisation results from each interp olation method with the original parameters of each model. Also given is the RMS difference b etween the methods, and a comparison b etween the results in the region of parameter space for which b oth schemes seem to give their b est results (see Figures 3.6 and 3.7). ............. 79

5.1 5.2

Summary of data quantities obtained from the SDSS DR3. . . . . . . . 108 The model grid used to obtain physical parameters from the SDSS hot sub dwarf candidates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 xi


xii 6.1

LIST OF TABLES Parameters of the two calibration stars as obtained by 2 -fitting to NLTE (Green et al., 2006) and LTE (Armagh) model atmospheres. Formal errors are given in parentheses. . . . . . . . . . . . . . . . . . . . . . . . 133 6.2 6.3 Classification results for the sdB-He stars of Harris et al. (2003). . . . . 137 Classification results for the Ahmad & Jeffery (2003) He-sdBs. . . . . . 140

A.1 Parameterisation Results for 192 Drilling et al. (2006) Hot Sub dwarfs . 164 B.1 Results for 282 SDSS Hot Sub dwarf Candidates . . . . . . . . . . . . . . 176 C.1 Results for 83 2MASS-Selected Hot Sub dwarf Candidates . . . . . . . . 189


List of Figures
1.1 A stellar sp ectrum (top), and a galaxy sp ectrum (b ottom). (Taken from the SDSS) 1.2 .................................. 14

Example of a quasar (top) and carb on star (b ottom) sp ectrum. (Taken from the SDSS) ............................... ........... 15 16

1.3 1.4

The emission sp ectrum of the Orion nebula (M42).

Examples from each hot sub dwarf sp ectrographic subgroup. Classifications listed are those from Drilling et al. (2006). ............. 20

1.5

Schematic temp erature-luminosity diagrams showing: a) the p ositions of stars b elonging to the main stellar groups; b) the normal sequence of stellar evolution exp erienced by a star of a few solar masses; c) p ossible evolution of an sdB star in a binary system. (Diagram courtesy of C.S. Jeffery). ................................... 21

2.1

The training sample shows clustering in certain regions of the classification space. For clarity, p oints have b een offset by small random shifts in both coordinates. .............................. 34

2.2

Results of the leave-one-out procedure for b oth ANN architectures at the near-optimal training time of 300 iterations for the 901:10:3 architecture (left column), and 500 iterations for the 901:5:5:3 architecture (right column). Also plotted is the b est-fit linear least squares line. . . . . . . 41

2.3

Parameterisations of the 60 calibration stars. Results from each method have b een combined onto each plot. ANN results are indicated by blue crosses, and 2 minimiser results by red pluses. ............. 46

2.4

Parameterisations of the 133 unparameterised stars using the ANNs and 2 minimiser. Also shown is the b est-fit linear least squares line. . . . . xiii 48


xiv 3.1

LIST OF FIGURES Example of a k-D tree in two dimensions. On the left is the representation of how the k-D tree on the right splits up the x, y plane. (Adapted from Moore 1991.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.2

A 1-simplex is a line segment. A 2-simplex is a triangle. A 3-simplex is a tetrahedron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . In two dimensions, the Delaunay triangulation guarantees that no other points lie in the circumcircle of any simplex. ...............

61

3.3

65

3.4

The line segment, L, is constructed using the centroid of the starting tetrahedron, T , and the interp olation p oint, p. The tetrahedra visited on the walk-through are coloured grey. .................. 68

3.5

Parameterisation results from the linear interp olation in tables method. Clearly visible are anomalous results arising from a susp ected defect in the method's implementation. ....................... 76

3.6

Parameterisation results from the linear interp olation in tables method. Axes have b een restricted to give a view of the grid b oundaries describ ed in Table 3.5. ................................. 77

3.7

Parameterisation results from the simplex-based interpolation scheme. In contrast with Figures 3.5 and 3.6, the simplex-based scheme clearly restricts the optimisers to the grid b oundaries. . . . . . . . . . . . . . . 78

4.1

Principal comp onent analysis. u1 is the first principal comp onent and the axis onto which the pro jected p ositions of the data have their maximum sum. u2 is the second principal comp onent, and u1 · u2 = 0. ... 83 87 89 90

4.2 4.3 4.4 4.5

Mean sp ectrum of the Drilling et al. (2006) sample. . . . . . . . . . . . . First five PCs of the Drilling et al. (2006) sample. ............

Second five PCs of the Drilling et al. (2006) sample. . . . . . . . . . . . Cumulative variance of the first ten PCs of the Drilling et al. (2006) sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.6

Illustration of pro jecting hot sub dwarf sp ectra onto the first four PCs of the Drilling et al. (2006) standards. . . . . . . . . . . . . . . . . . . . . . 93 96 97

4.7 4.8

Histogram of reconstructions errors from the SDSS data sample. . . . . Sp ectra in first three reconstruction error histogram bins (R 3.0). .


LIST OF FIGURES 4.9 Sp ectra in first three reconstruction error histogram bins (R 3.0). .

xv 98

4.10 Sample of sp ectra from the eighth error bin (R 3.0). . . . . . . . . . . 100 4.11 Sample of sp ectra from the fourteenth error bin (R 4.5). . . . . . . . 101 4.12 Sample of high S/N DA white dwarfs from the 22nd - 24th error bins (R 6.4 - 7.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.13 Sample of sp ectra from the fifty-third error bin (R > 15.0). . . . . . . . 103 5.1 Histogram of reconstruction errors for the colour-colour selected SDSS sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Parameterisation results of the 282 SDSS hot sub dwarf candidates. The helium main sequence of Paczynski (1971), and p ost-EHB evolutionary ґ tracks of Dorman et al. (1993) are also plotted. 5.3 . . . . . . . . . . . . . 112

Four example fits from the 282 SDSS hot sub dwarfs. The classification and physical parameters (Teff (K), log g, log(nHe /nH )) obtained for each star are printed in the lower corners of each plot. . . . . . . . . . . . . 113

5.4

The results of applying a kernel density estimate analysis to the data with another p ossible low-density region at Teff 41, 000K. . . . . . . . 114 from Figure 5.2. The low-density at Teff 22, 500K is prominent, along Classification results of the 282 SDSS hot sub dwarf candidates. Points have b een given small random offsets in each axis for clarity. . . . . . . 117

5.5

5.6

A comparison of the ANN classifications of the 282 SDSS hot sub dwarf candidates (left-most plots) with all the stars classified by Drilling et al. (2006) (right-most plots). Points have b een given small random offsets in each axis for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.7

A calibration of the ANN classifications onto the Drilling et al. (2006) system using the 282 SDSS hot sub dwarf candidates. . . . . . . . . . . 119

5.8

The distribution of SDSS-derived redshifts of the 282 hot sub dwarf candidates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.9

Examples of white dwarf and BHB contaminants. A - BHB star with deep Balmer lines. B - DA white dwarf with strong, broad Balmer lines due to high surface gravity. C - DB white dwarf. D - Uncertain (some evidence of weak carb on absorption, so p ossibly a DQ white dwarf ). . . 125 On the Automatic Analysis of Stellar Sp ectra


xvi

LIST OF FIGURES

5.10 This gray-shaded region of the log g­Teff plane represents an area of good probability that the stars within it are sub dwarfs. . . . . . . . . . . . . 126

5.11 TP rates (red) and FP rates (blue) of the PCA filter as a function of the reconstruction error threshold, R. The green curve is the difference between the TP and FP rates. . . . . . . . . . . . . . . . . . . . . . . . 127

5.12 A closer examination of the TP and FP rates. The p eak in the green TP-FP curve occurs at R 7.0 and signifies the optimum value for R in the SDSS sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.1

SFIT physical parameters for 2MASS-selected sample. The helium main sequence of Paczynski (1971), and p ost-EHB evolutionary tracks of Dorґ man et al. (1993) are also plotted. . . . . . . . . . . . . . . . . . . . . . 134

6.2

ANN classification for 2MASS-selected sample. Points have b een given small random offsets in each axis for clarity. . . . . . . . . . . . . . . . 135 The stars assigned late-A and early-F sp ectral typ es by the neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.3

6.4

Comparison of ANN classifications with those of Drilling et al. (2006) for the 17 He-sdBs of Ahmad & Jeffery (2003). Points have b een given small random offsets in each axis for clarity. Also plotted is the b est fit least squares regression line with error bars showing the RMS of the residuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.1

Schematic diagram showing how the work of this thesis fits in with the wider system envisaged by Jeffery (2003). . . . . . . . . . . . . . . . . . 149




Chapter 1

Introduction
The sp ectroscopy of light from astronomical ob jects is one of the most imp ortant methods for understanding the physics at work in the universe. Many fundamental parameters of those ob jects can b e determined by analysing their sp ectrum, including temp erature, chemical comp osition, motion, and other clues ab out their origin and evolution. Advances in information technology over the past 35 years, and their subsequent influence on observational methods, have allowed sp ectroscopic studies of unprecedented numb ers of ob jects to b e carried out over a short p eriod of time. Modern astronomy is now ab out dealing with very large quantities of data, and the problems associated with its management and analysis. This pro ject develops a collection of tools to assist astronomers in data mining large sets of astronomical sp ectra. The tools are general in nature, and can b e used to search for and automatically study the sp ectra of p otentially any typ e of astronomical ob ject. Together, the tools form a semi-automatic pip eline allowing a fast progression from large quantities of unknown sp ectra to useful scientific results. In the past, studies of automatic methods of sp ectral analysis have mainly centred around the problem of ob ject classification. This makes sense from the p oint of view 1


2

Chapter 1 - On the Automatic Analysis of Stellar Sp ectra

of a survey mission b ecause it is desirable to know what typ es of ob jects have b een observed, with particular interest b eing paid to those ob jects not falling into any known category. However, the individual astronomer, studying a particular typ e of ob ject, is not always interested in large-scale classification. He needs a way to search exclusively for samples in a data set which are most like his ob ject of interest. Once located, those samples are likely to exist in large enough numb ers to require further automatic assistance in their analysis. The techniques needed to help solve this problem already exist in the field, but they have not yet b een brought together and adapted to form any sort of useful, coherent system. As such, scientific insights contained in large data sets remain mostly untapp ed. The work in this pro ject represents what seems to b e the first attempt at rectifying this issue. Three ma jor algorithms are employed to construct a general data mining tool set.

1. Principal Comp onents Analysis is applied in a sup ervised classification role to create a filter that can help search for a sp ecific typ e of ob ject in an unknown data set. 2. Artificial Neural Networks have b een shown to b e a robust and versatile tool for many tasks in astronomy. They are used here to provide sp ectral classifications. 3. 2 minimisation is used to derive physical parameters for sp ectra by fitting them to grids of theoretical models.

Additional minor tools to facilitate data processing, management, and visualisation are also prototyp ed. Furthermore, a new and original methodology has b een developed to extend the functionality of the 2 minimisation code, SFIT, used at the Armagh Observatory.


1.1 Astronomical Data Mining

3

The code is modified using concepts from the field of computational geometry to allow the use of arbitrarily large, three-dimensional grids of theoretical models. This removes several severe limitations from the program, and prepares it for further modification to permit its use in a distributed computational environment. The sp ecific outcome of this pro ject is a set of general tools which can b e used to study the sp ectra of any astronomical ob ject, and a "real-world" demonstration of these tools through their application to search for and analyse the sp ectra of hot sub dwarf stars from the archives of the Sloan Digital Sky Survey. The results evidence several unexplained phenomena of extended horizontal branch stars that p ose imp ortant questions for the theory of stellar evolution. The work undertaken in this pro ject is a step towards the larger computational framework of Jeffery (2003) which outlines a wider system incorp orating the management of atomic data, dynamic generation and storage of grids of theoretical models, parameter space visualisation, and automated analysis. The use of distributed computational resources, such as the Grid, is also envisaged.

1.1

Astronomical Data Mining

The term "data mining" refers to the use of a broad set of techniques and algorithms for extracting useful patterns and models from very large data sets. Typically, the goal is to discover either something hitherto unknown ab out a phenomenon that only b ecomes apparent when it is studied en masse, or else a new phenomenon that only b ecomes apparent when observations are gathered in large enough quantities over a sufficiently wide range. Traditionally, in astronomy, much effort was invested in gathering observations of one particular ob ject, such as a star, in an attempt to understand that ob ject in detail. Given the universality of physics, the insights gained are usually applicable to other ob jects of the same typ e, allowing a wider understanding to b e achieved. On the Automatic Analysis of Stellar Sp ectra


4

Chapter 1 - On the Automatic Analysis of Stellar Sp ectra However, advances in technology, such as large-area mosaic CCDs and multi-ob ject

fibre-fed sp ectrographs, mean that modern telescop es can b e made to gather observations of thousands of ob jects in a single night. This op ens up the p ossibility of discovering new facts ab out particular ob jects by studying their prop erties in large numb ers, and also the p ossibility of discovering completely new ob jects. Unfortunately, this abundance of data brings with it a set of new problems. Managing all of the information requires knowledge of data formats, storage mechanisms, and techniques for indexing, searching, and analysing it all. Indeed, modern astronomy is fast b ecoming a cross-disciplinary endeavour, providing a rich area for exploring many asp ects of computer science and statistics in the context of real-world applications.

Data Types

The nature of astronomical data means that it is inherently heterogeneous in b oth format and content, with observations now b eing gathered over all regions of the electromagnetic sp ectrum. Broadly sp eaking, astronomical data can b e classified into five domains.

· Imaging data are the fundamental comp onent of astronomical observations, capturing a two-dimensional picture of the universe within a narrow wavelength region at a particular p oint in time. · Catalogues of ob jects are constructed by analysing imaging data, and recording many different parameters ab out each ob ject such as brightness and colour, morphological information, and coordinates. · Sp ectroscopy provides detailed physical quantification of ob jects including temperature, chemical composition, and kinematical information. · Studies of ob jects in the time-domain provide valuable insight into the nature of the universe by identifying moving ob jects, variable sources (e.g., pulsating


1.1 Astronomical Data Mining stars), or transient ob jects such as sup ernovae and gamma-ray bursts.

5

· Finally, theoretical simulations of astronomical ob jects are an imp ortant source of data. Comparing theoretical models with observational data is the central mechanism in understanding how these ob jects formed and have evolved.

Each of these data domains carries its own particular problems to b e solved in a data management and mining context. Imaging data and catalogue construction require robust, automatic techniques to identify sources distinct from background-level noise, then differentiate b etween different typ es of ob jects (e.g., stars, galaxies, and comets), and finally the indexing of these data to allow fast searching based on spatial criteria. Sp ectroscopy and time-domain data require more involved algorithms for the automated reduction and calibration of observations ­ algorithms which often have to be tailored for a specific instrument and telescope setup. The automatic analysis of sp ectroscopic data typically seeks to classify an ob ject onto a predefined categorical system by somehow comparing the ob ject with the set of standards which define the system. The physics of an ob ject which are manifest in its sp ectrum are determined by computing accurate theoretical models and comparing them with the observations. Any results then need to b e stored and indexed with the observations in a manner that allows for further re-analysis as more improved observations and theoretical models become available. Numerical simulations to generate theoretical models are always in need of p owerful and plentiful computational resources to allow more detail and precision to b e attained. As models will always have a shorter shelf-life than observations, appropriate meta-data needs to b e recorded and stored with the models so a historical record can b e kept as the underlying physics improves. This meta-data is also needed to help automate the parameterisation of observations by providing a means to explore grids of models, and ascertain when new models need to b e generated to cover a required part of the parameter space. On the Automatic Analysis of Stellar Sp ectra


6

Chapter 1 - On the Automatic Analysis of Stellar Sp ectra

1.2

Large Data Sets And Their Sources

Three main sources contribute to large observational data sets in astronomy, namely, those generated by sp ecific surveys, general-purp ose observatories, and space missions. In recent years, Virtual Observatory pro jects are investigating ways to combine the various databases generated by these sources, mapping out the computational infrastructures and tools needed to explore large data volumes.

Specific Surveys

Digital sky surveys generate very large quantities of homogeneous data over multiple wavelengths. As such, they are the main drivers b ehind the study of data mining methods in astronomy. The Digitized Palomar Observatory Sky Survey1 (DPOSS; Djorgovski et al., 1998) is a digital survey of the entire Northern sky in three visible-light bands, based on the photographic sky atlas, POSS-I I,