Querying the ALMA archive��

Finding DelayCals of a minimum duration and listing their sources��

This is a moderately complicated query that demonstrates most of the features of the query language that a casual user is likely to want to use:

There is a join between the ‘scan intent’ table and ‘source’ table illustrating how multiple tables with different information can be analysed together
There are multiple WHERE conditions illustrating sub-selection of specific rows
There a number of XPath expressions which extract data from the XML-based fields

Here is the query:

SELECT TScan.timestamp, TScan.cont, TSource.sn, intent
FROM (SELECT archive_uid,
             extract(aa.xml,
                     '//row[1]//scanIntent/text()').getStringVal() as intent,
             extract(aa.xml,
                     '//row[1]//startTime/text()').getStringVal() as stime,
             extract(aa.xml,
                     '//row[1]//endTime/text()').getStringVal() as etime,
             extract(aa.xml,
                     '//ContainerEntity[1]/@entityId').getStringVal() as cont,
             timestamp
      FROM ALMA.xml_scantable_entities aa) Tscan,
      (SELECT timestamp,
              archive_uid,
              extract(ab.xml,
                      '/SourceTable//sourceName/text()').getStringVal() as sn,
              extract(ab.xml,
                      '//ContainerEntity[1]/@entityId').getStringVal() as cont
              FROM ALMA.xml_sourcetable_entities ab) Tsource
WHERE TSource.timestamp > '20-jun-10' AND
      TScan.timestamp > '20-jun-10' AND
      Tscan.intent like '%%CALIBRATE_DELAY%%' AND
      Tscan.intent like '%%CALIBRATE_PHASE%%' AND
      (etime-stime) > 600e9 AND
      Tscan.cont=Tsource.cont;

The first row of the query:

SELECT TScan.timestamp, TScan.cont, TSource.sn, intent

specifies which information we actually want to see from the tables being analysed (documentation). Note that when more than one table is being considered in the query has the same column name, it is necessary to specify the name of the table as well as the column name. This is done with the TableName.ColumnName syntax which is seen above for the timestamp, cont and sn columns.

The next part of the query is the FROM clause which specifies which tables we are considering in this query. In this case these source tables are not tables that already exist in the ALMA data model but rather they are themselves generated by a sub-query. The first of these sub-queries has the following form:

(SELECT archive_uid,
        extract(aa.xml,
                '//row[1]//scanIntent/text()').getStringVal() as intent,
        extract(aa.xml,
                '//row[1]//startTime/text()').getStringVal() as stime,
        extract(aa.xml,
                '//row[1]//endTime/text()').getStringVal() as etime,
        extract(aa.xml,
                '//ContainerEntity[1]/@entityId').getStringVal() as cont,
        timestamp
 FROM ALMA.xml_scantable_entities aa) Tscan,

The body of the query is in parenthesis which signifies that it is a sub-query. Following the parenthesis is the symbol “Tscan” – this means we have created a name for this subquery with this value. Within the parenthesis the sub-query has the same structure as the top-level query:

The SELECT statement specifies which information (“columns”) we want to retrieve
The FROM statement specifies from which table to extract information
By placing the symbol after the table name in the FROM statement we create another name for it, in this case we create a name aa for the table ALMA.xml_scantable_entities

What makes this sub-query complicated is of course the extract expressions. These expressions are used parse semi-structured information contained in XML fields. The first argument is the column (aa.xml which in this case is ALMA.xml_scantable_entities.xml) to parse and the second is the XPath expression (explained later) which specifies which information to extract. After the extract statement there is a statement “as” and a name – this form gives names to the columns created using the extract statements.

There are two sub-queries in this top-level query:

The first is based on information in table ALMA.xml_scantable_entities, creates a temporary table named “Tscan” and contains information on what the intent of each scan in the archive is
The second is based on table ALMA.xml_sourcetable_entities, creates a temporary table named “Tsource” and contains information about the source observed in each scan

The last part of the query is the WHERE section:

WHERE TSource.timestamp > '20-jun-10' AND
      TScan.timestamp > '20-jun-10' AND
      Tscan.intent like '%%CALIBRATE_DELAY%%' AND
      Tscan.intent like '%%CALIBRATE_PHASE%%' AND
      (etime-stime) > 600e9 AND
      Tscan.cont=Tsource.cont;

This section has two purposes:

To filter out some information – in this case we are interested only in observations:
- After 20 June 2010
- With minimum scan length of 600 seconds (note that times recorded in the database are in units of nanoseconds)
- Only the observations which were tagged both CALIBRATE_DELAY and CALIBRATE_PHASE
To join the two source tables that we use together. In this case the join is created using the UID of the ContainerElement property in the XML fields

Understanding Joins��

In the present case this is a very simple concept: the information we require is in two tables (ALMA.xml_scantable_entities and ALMA.xml_sourcetable_entities) and we want to join rows in these two tables that belong to the same observation.

We implement this join by:

Finding the observation uid for each element of the XML column in each of the tables using XPath expressions to look for the ContainerElement element
Inserting a WHERE statement Tscan.cont=Tsource.cont which creates the join

Understanding XPath expressions��

Because of incorrect design of the ALMA archive, even very simple queries require the use of XPath language – basically the underlying reason is that even well structured data are stored in semi-structured XML and not in the tables of the database.

The XPath query language is defined here: http://www.w3.org/TR/xpath/. It is a language for extracting a (re-ordered) sub-set of an XML document. The word document is used in the technical sense – in the case of ALMA the document is each XML field.

An XPath expression is applied to an XML field in the database using the extract function (doc) in Oracle.

To write practical queries the following steps need to be done:

Inspect the XML field of a table you think might be interesting to see what information it contains. To do this inspection, just save the a few rows of the table and look at them in an editor.
Write an XPath expression which extract a single element from the XML field that contains the information required. Extracting multiple information at once works poorly if at all.
Use this XPath in an extract function to retrieve the information and for essentially a pseudo-column of the table

Here is an example of a dump from the “SCANTABLE” table:

"ARCHIVE_UID","TIMESTAMP","XML","SCHEMAUID","OWNER","DELETED","READPERMISSIONS","WRITEPERMISSIONS","HIDDEN","DIRTY","VIRTUAL"
"uid://X02/X11b/X1",05-DEC-09 23.36.53.435000000,<?xml version=""1.0"" encoding=""ISO-8859-1""?> <ScanTable xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance"" xmlns=""http://Alma/XASDM/ScanTable"" > <Entity entityId=""uid://X02/X11b/X1"" entityIdEncrypted=""na"" entityTypeName=""ScanTable"" schemaVersion=""1"" documentVersion=""1""/><ContainerEntity entityId=""uid://X02/X101/X1"" entityIdEncrypted=""na"" entityTypeName=""ASDM"" schemaVersion=""1"" documentVersion=""1""/><row> <scanNumber> 1 </scanNumber> <startTime> 4766772926114000000 </startTime> <endTime> 4766772973994000000 </endTime> <numIntent> 1 </numIntent> <numSubScan> 2 </numSubScan> <scanIntent> 1 1 CALIBRATE_POINTING</scanIntent><calDataType> 1 1 TOTAL_POWER</calDataType><calibrationOnLine> 1 1 true  </calibrationOnLine> <calibrationFunction> 1 1 UNSPECIFIED</calibrationFunction><calibrationSet> 1 1 NONE</calibrationSet><calPattern> 1 1 CROSS_SCAN</calPattern><numField> 1 </numField> <fieldName> 1 1 ""Jupiter""  </fieldName> <sourceName> Jupiter </sourceName> <flagRow> false </flagRow> <execBlockId> ExecBlock_0 </execBlockId> </row> </ScanTable>,"uid://A002/X14/X5e","user",0,"","",0,0,0

And here is just the XML reformatted to be easier to view on-screen:

<?xml version="1.0" encoding="ISO-8859-1"?>
<ScanTable xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance"" xmlns="http://Alma/XASDM/ScanTable" >
<Entity entityId="uid://X02/X11b/X1" entityIdEncrypted="na"
     entityTypeName="ScanTable" schemaVersion="1"
     documentVersion="1"/>
  <ContainerEntity entityId="uid://X02/X101/X1"
                entityIdEncrypted="na" entityTypeName="ASDM"
                schemaVersion="1"
                documentVersion="1"/>
  <row><scanNumber> 1 </scanNumber>
    <startTime>4766772926114000000 </startTime>
    <endTime>4766772973994000000 </endTime>
    <numIntent> 1 </numIntent>
    <numSubScan>2 </numSubScan>
    <scanIntent> 1 1 CALIBRATE_POINTING</scanIntent>
    <calDataType> 1 1 TOTAL_POWER</calDataType>
    <calibrationOnLine> 1 1 true </calibrationOnLine>
    <calibrationFunction> 1 1 UNSPECIFIED</calibrationFunction>
    <calibrationSet> 1 1 NONE</calibrationSet>
    <calPattern> 1 1 CROSS_SCAN</calPattern>
    <numField> 1 </numField> <fieldName> 1 1 "Jupiter"</fieldName>
    <sourceName> Jupiter </sourceName>
    <flagRow>false </flagRow>
    <execBlockId> ExecBlock_0 </execBlockId>
  </row>
</ScanTable>

It is easy to identify from this print-out the important information:

The attribute entityId of the ContainerEntity element is the uid of the overall observation – we use this to join tables together
Elements startTime and endTime contain the start and ending times of each scan in this observation
Element scanIntent describes the purpose of this scan

The XPath expressions given below illustrate most typical use:

//row[1]//scanIntent/text() means get the text of the scanIntent element of the first row element in the field
//ContainerEntity[1]/@entityId means get the value of attribute entityId of the first ContainerEntity[1] in the field

Querying the ALMA archive��

Finding DelayCals of a minimum duration and listing their sources��

Understanding Joins��

Understanding XPath expressions��

Table Of Contents

Previous topic

Next topic

Navigation

Querying the ALMA archive��

Finding DelayCals of a minimum duration and listing their sources��

Understanding Joins��

Understanding XPath expressions��

Table Of Contents

Previous topic

Next topic

Quick search

Navigation