Understanding the schema


What is the difference between a "schema", a "schema language", and a "data model"?

The database schema is like a blueprint that describes the layout of the data contained in the database: what kinds of fields are present and how they are organized. For example, it could state that one kind of experiment will be called a binding_assay, and will contain a field for the contributor's name which will be a string of characters, some other field which is a number, etc. The schema is stored in a file that is read by the database software, so minor changes like, say, adding a new field are relatively easy from a programming perspective because they can be made in this file without changing the programs at all. (Though of course, all the existing data would need to be changed to conform to the new schema file, which could be a rather tedious job.)

The language for specifying the schema is just a set of grammatical rules for describing schemas in general. That is, what exactly should the schema file say in order to tell the database software that there will be a table called binding_assay with a contributor field, and so on? Using the blueprint analogy, the schema language is like the symbols architects use to indicate room dimensions, windows and doors, brick vs. stucco, etc. Changes to the schema language are made in the actual database software, but are usually easy to do as long as they are only changing how things are said, not what can be said. For example, adding semicolons at the end of each line, or using Number in place of Integer, or requiring Begin Table before each new table description, are all trivial tasks. The schema file would need to be changed to conform to the new language, but the data could be left alone.

But what happens when you want to change the very nature of what can be specified in a schema file? That means a change to the data model used by the database software. Unlike the schema, the data model is not isolated in a separate file for easy changing, but rather is hard-coded right in the software itself, so changes to it can have a profound impact on the database programs. The data model cannot be separated from the software because it provides the framework that determines what the software must be able to handle--the programs must have special routines to process each possibility.

What kind of data model does the Globin Gene Server use?

Originally, the prototype version of our database of experimental results used a relational model. This is a very popular data model where the data are stored in tables with rows and columns, and each entry in a table is a simple value like a string of characters or a number. There are usually several tables, and they are linked together by common fields, so the database software can match up the data from various tables as needed.

However, we found that our data is so complex that in order to adhere to a strictly relational model, we would need an awful lot of tables linked together in confusing ways. For example, a DNA transfer experiment can use a construct with several segments of DNA, each of which can be of several types and may include an arbitrary number of mutations, and moreover the order of the segments is important. In addition, there can be several results measured for each construct, some with very complicated units.

So, the new version of the database has been revised to use a nested data model, similar to those described in papers by Roth et al.(1) and Pistor et al.(2). In this model, the table entries are not always simple numbers or strings; they can be subtables instead. In that case, each row in the table contains an entire "miniature" table for that field. The subtable has its own rows and columns, and it can have all the features of a main table, including subtables of its own. This model produces a schema that is still very complex (because our data is complex), but we feel it gives a more natural and intuitive representation of the data than we would obtain using the purely relational model.

So how do I read the new schema?

The schema file is now in two sections. The first part, Type Definition Macros, is a sort of prologue. It does not actually declare any tables; it just sets up definitions for common groups of fields that tend to occur frequently. The second part, Specifications for Data Tables, is where the actual table declarations appear. This may seem complicated at first, but it allows the table declarations in the second part to be much more concise.

The basic underlying syntax element of the schema language looks like this:

	name : type  constraints
where name is the name of the object being described (typically a table or field), type describes the kind of values the object can contain, and constraints is an optional list of more specific requirements for those values. Generally the name and constraints are relatively straightforward, but the type can become quite complex because that's where the nesting occurs.

To make it a bit easier for human readers to distinguish what's what, keywords that are specific to the schema language syntax are capitalized, while names and values are usually written in mixed or lower case. Type definition macros are sort of in-between; they are capitalized because they act more like types than like names.

There may also be comments in the schema file. When the schema software encounters a # sign, it ignores the rest of that line. This feature can be used to add notations for humans to read, or to prevent the computer from trying to process features that are not implemented yet.

Data types

Our model has two general classes of data types: simple types, which consist of a single value, and composite types, which are built up from several parts. The simple types currently supported are String, Integer, and Real; the composite types are Record, Set_Of, List_Of, and Variant.

Simple types

String values consist of words, or more precisely, a sequence of text characters. If the string contains any spaces, carriage returns, or punctuation other than underscores, it must be enclosed in single quotes, 'like this'. Using the quotes never hurts, so when in doubt, put them in. (In fact, when you are writing queries for Genie, you must use them around all string values.)

Integer values are positive or negative numbers without any decimal point (never use quotes).

Real values are positive or negative numbers that do have a decimal point (never use quotes for them either).

Records

A Record type is simply a group of subfields which are grouped together because they have some logical association with each other. For example, there are many places in the schema that refer to the location of a fragment of DNA in the beta-like globin gene cluster. In each case, we need to know the origin from which the position is measured (e.g. the cap site of the human G-gamma-globin gene), the start offset from that origin (i.e., the 5' end of the fragment), and the stop offset (the 3' end). Conceptually, these three fields work together to describe the location, so they are grouped into one composite field, like this:

	region: RECORD (
	     origin: STRING
	     start:  INTEGER
	     stop:   INTEGER
	)
Here all the fields between the parentheses are part of region's type. (Actually, this is a simplified example; the real schema makes this into a type definition macro called Region_Descriptor because it is used so often, and also adds a few constraints, which will be discussed below.)

Sets

A Set_Of type is similar to a record, except that there can be multiple copies of it. For example, a fragment of DNA can have an arbitrary number of mutations, so they are stored in a set:

	mutation: SET_OF {
	     region: REGION_DESCRIPTOR
	     type: STRING
	     description: STRING
	}
This is saying that there can be any number of mutations, each of which has a region, type, and description. (Here type happens to be the name of a field containing the type of mutation, such as substitution or deletion; it does not mean a schema data type.) Notice that the data type for region is the Region_Descriptor macro discussed above; i.e., it is really a Record containing an origin, start, and stop.

Since a set appears in multiple copies with the same group of fields in each, it is just like a table with rows and columns. In fact, that's exactly how tables are defined in the schema:

	binding_assay: SET_OF {
	     |
	     |
	     |
	}
The fact that a field within a table can itself be a table illustrates the nesting which is characteristic of this data model.

Lists

A List_Of type is just like a set, except that the order of the rows is significant. For example, in a DNA transfer experiment the order of segments within a construct affects the experimental results, so we use a list to store them:

	construct: SET_OF {
	     construct_segment: LIST_OF {
	          segment: SEGMENT_DESCRIPTOR
	     }
	}
This piece of the schema is saying that there can be a set of constructs for a single experiment (though in fact there is usually just one), and that each of them consists of an ordered list of construct_segments, each of which appears at first glance to contain just one field, segment. However, the type of segment is a type definition macro, and upon looking up Segment_Descriptor in the first section of the schema, we find that it is actually a Record with some rather complicated fields inside it.

This example illustrates something else worth mentioning. Our convention is to name a set or list according to an individual, singular row in the table it represents. That is, the set in this example is named construct (singular, not plural) because each of the rows in it is a construct. Each construct in turn happens to contain nothing but a list, though there could have been other fields. The list is named construct_segment (singular) because each of its rows is a construct segment. The various fields describing each segment have been consolidated into a single logical unit via the Segment_Descriptor macro, both for conceptual convenience and to avoid repeating all the details when segments are used in another table. Similarly, in an earlier example we saw that a set of mutations is named mutation (singular). This convention can make the schema seem a bit confusing at first until you get used to it, but we use it because it makes the data files much less confusing.

Variants

A Variant type is like a record that comes in different "flavors". Each flavor has a different group of subfields associated with it, and a particular value for the variant will use one group or another, but never more than one. The value includes a variant tag that indicates which flavor is being used. For example, there are three kinds of DNA fragments in our database: regions from the beta-like globin gene cluster, regions from elsewhere, and oligos. Here is roughly what the schema looks like:

	DNA_fragment: VARIANT [
	     <beta_g_region>     =>  region: REGION_DESCRIPTOR
	                             mutation: SET_OF {
	                                  region: REGION_DESCRIPTOR
	                                  type: STRING
	                                  description: STRING
	                             }
	     <non_beta_g_region> =>  species: SPECIES_NAME
	                             description: STRING
	                             abbrev: STRING
	                             GenBankID: STRING
	                             length_basepairs: INTEGER
	     <oligo>             =>  sequence: STRING
	]
The three items in angle brackets <> are the variant tags that identify the possible flavors for this variant record. That is, a particular fragment can use the fields for beta_g_region (region and mutation) or the fields for non_beta_g_region (species ... length_basepairs), or the fields for oligo (just sequence). The appropriate tag will be stored with the value to identify which flavor was used.

Sometimes a variant tag will show the keyword Empty instead of a list of fields. This means no fields are necessary for this particular flavor--the tag alone is enough to describe it.

Constraints

Each kind of data type discussed above can be followed by zero or more additional requirements that further describe the values it can take. Some of the following kinds are not supported yet, but will be in some future version of the database; they are noted below and appear in the schema preceded by # signs so the software will ignore them for the time being.

The Optional constraint is really more of an "un-constraint"; it indicates that it is OK for a field's value to be Null (i.e., missing). Without this, the software will insist that every row must have a value for this field.

The Key constraint means that this field can serve as a unique identifier for distinguishing the rows. The software will insist that every row must have a value for this field, and that they must all be different. Key fields are never Optional.

The Foreign_Key constraint means that this field can be used as a link to look up a corresponding row in another table by matching it with the other table's Key field. The software will insist that the other table must actually have a row with the matching key value (unless this field is also Optional and its value is Null). [This feature is not implemented yet.]

The Choice_Of constraint limits a String field to a short list of possible values. The software will insist that no other values are used (except possibly Null, if the field is also Optional).

The Regexp constraint limits a String field's value to match the given pattern according to the rules of regular expression matching (unless the field is also Optional and its value is Null). [This feature is not implemented yet.]

The Restrict constraint allows a variety of logical expressions to be used as restrictions on a field's value. For example, the software could be made to insist that a year must be greater than 1900, or that a region's stop value must be greater than its start value (unless the field is also Optional and its value is Null). [This feature is not implemented yet.]


References:

(1) Mark A. Roth, Henry F. Korth and Don S. Batory, SQL/NF: A Query Language for non-1NF Relational Databases, Inform. Systems 12.1: 99-114 (1987)

(2) P. Pistor and R. Traunmueller, A Database Language for Sets, Lists and Tables, Inform. Systems 11.4: 323-336 (1986)


6/96