Genie is a query language for the Database of Experimental Results. It resembles a well-known database language called SQL which is popular for relational databases; however Genie has been enhanced to work with our nested data model in a manner similar to the languages described in papers by Roth et al.(1) and Pistor et al.(2).
Genie is still under construction; some common database operations such as projection and join have not been implemented yet, but it is already useful.
Here is a simple example to illustrate the general idea:
Ex. 1 QUERY reference_info WHERE year >= 1990This query is asking for the literature citations of all the papers in the database which were published during or after 1990. The first line identifies which table is to be searched, and the second is specifying the criteria that each item must meet in order to be selected. In this case, only records which have a year value greater than or equal to 1990 will appear in the result.
As you can see, it is essential to have a copy of the current schema handy in order to know what fields are available for each table. Please read Understanding the schema if you have not already done so; it explains what the schema is for and how to interpret it. The schema is expected to change from time to time in response to the needs of the user community, so be sure to use the current version.
fieldname op valuewhere op can be any of the comparison operators =, !=, <>, >, >=, <, <= (both != and <> mean not equal to, so you can use whichever you prefer.) Value can be a literal constant (remember to put single quotes around it, if it's a String) or even another field name. There are also two special values, INFINITY and NULL, which specifies a missing or blank value.
Integer and Real values will be compared numerically, and in addition to the simple comparisons listed above, you can also build arithmetic expressions with them using +, -, *, and /, as well as () for grouping.
String values are compared alphabetically (like in a dictionary), except that all blanks and underscore characters are ignored, and there is no distinction between upper and lower case. But sometimes you just want to search for a particular word, or part of a word, without trying to match the whole value exactly; this is especially true for long text descriptions. Genie deals with this situation by providing a special function called CONTAINS, which determines whether or not the first string you mention contains the second as a substring. To use it, write something like this:
CONTAINS (string1, string2)where string1 and string2 can be either field names or literal constants (remember quotes!). Again there is no distinction between upper and lower case, but this time all punctuation is ignored in addition to blanks and underscores. That is, only letters and numerals are used in the search.
region.originThere may also be occasions when you are dealing with the Record as a whole (e.g. see the section below on Overlaps and Covers), and you need to build a literal constant for the whole Record. This is accomplished using square brackets [] to represent the Record grouping with the field values listed inside, in the same order as they appear in the schema, separated by commas. For example, a literal DNA region could be written as
['humhbb_orig', 8658, 8677]
DNA_fragment IS beta_g_regionQuotes are not needed around the flavor because it is not a literal constant.
Currently there are two operators that work on Sets and Lists: EXISTS, which determines whether or not the following Set or List has any rows in it, and COUNT, which finds out exactly how many rows there are. Whenever you see in the schema that a field you want to specify lies inside a Set or List, you need to open a new nesting level by writing something like
EXISTS ( QUERY setname WHERE ... )Between the parentheses, the subquery is in its own little world, where setname is taking the role usually played by a top-level table. You might think that to refer to the subfields inside setname you would need to write setname.subfield the way you do for records. However, this is not correct because the QUERY line takes care of establishing the setname for the whole subquery, the same way it does for top-level tables. Thus it's enough to write just subfield.
OVERLAPS (region1, region2)where region1 and region2 can be either field names or literal constants.
This query retrieves all binding assay experiments contributed by Ahmed ElSherbini that included methylation interference assays with K562 extract using a probe from HUMHBB, where a protection effect was observed.
Ex. 2 # This is a comment. QUERY binding_assay WHERE CONTAINS (contributor, 'sherbini') AND assay IS methylation_interference AND assay.probe IS beta_g_region AND assay.probe.region.origin = 'humhbb_orig' AND assay.source IS nuclear_extract AND assay.source.cell = 'K562' AND EXISTS ( QUERY assay.regional_effect WHERE effect = 'protection' )
Here is a query to retrieve all conserved sequences from an area corresponding to the human HS2 region.
Ex. 3 QUERY conserved WHERE OVERLAPS (region, ['humhbb_orig', 8486, 8860])
This one finds all the conserved sequences in the database that are at least 25 bases long.
Ex. 4 QUERY conserved WHERE region.stop - region.start + 1 >= 25
This example retrieves all binding assay experiments where the probe (for gelshifts) or one of the regional effects (for non-gelshifts) overlapped the AP1 region (corresponding to 8658-8677 in the human sequence).
Ex. 5 QUERY binding_assay WHERE ( assay IS gelshift AND assay.probe IS beta_g_region AND OVERLAPS (assay.probe.region, ['humhbb_orig', 8658, 8677]) ) OR ( assay IS NOT gelshift AND EXISTS ( QUERY assay.regional_effect WHERE OVERLAPS (region, ['humhbb_orig', 8658, 8677]) ) )
This query retrieves all DNA transfer experiments whose construct included a wild-type segment corresponding to part of the HS4 region in humans.
Ex. 6 QUERY DNA_transfer_experiment WHERE EXISTS ( QUERY construct WHERE EXISTS ( QUERY construct_segment WHERE segment.DNA_fragment IS beta_g_region AND NOT EXISTS ( QUERY segment.DNA_fragment.mutation ) AND OVERLAPS (segment.DNA_fragment.region, ['humhbb_orig', 951, 1234]) ) )
This one retrieves all DNA transfer experiments whose construct contained an analog of the entire human HS2 core, but used a non-globin reporter gene.
Ex. 7 QUERY DNA_transfer_experiment WHERE EXISTS ( QUERY construct WHERE EXISTS ( QUERY construct_segment WHERE segment.DNA_fragment IS beta_g_region AND COVERS (segment.DNA_fragment.region, ['humhbb_orig', 8486, 8860]) ) AND EXISTS ( QUERY construct_segment WHERE segment.DNA_fragment IS NOT beta_g_region AND EXISTS ( QUERY segment.feature WHERE feature_element IS reporter ) ) )
The following query retrieves all DNA transfer experiments whose construct contained a segment with a beta-globin region that overlapped the AP1 region (corresponding to 8658-8677 in the human sequence) and was completely contained in the region corresponding to 7750-9230 in the human sequence, but no other segments overlapped the LCR (corresponding to anything < 15,000 in the human sequence). In addition, one of the segments in this construct must have featured the epsilon-globin gene as a promoter.
Ex. 8 QUERY DNA_transfer_experiment WHERE EXISTS ( QUERY construct WHERE EXISTS ( QUERY construct_segment WHERE segment.DNA_fragment IS beta_g_region AND OVERLAPS (segment.DNA_fragment.region, ['humhbb_orig', 8658, 8677]) AND COVERS (['humhbb_orig', 7750, 9230], segment.DNA_fragment.region) ) AND COUNT ( QUERY construct_segment WHERE segment.DNA_fragment IS beta_g_region AND OVERLAPS (segment.DNA_fragment.region, ['humhbb_orig', -INFINITY, 15000]) ) < 2 AND EXISTS ( QUERY construct_segment WHERE EXISTS ( QUERY segment.feature WHERE feature_element IS promoter AND feature_element.gene = 'epsilon_globin' ) ) )
Here is one that retrieves all DNA transfer experiments using a wild-type construct (i.e., no mutations, and all segments were from the beta-like globin cluster except possibly reporters), with some segment overlapping the HUMHBB region from -2686 to 10410, where experiments were performed measuring expression levels in transgenic mice.
Ex. 9 QUERY DNA_transfer_experiment WHERE EXISTS ( QUERY construct WHERE NOT EXISTS ( #no beta_g_regions with mutations QUERY construct_segment WHERE segment.DNA_fragment IS beta_g_region AND EXISTS (segment.DNA_fragment.mutation) ) AND NOT EXISTS ( #no non_beta_g_regions or oligos, except reporters QUERY construct_segment WHERE segment.DNA_fragment IS NOT beta_g_region AND NOT EXISTS ( QUERY segment.feature WHERE feature_element IS reporter ) ) AND EXISTS ( #some segment overlaps the given region QUERY construct_segment WHERE segment.DNA_fragment IS beta_g_region AND OVERLAPS (segment.DNA_fragment.region, ['humhbb_orig', -2686, 10410]) ) ) AND EXISTS ( QUERY experiment WHERE type_of_assay IS transgenic_mouse AND EXISTS ( QUERY type_of_assay.result WHERE measurement IS expr_level ) )
Your query output will come in several sections. The first line will report how many rows were selected by your query; this does not count the nested subtables, just the top level rows. E.g., how many DNA transfer experiments or binding assay experiments matched your criteria?
The next part will display the selected rows using the same language that Genie uses for data input. This language uses statements of the form
fieldname = valuefor simple data types, and nested groups of the form
BEGIN fieldname | | | END fieldnamefor Records and for each row of a Set or List. Variants also include the tag to identify which "flavor" was used:
BEGIN fieldname AS tag | | | END fieldname
The data language is relatively simple to understand compared to the schema or query languages, but it can be very tedious to read, especially if there are a lot of records. To make it easier to see what you've got, the Globin Gene Server provides a facility for drawing your output in a graphical format. Currently it can handle either DNA transfer experiments or binding assays .
Finally, the last part is a schema for the selected records, which (in the current version of Genie) is just a restatement of the table you were selecting from. This can be used by other programs that read and further process your output, like the drawing programs for example.
We hope that you will find this Database of Experimental Results to be a useful and powerful tool. Of course, it is only as good as the data that is put into it, and so we encourage you to share your own experimental results by sending them to us via electronic mail for inclusion in the database.
Questions, suggestions, and comments are welcome; please contact Cathy Riemer.