Note: This paper was prepared in November 1992 for presenation at the ISMM conference on knowledge and information management (CIKM) in Baltimore, MD. It references a the July 1992 SQL3 base document. It has not been updated since that time. Object SQL: Language Extensions for Object Data Management Leonard J. Gallagher Information Systems Engineering Division National Institute of Standards and Technology Gaithersburg, MD 20899 Abstract Database Language SQL is enjoying success as an effective International Standard for the definition and management of relationally structured data. The first SQL standard, in 1986, provided basic language constructs for defining and manipulating tables of data; a revision in 1989 added language extensions for referential integrity and generalized integrity constraints; and the most recent revision in 1992 provides new facilities for schema manipulation and data administration, as well as substantial enhancements for data definition and data manipulation. A companion standard for Remote Database Access (RDA), also completed in 1992, provides the basic services and protocols for SQL interoperability in a client/server environment. National and international SQL standardization committees are now focusing on development of future extensions for meeting the stated requirements of managing complex objects in engineering and multimedia environments. These extensions include object identifiers, abstract data types, inheritance hierarchies, and all of the other features normally associated with object data management. This paper presents the object management features contained in the draft specification for the next SQL standard, discusses problems and alternatives, and addresses planned future standardization activities. 1 Introduction Data management requirements for many scientific and engineering applications often exceed the capabilities of existing database management systems. Such applications often require a logically integrated database of diverse data (e.g. documents, graphics, alphanumeric records, complex objects, images, voice, video) stored in geographically separated data banks under the management and control of heterogeneous data management systems. An over-riding requirement is that these various data managers be able to communicate with each other and provide shared access to data and data operations and methods under appropriate security, integrity, and access control mechanisms. Database Language SQL [8] and its distributed processing counterpart, Remote Database Access (RDA) [10], are important International Standards that are able to address a significant portion of the above data management requirements. SQL is particularly appropriate for the definition and management of data that is structured into repeated occurrences having common data structure definitions. SQL provides a high-level query and update language for set-at-a-time retrieval and update operations, as well as required database management functions for schema and view definition, integrity constraints, schema manipulation, and access control. SQL provides a data manipulation language that is mathematically sound and based on a first-order predicate calculus. SQL is self-describing in the sense that all schema information is queryable through a set of catalog tables. Features of the most recent SQL standard are discussed in [4], [5], and [12]. Early in 1991, technical committees for SQL standardization, operating under the procedures of the American National Standards Institute (ANSI) and the International Organization for Standardization (ISO), committed to enhancing SQL into a computationally complete language for the definition and management of persistent, complex objects. This includes the specification of abstract data types, object identifiers, methods, inheritance, polymorphism, encapsulation, and all of the other facilities normally associated with object data management. Preliminary specifications for these facilities are contained in the most recent SQL3 Working Draft [9]. This article focuses on the new object-oriented facilities proposed for inclusion in SQL3. In particular, Section 2 describes the preliminary status of abstract data type (ADT) facilities and object identifiers. Section 3 discusses newly added features for compound SQL statements, exception handling, and flow of control. Section 4 addresses parameterized data types, including generator types such as LIST, SET, and ARRAY. Section 5 discusses stored procedures, Section 6 describes other supporting features, and Section 7 draws some conclusions. 2 Object Definition Earlier versions of the SQL3 Working Draft specified object-oriented facilities such as User-Defined data types, Assertions, and Triggers to support the object notions of abstraction and encapsulation. They also included the generalization and specialization of tables in a table hierarchy that supported multiple inheritance. In each case, these object concepts were only partially supported without fully enforcing them as a discipline. There was no notion of object identifiers and no notion of class hierarchies for user-defined data types. The newest SQL3 Working Draft [9] now addresses the requirement for true "objects" and "object identifiers" in SQL and also specifies supporting features such as encapsulation, subtypes, inheritance, and polymorphism. The following subsections briefly describe the current status of these features. Of course, as with any draft, the specifications are subject to revision and improvement as they evolve over the next two or three years before final standardization. 2.1 Abstract data type (ADT) The abstract data type facility provides the capability to define and manage persistent data type definitions, including structures and operations on those structures. A new ADT can be "constructed" from any existing data type, including previously defined abstract data types, known to the current SQL environment. An abbreviated version of the current syntax is as follows: CREATE TYPE [ ] [ ] [ ] The are discussed in Section 2.2 and the is discussed in Section 2.5 below. The specifies the attributes, operations, and other methods applicable to the ADT definition. Attributes contain an ; otherwise, an attribute definition has the same syntax as a regular relational column definition. An is specified as either PUBLIC, PRIVATE, or PROTECTED. Public components form the interface of the ADT and are visible to all authorized users of the ADT. Private components are totally encapsulated, and are visible only within the definition of the ADT that contains them. Protected components are partially encapsulated, being visible both within their own ADT and within the definitions of all subtypes of that ADT. Operations on an ADT may include EQUALS and LESS THAN. The EQUALS clause identifies a function that specifies conditions under which two instances of the defined data type are considered to be equal. The LESS THAN clause identifies a function that specifies a comparison ordering over data type instances. The definitions of equality and ordering, taken together, specify the semantics to be used in the SQL comparison predicate when applied to ADTs. If the LESS THAN definition is not specified, then the ordering tests of the comparison predicate must return an unknown value. Other methods associated with ADTs include FUNCTION definitions. FUNCTIONs operate on one or more ADT instances and return either Boolean, if the result is to be used as a truth value in a Boolean predicate, or a single value of a defined data type, if the result is to be used as a value specification. Functions may be SQL functions, completely defined in an SQL schema definition, or they may be external function calls to functions defined in standard programming languages (see 2.4). Special "constructor" and "destructor" functions are defined to make or remove instances of an abstract data type. At the present time constructor and destructor functions are invoked implicitly through Insert or Delete operations on a table (see 2.3). Special methods, identified as CAST functions, specify how an ADT may be mapped to other existing data types. For example, an IMAGE ADT may be mapped to a BIT STRING representation. With the ability to include CAST specifications in any ADT definition, a data type definer can define mappings to specific external representations. In this way the internal representation may be kept PRIVATE and not directly accessible, thereby allowing efficient implementation. Abstract data type definitions and SQL or external function declarations are treated just like any other SQL objects as far as access control and other usage privileges are concerned. Access control is independent of encapsulation since encapsulation defines the structure of what is possible to see and access control determines who can see that structure. All names are qualified by the schema name of the containing schema and by the catalog name of the containing catalog. Each such object is "owned" by the authorization identifier of the schema in which it is defined and all privileges to use the object, to "see" its representation, or to modify its definition must be explicitly granted by the object owner. Privileges on existing ADTs may be GRANTed and REVOKEd and new ADTs or function declarations may be ADDed or DROPed from the schema as part of the SQL schema manipulation language. 2.2 Object identifiers Object identity is that aspect of an object that never changes and that distinguishes the object from all other objects. Ideally, an object's identity is independent of its name, structure, or location. Object identity is therefore a unique identification of an object that is independent of the state of that object, and which persists over time. The identity of an object persists even after the object no longer exists (e.g. like a timestamp), so that it may never be confused with the identity of any other object. Other objects can use the identity of an object as a unique way of referencing it. The clause in an ADT definition allows several alternatives for object identifier (OID) specification: WITH OID VISIBLE, WITH OID NOT VISIBLE, or WITHOUT OID If WITH OID is specified, then an OID value is generated when the object is created to give that object an immutable identity. The OID can be referenced by an "object reference" in constraints, queries, and other ADT definitions. If NOT VISIBLE is specified, then the OID value may not be passed as a parameter to functions or stored in a host language variable. If WITHOUT OID is specified, then the ADT does not have an object identifier; instead, each instance represents itself just like values of primitive data types do. There is a continuing debate in the SQL standardization committees as to whether SQL should support all three of the above options, or if every new ADT definition should be assumed to carry a unique object identifier. The outcome of this debate will not affect the functionality of the new language, but it may influence its appearance and style. PUBLIC components of an ADT are accessible to authorized users through a special "attribute reference" operator (i.e. .). If the ADT has WITH OID specified, then the will be to an object identifier of a specific ADT; if the ADT has WITHOUT OID specified, then the will be to the ADT value itself. The attribute reference identifies a specific component of the ADT instance and permits the user to read or modify its value. 2.3 Object management Object ADTs are subject to special "constructor" and "destructor" functions that either create a new instance of the ADT and make it part of the database or remove an instance of an ADT from the database. Since SQL is a "table-based" language, SQL designers have to address issues concerning whether or not SQL object instances may exist outside of table occurrences. If SQL objects are allowed to exist outside of tables, then new syntax to manipulate them and new structures to hold collections of them must become part of the language. Although these issues are still subject to debate and modification, the current status is to require that all object manipulation be achieved through table operations. SQL3 currently allows specification of a "tabular" shell over an ADT class. In this way, constructor and destructor functions are automatically invoked when rows are inserted or deleted from the table, and the table itself is the collection of all object occurrences. SQL query and update statements may then be applied to the table without the need for any special language enhancements. If a requirement surfaces later to allow objects to exist and be managed independently of tables, then it can be handled as an upward compatible language enhancement. In order to accommodate this implicit invocation of constructor and destructor functions, minor enhancements are needed to the syntax and semantics of the CREATE TABLE and INSERT statements (see [3]). The statement "CREATE TABLE T OF " creates a tabular envelope around the abstract data type specified by and all attributes of the abstract data type become columns of the new table T. Object instances may then be managed by the usual SQL Select, Update, and Delete statements. The statement "INSERT INTO T ALIAS ", using the new ALIAS option with an Insert statement, returns the object identifier of any new object created by the Insert statement to the variable. Syntax rules prohibit specification of an alias when the underlying ADT is specified WITHOUT OID. 2.4 Methods and functions An abstract data type includes not only a collection of attributes but also the methods that define the behavior of the ADT. Some of the methods associated with an ADT might be realized by means of data that is stored in the database, while others might be realized as executable code (functions). An implementation of an ADT is the stored data together with the data structures and code that implement the behavior of the ADT. As seen in Section 2.1, methods may be encapsulated within the ADT definition. Specific methods for determining equality, and ordering when appropriate, are then usable in regular SQL comparison predicates. As we have seen, other methods can be defined as special operators on ADTs or as predicates that return truth values. A single value returned from a function call can be used any place in the SQL language that a single value is allowed. A truth value returned from a function call can be used as one of the terms in a boolean predicate. Functions may be either SQL functions or external functions. SQL functions are defined completely in SQL, whereas external functions have only their interface definition specified in SQL, with the content of the function written in some programming language (e.g. Ada, C, or eventually C++). An SQL function may be "defined", or an external function may be "declared", as an independent schema element, as part of an ADT definition, or as part of a module definition. The syntax of an SQL function is: [CONSTRUCTOR | ACTOR | DESTRUCTOR] FUNCTION RETURNS ; END FUNCTION Only constructor or destructor functions may create or destroy new ADT instances; they have already been discussed above. An actor function is any other function that reads or updates components of an ADT instance or accesses any other parameter declared in the . A parameter in the parameter list consists of a parameter name and an SQL data type. The RETURNS clause specifies the SQL data type of the result returned. Since all data types in an SQL function are SQL types that must accommodate Null values, it is not necessary to worry about "indicator" parameters to convey Nulls. The may be any SQL statement, including compound statements and control statements (see Section 3). To accommodate SQL functions, SQL3 specifies the following new SQL statements: - A NEW statement that allows creation of a new OBJECT ADT instance; it is only allowed in a CONSTRUCTOR function. - A DESTROY statement that destroys the existence of an OBJECT ADT instance; it is only allowed in a DESTRUCTOR function. - An ASSIGNMENT statement that allows the result of an SQL value expression to be assigned to a free standing local variable, a column, or an attribute of an ADT. - A CALL statement that allows invocation of an SQL procedure. - A RETURN statement that allows the result of an SQL value expression to be returned as the RETURNS value of the SQL function. The syntax of an external function declaration is: DECLARE EXTERNAL RETURNS [ CAST AS ] LANGUAGE The is a list of SQL data types. If a data type in the parameter list is supported in the programming language identified by the LANGUAGE clause, then the corresponding programming language routine has two parameters for that data type; the second parameter is the "indicator" parameter to convey Null values. If a data type in the parameter list is an ADT not supported in the programming language identified by the LANGUAGE clause, then the corresponding programming language routine has two parameters for the base type of each attribute in the ADT definition, recursively. Again, the second parameter in each case is an "indicator" parameter. The actual mapping from the in the external function declaration to the parameter list of the programming language routine can become quite complex, so techniques for better management of cross-language calls are under active consideration. The CAST AS clause is a convenience to allow "encapsulated" casting from a programming language data type to an SQL data type. For example, an SQL DATETIME data type that appears in the formal parameter list is automatically cast to its character string literal representation before it is passed to a programming language routine, e.g. FORTRAN. The CAST AS clause could also automatically cause the character string RESULT to be re-cast into an SQL DATETIME value. Since every SQL data type has a defined CAST operation to and from character string representations, it is possible to pass any SQL data type to any programming language that supports character strings. More sophisticated data type correspondence, especially to object programming languages such as C++, will likely be the subject of near-term considerations. 2.5 Subtypes and inheritance Specification of "UNDER " in the of an ADT definition (see 2.1) permits a new ADT to be defined as a subtype of an existing ADT. A type can have more than one subtype and more than one supertype. Thus a subtype is a specialized type of one or more supertypes and a supertype is a generalized type of one or more subtypes. A supertype shall not have itself as a proper subtype and a subtype family shall have exactly one maximal supertype. Inheritance is an abstraction mechanism that adds to the power of data abstraction by allowing classes of objects to be related hierarchically. Inheritance allows classes to share definitions with other classes, thereby supporting newer, more specialized, data definitions without losing the existing properties and operations of the superclass. Through inheritance, new types can be built under older, less specialized types rather than having to rewrite properties from scratch. Inheritance makes it possible to build a hierarchy of related ADTs, i.e. a "type hierarchy", that share the same interface, and possibly the same representation and implementations. As we move up in the inheritance hierarchy, types become more generalized; as we move down types become more specialized. These generalization/specialization capabilities allow more accurate and succinct modeling of applications. The SQL implementation of a type hierarchy (see [3] and [11]) requires that an instance of a subtype is also an instance of all of its supertypes. Every instance is associated with a "most specific type" that corresponds to the lowest subtype assigned to the instance. At any given time, an instance must have exactly one most specific type. Note that the most specific type of an instance need not correspond to a leaf type in the type hierarchy. For example, a type hierarchy might consist of a maximal supertype PERSON that has STUDENT and EMPLOYEE as two subtypes, and STUDENT may have two subtypes GRAD and UNDERGRAD. An instance in this hierarchy might be created with a most specific type of STUDENT, e.g. a special, non-degree student, even though STUDENT is not a leaf in the type hierarchy. As we saw above, every attribute definition in an ADT has an encapsulation level specified as either PUBLIC, PRIVATE, or PROTECTED. Public and protected components are visible to the definitions of all subtypes of that ADT, but private components are not. A subtype can define constructor, actor, and destructor functions just like any other ADT. All operations of the supertype are invocable from the subtype, so there is a high potential for name conflicts when the subtype defines more specialized operations. Name resolution rules, described in the following section, ameliorate this problem. Since an instance must have at most one "most specific type" associated with it, a given instance cannot have two sibling types simultaneously as its most specific type. For example, in the above student type hierarchy, an instance of PERSON may not be an instance of both STUDENT and EMPLOYEE simultaneously. However, real world examples require that we have some method for modeling a person as both STUDENT and EMPLOYEE, for this can certainly be the case. To handle these situations, SQL provides "multiple inheritance", i.e. a subtype can have more than one direct supertype. With multiple inheritance, we can define a new type STUDENT-EMP which is a subtype of both STUDENT and EMPLOYEE. A person who is both a student and an employee can be modeled as an instance of the STUDENT-EMP type. In this way an instance will satisfy the requirement to always have a "most specific type". Multiple inheritance could lead to ambiguous inheritance of components from its supertypes, so SQL provides some disambiguity rules as follows: - If an attribute in more than one supertype is inherited from a common supertype higher in the hierarchy, then only the one from the common supertype, is inherited. - If an attribute with the same name in each of the supertypes are not inherited from a common supertype higher in the hierarchy, then the type definition is invalid unless the type definer renames the inherited components to remove the name clash. These rules, and other related issues, are subject to improvement and evolution as the SQL ADT facility stabilizes over the next two or three years. 2.6 Polymorphic functions Polymorphism is the ability to invoke an operation on any of several different objects and have that object determine what to do at run- time. A polymorphic function is one that can be applied in the same way to a variety of data objects. Support for polymorphism involves technical decisions concerning early or late binding among objects and the procedures that invoke their methods. To help address some of these technical decisions, a number of techniques have evolved (see [1]), such as: Overloading. The ability to assign the same name to more than one function or procedure -- name resolution is then determined by a set of rules, thereby allowing a processor to distinguish among different functions of the same name by examining the "type" of the input data. Coercion. The ability to omit semantically needed type conversions -- we have already seen that SQL uses this technique in some of its parameter passing to external functions. Inclusion. The ability to manipulate objects of a subtype "as if" they were objects of a supertype -- possibly with a function of the same name that calls different routines. Generalizing. The ability to specify that a parameter should "take on" the type of some supertype during processing of a specific function call. The resolution rules to support polymorphism are derived from one basic concept, i.e. that for any particular function invocation, there must exist a single "best match" from the candidate functions that are "in scope". When a function call is executed, the unique "most specific type" of the various input parameters is used to help define the "best match" according to the following rules, many of which are derived from those used by C++: - Begin with the set of all functions that are "in scope" for a particular function call, i.e. those that are defined or declared with the calling function name in the statement, procedure, module, or schema associated with the function call. - For each argument, determine the set of functions that is a "best match" for that argument, then take the intersection of these sets. Unless this intersection has exactly one member, the call is illegal. That is, the function selected must be a "strictly better match" for at least one argument than every other possible function, but not necessarily the same argument for each function. - To decide which functions are the best match for each argument, agree: an "exact match" is better than one based on type coercion (i.e. CASTing), an implicit conversion to the "closest" supertype is better than SQL or user-defined type coercion, and an implicit SQL- defined CAST is better than an implicit user-defined CAST. 3 Control Structures Support for SQL functions is discussed in Section 2.4 above. We saw there that it was necessary to introduce several "control" statements into the SQL language, e.g. ASSIGNMENT, CALL, and RETURN. The obvious next step was to consider if more control statements and other "programming language" facilities should be added to SQL. In particular, SQL standardization committees decided that it was appropriate to specify SQL syntax and semantics for the following facilities (see [6]): - sequences of SQL statements in a procedure instead of the single SQL statement allowed in the current SQL standard, - flow of control statements, such as looping, branching, etc., - exception handling, so that when an exception is raised, the SQL function or procedure can resolve the issue internally, or propagate the exception to the next outermost exception handler, rather than always returning control to the main calling routine. These 3GL programming language facilities are valuable because they allow procedural encapsulation and they allow complete behavior to be specified within an ADT definition without the need to escape to a procedure written in some other language. Complex behavior can be made available to the host application program via a single call. This offers benefits in both cost and control. In SQL'92, all procedures are single SQL statements so multiple calls must be made to address complex problems. All temporary state and flow of control belong to the host language application, thereby adding complexity to the application program that could be encapsulated in the SQL procedure. The facilities in the following paragraphs are included in the current SQL3 Working Draft [9]. 3.1 Compound statement A compound statement is a statement that allows a collection of SQL statements to be grouped together into a "block". A compound statement may declare its own local variables and specify exception handling for an exception that occurs during execution of any statement in the group. Its syntax is as follows: [ : ] [ ] BEGIN [ ] [ ] END [ ] ::= EXCEPTION [ {WHEN THEN }... ] ::= | OTHER An is unique within a and may be declared with an exception declaration (see 3.2). 3.2 Exception handling An exception declaration establishes a one-to-one correspondence between an SQLSTATE error condition and a user-defined exception name. It's syntax is as follows: DECLARE EXCEPTION FOR SQLSTATE The exception handling mechanism under consideration for SQL3 is based very strongly on the mechanism defined in Ada. Each compound statement is assumed to have an exception handler; if one is not explicitly defined, then a default handler is provided by the system. When the execution of a statement results in an active exception condition, then the containing exception handler is immediately given control. Ultimately, the exception handler terminates with one of the following behaviors: - The compound statement terminates with the active exception condition still active, or - The compound statement terminates with a new active exception condition, or - The compound statement terminates successfully, as though no exception occurred, and there is no outstanding active exception condition. The exception handler may execute SIGNAL or RESIGNAL statements to identify a new exception name or to pass on the existing exception name. If an exception condition occurs in the exception handler itself, then the compound statement is terminated and that exception condition becomes the "active" exception condition. 3.3 Flow of control The following program flow of control statements are currently specified in the draft SQL3 document: - A CASE statement to allow selection of an execution path based on alternative choices. A is executed and, depending on the result, control is transferred to the appropriate block of statements. - An IF statement with THEN, ELSE, and ELSEIF alternatives to allow selection of an execution path based on the truth value of one or more conditions. - A LOOP statement, with a WHILE clause, to allow repeated execution of a block of SQL statements based on the continued true result of the in the WHILE clause. A LOOP statement is also allowed to have a statement label. - A LEAVE statement to provide a graceful exit from a block or loop statement. 4 Parameterized Types The ability to define abstract data types does not by itself provide the capability to define "parameterized" types. A parameterized type is really a "type family" with a new data type generated for each value of an input parameter. For example, an ADT definition for VECTOR(N) can be thought of as a family of data types, one for each positive integer value of N. We may think of a "parameter" as any value of a data type known to the SQL environment, e.g. an integer value in the example above. We may also think of a "parameter" as a reference to an existing data type, rather than as a value of that type. For example, we may wish to specify that VECTOR(N) is really a vector of integers, or reals, or decimals with fixed precision and scale. This can be achieved by passing a data type name to the ADT definition instead of just a data type value. The syntax for specifying a parameterized type in SQL3 is very similar to that for specifying a regular ADT, namely: CREATE TYPE TEMPLATE