Example-Driven Research

These days I am writing a book on ‘domain-specific language engineering’ for use in a master’s course at Delft University. The book is about the design and implementation of domain-specific languages, i.e. the definition of their syntax, static semantics, and code generators. But it also contains a dose of linguistic reflection, by studying the phenomenon of defining languages, and of defining languages for defining languages (which is called meta-modelling these days).

Writing chapters on syntax definition and modeling of languages, takes me back to my days as a PhD student at the University of Amsterdam. Our quarters were in the university building at theWatergraafsmeer, which was connected to the CWI building via a bridge. Since the ASF+SDF group of Paul Klint was divided over the two locations, meetings required a walk to the other end of the building. So, I would regularly wander to the CWI part of the building to chat.

[While the third application we learned to use in our Unix course in 1989 was talk, with which one could synchronously talk with someone else on the internet (the first application was probably csh and the second email), face-to-face meetings were still the primary mode of communication; as opposed to the use of IRC to talk to one’s officemate.]

Often I would look into Jan Heering’s office to say hi, and more often than not would end up spending the rest of the afternoon discussing research and meta-research.

One of the recurring topics in these conversations was the importance of examples. Jan was fascinated by the notion of ‘programming by example’, i.e. deriving a program from a bunch of examples of its expected behaviour, instead of a rigorous and complete definition for all cases. But the other use of examples was for validation, a word I didn’t learn until long after writing my thesis.

The culture of the day (and probably location?) was heavily influenced by mathematics and theoretical computer science. The game was the definition, preferably algebraically, of the artifacts of interest, and then, possibly proving interesting properties. The application minded would actually implement stuff. As a language engineer I was mostly interested in making languages with cool features. The motivation for these features was often highly abstract. The main test example driving much of the work on the ASF+SDF MetaEnvironment was creating an interactive environment for the Pico language (While with variable declarations). The idea being that once an environment for Pico was realized, creating one for a more realistic language would be a matter scaling up the Pico definition (mere engineering). Actually making a language (implementation) and using that to write programs would be a real test. To be fair, there were specifications of larger languages undertaken, such as ones of (mini-) ML [8] and Pascal [6]. As a student I had developed a specification of the syntax and static semantics of the object-oriented programming language Eiffel [16], but that was so big it was not usable at the Sun workstations we had at that time.

Time and again, Jan Heering would stress the importance of real examples to show the relevance of a technique and/or to discover the requirements for a design. While I thought it was a cool idea, I didn’t have examples. At least not to sell the design and implementation of SDF2, the syntax definition formalism that turned out to be the main contribution of my PhD thesis [20].

Design of SDF2

SDF2 is the successor to the syntax definition formalism SDF, originally designed by Heering, Hendriks, Klint, and Rekers [7], which was the basis for syntax definition in the MetaEnvironment [10]. SDF was originally designed to be declarative, high-level, and able to support the definition of real languages. To that end, its design was actually motivated by a careful investigation of the requirements of the syntax of programming languages.

The redesign and reimplementation of SDF fitted in the larger scheme of bootstrapping ASF+SDF, i.e. the definition and implementation of ASF+SDF in itself. The motivation for that project was in the first place driven by need; the LeLisp language and Centaur platform in which the first version of the ASF+SDF MetaEnvironment was implemented was on the verge of becoming extinct. But as a second system effort, it was also a chance to reconsider the design of the language and the implementation of ASF+SDF, taking the experience of the first version into account. Furthermore, bootstrapping is of course the ultimate test for a language for defining languages.

As a result the redesign of SDF that I undertook was driven mainly by internal motivations, trying to achieve elegance, efficiency, orthogonality, and simplicity, while trying to be largely backwards compatible with the existing SDF language. Thus the syntax of SDF2 was mostly the same as that of SDF, but the underlying abstract syntax was different; with three main differences.

First, the syntax of the language was organized into a kernel language that provided the essential expressive power of the language to simplify implementation and language extensions providing syntactic abstractions (sugar) to make reading and writing syntax definitions tractable. The normalization (desugaring) of full SDF2 to Kernel-SDF [17] was using rewrite rules in ASF+SDF. (This design pattern was inspired by the design of Haskell [12, 9] as a core and extend language, and since applied in Stratego [3] and WebDSL [22].)

Second, heuristics for disambiguation were replaced with a systematic approach to disambiguation using disambiguation filters [11]. While this part was supposed to be the core of my research, it didn’t get very far. However, I developed a scheme for applying disambiguation filters for associativity and priority during parser generation [18].

Third, triggered by the work of Salomon and Cormack [14], I redesigned SDF to fully integrate the definition of lexical syntax and concrete syntax. Instead of a separate lexical analysis stage and corresponding regular grammar specification of tokens, context-free grammar productions are used for the definition of lexical syntax (tokens) as well as the context-free syntax (sentence structure). What motivated this last change? Wisdom in parsing has it that a separate lexical analysis phase is indispensable for performance. However, the implementation of the first version of SDF used GLR parsing to support arbitrary context-free grammars. To make this useful, (local) ambiguity at the lexical level should be supported as well. As a result the architecture was quite complex, with a scanner producing multiple tokens, with potentially unsynchronized token boundaries. In this context, scannerless parsing produced a considerable simplification of the implementation. There was no need for a separate language of regular grammars, no generation of finite automata for lexical analysis, and no difficult interface between scanner and parser to support lexical ambiguities. Furthermore, lexical disambiguation heuristics such as longest match and prefer keywords were mostly right, but could also produce the wrong token list. Scannerless parsing forced a declarative definition of lexical disambiguation.

SGLR

Thus, scannerless parsing was motivated mostly by arguments pertaining to the implementation rather than the application of SDF. It turned out that the use of the Generalized-LR algorithm [15, 13] solves many of the issues of lexical disambiguation. Since the parsing context dictates the possible ‘tokens’, many ambiguities at the lexical level are automatically solved.

However, the approach does not solve all lexical ambiguities. Solving these ambiguities with post-parse filters is not feasible, since the parse forest explodes with too many ambiguities. Basic lexical ambiguities need to be solved during parsing. Salomon and Cormack [14] had introduced the notions of follow restrictions and reject rules to formulate longest match and prefer keywords disambiguation rules. Their approach to implementing these was to adapt the parser generator to include these concepts into the parse table. However, the use of arbitrary grammar symbols for follow restrictions and reject rules was not succesful.

In the SGLR algorithm follow restrictions do work out through a simplification of the grammar formalism. Instead of considering arbitrary non-terminals to exclude from following a (lexical) non-terminal, only a character class (or sequence of character classes) is used as follow restriction. Thus, follow restrictions can be implemented simply by restricting the follow set of a reduction, avoiding ambiguities in the parse table. Reject productions are more problematic. By carefully ordering the order in which reductions are considered, SGLR avoids applying reductions from states that could end up being rejected. The approach works in practice, but expression in the parse table would be more satisfactory. The paper I wrote about SGLR [19] was not accepted for publication. This was partly due to the prototypical nature of the implementation. Reviewers were skeptical of a parsing solution without separate scanner. However, the more problematic issue with the paper was that it did not provide any compelling motivation for the introduction of scannerless parsing. The paper provides technical considerations such as integrated uniform grammar formalism, disambiguation by context, and conservation of lexical structure. The examples used to illustrate these features are rather lame, e.g. subrange types vs floating point numbers in Pascal. Why introduce a new technique with worse performance, if it cannot be used for anything that could not be done with the old techniques?

Language Embedding

It turned out that there was a killer application for scannerless parsing, but I realized this only much later. In the paper meta-programming with concrete object syntax [21], a general architecture for embedding an object language into a metalanguage is described in response to a challenge by Tim Sheard. Of course, the technique was used in ASF+SDF, but somehow it did not seem like an example that provided compelling motivation for scannerless parsing. Application to other combinations of languages made it interesting for a wider audience. The approach relies crucially on the modularity of SDF and SGLR. Since meta-language and object-language typically have a different syntax, in particular a different lexical syntax, language composition is not an option with traditional parsing technogies (LL,LR) since these are not closed under composition. Due to disambiguation by context different lexical syntaxes can co-exist without problems in SDF. Meta-programming with concrete object syntax brought on a host of applications of language embedding, which I explored with Martin Bravenboer [1]. In MetaBorg [5] domain-specific languages abstracting over APIs are embedded in a general-purpose language. In StringBorg [2] so called ‘string languages’ are embedded to ensure that user input does not lead to injection attacks. AspectJ, an extension of Java, is a composition of several sub-languages with different lexical syntax, which we described with a declarative syntax definition in SDF [4] (where existing implementations had to rely on hacks).

Example-Driven Research

The lesson that I learned from the SGLR experience, was to take examples serious. Examples are crucial as witnesses to the anomalies of existing approaches and techniques. Addressing such anomalies provides compelling motivation for a new approach or technique; this part may be more important than a rigorous description of the solution. For example, the design of WebDSL, a domain-specific language for web application, was based on an inductive domain analysis process [22], i.e. starting with a study of existing best practices in web application development. The design of WebDSL solves a number of weaknesses in the state-of-the-art. In advising my PhD students I use the lesson I learned from Jan Heering: provide good examples.

There is another lesson in the SGLR experience, however. Based on the applications we developed at the time there was no clear motivation for scannerless parsing. From that perspective we should have never gone in that direction. SGLR was developed based on internal motivations such as elegance and simplicity. This turned to enable a host of applications unforeseen at the time of development. Thus, there should always be room for research that explores new techniques just because it seems cool, not necessarily useful.

References

[1] M. Bravenboer. Exercises in Free Syntax. Syntax Definition, Parsing, and Assimilation of Language Conglomerates. PhD thesis, Utrecht University, Utrecht, The Netherlands, January 2008.

[2] M. Bravenboer, E. Dolstra, and E. Visser. Preventing injection attacks with syntax embeddings. A host and guest language independent approach. In J. Lawall, editor, Generative Programming and Component Engineering (GPCE 2007), pages 3–12, New York, NY, USA, October 2007. ACM.

[3] M. Bravenboer, K. T. Kalleberg, R. Vermaas, and E. Visser. Stratego/XT 0.17. A language and toolset for program transformation. Science of Computer Programming, 72(1-2):52–70, June 2008. Special issue on experimental software and toolkits.

[4] M. Bravenboer, E. Tanter, and E. Visser. Declarative, formal, and extensible syntax definition for AspectJ. A case for scannerless generalized-lr parsing. InW. R. Cook, editor, Proceedings of the 21th ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA 2006), pages 209–228, Portland, Oregon, USA, October 2006. ACM Press.

[5] M. Bravenboer and E. Visser. Concrete syntax for objects. Domain-specific language embedding and assimilation without restrictions. In D. C. Schmidt, editor, Proceedings of the 19th ACM SIGPLAN Conference on Object- Oriented Programing, Systems, Languages, and Applications (OOPSLA 2004), pages 365–383, Vancouver, Canada, October 2004. ACM Press.

[6] A. Van Deursen. The static semantics of pascal. In A. van Deursen, J. Heering, and P. Klint, editors, Language Prototyping. An Algebraic Specification Approach, volume 5 of AMAST Series in Computing, chapter 2, pages 31–52. World Scientific, Singapore, September 1996.

[7] J. Heering, P. R. H. Hendriks, P. Klint, and J. Rekers. The syntax definition formalism SDF – reference manual. SIGPLAN Notices, 24(11):43–75, 1989. [8] P. R. H. Hendriks. Typechecking Mini-ML. In J. Bergstra, J. Heering, and P. Klint, editors, Algebraic Specification, ACM Press Frontier Series, pages 299–337. The ACM Press in co-operation with Addison-Wesley, 1989. Chapter 7.

[9] P. Hudak, S. Peyton Jones, and P.Wadler, editors. Report on the Programming Language Haskell. A Non-strict, Purely Functional Language. (Version 1.2), volume 27. May 1992. ACM SIGPLAN Notices.

[10] P. Klint. A meta-environment for generating programming environments. ACM Transactions on Software Engineering and Methodology, 2(2):176– 201, April 1993.

[11] P. Klint and E. Visser. Using filters for the disambiguation of context-free grammars. In G. Pighizzini and P. San Pietro, editors, Proc. ASMICS Workshop on Parsing Theory, pages 1–20, Milano, Italy, October 1994. Tech. Rep.

126–1994, Dipartimento di Scienze dell’Informazione, Universit`a di Milano. [12] S. L. Peyton Jones. The Implementation of Functional Programming Languages. Prentice-Hall International Series in Computer Science. Prentice- Hall, Englewood Cliffs, New Jersey, 1987.

[13] J. Rekers. Parser Generation for Interactive Environments. PhD thesis, University of Amsterdam, 1992.

[14] D. J. Salomon and G. V. Cormack. Scannerless NSLR(1) parsing of programming languages. SIGPLAN Notices, 24(7):170–178, 1989.

[15] M. Tomita. Efficient Parsing for Natural Languages. A Fast Algorithm for Practical Systems. Kluwer Academic Publishers, 1985.

[16] E. Visser. Syntax and static semantics of Eiffel. A case study in algebraic specification techniques. Unpublished technical report, December 1992.

[17] E. Visser. A family of syntax definition formalisms. In M. G. J. van den Brand et al., editors, ASF+SDF 1995. A Workshop on Generating Tools from Algebraic Specifications, pages 89–126. Technical Report P9504, Programming Research Group, University of Amsterdam, May 1995.

[18] E. Visser. A case study in optimizing parsing schemata by disambiguation filters. In International Workshop on Parsing Technology (IWPT 1997), pages 210–224, Boston, USA, September 1997. Massachusetts Institute of Technology.

[19] E. Visser. Scannerless generalized-LR parsing. Technical Report P9707, Programming Research Group, University of Amsterdam, July 1997.

[20] E. Visser. Syntax Definition for Language Prototyping. PhD thesis, University of Amsterdam, September 1997.

[21] E. Visser. Meta-programming with concrete object syntax. In D. Batory, C. Consel, and W. Taha, editors, Generative Programming and Component Engineering (GPCE 2002), volume 2487 of Lecture Notes in Computer Science, pages 299–315, Pittsburgh, PA, USA, October 2002. Springer-Verlag.

[22] E. Visser. WebDSL: A case study in domain-specific language engineering. In R. L¨ammel, J. Visser, and J. Saraiva, editors, International Summer School on Generative and Transformational Techniques in Software Engineering (GTTSE 2007), volume 5235 of Lecture Notes in Computer Science, pages 291–373, Heidelberg, October 2008. Springer.