[<< | Prev | Index | Next | >>]

Thursday, October 21, 2004

Programming in Syn



Executive summary: What if the syntax and semantics of a programming language were specified in a library, rather than built into the language, and thus could be swapped or extended at will just like any other library?


October 13, 2004

What is Syn?

The short but incomplete answer is that Syn is a simple functional programming language who's only datatype is parse trees.(1)

But because Syn passes through un-recognized nodes as-is, Syn can also be viewed as a macro pre-processor.

Probably the best way to understand Syn is this: Syn is a macro pre-processor so powerful that you could write "macros" to process a high-level source language like C++ all the way down to a low-level target like assembly language if you really wanted to. (In fact, a language like C++ is a modest goal--Syn can do much much more than that. Read on...)

But this is not the same as saying Syn is just another language you can write compilers in--a better way to look at it is that you write libraries in Syn, and then when you write code in your source (higher-level) language, what you are really doing is just writing more Syn code which calls those libraries to generate the compiled version of your code. Follow? The relevant distinction is that when you're writing in your source language, you are still truly writing in Syn, with all of the compile-time power that confers, even if it happens to look completely different (i.e., even if it happens to look like C++).

How Does Syn Work?

Part of the trick to Syn is that Syn itself has no syntax whatsoever. You give Syn a parse tree, it evaluates it, and it gives you a parse tree back. That's it. So parsing is a separate stage entirely, which means the various source files in a Syn project don't all have to be of the same syntax. For instance, some of the files might look like C++, while others might be a seemingly different language entirely that's useful for programming Syn directly. But the parser combines each of these with their associated syntax description, and spits them all out as identically represented parse trees.

To make this more concrete, consider the built-in function syn_import. Syn_import takes a module name--not a file name--as a parameter, and imports that module's parse tree, not it's original source. So that means a file written in one syntax can import a file written in a completely different syntax if it wants to, and the two will play nicely together because at that point they are both parse trees.

In typical usage, this means that you can import a Syn file which completely defines the semantics of your language--which is only possible because the file you're importing, even though also written in Syn, was written to a different set of semantics than the one it is defining. The base level Syn files, in turn, are ultimately written to the semantics defined by Syn's relatively few built-in methods (basically simple arithmetic, conditionals, parse tree manipulation, and macros, and that's about it).

Where it gets interesting is this process can be repeated through multiple levels of abstraction. Synsh and Synge, which I'll describe next, are examples of this.

Synsh - A Concrete Example

Although Syn is a programming language in its own right, it's not really intended to be the target language. Typically, one would choose some efficient, optimizable target language like LLVM, or the kernel language of Oz. But in principle you can target just about anything: C, assembly, Basic, or, in the case of Synsh, shell scripts.

Now, writing a language that compiles down to shell scripts is a bit nutty, to be sure. But it makes for an easy example because the output is easy to read (and if you have a Unix machine, easy to run too), and so is a good proof-of-concept. It is also a proof of the power of the Syn approach--because I wrote Synsh from scratch in a little under three days.

Now before you assume that the resulting source language must be really ugly, here are some actual examples (remember, this language was created from scratch in three days, and these examples run and do what you would expect):

import "tcsh"

define int fact(int i)
{
	if i < 1 then
		return(1)
	else
		return(i*fact(i-1));
};

define void main()
{
	print("Fact(5) = " + fact(5));
};

When you feed this to Syn, it spits out a tcsh shell script which when run in turn outputs "Fact(5) = 120".

To give a more useful example:

import "tcsh"

define int listall(string dir, string indent)
{
	stringList l;
	int i;
	int s;

	l = filesIn(dir);
	s = 0;

	for (i=0; i<size(l); i=i+1) {
		if modifyTime(l[i]) > now() - days(2) then {
			print(indent+fileTail(l[i])+" (changed ");
			print(now() - modifyTime(l[i]));
			println(" ago.)");
		} else {
			println(indent+fileTail(l[i]));
		};
		if isDirectory(l[i]) then {
			s = s + listall(l[i], indent+"....");
		};
		s=s+1;
	};
	return(s);
};

define void main()
{
	println("Total of "+listall(".", ".")+" files.");
};

Here is the compiled script, and the output of that script.

It is worth mentioning at this point that Syn, in total, is composed of about 900 lines of generic C (not counting blank lines or comments) and 100 lines of misc libraries written in Syn. In turn, Synsh is about 250 lines of Syn code, the Tcsh module of Synsh is about 200 lines of Syn and 400 lines of Synsh. So, the above Synsh->Tcsh-script translations, starting with nothing but a C compiler and a parser with associated libraries, required a total of a mere 1,850 lines of code (some of which was in turn semi-automatically generated by the parser, but I digress). Thus showing not so much the power of Syn (which itself is quite minimalistic), but of the overall approach.

Back to how it works:

You'll note I made mention of C code, Syn code, and Synsh code above. Syn is written in C, but since Syn includes a macro primitive, Syn can be easily extended in Syn. So, for instance, where the built-in macro is properly just a syntactic macro, which tells Syn how to expand a given kind of parse node in the parse tree, it is possible to build a parse tree on the fly and then evaluate that too--which means it is possible to write a macro in Syn for defining generic functions with named parameters(2). The defun.syn module, which is written in Syn, provides this. It's a bit convoluted, but it only had to be done once.

To wit, here is a raw Syn program to compute and print the factorial of 5:

// syntax: syn3

macro syn_test
	if $_ < 1 then
		1
	else
		$_ * %test($_-1);

%print(%test(3+2));

The first line is a comment, but also something more: As I said, every file in a Syn project can have its own syntax. How you decide what files belong with what syntax is up to you, but for my purposes I have my make script examine the first line for a directive of the form "syntax: foo", and failing that to use the file extension (e.g., "source.foo"), and failing that to default to "syn". This way I can put the exact syntax revision (e.g., syn3) in the file itself, so if I ever update the syntax in some non-backward-compatible way, I can just rev the number (syn4), and can start using it for new files without breaking any of the old ones.

So, syn3 is the current syntax I use for writing Syn programs (follow the link to see the actual syntax specification)--but again Syn itself doesn't care since it just sees the resulting parse tree.

The second line defines a macro for syn_test parse nodes. If you look near the bottom of the syntax file, syn3, you will see that syn_test represents a construct of the form "%test(foo)" where foo is any valid syn3 expression. So, the macro above is saying "any time we see something in the source that looks like %test(foo), replace it with...". And the parameter, foo, is available (under syn3) in the macro variable "$_" which always represents the contents of the associated parse node.

Finally, the third line invokes syn_print, via the syn3 syntactic representation "%print(...)", which is a Syn built-in method to print to stderr. (The output goes to stderr by default since the normal output of Syn is the modified parse tree, and explicit printing like this is typically just used to report errors.)

Now the same thing again, using the defun.syn module mentioned before:

// syntax: syn3

import "defun";

defun fact(n)
	if $n < 1 then
		1
	else
		$n * fact($n-1);

%print(fact(3+2));

This program is functionally identical to the first, but it has named parameters, and is invoked via a generic function call syntax rather than the single-use %test() syntax. I.e., defun allows us to define new functions in our Syn programs without having to modify the syntax.

The way defun works is by constructing a macro definition for the function in question ("fact", in this case), and then evaluating it. I.e., when you actually call fact(), it is Syn's built-in macro expander that does the dirty work. (It's slightly more convoluted than that, actually, since fact() is syntactically syn_funcall(syn_func=fact), so syn_funcall itself is defined, in defun.syn, to put the function name together with its parameters and to evaluate it as if it were a direct syntactic construct... but I digress.)

I mention this--the way defun works--because it leads to the next leap: If you can write a macro to redefine syn3 functions in terms of built-in syn macros, what about writing a syn3 function to re-define your source-language functions in terms of syn3 functions?

Synsh takes this approach, where the synsh.syn library, written in syn3, implements two new kinds of macros (synsh_macro, and synsh_synmac) supported by the Synsh syntax (synsh.stx).

To wit:

synmac add add(%(left),%(right));

which is written in Synsh syntax (not syn3) says that constructs of the form "a+b" should get re-mapped to "add(a,b)". The "left" and "right" refer to the names of the children of the "add" parse node in the parse tree (see "add" in synsh.stx).

Further:

synmac for
{
	%(init);
	while %(cond) do {
		%(body);
		%(cont);
	};
};

defines C-style for loops of the form "for (a; b; c) d" in terms of while loops. To help tie the relationship between the syntax file and the Synsh file together, here is the definition of the for loop in the syntax file:

for = ( "for" spc* "("
	spc* <synsh_expr:init> spc* ";"
	spc* <synsh_expr:cond> spc* ";"
	spc* <synsh_expr:cont> spc* ")" spc* <synsh_expr:body> )

Note the macro variables, "%(init)" and so on, refer to the names given to the elements in the syntax file.

Now, similarly to the way in syn3 we have macros and functions (via defun), Synsh implements macros with named parameters as well:

macro void print(bool a)
	if %(a) then
		print("TRUE")
	else
		print("FALSE");

But note that these macros are typed (void, bool, etc.). The Synsh library, written in syn3, which implements Synsh macros, uses the types to create a unique name for the function based on the parameter types--once again ultimately submitting the resulting macro via Syn's same ol' built-in macro primitive. Similarly, when you use a function in Synsh, as in "print(1<2)", the types of the parameters are used to select the unique name ("print_bool" in this case), and to invoke the appropriate macro via the same ol' built-in syn macro expansion. All this new semantic behavior is implemented by Synsh in 250 lines of Syn code.

So, to recap: The Synsh syntax file (synsh.stx) plus the Synsh libraries (synsh.syn) defines a new statically typed macro language with operator overloading. But that's not all...

If Synsh is a new macro language, what are its primitives (macros of macros of macros of... what?)? Also tucked away in the Synsh source above is the support for the new primitives src:, dst:, and code:. These allow you specify, in the body of a Synsh macro, what back-end code to generate, and where the resulting value can be accessed. For example, consider these two Synsh macros (from tcsh.synsh, which implements a tcsh shell script back-end driver Synsh):

macro int add(int a, int b)
{
	code:@ %<iadd> = %(a) + %(b);;
	src:${%<iadd>};;
};

macro string add(string a, string b)
{
	code:set %<sadd>=%(a)%(b);;
	src:"$%<sadd>};;
};

These two macros define addition of integers (summation) and strings (concatenation) using tcsh shell script syntax. The "src:" line specifies where the result ends up (the name of the tcsh shell variable). The syntax "%(foo)", as before, refers to the macro parameter foo. The syntax "%<foo>" is new (although it appears in syn3 as "$$foo"): it means create a new temporary name (which will be something like "foo_5b3") and use it consistently throughout this invocation of the macro.

To see how these macros play out in practice, plug the expression "10 + 20 + foo" into Synsh/tcsh and see what comes out:

@ iadd_a0 = 10 + 20
set sadd_a1=${iadd_a0}"foo"

And, the "return value" of this expression, via the src: directive, is "$sadd_a1".

Note that most of the tcsh support is written in Synsh, not Syn, as the whole point of Synsh is to provide a shell-script-macro language which is easy to extend, and "extending" it begins as ground zero and builds up the basics like arithmetic and such. But due to time constraints (Synsh was really just a stunt, after all), Synsh doesn't provide a way to create new types--so I had to drop back to Syn to do that--but the flexibility to do that is all part of the power: tcshTypes.syn declares the common types, including an array type which takes another type as a parameter(!), defines the variable argv to refer to the array of strings passed as an argument to the shell script, specifies how integer and string constants get handled, and finally provides a standard header which is crammed onto all scripts (which is a little complicated at first glance, which I will get to in a moment).

Now, something to note here is that our main source language--the one we actually want to do our work in--is Synsh, and most of our code, even the low-level macros that generate shell-code directly, is written in it. But we always have this option to drop back one or more levels if we want to write something more meta. As my final example of this, tcshDefine.syn is an extension to the tcsh driver which supports proper function definitions rather than just macros. It does this by converting the function definition into some Synsh macros (which Synsh converts to Syn macros...) which do the right things... See for example tcsh-fact above, or dive into the tcshDefine source if you're brave.

So, in sum, to use the factorial function as an example, when you write "fact(n)" into your Synsh/tcsh source file:

Got it?

Now remember, all this happened in three days, and I didn't have to sell my soul to the devil--I just had to program in Syn.

Synge - What Synsh Wishes it Were

By now if you brain is still intact, your stomach might be a little queasy from all this shell script stuff. But remember, Synsh was just a stunt, a proof-of-concept.

Enter Synge.

Apply the same concepts in Synsh, but start one level lower, and take it one level further.

On the back end, rather than using shell scripts as the back end, use some sort of generic imperative language like LLVM. Now instead of spitting out clumsy shell scripts, the very same (Synsh-like) source code could be translated to a highly-optimizable assembly code. (Note that rather than having Syn write text files directly, as Synsh clumsily does, Syn would just output a parse tree representing the LLVM code, and a simple C++ program written with the LLVM libraries and the parse-tree libraries would load the parse tree into LLVM.)

On the front end, why stop at a statically-typed language like C++? The same concept could be applied yet a step further, with a third syntax which is the true final high-level language. For purposes of discussion, let's call this final language Synge, and imagine that the intermediary is still called Synsh (since it would be very similar in nature).

When you call a method in Synge, Syn may, for example, in turn call a function written in Synsh which is a method dispatcher that decides based on the types of the various parameters what actual Synge function body to invoke. But since all of this code, both the Synsh and the Synge code, ultimately maps to LLVM code, LLVM is now in a position to highly optimize things like method dispatches--such that, in principle, Synge may for example represent everything as objects (which is friendly to the programmer), while LLVM may, simply by collapsing the code path which is determinable at compile time, ultimately emit assembly code that, e.g., treats integers and other primitives efficiently (which is not normally the case in language where everything is an object...). Essentially, all run-time support for the language, contrary to the usual situation, is actually written in the language itself as far as the back end is aware (even though in truth it is written in layers with one supporting the next), which means the back end gets to optimize it all together.

Furthermore, and central to the point, the Synge programmer is not limited to the syntax nor dispatching methods given him by some language designer--rather, he can add his own at will. For instance, I once wrote in C a whole user-interface hierarchy, plus applications, based on a particular interface-like paradigm which isn't supported by any languages I'm aware of, and while the paradigm worked wonderfully, it was a royal pain to code in simply because I had to type a lot of redundant code to make up for the missing syntax and integral dispatcher. If I'd had Syn at the time, it would have been... a synsh!

And of course it's always nice to be able to make simpler extensions, too (things that could be written directly in Synge with something equivalent to Synsh's macro and synmac). List-comprehensions as in Python, for instance, which are super handy, would take a few minutes to add to Synge. And then when you use them, they would compile efficiently when possible and less efficiently but more generally when necessary--which is about ideal from a programmer's perspective.

Yeah, So What?

Bottom line is, Syn is a meta-programming language that is expandable both syntactically and semantically, with the two working hand-in-hand to let you express your concepts just how you want to.

I'll give one more example of a Syn application:

Since syntax is itself layerable, it is possible to have multiple syntaxes available even in a single file. Why would you want to do this? Imagine you're building a set of web pages, and you want to intermix a bunch of text with some images and some active forms and some mathematical equations and so on... You could use html, mathml, PHP, and all these other hacks and cobble something together, but if you need to make any systematic changes you're screwed (or at least very busy for a long time). Maybe there's some application out there that can manage all this for you, but chances are it's buggy, expensive, and is missing at least a few crucial features you really wished it had--and if tomorrow you decide you want to generate pdf files instead of html, you may be SOL because the app was html based and even though it has a pdf output option, it really misses the mark.

A better approach would be if you could write in something like latex, which lets you specify what, but not how--and then you can specify how separately. Except instead of being limited to latex's few concepts, you can add object types at will, along with syntax that makes them easy to type (and easy to read). So your math can look like math, and your images could be as simple as \image("foo.eps"), and your text can be free-flowing, and your active components just calls to code \like(this) and so on. And all of these things would be evaluated by Syn (Synge?) as constructors, effectively, building a code-and-data model of what the document is made of. And then at the end, you may ask it: go generate a suite of web pages representing yourself; or go generate a pdf document, or a set of them with an index, or... whatever. And when the postscript-image object is asked to put itself in an html document, it knows to first convert itself to a png; and the math may convert to pngs, or maybe mathml, or maybe both with a javascript switch according to browser support. Or maybe you compile the "document" via LLVM into an executable that generates itself on the fly, sending mathml or pngs at the time. And if you don't like the way something's formatted, you change a few lines of Synge code (which may look like html in that particular place, or maybe not), and presto, it's fixed everywhere -- the promise CSS never really delivered.

I haven't actually tried this with Syn yet, but I did it a while back with Python (I called it Puml--Python Used as a Markup Language), and it worked great. The only problem was the syntax was kind of ugly for this (I had to type ...""", image("foo.ps"), """... every time I wanted to drop out of text-entry mode and insert an image, for instance). And if it hadn't have been for Python's triple-quote feature (which let's you spew on huge blocks of text relatively unhindered) it would have been completely impractical. But it proved the concept, and it worked great, and if the syntax had been prettier, I'd be writing this document in Puml right now...

But I digress. This has all been a tangent from my main line of work which is mostly unrelated, so I'm going to get back to that now.

-Simon


(1) In that respect Syn is very similar to Lisp, except instead of Lists it has Trees--where a Lisp node has a first (car) and rest (cdr), a Syn node has zero or more named children. This is not a hugely relevant difference, since one can be easily mapped to the other, but it's just more convenient for Syn's intended purposes.

(2) Again, Lisp employs the same trick!



[<< | Prev | Index | Next | >>]


Simon Funk / simonfunk@gmail.com