May 2011: This page has retired. For Notes on making Higher Subleq Compiler see Low Cost Subleq Supercomputer

Notes on making Higher Subleq Compiler

Last update      Main compiler page

Notes on making Higher Subleq.

Step 1: Simple function calls

After playing with Subleq (one instruction computer language) I decided to write a compiler to Subleq from a higher language. I chose C++ as an implementation language. For simplicity I started with extremely simplified language, later it could be enhanced with more features like process control statements, arrays, pointer arithmetic, auto (stack) variables, function arguments and return values, and classes. I tried it to be as much as possible like C, but on the other hand bound to Subleq internal structure. The initial step was to write the compiler for such simple program as
void f()
	__out 50;

void main()
	__out 49;
	__out 10;
This is not trivial from Subleq point of view since it requires a stack model and Subleq does not have one. An artificial operator __out is added to make the program to produce a visible output. Later it will be removed as the language evolves with proper output functions. So, for now __out is just to manifest the execution. The constant argument is the ASCII code for the character: the program must then print 12 and new line character. The initial grammar for Hsq:
	list of functions, one must be main
	void id() block
	{ list of statements }
	expression ;
	__out const
id and const are as commonly understood
The process flow is the following:
  1. TEXT parsed into TOKENS
  2. TOKENS pre-processed into TOKENS (this can involve step 1 again for #include)
  3. TOKENS parsed into Internal Tree Representation (ITR)
  4. ITR compiled into Subleq code
void main()
  __out 50;
[void] [main] 
[(] [)] [{] 
[__out] [50] 
[;] [}]
Root :1
+- Function [main] :1
   +--Param type list :1
   +--Block :2
      +--Op IO [__out] :3
         +--Const [50] :3


Assemly (relative addressing)
	top:top top main
	c1 (-1)
	Z Z (-1)
. c1:50
. inc:-1 Z:0 dec:1 sp:-sp
Subleq (absolute addressing)
0 0 3 
9 -1 6 
11 11 -1 
-1 0 1 -13 


Text is a sequence of chars and Tokens is a sequence of Token objects. Each Token has its value, and position: file and line number. The parse function has to take into account a table of known long tokens, like += or >>=, because the preference must be given to longer token. We will enhance the table while adding more features to the language. The central point is the ITR. For the example program above it should look like the following:
+--Function [f] (1)
|  +--Block (2)
|     +--Statement (3)
|        +--Expression (3)
|           +--Out [50] (3)
+--Function [main] (6)
   +--Block (7)
      +--Statement (8)
      |  +--Expression (8)
      |     +--Out [49] (8)
      +--Statement (9)
      |  +--Expression (9)
      |     +--FuncCall [f] (9)
      +--Statement (10)
         +--Expression (10)
            +--Out [10] (10)
Here line numbers are printed in parentheses and grammar values in square brackets. This structure is easily traversed with recursive functions. Hence every node will be a class derived from the abstract class with the common functionality. The class might look like the following snippet from the program:
class ItrExpr: public ITR
	ItrExpr(const itrinfo &i): ITR(i) {}
	string typname() const { return "Expression"; }
	void exec();
	string compile();
	static parseinfo parse(kid i);
The element of the tree structure knows everything about how it is created (parse method), how it is executed or interpreted (exec method), and how it is compiled (compile method). The virtual function typname() used to produce the dump of the tree. The interpreter is a simple process starting at the Root, finding the main() function, executing all statements of the function. The compiler is a process of writing the output Subleq code. For the given program Subleq code is the following:
	0 0 sqmain
	c1 (-1)

	# return
	?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0

	c2 (-1)

	# call f
	dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
	?+6; sp ?+2; ?+2 0 ?+1; . ?+3; Z Z _f; inc sp

	c3 (-1)

	# return
	?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0

	# call main
	dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
	?+6; sp ?+2; ?+2 0 ?+1; . ?+3; Z Z _main; inc sp

	 0 0 (-1)

. c3:10 c2:49 c1:50

. inc:-1 Z:0 dec:1 sp:-sp
Calling a function and returning from the function using the stack seems confusing. But this is the price to pay for the simplicity of the final instruction language.

Step 2: Expressions

Now let us define the proper grammar for the language. The grammar below is taken from the C standard and greatly simplified.
1. program:=
	list of functions or declarations (one funcion must be main)
2. declaration:=
	type_name comma list of declarators ;
3. type_name:=
4. declarator:=
	id = ass_expr
	id [ ass_expr ]
5. function:=
	type_name id ( param_type_list ) block
	type_name id ( param_type_list ) ;
6. param_type_list:=
	empty or comma list of type_name idopt
7. block:=
	{ list of statements }
8. statement:=
	expression ;
9. keyword_statement:=
	for ( expressionopt ; expressionopt ; expressionopt )
	while ( expression ) statement
	if ( expression ) statement
	if ( expression ) statement else statement
	goto expression ;
	return expression;
10. labeled_statement:=
	new in current scopes id : statement
11. expression:=
	exression , ass_expr
12. ass_expr:= *
	unary_expr ass_operator ass_expr
13. ass_operator:=
	one of = -= += *= /= %=
14. equality_expr:=
	equality_expr == rel_expr
	equality_expr != rel_expr
15. rel_expr:=
	rel_expr < add_expr
	rel_expr > add_expr
	rel_expr <= add_expr
	rel_expr >= add_expr
16. add_expr:=
	add_expr + mul_expr
	add_expr - mul_expr
17. mul_expr:=
	mul_expr * unary_expr
	mul_expr / unary_expr
	mul_expr % unary_expr
18. unary_expr:=
	++ unary_expr
	-- unary_expr
	& unary_expr
	* unary_expr
	+ unary_expr
	- unary_expr
	! unary_expr
19. postfix_expr:=
	postfix_expr [ expression ]
	postfix_expr ( expr_list )
	postfix_expr --
	postfix_expr ++
20. expr_list:=
	empty or comma list of ass_expr
21. prim_expr:=
	known id
	( expression )
22. operator_io:=
	__out expression
	__in expression
id, keyword, string, and const are as commonly understood
The core of this grammar is the recursive definition of expressions. To analyse how expression have to be compiled into Subleq code the best way is to start with examples. Consider this
Since the expressions are recursive the compilation of each node must return the compiled code and the result - value of the expression:
t; b Z; Z t; Z;	# prepare temporary for the subtraction
c t;		# make subtraction
# at this point subtraction returns the above code
# and t temporary as the result
a Z; Z t; Z;
# now the code above evaluates the expression
# and the result is in t
Here Z is the zero register. Temporaries requested and allocated automatically for each expression, and then reused for any other expressions. Their names are t1, t2, t3, and so on. This seems quite simple. More complex cases arise when the values need to be modified in memory. For example:
This must be translated into
tl; k Z; Z tl; Z;
a tl:0;
Dereferencing makes the need for temporary to be inserted into the code (tl stands for temporary label). From this example we can see that there is no way to decouple the evaluation of dereferencing from the assignment. A few more examples:
tl; k Z; Z tl; Z;
a Z; Z tl:0; Z;

tl1; tl2;
k Z; Z tl1; Z tl2; Z;
tl1:0 tl2:0

tl1; tl2; tl3;
k Z; Z tl1; Z tl2; Z tl3; Z;
tl1:0 tl2:0; a Z; Z tl3:0; Z;
In the last case we need 3 (!) labels: 2 for clearing the variable and 1 for adding. Fortunately the number of cases involved L-values are limited:
Lval ++
Lval --
++ Lval
-- Lval
& Lval
Lval = expr
__in Lval
Lval can be either identifier, *(expr), or array[expr]. And array[expr] is equivalent to *(array+expr), which means that only 2 cases has to be considered: namely, identifier or *(expr).

Step 3: Function calls

Step 3.1: Simple calls

Consider this
void f(int a, int b);
The call to a function f must be translated into something like this
# 1 push b
# 2 push a
# 3 push ret_addr
# 4 goto f
# 5 ret_addr: pop3 sp
However, bear in mind that the arguments can be expressions and the call to a function can be a part of another expression - sub-expression, i.e. the code must properly handle more complicated cases like the following
int f(int a, int b)
	return f;
	int k;
	k(f(1,2),3);     // call through a variable
	k = f(1,2)(3,4);
# 1 push B
	# clearing the next cell in the stack
	# the line below is same as C: *(++sp)=0; [note that sp is negative]
	dec sp; t1; sp t1; t2; sp t2; t1:0 t2:0
	# C: *sp+=B;
	t3; sp t3; b Z; Z t3:0; Z

# 2 push A
	# the same trick with A
	dec sp; t4; sp t4; t5; sp t5; t4:0 t5:0
	t6; sp t6; a Z; Z t6:0; Z

# 3 push ret_addr
	dec sp; t7; sp t7; t8; sp t8; t7:0 t8:0
	t9; sp t9; t10 t9:0 goto_addr
	. t10: ret_addr
# 4 goto f
	goto_addr: Z Z f

# 5 ret_addr: pop x3 sp
	ret_addr: const(-3) sp
The code above does not handle yet return value nor expression instead of function name, like in the example above: call through a variable

Step 3.2: Real calls

Step 5 in the example above must be enhanced with return value extraction:
	t;			t    .   .
	t2			?+8  .   .
	dec sp			dec  sp  .
	sp t2			sp   ?+1 .
	t2:0 t 			0    t   .
	inc sp                  inc  sp  .
	const(-3) sp		cm3  sp  .
	# now t holds the result
Notation const(-3) sp is a short for
unique_name sp

Step 3.3: Inside function

Function code have to be wrapped with the following commands:
1. # push bp
2. # sp -> bp
3. # sp -= stack_size

   # ... function code

5. # bp -> sp
6. # pop bp
7. # return
	dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
	?+6; sp ?+2; bp 0
	bp; sp bp
	stk_sz sp

	# ...

	sp; bp sp
	?+8; sp ?+4; bp; 0 bp; inc sp
	?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0
stk_sz is a constant which is calculated for every function during parsing.
It turns out that it is not enough to save bp. A function call can happen inside an expression. In such case all temporaries of the expression have to be saved. A new function will be using the same temporary memory cells for its own needs. For the expression f()+g() the results of the calls may be stored in variables t1 and t2. If function g changes t1 where the result of function f is stored, the problem arises.
A good solution is to force every function push all temporaries it is using into a stack and to restore them upon exit. Consider the followig function:
int g()
	return k+1;
It is translated into:
	# save bp
	dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
	?+6; sp ?+2; bp 0
	bp; sp bp

	# push t1
	dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
	?+6; sp ?+2; t1 0
	# push t2
	dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
	?+6; sp ?+2; t2 0

	# calculate addition
	t1; t2
	_k t1
	dec t1
	t1 t2
	# set the return value [negative]
	ax; t2 ax

	# pop t2
	?+8; sp ?+4; t2; 0 t2; inc sp
	# pop t1
	?+8; sp ?+4; t1; 0 t1; inc sp

	# restore bp
	sp; bp sp
	?+8; sp ?+4; bp; 0 bp; inc sp
	# exit
	?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0

Step 3.4: Reuse of temporaries

Since all used temporaries in the function are pushed into the stack, it is worthwile to reduce the number of used temporaries. It is possible to do just by releasing any used temporary into a pool. Then later when a new temporary is requested, the pool is first checked and only if empty a new temporary is allocated.
Compare the following function:
int g()
	return 1+k[1];
Reuse of tempsNo reuse
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; bp 0
        bp; sp bp
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; t1 0
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; t2 0
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; t3 0

        t1; t2
        _k t1
        dec t1
        t1 t2
        t1; t3; ?+11; t2 Z; Z ?+4; Z; 0 t1; t1 t3
        t1; t2
        dec t1
        t3 t1
        t1 t2
        ax; t2 ax

        ?+8; sp ?+4; t3; 0 t3; inc sp
        ?+8; sp ?+4; t2; 0 t2; inc sp
        ?+8; sp ?+4; t1; 0 t1; inc sp
        sp; bp sp
        ?+8; sp ?+4; bp; 0 bp; inc sp
        ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; bp 0
        bp; sp bp
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; t1 0
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; t2 0
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; t3 0
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; t4 0
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; t5 0
        dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0
        ?+6; sp ?+2; t6 0

        t1; t2
        _k t1
        dec t1
        t1 t2
        t3; t4; ?+11; t2 Z; Z ?+4; Z; 0 t3; t3 t4
        t5; t6
        dec t5
        t4 t5
        t5 t6
        ax; t6 ax

        ?+8; sp ?+4; t6; 0 t6; inc sp
        ?+8; sp ?+4; t5; 0 t5; inc sp
        ?+8; sp ?+4; t4; 0 t4; inc sp
        ?+8; sp ?+4; t3; 0 t3; inc sp
        ?+8; sp ?+4; t2; 0 t2; inc sp
        ?+8; sp ?+4; t1; 0 t1; inc sp
        sp; bp sp
        ?+8; sp ?+4; bp; 0 bp; inc sp
        ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0

Obviously the gain of reuse grows with the size of expression.

Step 4: Stack variables

Once bp is stored in the stack and sp is decremented, all local variables come to live. They can be accessed only indirectly because the compiler does not know their addresses. For example, function
int f(int x, int y)
	int a, b=3, c[3], d=5;
has 4 local variables with the stack size equal to 6. When this function is entered the stack has the following values:

... y[9] x[7] [ret_addr] [old_bp] a[?] b[3] c[?] c[?] c[?] d[5] ...
                             ^                               ^
                             bp                              sp
The compiler knows about the offset of each variable from bp:

Hence, in the code any reference to the local variable can be replaced with *(bp+offset) with the exception to the array c. The array has to be replaced with (bp+offset) because the name of the array does not refer to a variable, only referencing with [] does:
c[i] -> *(c+i) -> *((bp+3)+i)

Step 5: Multiplication

The only trivial multiplication in Subleq is the multiplication by 2: [t; a Z; a Z; Z t; Z]. To multiply 2 numbers one can use the formula
AB = (2A)*(B/2) + A*(B%2)
This is a simple recursive formula, but it requires integer and modular division. The division can be implemented as the following algorithm. Given two numbers A and B, we increase B by 2 until the next increase gives B greater then A. At the same time as increasing B, we increase a variable I by 2, which is initialized to 1. Now I holds the part of the result of division - the rest is to be calculated further using A-B and B. This can be done recursively accumulating all I's. At the last step when A<B, A is the modulo. This algorithm can be implemented as a short function in C. Upon the exit this function returns the integer division as the result and modulo division in the argument j.
int divMod(int a, int b, int * j)
         [20] 3  0  1
         20   6  1  2
         20 [12] 2 [4]
         20< 24  3  8

20-12=   [8]  3  0  1
         8   [6] 1 [2]
         8 < 12  2  4

8-6=     [2]  3  0  1

   if( a < b ) { *j=a; return 0; }

   int b1=b, i=1, bp, ip;

   bp = b1; ip = i;
   b1 *= 2; i *= 2;

   if( b1 > a ) 
      return ip+divMod(a-bp,b,j);

   goto next;

Since the multiplication operations (mult, idiv, and mod) require quite elaborate calculations, they are implemented as library functions. That is, each multiplication a*b is replaced with a call _mul(a,b), and later the library manager adds (if necessary) the implementation of the function. Hence any program using multiplication operations is increased in size by a constant value - the size of multiplication functions.

Step 6: Jumps

6.1 Bools

C style of Boolean expressions leads to long and elaborate code in Subleq. This is the result that every Boolean expression evaluates on basis equal or not equal to zero. Since this is unnatural to Subleq operation, a several steps required selecting true or false result. Another approach is to treat less or equal to zero as false and positive value as true. Then if-expression if(expr){...} will be just one instruction
Z t @  --+
{...}    |
<next> <-+
where t is the result of expression. However to remain fully compatible with C (for example, if(x+1){...} - implicit conversion to Boolean) we have to detect all cases where integer expression is used as Boolean. Fortunately there are only a few cases where we need this:
  • if(expr)
  • while(expr)
  • for(...,expr,...)
  • ! expr
  • expr1 && expr2
  • expr1 || expr2
The job can be done inside the parser, so the compiler would not have to care about Boolean or integer expression, just much simpler code.

However another problem appears with C compatibility, namely storing or passing Boolean as integer. For example, in the following cases:
  • passing an argument f(a>0)
  • returning from a function return(a>0);
  • assignment x=(a>0);
  • other arithmetic expression x=1+5*(a>0);
In these cases to make the behavior compatible with C the Boolean must be converted to C-style Boolean, that is negative result zeroed:
x Z @ -+
x x    |
Z   <--+

6.2 The Cascader

The following construction gives a nice framework of handling conditional jumps.
|             |
| +-- Z x @   |  x>0
| |   Z Z @ --|------->
| +-> x Z @ --|------->
|     Z       |  x==0
|     |       |
      | x<0
Note, that x does not change even it is the second operand; and Z is zero on any exit!