HIGHER SUBLEQNotes on making Higher Subleq CompilerLast update Main compiler pageNotes on making Higher Subleq.Step 1: Simple function callsAfter playing with Subleq (one instruction computer language) I decided to write a compiler to Subleq from a higher language. I chose C++ as an implementation language. For simplicity I started with extremely simplified language, later it could be enhanced with more features like process control statements, arrays, pointer arithmetic, auto (stack) variables, function arguments and return values, and classes. I tried it to be as much as possible like C, but on the other hand bound to Subleq internal structure. The initial step was to write the compiler for such simple program as
void f()
{
__out 50;
}
void main()
{
__out 49;
f();
__out 10;
}
This is not trivial from Subleq point of view since it
requires a stack model and Subleq does not have one.
An artificial operator __out is added to make the program
to produce a visible output. Later it will be removed as
the language evolves with proper output functions. So,
for now __out is just to manifest the execution. The
constant argument is the ASCII code for the character: the
program must then print 12 and new line character.
The initial grammar for Hsq:
program:=
list of functions, one must be main
function:=
void id() block
block:=
{ list of statements }
statement:=
expression ;
expression:=
function_call
operator_out
function_call:=
id()
operator_out:=
__out const
id and const are as commonly understood
The process flow is the following:
Text
void main()
{
__out 50;
}
Tokens
[void] [main]
[(] [)] [{]
[__out] [50]
[;] [}]
ITR
Root :1
+- Function [main] :1
+--Param type list :1
+--Block :2
+--Op IO [__out] :3
+--Const [50] :3
ITR
Interpreter Output 2
Assemly (relative addressing)
top:top top main main: c1 (-1) Z Z (-1) . c1:50 . inc:-1 Z:0 dec:1 sp:-sp
Subleq (absolute addressing)
0 0 3 9 -1 6 11 11 -1 50 -1 0 1 -13
Output
2
ROOT
|
+--Function [f] (1)
| +--Block (2)
| +--Statement (3)
| +--Expression (3)
| +--Out [50] (3)
|
+--Function [main] (6)
+--Block (7)
+--Statement (8)
| +--Expression (8)
| +--Out [49] (8)
+--Statement (9)
| +--Expression (9)
| +--FuncCall [f] (9)
+--Statement (10)
+--Expression (10)
+--Out [10] (10)
Here line numbers are printed in parentheses and grammar values
in square brackets. This structure is easily traversed with
recursive functions. Hence every node will be a class derived
from the abstract class with the common functionality. The
class might look like the following snippet from the program:
class ItrExpr: public ITR
{
public:
ItrExpr(const itrinfo &i): ITR(i) {}
string typname() const { return "Expression"; }
void exec();
string compile();
static parseinfo parse(kid i);
};
The element of the tree structure knows everything about how
it is created (parse method), how it is executed or interpreted
(exec method), and how it is compiled (compile method). The virtual
function typname() used to produce the dump of the tree.
The interpreter is a simple process starting at the Root, finding
the main() function, executing all statements of the function.
The compiler is a process of writing the output Subleq code.
For the given program Subleq code is the following:
0 0 sqmain _f: c1 (-1) # return ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0 _main: c2 (-1) # call f dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; ?+2 0 ?+1; . ?+3; Z Z _f; inc sp c3 (-1) # return ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0 sqmain: # call main dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; ?+2 0 ?+1; . ?+3; Z Z _main; inc sp 0 0 (-1) . c3:10 c2:49 c1:50 . inc:-1 Z:0 dec:1 sp:-spCalling a function and returning from the function using the stack seems confusing. But this is the price to pay for the simplicity of the final instruction language. Step 2: ExpressionsNow let us define the proper grammar for the language. The grammar below is taken from the C standard and greatly simplified.
token:=
keyword
id
const
string
punctuator
comment
1. program:=
list of functions or declarations (one funcion must be main)
2. declaration:=
type_name comma list of declarators ;
3. type_name:=
void
int
4. declarator:=
id
id = ass_expr
id [ ass_expr ]
5. function:=
type_name id ( param_type_list ) block
type_name id ( param_type_list ) ;
6. param_type_list:=
empty or comma list of type_name idopt
7. block:=
{ list of statements }
8. statement:=
;
block
keyword_statement
labeled_statement
expression ;
9. keyword_statement:=
for ( expressionopt ; expressionopt ; expressionopt )
while ( expression ) statement
if ( expression ) statement
if ( expression ) statement else statement
goto expression ;
continue;
break;
return;
return expression;
10. labeled_statement:=
new in current scopes id : statement
11. expression:=
ass_expr
exression , ass_expr
12. ass_expr:= *
equality_expr
unary_expr ass_operator ass_expr
13. ass_operator:=
one of = -= += *= /= %=
14. equality_expr:=
rel_expr
equality_expr == rel_expr
equality_expr != rel_expr
15. rel_expr:=
add_expr
rel_expr < add_expr
rel_expr > add_expr
rel_expr <= add_expr
rel_expr >= add_expr
16. add_expr:=
mul_expr
add_expr + mul_expr
add_expr - mul_expr
17. mul_expr:=
unary_expr
mul_expr * unary_expr
mul_expr / unary_expr
mul_expr % unary_expr
18. unary_expr:=
postfix_expr
++ unary_expr
-- unary_expr
& unary_expr
* unary_expr
+ unary_expr
- unary_expr
! unary_expr
19. postfix_expr:=
prim_expr
postfix_expr [ expression ]
postfix_expr ( expr_list )
postfix_expr --
postfix_expr ++
20. expr_list:=
empty or comma list of ass_expr
21. prim_expr:=
known id
const
string
( expression )
operator_io
22. operator_io:=
__out expression
__in expression
id, keyword, string, and const are as commonly understood
The core of this grammar is the recursive definition of expressions.
To analyse how expression have to be compiled into Subleq
code the best way is to start with examples. Consider this
a+(b-c)Since the expressions are recursive the compilation of each node must return the compiled code and the result - value of the expression: t; b Z; Z t; Z; # prepare temporary for the subtraction c t; # make subtraction # at this point subtraction returns the above code # and t temporary as the result a Z; Z t; Z; # now the code above evaluates the expression # and the result is in tHere Z is the zero register. Temporaries requested and allocated automatically for each expression, and then reused for any other expressions. Their names are t1, t2, t3, and so on. This seems quite simple. More complex cases arise when the values need to be modified in memory. For example: *k-=aThis must be translated into tl; k Z; Z tl; Z; a tl:0;Dereferencing makes the need for temporary to be inserted into the code (tl stands for temporary label). From this example we can see that there is no way to decouple the evaluation of dereferencing from the assignment. A few more examples: *k+=a tl; k Z; Z tl; Z; a Z; Z tl:0; Z; *k=0 tl1; tl2; k Z; Z tl1; Z tl2; Z; tl1:0 tl2:0 *k=a tl1; tl2; tl3; k Z; Z tl1; Z tl2; Z tl3; Z; tl1:0 tl2:0; a Z; Z tl3:0; Z;In the last case we need 3 (!) labels: 2 for clearing the variable and 1 for adding. Fortunately the number of cases involved L-values are limited: Lval ++ Lval -- ++ Lval -- Lval & Lval Lval = expr __in LvalLval can be either identifier, *(expr), or array[expr]. And array[expr] is equivalent to *(array+expr), which means that only 2 cases has to be considered: namely, identifier or *(expr). Step 3: Function callsStep 3.1: Simple callsConsider thisvoid f(int a, int b); ... f(a,b);The call to a function f must be translated into something like this # 1 push b # 2 push a # 3 push ret_addr # 4 goto f # 5 ret_addr: pop3 spHowever, bear in mind that the arguments can be expressions and the call to a function can be a part of another expression - sub-expression, i.e. the code must properly handle more complicated cases like the following
int f(int a, int b)
{
...
return f;
}
...
int k;
k=f;
k(f(1,2),3); // call through a variable
k = f(1,2)(3,4);
So,
# 1 push B # clearing the next cell in the stack # the line below is same as C: *(++sp)=0; [note that sp is negative] dec sp; t1; sp t1; t2; sp t2; t1:0 t2:0 # C: *sp+=B; t3; sp t3; b Z; Z t3:0; Z # 2 push A # the same trick with A dec sp; t4; sp t4; t5; sp t5; t4:0 t5:0 t6; sp t6; a Z; Z t6:0; Z # 3 push ret_addr dec sp; t7; sp t7; t8; sp t8; t7:0 t8:0 t9; sp t9; t10 t9:0 goto_addr . t10: ret_addr # 4 goto f goto_addr: Z Z f # 5 ret_addr: pop x3 sp ret_addr: const(-3) spThe code above does not handle yet return value nor expression instead of function name, like in the example above: call through a variable Step 3.2: Real callsStep 5 in the example above must be enhanced with return value extraction:ret_addr: t; t . . t2 ?+8 . . dec sp dec sp . sp t2 sp ?+1 . t2:0 t 0 t . inc sp inc sp . const(-3) sp cm3 sp . # now t holds the resultNotation const(-3) sp is a short for unique_name sp ... unique_name:-3 Step 3.3: Inside functionFunction code have to be wrapped with the following commands:1. # push bp 2. # sp -> bp 3. # sp -= stack_size # ... function code 5. # bp -> sp 6. # pop bp 7. # returni.e. dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; bp 0 bp; sp bp stk_sz sp # ... sp; bp sp ?+8; sp ?+4; bp; 0 bp; inc sp ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0stk_sz is a constant which is calculated for every function during parsing. It turns out that it is not enough to save bp. A function call can happen inside an expression. In such case all temporaries of the expression have to be saved. A new function will be using the same temporary memory cells for its own needs. For the expression f()+g() the results of the calls may be stored in variables t1 and t2. If function g changes t1 where the result of function f is stored, the problem arises. A good solution is to force every function push all temporaries it is using into a stack and to restore them upon exit. Consider the followig function:
int g()
{
return k+1;
}
It is translated into:
_g: # save bp dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; bp 0 bp; sp bp # push t1 dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; t1 0 # push t2 dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; t2 0 # calculate addition t1; t2 _k t1 dec t1 t1 t2 # set the return value [negative] ax; t2 ax # pop t2 ?+8; sp ?+4; t2; 0 t2; inc sp # pop t1 ?+8; sp ?+4; t1; 0 t1; inc sp # restore bp sp; bp sp ?+8; sp ?+4; bp; 0 bp; inc sp # exit ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0 Step 3.4: Reuse of temporariesSince all used temporaries in the function are pushed into the stack, it is worthwile to reduce the number of used temporaries. It is possible to do just by releasing any used temporary into a pool. Then later when a new temporary is requested, the pool is first checked and only if empty a new temporary is allocated.Compare the following function:
int g()
{
return 1+k[1];
}
Obviously the gain of reuse grows with the size of expression. Step 4: Stack variablesOnce bp is stored in the stack and sp is decremented, all local variables come to live. They can be accessed only indirectly because the compiler does not know their addresses. For example, function
int f(int x, int y)
{
int a, b=3, c[3], d=5;
...
}
f(7,9);
has 4 local variables with the stack size equal to 6. When this function is entered
the stack has the following values:
... y[9] x[7] [ret_addr] [old_bp] a[?] b[3] c[?] c[?] c[?] d[5] ...
^ ^
bp sp
The compiler knows about the offset of each variable from bp:
Hence, in the code any reference to the local variable can be replaced with *(bp+offset) with the exception to the array c. The array has to be replaced with (bp+offset) because the name of the array does not refer to a variable, only referencing with [] does: c[i] -> *(c+i) -> *((bp+3)+i) Step 5: MultiplicationThe only trivial multiplication in Subleq is the multiplication by 2: [t; a Z; a Z; Z t; Z]. To multiply 2 numbers one can use the formulaAB = (2A)*(B/2) + A*(B%2) This is a simple recursive formula, but it requires integer and modular division. The division can be implemented as the following algorithm. Given two numbers A and B, we increase B by 2 until the next increase gives B greater then A. At the same time as increasing B, we increase a variable I by 2, which is initialized to 1. Now I holds the part of the result of division - the rest is to be calculated further using A-B and B. This can be done recursively accumulating all I's. At the last step when A<B, A is the modulo. This algorithm can be implemented as a short function in C. Upon the exit this function returns the integer division as the result and modulo division in the argument j.
Since the multiplication operations (mult, idiv, and mod) require quite elaborate calculations, they are implemented as library functions. That is, each multiplication a*b is replaced with a call _mul(a,b), and later the library manager adds (if necessary) the implementation of the function. Hence any program using multiplication operations is increased in size by a constant value - the size of multiplication functions. Step 6: Jumps6.1 BoolsC style of Boolean expressions leads to long and elaborate code in Subleq. This is the result that every Boolean expression evaluates on basis equal or not equal to zero. Since this is unnatural to Subleq operation, a several steps required selecting true or false result. Another approach is to treat less or equal to zero as false and positive value as true. Then if-expression if(expr){...} will be just one instruction
Z t @ --+
{...} |
<next> <-+
where t is the result of expression. However to remain fully
compatible with C (for example, if(x+1){...} - implicit
conversion to Boolean) we have to detect all cases where integer expression
is used as Boolean. Fortunately there are only a few cases where we need
this:
However another problem appears with C compatibility, namely storing or passing Boolean as integer. For example, in the following cases:
6.2 The CascaderThe following construction gives a nice framework of handling conditional jumps.
|