HIGHER SUBLEQMay 2011: This page has retired. For Notes on making Higher Subleq Compiler see Low Cost Subleq SupercomputerNotes on making Higher Subleq CompilerLast update Main compiler pageNotes on making Higher Subleq.Step 1: Simple function callsAfter playing with Subleq (one instruction computer language) I decided to write a compiler to Subleq from a higher language. I chose C++ as an implementation language. For simplicity I started with extremely simplified language, later it could be enhanced with more features like process control statements, arrays, pointer arithmetic, auto (stack) variables, function arguments and return values, and classes. I tried it to be as much as possible like C, but on the other hand bound to Subleq internal structure. The initial step was to write the compiler for such simple program asvoid f() { __out 50; } void main() { __out 49; f(); __out 10; }This is not trivial from Subleq point of view since it requires a stack model and Subleq does not have one. An artificial operator __out is added to make the program to produce a visible output. Later it will be removed as the language evolves with proper output functions. So, for now __out is just to manifest the execution. The constant argument is the ASCII code for the character: the program must then print 12 and new line character. The initial grammar for Hsq: program:= list of functions, one must be main function:= void id() block block:= { list of statements } statement:= expression ; expression:= function_call operator_out function_call:= id() operator_out:= __out const id and const are as commonly understoodThe process flow is the following:
Text
void main() { __out 50; }
Tokens
[void] [main] [(] [)] [{] [__out] [50] [;] [}]
ITR
Root :1 + Function [main] :1 +Param type list :1 +Block :2 +Op IO [__out] :3 +Const [50] :3
ITR
Interpreter Output 2
Assemly (relative addressing)
top:top top main main: c1 (1) Z Z (1) . c1:50 . inc:1 Z:0 dec:1 sp:sp
Subleq (absolute addressing)
0 0 3 9 1 6 11 11 1 50 1 0 1 13
Output
2ROOT  +Function [f] (1)  +Block (2)  +Statement (3)  +Expression (3)  +Out [50] (3)  +Function [main] (6) +Block (7) +Statement (8)  +Expression (8)  +Out [49] (8) +Statement (9)  +Expression (9)  +FuncCall [f] (9) +Statement (10) +Expression (10) +Out [10] (10)Here line numbers are printed in parentheses and grammar values in square brackets. This structure is easily traversed with recursive functions. Hence every node will be a class derived from the abstract class with the common functionality. The class might look like the following snippet from the program: class ItrExpr: public ITR { public: ItrExpr(const itrinfo &i): ITR(i) {} string typname() const { return "Expression"; } void exec(); string compile(); static parseinfo parse(kid i); };The element of the tree structure knows everything about how it is created (parse method), how it is executed or interpreted (exec method), and how it is compiled (compile method). The virtual function typname() used to produce the dump of the tree. The interpreter is a simple process starting at the Root, finding the main() function, executing all statements of the function. The compiler is a process of writing the output Subleq code. For the given program Subleq code is the following: 0 0 sqmain _f: c1 (1) # return ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0 _main: c2 (1) # call f dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; ?+2 0 ?+1; . ?+3; Z Z _f; inc sp c3 (1) # return ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0 sqmain: # call main dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; ?+2 0 ?+1; . ?+3; Z Z _main; inc sp 0 0 (1) . c3:10 c2:49 c1:50 . inc:1 Z:0 dec:1 sp:spCalling a function and returning from the function using the stack seems confusing. But this is the price to pay for the simplicity of the final instruction language. Step 2: ExpressionsNow let us define the proper grammar for the language. The grammar below is taken from the C standard and greatly simplified.token:= keyword id const string punctuator comment 1. program:= list of functions or declarations (one funcion must be main) 2. declaration:= type_name comma list of declarators ; 3. type_name:= void int 4. declarator:= id id = ass_expr id [ ass_expr ] 5. function:= type_name id ( param_type_list ) block type_name id ( param_type_list ) ; 6. param_type_list:= empty or comma list of type_name id_{opt} 7. block:= { list of statements } 8. statement:= ; block keyword_statement labeled_statement expression ; 9. keyword_statement:= for ( expression_{opt} ; expression_{opt} ; expression_{opt} ) while ( expression ) statement if ( expression ) statement if ( expression ) statement else statement goto expression ; continue; break; return; return expression; 10. labeled_statement:= new in current scopes id : statement 11. expression:= ass_expr exression , ass_expr 12. ass_expr:= * equality_expr unary_expr ass_operator ass_expr 13. ass_operator:= one of = = += *= /= %= 14. equality_expr:= rel_expr equality_expr == rel_expr equality_expr != rel_expr 15. rel_expr:= add_expr rel_expr < add_expr rel_expr > add_expr rel_expr <= add_expr rel_expr >= add_expr 16. add_expr:= mul_expr add_expr + mul_expr add_expr  mul_expr 17. mul_expr:= unary_expr mul_expr * unary_expr mul_expr / unary_expr mul_expr % unary_expr 18. unary_expr:= postfix_expr ++ unary_expr  unary_expr & unary_expr * unary_expr + unary_expr  unary_expr ! unary_expr 19. postfix_expr:= prim_expr postfix_expr [ expression ] postfix_expr ( expr_list ) postfix_expr  postfix_expr ++ 20. expr_list:= empty or comma list of ass_expr 21. prim_expr:= known id const string ( expression ) operator_io 22. operator_io:= __out expression __in expression id, keyword, string, and const are as commonly understoodThe core of this grammar is the recursive definition of expressions. To analyse how expression have to be compiled into Subleq code the best way is to start with examples. Consider this a+(bc)Since the expressions are recursive the compilation of each node must return the compiled code and the result  value of the expression: t; b Z; Z t; Z; # prepare temporary for the subtraction c t; # make subtraction # at this point subtraction returns the above code # and t temporary as the result a Z; Z t; Z; # now the code above evaluates the expression # and the result is in tHere Z is the zero register. Temporaries requested and allocated automatically for each expression, and then reused for any other expressions. Their names are t1, t2, t3, and so on. This seems quite simple. More complex cases arise when the values need to be modified in memory. For example: *k=aThis must be translated into tl; k Z; Z tl; Z; a tl:0;Dereferencing makes the need for temporary to be inserted into the code (tl stands for temporary label). From this example we can see that there is no way to decouple the evaluation of dereferencing from the assignment. A few more examples: *k+=a tl; k Z; Z tl; Z; a Z; Z tl:0; Z; *k=0 tl1; tl2; k Z; Z tl1; Z tl2; Z; tl1:0 tl2:0 *k=a tl1; tl2; tl3; k Z; Z tl1; Z tl2; Z tl3; Z; tl1:0 tl2:0; a Z; Z tl3:0; Z;In the last case we need 3 (!) labels: 2 for clearing the variable and 1 for adding. Fortunately the number of cases involved Lvalues are limited: Lval ++ Lval  ++ Lval  Lval & Lval Lval = expr __in LvalLval can be either identifier, *(expr), or array[expr]. And array[expr] is equivalent to *(array+expr), which means that only 2 cases has to be considered: namely, identifier or *(expr). Step 3: Function callsStep 3.1: Simple callsConsider thisvoid f(int a, int b); ... f(a,b);The call to a function f must be translated into something like this # 1 push b # 2 push a # 3 push ret_addr # 4 goto f # 5 ret_addr: pop3 spHowever, bear in mind that the arguments can be expressions and the call to a function can be a part of another expression  subexpression, i.e. the code must properly handle more complicated cases like the following int f(int a, int b) { ... return f; } ... int k; k=f; k(f(1,2),3); // call through a variable k = f(1,2)(3,4);So, # 1 push B # clearing the next cell in the stack # the line below is same as C: *(++sp)=0; [note that sp is negative] dec sp; t1; sp t1; t2; sp t2; t1:0 t2:0 # C: *sp+=B; t3; sp t3; b Z; Z t3:0; Z # 2 push A # the same trick with A dec sp; t4; sp t4; t5; sp t5; t4:0 t5:0 t6; sp t6; a Z; Z t6:0; Z # 3 push ret_addr dec sp; t7; sp t7; t8; sp t8; t7:0 t8:0 t9; sp t9; t10 t9:0 goto_addr . t10: ret_addr # 4 goto f goto_addr: Z Z f # 5 ret_addr: pop x3 sp ret_addr: const(3) spThe code above does not handle yet return value nor expression instead of function name, like in the example above: call through a variable Step 3.2: Real callsStep 5 in the example above must be enhanced with return value extraction:ret_addr: t; t . . t2 ?+8 . . dec sp dec sp . sp t2 sp ?+1 . t2:0 t 0 t . inc sp inc sp . const(3) sp cm3 sp . # now t holds the resultNotation const(3) sp is a short for unique_name sp ... unique_name:3 Step 3.3: Inside functionFunction code have to be wrapped with the following commands:1. # push bp 2. # sp > bp 3. # sp = stack_size # ... function code 5. # bp > sp 6. # pop bp 7. # returni.e. dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; bp 0 bp; sp bp stk_sz sp # ... sp; bp sp ?+8; sp ?+4; bp; 0 bp; inc sp ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0stk_sz is a constant which is calculated for every function during parsing. It turns out that it is not enough to save bp. A function call can happen inside an expression. In such case all temporaries of the expression have to be saved. A new function will be using the same temporary memory cells for its own needs. For the expression f()+g() the results of the calls may be stored in variables t1 and t2. If function g changes t1 where the result of function f is stored, the problem arises. A good solution is to force every function push all temporaries it is using into a stack and to restore them upon exit. Consider the followig function: int g() { return k+1; }It is translated into: _g: # save bp dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; bp 0 bp; sp bp # push t1 dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; t1 0 # push t2 dec sp; ?+11; sp ?+7; ?+6; sp ?+2; 0 ?+6; sp ?+2; t2 0 # calculate addition t1; t2 _k t1 dec t1 t1 t2 # set the return value [negative] ax; t2 ax # pop t2 ?+8; sp ?+4; t2; 0 t2; inc sp # pop t1 ?+8; sp ?+4; t1; 0 t1; inc sp # restore bp sp; bp sp ?+8; sp ?+4; bp; 0 bp; inc sp # exit ?+8; sp ?+4; ?+7; 0 ?+3; Z Z 0 Step 3.4: Reuse of temporariesSince all used temporaries in the function are pushed into the stack, it is worthwile to reduce the number of used temporaries. It is possible to do just by releasing any used temporary into a pool. Then later when a new temporary is requested, the pool is first checked and only if empty a new temporary is allocated.Compare the following function: int g() { return 1+k[1]; }
Obviously the gain of reuse grows with the size of expression. Step 4: Stack variablesOnce bp is stored in the stack and sp is decremented, all local variables come to live. They can be accessed only indirectly because the compiler does not know their addresses. For example, functionint f(int x, int y) { int a, b=3, c[3], d=5; ... } f(7,9);has 4 local variables with the stack size equal to 6. When this function is entered the stack has the following values: ... y[9] x[7] [ret_addr] [old_bp] a[?] b[3] c[?] c[?] c[?] d[5] ... ^ ^ bp spThe compiler knows about the offset of each variable from bp:
Hence, in the code any reference to the local variable can be replaced with *(bp+offset) with the exception to the array c. The array has to be replaced with (bp+offset) because the name of the array does not refer to a variable, only referencing with [] does: c[i] > *(c+i) > *((bp+3)+i) Step 5: MultiplicationThe only trivial multiplication in Subleq is the multiplication by 2: [t; a Z; a Z; Z t; Z]. To multiply 2 numbers one can use the formulaAB = (2A)*(B/2) + A*(B%2) This is a simple recursive formula, but it requires integer and modular division. The division can be implemented as the following algorithm. Given two numbers A and B, we increase B by 2 until the next increase gives B greater then A. At the same time as increasing B, we increase a variable I by 2, which is initialized to 1. Now I holds the part of the result of division  the rest is to be calculated further using AB and B. This can be done recursively accumulating all I's. At the last step when A<B, A is the modulo. This algorithm can be implemented as a short function in C. Upon the exit this function returns the integer division as the result and modulo division in the argument j.
Since the multiplication operations (mult, idiv, and mod) require quite elaborate calculations, they are implemented as library functions. That is, each multiplication a*b is replaced with a call _mul(a,b), and later the library manager adds (if necessary) the implementation of the function. Hence any program using multiplication operations is increased in size by a constant value  the size of multiplication functions. Step 6: Jumps6.1 BoolsC style of Boolean expressions leads to long and elaborate code in Subleq. This is the result that every Boolean expression evaluates on basis equal or not equal to zero. Since this is unnatural to Subleq operation, a several steps required selecting true or false result. Another approach is to treat less or equal to zero as false and positive value as true. Then ifexpression if(expr){...} will be just one instructionZ t @ + {...}  <next> <+where t is the result of expression. However to remain fully compatible with C (for example, if(x+1){...}  implicit conversion to Boolean) we have to detect all cases where integer expression is used as Boolean. Fortunately there are only a few cases where we need this:
However another problem appears with C compatibility, namely storing or passing Boolean as integer. For example, in the following cases:
6.2 The CascaderThe following construction gives a nice framework of handling conditional jumps.
