CS 451 - 651 Notes on Function Calls ( Nov 18 ++ 2009)


Overview

Discussed what steps are involved in adding support for functions.

Software Engineering aspects:  illustrative of ways we can keep a large project under control.

Software Engineering aspects

We're faced here with a complex undertaking, with many different parts.  Certainly, some careful planning anddesign work will be needed lest it deteriorate into a mass of confusion which, quite honestly, might never work properly.
It's a myth that only student projects, beset by limitations on time and resources, are the only ones to actually fail.  Many large commerical projects have had to be abandoned, often after large amounts of time and money.  And even those which are "successful" carry a continuing burden of defects (consider Microsoft Windows, for instance).

Some useful ideas:
Details should be apparent below.

Functions:  Design Discussion

To provide functions, we'll need a collection of different pieces:
All of these will need some support in the parser (and perhaps even in the scanner, if we add additional keywords).  There will also have to be new Node types, with analyze() and emit() routines.  Additional ILOC instructions will have to be implemented.
We can start anywhere, and work outwards.  Sooner or later, we'll need a good understanding of what we expect to happen at runtime.  That's not surprising:  after all, it's the semantics (meaning) of the program that is our ultimate guide.

Simplifications:

--> we'll focus on creating a block of code which is the function, getting it defined, and then at runtime getting there and back.  We can populate the body with simple things like "print 2+2;" and thus avoid having to worry about how to implement scopes and local variables and all that complexity .. for a little while, anyway!  This will keep us busy enough for now.
--> we can add arguments later -- perhaps a good compromise will be to have just one argument, which we can pass in a register.
  We can also leave out some user-program error checking, and add it later.  Don't leave out consistency checking, though!  It's an important way to catch errors in your implementation.
--> if one function cannot call another, then there is a little less complication about remembering where we came from and how to get back.
 Not a lot is required to lift this restriction.  We save a little, initially, and every little bit can help.

Syntax

As you've noticed by now, syntax is more in the "necessary evil" category, and really up to the language designer's whim.  We can, potentially, invent our own syntax for function definitions ... certainly, we should be willing to put in variations if they're needed to get our job done easily.

Most of what we've been doing is essentially C/Java [subset] syntax.  Let's stick with that.

 If you'd like to explore other variations, feel free: just document what you're trying to do!  And bear in mind that you'll have to modify any standard test programs we've provided.

Clarification:  "Compile-time" and "Run-time" and ILOC

A program is a linear sequence of ILOC instructions.  Note that there are two, really distinct, points in time:  when you create the instructions (at "compile-time", via "emit()") ... and when you execute them (at "run-time", via "run()" and "execute()").  Make sure you're absolutely clear on the difference:  disaster will strike otherwise!

(This concept is not new.  You've already seen the distinction between what resides in a variable declaration [at compile time]:  (information about names and type); and what resides in memory [at runtime]:  values.  If you're "not clear on the concept", then you'll be confused by how we handle initialization clauses provided with the declaration.)

In the context of functions and function calls:
A small implementation detail.  The PC has the address (instruction number) of the next instruction which will be executed.  As soon as that instruction is fetched from i-memory, the PC is immediately incremented.  In the normal case, this will eventually lead to fetching (and executing) the following instruction.  But:
("real" computer hardware usually works this way, BTW, for more or less the same reasons.)


OK.

Here are some pieces involved in implementing functions, looking down from the top:

1. Syntax for a function definition/declaration.  Something like
    TYPE IDENT () { statementList }
which can be a Declaration in our classification of things.  Perhaps you have a non-terminal already which has {statementList} .. if so, you should be able to use that here, too, I suspect.

The type is the return type of the function.  We should eventually add "void" as a built-in type, then we can have functions which don't return anything.  (for our first implementation, we'll not worry about the return stuff)   Also, for arguments, we can add variants of this production -- later! -- to handle arguments.

Define a StmtNode subclass for FunctionDefinition and have the parser built an instance when this production is recognized.   (Important note:  when will that happen?  Yes, when the whole definition, including the body, has been seen.)

This class will have an analyze() and emit(), of course.
 Analyze() as usual does type checking.  We can probably skip it, as before, for the initial implementation, as we'll probably only do things like print expressions formed from literals.  Eventually, it. should check the validity of the type (copy code from other decls) and analyze the statement list (presumably that's already written!)
 Emit is a different story.   See below.

2. Add a return statement, which has pretty simple grammar:
    RETURN  |   RETURN expr
(and you'll need to define the new terminal in the scanner, too).   Do we need both forms?  Probably not:  it violates our idea of keeping things simple.   Eventually we'l need that form.   (Perhaps it's not that much harder to do both at once:  one can imagine a "value" field, which might be null.  Just don't also get sucked into doing the semantic checking on the various new material for now.)
   And a new StatementNode subclass for return.  It needs to analyze the value subtree, if any:  easy.  What about type checking?  Aha,  that is actually the hidden part of the iceberg.  The type of the return value must match the declaration of .. of the current function.  How do we find the "current function", while we're processing its statements?? -- we'll leave that till a bit later on.  It won't hurt if we assume for now that the type is "ok":  there won't be a value at first, and later most of our stuff is just "int" anyway.
  Emit will put out a RETURN instruction.  It will be discussed shortly.

3.  Add a way to call functions.  Here we should consider the full picture briefly.  There are two kinds of function calls:  ones that return a value, and those that do not.  The value-return kind is part of an expression, since you can say
x = 2*fn(y)-3;
The other kind sits up at the level of a statement:
doSomething(x, y);
There are several ways to make this work.
A.  (Fairly simple)  Treat the expression case as the "normal" one, and so create a "function-call-expr".   This will eventually have to appear as a new option of "factor" in the grammar.  It will of course have an associated class (function-call-expr-node) in the AST.

 Then create a "function-call-statement" in the grammar and in the tree:  the [only] child of the function-call-statement is a function-call-expr.  By having both, the Java type-checking should be happier.  Since we've tended to separate "statements" and "expr-nodes" into two different groups, it's hard to now have something which can be both.  (otherwise, you could just say that a "function-call-expr" was one choice for "statement")

B.  C and Java solve this problem by allowing a "statement" to be just an "expr".  That also allows statements like
j++;
x+y;
(The first of these is probably more useful, but it's not the language's job to make sure you don't do silly things.)  Incidentally, in C and Java, an assignment is actually an expr too, which allows you to write
if ((x=y*3) < 10) ...
It probably wouldn't be very hard to make an "expr-statement" category. 

C.  In Fortran and some other languages, there's a keyword for the statement case, thus
call doSomething(x, y);
You're welcome to do that, though it doesn't seem to make anything much easier.


Emit

The CALL instruction is very similar to BR .. it holds a constant which identifies the next instruction to be executed.  In order to be able to get back, we provide CALL with another argument, a register into which it will deposit (at run-time, remember!) the current PC value.   The RETURN instruction needs to be given the same register, and will restore the old PC.
SIMPLIFICATION:  we'll not allow a function to call another function, yet.  This keeps us from having to save the return address value  (which would otherwise be lost when the inner call occurs)

How do the caller and "callee" (the function) know which register to use for the return address?  Answer:  we'll establish a convention, and use a special register, the same one everywhere.   Let's call it "rRA" (Return Address);  we'll create it in Reg as part of the initialization.   (I'll distribute that code).   This is what normal compilers do.  rRA will be register 2.

When we emit code for the function body, how to keep that from being accidentally encountered and executed as part of the outermost program sequence?  Simple solution:  insert a "BR" instruction before emitting the body, and then patch it up so it will branch around the function body.  This is a little bit of a kludge, but works -- and it's a nice, local, simple solution that doesn't need some kind of complex agreement among lots of parts of our system.  (Note that the logic here is fairly similar to the forward jumps we've been seeing in if-then-else and while constructs.) 

Otherwise, emitting the function body is just like any other StatementList.  (Actually, that's not the whole story:  see part 2, below.)


We still need one very important piece of information, namely the address of the beginning of the function body's code.  Well, the emit() routine inside FunctionDefinitionNode knows (or, rather, discovers) that address when it goes to generate the code for the body.  (Clue:  ILOC.nextLoc().  Clue: see IfWhileStatement)  This can be stored as an attribute of the function definition/declaration.  As long as the declaration has been seen (emitted) before we try to generate (emit) code for the call, all should be well.
SIMPLIFICATION:  functions must be declared before they can be called.  (This is not too painful, and works for everything except what's called "mutual recursion" . it should be good enough for the likes of us.)
Question:  what is "mutual recursion"?  How do C and Java handle this?  (not the same way)



Part Two:  The Rest of the Story


1.  Prologue and Epilog

For a full implementation, there's some bookkeeping required as a function gets underway.  This serves to save the return address and to create a new stack frame for the function, with links back to the previous one.  Typically, several instructions will be required, known as a "prologue".  Note that saving the return address in memory allows us now to call another function safely.

The reverse operation is required before the function actually returns.  Its stack frame needs to be removed and the previous state restored.  This is called the "epilog".

--> more details on these can be found in the next lecture.

2.  How many returns?

In most languages, if control reaches the end of a function body, there's an implied "return" .. inserted by our friendly compiler.  But there's also a "return" statement.  Should the compiler add the implied return even if the programmer has an explicit one?

The answer is "yes".  The simple case isn't too bad -- a "return" as the last statement of the function.  But what if there's control flow?  Say we have
void fcn () { if (x < 0) return; }
Here it might look like the "last statement" is a return, but that's only on one path ...
So the safe solution is to always generate that implied return; and then let a general (optimizer) pass go through and notice places where there are extra branches, unreachable code, etc.

NOTE:  the implied return only works if there's no return value.  Otherwise an error will need to be reported.

3.  Arguments and return value

The return value is typically conveyed in a dedicated real register.  In our case, rVal is reserved for that.

To bridge between our world of infinite registers and the single real rVal, we add a copy ("i2i") instruction, which ensures that rVal is only used for a short time right around the call and return sequences.  Thus
At the call site, right after the CALL instruction is where execution will resume.
If we limit ourselves to one argument, it can be passed in via rVal,with similar copies.  Or one can pass any number of arguments on the stack.  Both methods are used by real compilers.  More details will be provided later.

4.  Name scopes

We'll need a separate nametable for each scope, with back links to the parent.  This will need supporting mechanism from the CurrentFunction class, which is the route taken by anyone who wants to access the current nametable.

If you think about it for a little while, it's clear that the function's nametable needs to be de-selected at the end of the function.  That's probably easy, corresponds easily to a point in the grammar.  The new nametable needs to be created, of course, fairly early on.  (Not too early:  the function name itself is part of the outside scope, so it can be called).  How to get control at the proper time?  Answer:  a little clever "grammar programming".
  (stay tuned!)