A C# RECURSIVE DESCENT PARSER

September 21, 2017 C# Developer Updates 0

The majority of the player’s interaction with System Dot is through ‘hacking’ different objects in the environment and altering their source code, which is a language invented by me simply called syscode. This is made possible through a recursive descent parser defined in the Parser class customly written for the game. Each interactable object in the world has an EnemyTerminal script attached to it, which has a Parser object. This script allows the player to click on an object and display a terminal with the object’s syscode inside of it. The player can alter this code and hit a compile button attached to the terminal, which will cause a parse function within the Parser class to execute. This parsing will take the object terminal’s code as input and return a list of actions for the object to execute as long as all of the written code conforms to the context-free grammar found in the header comment of the full Parser class here.

The parser can handle the following programming principles:

  • Built-in commands such as System.body(Color.BLUE);
  • Commenting with “//”

  • Data types like integers, doubles, booleans, and strings (i.e. int x = 4;)

  • Object methods like substring and length of a string

  • Basic arithmetic including addition, subtraction, multiplication, division, modulus

  • Type casting for instance floats being an int

  • Type checking for example ints being set to floats returning an error

  • Boolean logic like AND (&&) and OR (||) operators between boolean expressions

  • Full conditionals with “if”, “else if”, and “else”

  • Nested conditionals

  • While loops

  • For loops
  • Error handling

The parser was designed to be fully modularized. Whenever a new specific action is needed to solve a puzzle in the game that has not already been defined, the action can be added to the parser with ease without having to write a whole new script. The core function in the Parser class is the getToken() method. The getToken() method reads in each character of a string called tokens and evaluates its token type. Token types are an enumerated type such as ID, DOT, LPAREN, RPAREN, and so on. For instance, if the following string was passed in: System.body(Color.BLUE); then getToken() will return SYSTEM as a token type and store “System” in the token global variable. Then, we could check if the token read was a System type and, if it was, verify that a period follows it by calling getToken() again and checking if token type is DOT.

The parser continues to call getToken() in a while-loop until the the string is eaten up (or we reach the end of string token type EOS), an error is found, or the code runs into an infinite loop. Then, a series of conditionals check which type was ate up to then perform the corresponding actions. This while-loop mimics how an actual compiler runs for a programming language like C by using types such as EOS. However, what makes my compiler different is that it immediately stops parsing when it reaches game-specific clauses like getting an error or running into an infinite loop. Below is the while loop with the corresponding getToken(). More insight on the statement !isInfinite.Contains(100) will be given later on.

 

VARIABLES AND TYPE CHECKING

There are also several global objects that manage type-checking. First, a variable structure has a name, value, and an enumerated type. For instance, a variable declared as int x = 4; will have a name of “x”, a value of “4”, and an enumerated type of “INT”. It is through these variables that a symbol table is created in the form of a dictionary data object, which is nice because retrieving data takes O(1) time. The symbol table uses the name of the variable as keys and the actual variable object as the value. The symbol table is important for (1) checking if a variable already exists and (2) checking if an already existing variable matches a new assignment. To the left is an example of how the symbol table is populated with the following code:

int x = 3;
double y = 2.5;
string z = “Hello world”;

In the assignment section of the parser, the symbol table can be used to check for any new assignments or reassignments. For instance, the following two lines of code will result in type errors and return an ERROR action:

x = 4.25;
z = true;

In order to ensure widening primitive type conversions, there is another structure identified called numberOrString, which has an associated enumerated type tag and the value. Unlike variables, numberOrString is primarily used to assign the number or string after the equal sign in an assignment statement for easy type checking. For instance, int x = 4; would store “4” with a tag of “INT” and a value of “4”. Therefore, when a variable is checked to see if it has an appropriate corresponding value, it is as simple as checking the variable type with the tag. Consequently, we can do simple widening conversion between a real number and an integer by checking if the variable type is “REAL” and the tag of the number is “INT”. The Parser code demonstrating this widening conversion is depicted below.

 

SCOPE CHECKING

Not only are there objects designed for type checking, but there are also global structures meant for scope checking. The statementTag is a List<StatementType> structure that acts like a stack. Whenever a token type of IF, WHILE, or FOR is read by getToken(), a corresponding HASIF, ISWHILE, or ISFOR StatementType will be added to the statementTag structure.

In the code above, the ParseCondition() function parses the conditional statement after the IF token to check whether or not we parse the contents of the if-statement. If we do, ParseCondition() returns true and an EXECUTEIF StatementType is added to the statementTag stack. These StatementTypes not only evaluate scopes, but also check for syntax errors. Let’s look at the code below to demonstrate this functionality. If an ELSE token is read in at any moment and we do not have a HASIF or EXECUTEIF StatementType at the top of the stack, then an error will be returned because a corresponding if-statement for the else-statement was never evaluated.

For each new assignment, the newAssignments integer list stores the number of new assignments of a certain scope. Whenever getToken() reads an LBRACE (left brace) or FOR token (we do this for the FOR token because of new assignments within the for condition), an integer is added to the list starting at 0. Then, whenever there is an INT, DOUBLE, STRING, or BOOLEAN token, that new integer in the list will be incremented, documenting how many new assignments have been made within that current scope. When we read in an RBRACE (right brace), that integer will be removed from the list while also removing all the values in the symbol table starting from the newly assigned values up to the newAssignment integer amount. This relationship between left and right braces define the scope of variables– the variable’s life starts at the left brace and dies at the right brace. In order to ensure correct syntax, the statementTag is always checked to see if it has elements inside its stack to confirm that the left and right braces were supposed to be there in the first place.

 

NESTED CONDITIONALS

To better understand how the stack functions in the parser, here is a demonstration of the stack for a nested if-condition statement.

If S1 is a true statement, then we would have the following statementTag stack after executing the first line,

Once again, the reason we add EXECUTEIF and ISEXECUTING to the stack is because now the parser knows it is traversing the body of the if-statement. Therefore, if it finds a rogue else-statement lurking within, it knows it cannot tie it to a HASIF because the top tag on the stack is ISEXECUTING. Therefore, we return an action of an ERROR and go on our way. However, if we find another if-statement, we simply add a HASIF statement tag to it as shown below.

After executing the second line,

HASIF is added because another if-statement is read, but EXECUTEIF or ISEXECUTING is not because S2 does not evaluate to true. Therefore, if an else-statement is read, I can tie it to that HASIF. We can see that occur when we execute lines 4-5,

The top HASIF has been removed because we found the corresponding else statement. Otherwise, it would be removed when reading line 6 if the else statement was not there. However, when we do read line 6, the stack looks like this,

We keep HASIF in the stack in case we read in an else-statement, which we do after line 7. That outer else-statement corresponds to the last statement tag in the stack, which then removes it and makes the stack empty again. In the end, we did not produce any errors and the parser continues to go through the rest of the code.

 

LOOPS

Loops are handled in a similar fashion as scope checking with a minor caveat. Whenever the getToken() function is called, the passed in string that is going to be parsed will have a character removed from the front end of the string. This makes looping difficult– how will we parse the string again if the loop condition does not fail? Instead of storing each instruction into a sophisticated data structure, I simply took the approach of saving the string in a string list called loopCode and then concatenating that string into the passed-in string if the loop does not terminate. Therefore, as we continue to parse, all the actions and values generated from that passed-in string will be performed again.

The loopCode object attains its contents after we call ReadBody(). This function is called whenever getToken() returns a WHILE or FOR token type. Even if the loop condition immediately evaluates to false, ReadBody() will still be called to act like a storage facility. The main method of attaining the string that will be stored is by parsing the passed-in code string until an equal number of left and right braces is reached. It is only then that we terminate the continuous getToken() calls. The concat string is used to keep track of the code following the “body” code. Therefore, once the body is stored in loopCode, that substring is appended to the front of the following substring after it to get the complete code string. Refer to the code below for how this is done.

Since there may be many nested loops, loopCode adds the string in the body to the latest index so that it can manage different stored body substrings. That is why we use loopCode.Count – 1 as an index; an empty string is already stored and we’re replacing it with the actual string segment.

Whenever the loop condition evaluates to true, the passed-in code will have the body code reattached and back in business to get parsed again. This process is done once the right brace is read, there are statement types inside the statementTag stack, and at the top of the stack there is a ISWHILE StatementType (which says we are at the end of the while-loop body). The following piece of code demonstrates that concatenation if the loop condition evaluates to true:

If the loop condition evaluates to false and terminates the loop body, then there is a condition that checks for it at the top of the outermost while-loop. keepParsing is set to false when the loop condition returns a false value. In those cases, we clear the condition and body code of the loop, remove ISWHILE from the statementTag stack, and continue to parse the code string. Here is how keepParsing is used:

There are cases when the user can write a loop that never terminates. For instance, if the player types while(true){ }, the parser will continuously getToken(), evaluate loop condition, store body into loopCode, reset the code, evaluate condition again, and so on infinitely. By typing that simple line of code, the player could crash the system and the game. To account for this, a precaution has been added to make sure an infinite loop is never entered, and if so, terminate it at the threshold limit. Right now, the threshold limit is 100 because there is not a reason for the player to choose an iteration level over 100 (and if they do, they will be notified). The threshold limit is stored in an integer list called isInfinite. The reason why it is an integer list is to take into account nested loops with their own threshold limits. Every time a loop is first read, 0 is added to the list. Every time loopCode is attached to the code string, that 0 is incremented. Then, if the list has a value that is greater than 100, terminate the loop that continuously calls getToken().

Additionally, having a threshold limit allows me to categorize an infinite loop as an action itself.

This can provide for some unique and imaginative gameplay elements. Imagine the player enters a factory with an assembly line machine continuously smashing the floor at a speed where the player cannot pass without getting killed. Upon hacking the machine, the player discovers the following code:

In this case, the red code is considered “legacy code” that the player cannot modify. Instead, they have to figure out how to alter the existing code without producing a syntax error to stop the machine from smashing. The player could change int x = 4; to int x = -1; so the while loop never executes. Regardless, in order to set up this scenario in the first place, we would have to recognize an infinite loop as an action and then have the actions inside the infinite loop repeat indefinitely. In this case, the machine would continuously smash the floor and increment the value of x.

 

Of course, I could not go through the entire Parser class (there are over 3000 lines of code), but I hope the portions highlighted gave some insight on how the parser, as a whole, works. I think my custom parser is what distinguishes System Dot from other virtual worlds/video games that teach programming. For instance, unlike CodeCombat or CodeAcademy, which asks the user for a specific segment of code from a list and will not execute unless that code is entered, System Dot gives players the freedom to write whatever they want and see their creations come to life. Furthermore, instead of writing the same lines of code over and over again to reinforce programming principles in other instructional programming software, the player in System Dot can write the same code in a variety of ways and see how those different ways affect the behavior of the object they are modifying. Once again, this amount of liberty is what I think will cause players to continue playing System Dot; they have no bounds on their creativity other than the locations of the game objects in the game world. In the end, our ultimate goal with my custom parser is to provide a canvas for students to flex their computational thinking skills through diverse coding problems.