- lex and yacc help you write programs that transform structured input
- Organization of a Lex program
Regular Expression to DFA
Compiler is a program which converts the given source program in high-level language into an equivalent machine language. While doing so, it detects errors and reports errors. This process is quite complex and it is divided into number of phases such as Lexical Analysis, Syntax and Semantic Analysis, Intermediate Code generation, Code Generation and Code Optimization.
The lexical analysis phase of compiler reads the source program, character by character and then groups these characters into tokens which are further passed to next phase, which is nothing but parsing or syntax or semantic analysis. After syntax and semantic analysis, Intermediate Code is generated which is followed by actual code generation.
Lexical Analyzer recognizes the tokens from series of characters. A “C” program consists of tokens such as Identifiers, Integers, Floating Point Numbers, Punctuation symbols, relational and logical and arithmetic operators, keywords and comments (to be removed). To identify these tokens, lexical analyzer needs the specification of each of these symbols. The set of words belonging to a particular token type is a regular language. Hence each of these token types can be specified using regular expressions. For example, consider the token Identifier. In most of the programming languages, an identifier is a word which begins with an alphabet (capital or small) followed by zero or more letters or digits (0..9). This can be defined by the regular expression
(letter) . ( letter | digit)* where
letter = A|B|C|……|Z| a| b |c |……|z
and digit = 0|1|2|….|9
One can specify all token types using regular expressions. These regular expressions are then converted to DFA’s in the form of DFA transition table. Lexical analyzer reads a character from a source program and based on the current state and current symbol read, makes a transition to some other state. When it reaches a final state of DFA, it groups the series of characters so far read and outputs the token found.
Formal definition of Regular expression
The class of regular expressions over ∑ is defined recursively as follows:
1. The letters φ and Є are regular expressions over ∑ .
2. Every letter ‘a’ c ∑is a regular expression over ∑ .
3 If ‘R1’ and ‘R are regular expressions over ∑, then so are ‘(R1|R2)’, ‘(R1.R2)’ and (R1)*
Where
‘|’ indicates alternative or parallel paths.
‘.’ Indicates concatenation
‘*’ indicates closure
4. The regular expressions are only those that are obtained using rules (1) and (2).
Formal definition of DFA:
The formal definition of finite automata is denoted by a tuple
( Q, ∑,d, qo, f)
Where
Q à Finite set of table
∑à finite input alphabet
qoà Initial state of FA,qo, qo Q
Fà set of final states, F c Q
dà Transition function called as state function mapping
Q * ∑ à Q
i.e. d= Q*∑à Q
A FA is called deterministic (DFA) if from every vertex of its transition graph, there is an unique input symbol which takes the vertex state to the required next state.
DFA is constructed directly from an augmented regular expression ( r )#. We begin by constructing a syntax tree T for ( r ) # and then we compute four functions Nullable, Firstpos, Lastpos and Followpos. The functions Nullable, Firstpos, Lastpos are defined on the nodes of a syntax tree and are used to compute Followpos which is defined on set of positions. We can short circuit the construction of NFA by building the DFA whose states correspond to the sets of positions in the tree. Positions, in particular , encode the information regarding when one position can follow another. Each symbol in an input string to a DFA can be matched by certain positions. An input symbol ‘c’ can only be matched by positions at which there is a ‘c’ but not every position with a ‘c’ can necessarily match a particular occurrences of ‘c’ in input stream.
The steps in algorithm are
Nullable, Firstpos, Lastpos, Followpos
Also if some leaf node is for έ, then it is also nullable.
Firstpos (Firstposition): At each node n of the syntax tree of a regular expression, we define a function firstpos(n) that gives the set of first positions that can match first symbol of a string generated by sub expression rooted at ‘n’.
Lastpos (lastposition) : At each node n of the syntax tree of a regular expression, we define a function lastpos(n) that gives the set of last positions that can match last symbol of a string generated by sub expression rooted at ‘n’.
To compute firstposition and last position, we need to know which nodes are the roots of sub expression that generate languages that include the empty string. Such nodes are Nullable. We define nullable(n) to be true if node ‘n’ is nullable , false otherwise.
Computation of Followpos : Followpos(i) tells us what positions can follow position i
in the syntax tree. This can be computed as follows.
1. if n is a ‘.’ (cat) Node, with a left child C1 and right child C2 and i is a position in the Lastpos(C1), then all positions in Firstpos(C2) are in Followpos(i)
2. if n is a * (closure) Node and i is a position in the Lastpos(n), then all positions in Firstpos(n) are Followpos(i)
6. Construct DFA from Follow Pos.
Note : Step 5 can be done during construction of tree, since you are building the tree from bottom to top, and when computations at some root of sub tree are to be done, information of sub tree is available. So no need to do any traversal.
Node Structure for Parse Tree
{
Leftchild and Rightchild : pointers to the node structure
Nullable : Boolean Type
Data : Character Type
Fistpos and Lastpos : set of integers
Pos : integer (this may or may not be part of tree node)
}
Stack : Stack is required to build the tree. This can be implemented either using link list (preferable) or as an array. Item or data that will be pushed into or popped out from stack is pointer to the node structure of a Parse Tree and not just a single character..
Computations of Firstpos and Lastpos.
Node n |
Nullable(n) |
Firstpos(n) |
Lastpos(n) |
|||||
N is a leaf labeled έ |
true
|
ø
|
ø
|
|||||
|
Nullable(c1) or
Nullable (c2) |
Firstpos (C1) U Firstpos(C2) |
Lastpos (C1) U Lasttpos(C2) |
|||||
N is a leaf labeled with position i |
false
|
{ i } |
{ i } |
|||||
|
Nullable(c1) and
Nullable (c2) |
If nullable (C1) then Firstpos (C1) U Firstpos(C2) else Firstpos(C1) |
If nullable (C2) then Lastpos (C1) U Lastpos(C2) else Lastpos(C2) |
|||||
|
true
|
Firstpos (C1) |
Lastpos (C1) |
Algorithm for construction of DFA transition table
Begin
Mark T
For each input symbol a do
Begin
Let U be the set of positions that are in Followpos(P) for some P in T,
such that the symbol at position P is a.
If U is not empty and is not in Dstates then add U as an unmarked
state to Dstates
Dtran [T,a] = U
End
End
Macro Processor
Implementation of TWO Pass assembler with hypothetical Instruction set Instruction set should include all types of assembly language statements such as Imperative, Declarative and Assembler Directive. While designing stress should be given on
To learn the basic translation process of assembly language to machine Language.
A language translator bridges an execution gap to machine language of computer system. An assembler is a language translator whose source language is assembly language.
Language processing activity consists of two phases, Analysis phase and synthesis phase. Analysis of source program consists of three components, Lexical rules, syntax rules and semantic rules. Lexical rules govern the formation of valid statements in source language. Semantic rules associate the formation meaning with valid statements of language. Synthesis phase is concerned with construction of target language statements, which have the same meaning as source language statements. This consists of memory allocation and code generation.
Analysis of source program statements may not be immediately followed by synthesis of equivalent target statements. This is due to forward references issue concerning memory requirements and organization of Language Processor (LP).
Forward reference of a program entity is a reference to the entity, which precedes its definition in the program. While processing a statement containing a forward reference, language processor does not posses all relevant information concerning referenced entity. This creates difficulties in synthesizing the equivalent target statements. This problem can be solved by postponing the generation of target code until more information concerning the entity is available. This also reduces memory requirements of LP and simplifies its organization. This leads to multi-pass model of language processing.
It is the processing of every statement in a source program or its equivalent representation to perform language-processing function.
There are three types of statements Imperative, Declarative, Assembly directives. An imperative statement indicates an action to be performed during the execution of assembled program. Each imperative statement usually translates into one machine instruction. Declarative statement e.g. DS reserves areas of memory and associates names with them. DC constructs memory word containing constants. Assembler directives instruct the assembler to perform certain actions during assembly of a program, e.g. START directive indicates that the first word of the target program generated by assembler should be placed at memory word with address
Tasks performed by the passes of two-pass assembler are as follows:
Synthesize the target code by processing the intermediate code generated during
It has the following fields
Index | Mnemonic | TYPE | OP-Code | Length | Link |
0 | ADD | IS | 01 | 01 | -1 |
1 | BC | IS | 07 | 01 | -1 |
2 | COMP | IS | 06 | 01 | -1 |
3 | DIV | IS | 08 | 01 | 5 |
4 | EQU | AD | 03 | – | 7 |
5 | DC | DL | 01 | – | 6 |
6 | DS | DL | 02 | – | -1 |
7 | END | AD | 05 | – | -1 |
Mnemonic : Such as ADD, END, DC
TYPE : IS for imperative, DL for declarative and AD for Assembler directive
OP- code : Operation code indicating the operation to be performed.
Length : Length of instruction required for Location Counter Processing
Hash table Implementation of MOT to minimize the search time required for searching the instruction.
Hash Function used is ASCII Value of the First letter of Mnemonic – 65. This helps in retrieving the op- code and other related information in minimum time. For Example the instruction starting with alphabet ‘A’ will be found at index location 0, ‘B’ at index 1, so on and so forth. If more instructions exist with same alphabet then the instruction is stored at empty location and the index of that instruction is stored in the link field. Thus instructions starting with alphabet ‘D’ will be stored at index locations 3,5,and 6. Those starting with E will be stored at 4 and 7 and the process continues.
SYMTB: The symbol table.
Fields are Symbol name, Address (LC Value). Initialize all values in the address fields to -1 and when symbol gets added when it appears in label field replace address value with current LC. The symbol if it used but not defined will have address value -1 which will be used for error detection.
Symbol | Address |
Loop | 204 |
Next | 214 |
Literal | Address |
= ‘5’ | |
= ‘1’ | |
=‘1’ |
Intermediate form used Variant 1 / Variant 2
Students are supposed to write the variant used by them.
case 1 : imperative statement
case 2: Declarative statement
case 3: Assembler Directive
Imperative statement case
If opcode >= 1 && opcode <=8 ( Instruction requires register operand)
a. Set type as IS, get opcode, get register code, and make entry into symbol or literal table as the case may be. In case of symbol, used as operand, LC field is not known so LC could be -1. Perform LC processing LC++. Updating of symbol table should consider error handling.
if opcode is 00 ( stop) :
Set all fields of Intermediate call as 00. LC++
else register operand not required ( Read and Print)
Same as case 1, only register code is not required, so set it to zero. Here again update the symbol table. LC++
On similar lines we can identify the cases for declarative and assembler directive statements based on opcode.
List of hypothetical instructions:
Instruction Assembly Remarks
Opcode mnemonic
00 STOP stop execution
01 ADD first operand modified condition code set
02 SUB first operand modified condition code set
03 MULT first operand modified condition code set
04 MOVER register memory
05 MOVEM memory register
06 COMP sets condition code
07 BC branch on condition code
08 DIV analogous to SUB
09 READ first operand is not used.
10 PRINT first operand is not used.
Errors
Forward reference(Symbol used but not defined): –
This error occurs when some symbol is used but it is not defined into the program.
Duplication of Symbol
This error occurs when some symbol is declared more than once in the program.
Mnemonic error
If there is invalid instruction then this error will occur.
Register error
If there is invalid register then this error will occur.
Operand error
This error will occur when there is an error in the operand field,
%{ # include # include "y.tab.h" # include %} %% [0-9]+(\.[0-9]+)? { yylval.dval=atof(yytext);return NUMBER;} [\t] ; /* ignore whitespace */ \n {return 0;} . return yytext[0]; %% int yywrap(void) { return 1; }
%{ # include # include %} %union { double dval; } %token NUMBER ; %left '-''+' %left '*''/' %right '^' %type expr %% result: '='expr |expr {printf("%f",$1);} ; expr: expr'+'expr {$$=$1+$3;} | expr'-'expr {$$=$1-$3;} | expr'*'expr {$$=$1*$3;} | expr'^'expr { $$=pow($1,$3);} | expr'/'expr { if($3==0) yyerror("devide by zero"); else $$=$1/$3;} | '-'expr {$$=-$2;} | '('expr')' { $$=$2;} | NUMBER {$$=$1;} ; %% void yyerror(char *s) { fprintf(stdout,"%s\n",s); } int main() { yyparse(); return 0; }