Implement Lexical Analyzer for the subset of C

Aim : Write a program in C to implement Lexical Analyzer for the subset of C.


  1. To understand the basic principles in compilation.

  2. To study lexical analysis phase of compiler.


Compiler takes input as a source program & produces output as an equivalent sequence of machine instructions. This process consists of two-step processing of source program.

  1. Analysis of source program.

  2. Synthesis of target program.

Analysis step:

It consists of three sub steps

  1. Lexical Analysis – Determine lexical constituents in source program.

  2. Syntax Analysis – Determine structure of source string.

  3. Semantic Analysis – Determine meaning of source string.

Synthesis Step:

It deals with memory allocation & code generation. The actions in analysis phase are uniquely defined for a given language. But synthesis step consists of many action instances where actions depend on the aspect concerning the execution environment of the compiler.

e.g. – Operating system interfaces target machine features such as instruction set, addressing modes, etc.

Lexical analysis: The action of scanning the source program into proper syntactic classes is known as lexical analysis.

Task of Lexical Analysis:

  1. To scan the program into basic elements or tokens of the language.

  2. To build the Uniform symbol table (table of tokens).

  3. To build the symbol & literal table.

  4. To remove white spaces & comments.

  5. To detect errors such as invalid identifier or constant.

Data structures / Databases required:

  1. Source program – Original Source program, which is scanned by compiler as string of characters.

  2. Terminal Table – A permanent database that has entry for each terminal symbols such as arithmetic operators, keywords, punctuation characters such as ‘;’ , ‘,’etc Fields: Name of the symbol.

  3. Literal Table – This table is created during lexical analysis so as to describe all literals in the program.

Fields: Name of the literal.

4. Identifier Table – Created during lexical analysis and describes all identifiers in the program.

Fields: Name of the identifier.

5. Uniform Symbol Table – Created during lexical analysis to represent the program as a string of tokens, rather than of individual characters. Each uniform symbol contains the identification of the table to which it belongs.( IDN – Identifier table, LIT – Literal Table TRM – Terminal Symbol Table)and index within that table.

6. Buffer – One buffer or two buffer schemes to load source program part by part to reduce disk i/o.

Keywords can be stored either in the terminal table or in the keyword table.

Basic steps in algorithm

1. Initialize line no to 1.

2. Read the source program line by line ( Buffer is line)

3. For each line separate the tokens such as

i) identifier/ function name / keywords : Follow the transition diagram to detect this; i.e. letter followed by letter or digit. Search in keyword table for existence of keyword, otherwise it is identifier or function name.

ii) Integer Constant : Digit followed by digit

iii) All types of operators such as >, >=, ++, +, – etc

iv) Remove comments of the type /* */

v) Remove all white spaces.

4. Assign a line no and increment a line number.

5. Repeat steps 2-4 till end of file.