运用Yacc和Lex来DIY计算器

最新推荐文章于 2023-09-27 03:47:14 发布

skywuuuu

最新推荐文章于 2023-09-27 03:47:14 发布

阅读量2.1k

点赞数 1

分类专栏：编译原理文章标签： flex

本文链接：https://blog.csdn.net/skywuuu/article/details/112672144

版权

编译原理专栏收录该内容

4 篇文章

订阅专栏

本文详细介绍了如何使用Yacc和Lex在Ubuntu和Windows上构建一个简单的计算器项目，包括下载安装步骤、功能演示、代码解释以及遇到的问题。计算器支持三种输出格式，包括解释器、编译器和语法树版本，并实现了基本和高级运算，如加减乘除、阶乘、二进制转换、三角函数等。同时，文章还探讨了ANTLR这一更先进的语法分析生成器的优势。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前言

来源于设计计算器的小组作业（Github链接：https://github.com/Skywuuuu/LexAndYACC），运用了Yacc和Lex来设计一个简单的计算器。文章内容为英文，但通俗易懂（用翻译软件翻译一下应该没问题，暂时不校对中文版了（偷懒））。感谢小伙伴们的帮助，以下排名不分先后：️️dzj，lxz，lyhan，skywuuu. 文章有点长，Introduction可跳过，Preparation部分讲的是如何在windows和Ubuntu下安装Flex和Bison，之后是展示我们设计的功能（Demonstration fo functions）和代码解释（Code Explanation），遇到的一些困难和我们编译器的不足也可以跳过，最后一个部分是介绍ANTLR的。以下，enjoy：

Introduction

To deepen our understandingofthe working procedures of the compiler and learn how to design it practically,we construct acalculatoras a project.We will discuss how to develop this simple compiler using two basic tools which are Lex (Lexical Analyzer Generator) and Yacc (Yet Another Compiler-Compiler) in the Ubuntu(Different operating systems will support different versions of software). Specifically, the details about the implementation of the calculator and how to use it correctlywill be elaboratedstep by step as a manual.In the preparation stage, userswould knowhow to download and install the compulsory packages inwindowsand Ubuntuandhow to compile the source filesand runthem.Three output formats supported by our calculator andthe basicand advancedoperationswill beintroduced in the demonstration of calculator functions.In the code explanation, detailed information about sixdifferent source files and difficulties we met will be discussed. Then, somelimitationsofour calculator will be mentioned.Lastly, we will introduce a more advanced generator, ANTLR, and discuss why it is better.

Preparation

To construct a calculator, flex and GNU bison are necessary and powerful tools to use.

Download and Installation

Windows

Flex

If the operating system of your computer is Windows, you should download and install flex. The steps are as below:

Clickthis linkand you will see this page.

Click Setup (withthered underline in the picture) and thenit will automatically jump to another page and the download will begin in five seconds. After that, you will get an executable file (file ends with .exe). You can find it under the default download directory (C:\Users\Your_name\Downloads).

Double clicks and the installation begins. You will see the window below.

Press Next button to continue until you see the window below. Just use the default path and click Next. Copy the address path below for later use is better.

Just choose Full installation, which means all of them are ticked. And click next.

Click next until you complete the flex setup and clickFinish to exit the installation.

Finally, you need to set the environmental variable.
1. Right click this PC and choose the Properties.
2. Choose Advanced system settings.
3. Click Environment variables.
4. Under the System variables, click New…
5. Give the variable a name and paste the path, for example, my path is D:\GnuWin32\bin. Do remember to include directory bin.
6. Finally, click OK.
Test whether flex is successfully installed into your computer.
1. Click windows.
2. Type cmd, the abbreviation of command,and open the command window.
3. Type flex -V to check the version of flex. If it shows your version, that means your installation succeeds.

Bison

In this part, we will introduce you the way to download and install bison.

Clickthis linkand you will see this page.

Click Setup (with red underline) and thenit will automatically jump to another page and the download will begin. After that, you will get an executable file (file ends with .exe). You can find it under the directory C:\Users\Your_name\Downloads.

Step 3 to step 8 are similar to flex, so we just skip the introduction.

Ubuntu

Under Ubuntu, the download and installation are quite simple. Type the command below:

If ubuntu asks you to choose yes or no, just type Y and enter to continue.

How to compile all of the files

Compile lex file
- flex LA.l
Compile yacc file
- yacc -dtv SA.y
Compile C file by g++ to get the object file. (The reason for using g++ is that the source files apply some c++ characteristics.)
- g++ -c c_file.c
- g++ -c lex.yy.c
- g++ -c y.tab.c
Compile all object files to generate an executable file
- g++ -o a.out y.tab.o lex.yy.o c_file.o
For convenience, we write a shell script including all of the commandsabove to compile the file and generate an executable file.The shell script file is demonstrated below:

For different output formats, we just need to run the corresponding shell script to compile the source files. (Interpreter: makefile_i.sh, Compiler: makefile_c.sh, Graph: makefile_g.sh)

Demonstration of functions

Three types of outputs

Before demonstrating the functions of this calculator, we introduceyou3 different versions of outputs, which are interpreterversion, compilerversion, and syntax treeversion.

Interpreterversion

Fortheinterpreter, actually, all of the shown output in the function demonstration part is in this type. The interpreter will execute the input code line by line, therefore, we can see the output of print function on the terminal.

Compilerversion

The compilerversioncan output the corresponding intermediate code of the input code.

Syntax tree / Graphversion

The last type of output isthesyntax tree or graph. It can demonstrate the syntax tree of the input code.

Print

The calculator can output the value of a variable through “print”.

Input:

Output:

Noticethat if we print a variable having not been assigned value, it will output 0.

Input:

Output:

Basic arithmetic operations

Addition

Input:

Output:

Subtraction

Input:

Output:

Multiplication

Input:

Output:

Division

Input:

Output:

Other operations

Function fac is the abbreviation of factorial, which means users can use this function to calculate the mathematical expression like 5! and get the answer. However, ifanegative number or floating number is given, an error will occur.

Input:

Output:

Function B2D is the abbreviation of transferring binary number to decimal number, which means users can use this function to transfer, for example, 10000B to 16. Remember to add capital B at the end ofthebinary string.

Input:

Output:

Function sin means sin operation is supported in our compiler. We keep four decimal places.

Input:

Output:

Input:

Output:

Input:

Output:

Input:

Output:

Function pow is the abbreviation of calculating the power of a number. You can directly use ‘^’ symbol to get the power of a number.

Input:

Output:

Input:

Output:

Logic Judgement

if-else

This calculator accepts if-else statement. It can support condition statements including >, <, !=, ==, >=, <=. The if-else statement supports 2 types of format. The first type has braces enclosing the statement. For example:

Input :

Output:

It also acceptsstatementswithout braces, but only 1 statement can be writteninside the if and else respectively.

Input:

Output:

Iteration

while

Input:

Output:

Code Explanation

In this section,we will explainsix source files, which are LA.l, SA.y, calc.h, interpreter.c, compiler.c, graph.c. Notice that when we add a new operator, not only the lex and yacc file should be modified, the interpreter.c, compiler.c, and graph.c should also be changed. And due to the limitation of the articlelength,we decide to explain the code structure based on interpreter.c, andthe way to addanew operator based oncompiler.c andgraph.c, because the logic of these three files issimilar.

LA.l

The basic technology of Lex is the FSA (finite-state automation), which consists of one beginning state and one or many ending states. The Lex implements the FSA through the regular expressions, which will be simulated to execute the similar procedures in FSA to match the input contents and then return the tokens or execute other commands set previously. The LA.l file in our project consists ofthree parts:definitions,rules, andsubroutineswhich are separated by %%.

Definitions

In the partof definition, the header files of C or C++should be included.Contents between%{ and }%will be written into the .C file entirely withoutanychange after the compilation steps.Therefore,we can use some build-in functions or variables conveniently.

Rules

Variables and functions

In the rules part, there are some default variables and functions in lex file we need to useas follows.

"yytext"is the pointer to the matched string.
“yylval” isthe semantic value associated with the token. It will be passed to the yacc file as the parameters to execute the next operation.Thus,when further operation needs the content of the matched strings, yylval will be needed.Specifically, in our file, the yylval.s represents the stringdata typeand the yylval.iValue stands for floating numbers because floating numbersare more suitable for most of the operations (For the integer, we just store them as xxx.000000).
yyerroris used to output the error information whenthe input doesn’t match the rules.
atof() andatoi() are used to convert the input values to the float, int format.

Regular Expressions and Actions

Decimal Number

Since we need the token and the detailed value of them for someoperations, we keep the value in the yylval and pass them to the yacc file as semantic values. The two ‘?’ in [0-9]+(.[0-9]+)?([eE][0-9]+)? improve the robustness of the input, which can handle the cases in both integer and float formats since ‘?’ means those two items can exist or not.

Relation Operators

Symbols****

We only need the operators’ name to pass to yacc file so that we can use themtocombine with different expressions. So we just return their names as tokens using ‘*’.

Binary Number

We first use the malloc() to allocate a memory with string type for the yylval.s, and we use the strcpy() that is declared in the to copy the string content from yytext to yylval.s. So the whole bit string can be passed to yacc file as semantic value with token ‘BIN’ .

Mathematical Functions

We just return the tokens and the yacc file will execute the corresponding functions we write accordingtothe token it receives. We just execute these three commands like basic operands, ‘+’, ‘-‘, ‘*’, ‘/’, and the onlydifferentthing is that we write the details of these three functions by ourselves.

Keywords

Identifiers

The identifiers startwith ‘’ or letters and just can be constructed by letter, ‘’ or numbers. And it will pass the token and semantic value to SA.y file.

Ignore Whitespaces

Error Reporting Function

When tokens don’t satisfy the rules, we call the yyerror function to output the error information.

Subroutines

Theyywrap( )functionwill be executed in this part. Itis used to determine whether the program has read all the input streams. If it’s called at the end of the last input command, it will return 1 so that system can receive and stop the program, otherwise it will return 0.

SA.y

Yacc is a useful program used toproduce the source code of the syntax analyzer by the language based on the gramma we rule. The grammar of the Yacc file is similar to the lex one, whichisdivided into three parts:definitions,rules**,andsubroutines**.

Definitions

Header filesand function prototypesare contained.Besides, thevariable"sym" is defined bymapin C++, so that thevariable namecan bea string withanuncertain lengthother than just one letter.

Keywords Definition

Each keyword begins with a “%” and is followed by a token.

The"%union"part is used to assign the different data types to the different symbols."iValue"belongsto float type variables,"sIndex"belongsto char type variable,“nPtr” is thepointer pointing to any type of variables and the"s"is the string variable.

As shown, it’s optional to include a that supports the member name in the %union part for the**%token**,%left,%right****,and%nonassoc, while the**%type**must include a .

In our file, the**%token**part decides the terminals in the grammar ofthecalculator, which include the FLOAT with token , the VARIABLE with token ~~, the BIN with token~~, and the WHILE, IF, PRINT, SIN, B2D, FAC.~~~~

The******%left, %right****,and****%nonassoc**identify tokens for theirassociation modes. For instance, “+ - * /” tokens is left-associative, thus defined in the %left.

The**%type**identifies nonterminals.

Rules

In this part, the source codes resemble the BNF grammar.The grammar rules containproductions on the left side andsemanticactions on the right side enclosedin {}.

In the parsing stage, we can use the values of the symbols storedin the value stacks: $$ means the attribute values of the non-terminalsonthe left side, $i means the attribute value of the ith symbol on the right side. These variableswillbe passed to the opr() function with the matched operators and the quantities of the operands. Besides, thecon(), str()andthe id() will be called to create the pointer with the value and the matched type.

Subroutines

The details of the functions are shown here. The con(), str(), id() are the functions used to return the target pointer with the variable type and value. And the opr() is specially used to return the pointer storingthe operands and the operator.

calc3.h

nodeEnum

The keyword “enum” is the abbreviation for enumeration. The default value of the first enumeration member is 0, and the value of subsequent enumeration members is incremented by 1 over the previous member, which means typeCon=0, typeId=1, typeOpr=2 and typeStr=4."nodeEnum"is the enumeration variable.

conNodeType

This is the definition of constant value. We useafloating number rather than integer because floating numbersis more precisethan integer.

strNodeType

The “strNodeType” is used to hold the constant string, for example, the binary string 1000B will be stored by strNodeType.

idNodeType

In this node, the name of identifiers can be stored here.

oprNodeType

Operator has its symbol and the number of operands. That’s why we define “oper” to store the ASCII code oftheoperator and “nops” to store the number of operands. The extra struct “nodeTypeTag” can be used at runtime to reserve all kinds of node we define.

nodeType

This struct can be regarded as a collection of all types that we introduce in front of it.

bool operator()(char const *a, char const *b) const

This is acomparison function passed to map so that map can compare char* correctly.

The variable “sym” means symbol table, it stores the names of identifiers during the execution oftheprogram. We can get the value of an identifier by its name, whichis the feature of map, one ofthedata structures. We add keyword "extern"in front of the map"sym"so that"sym"can be usedas a general variablein other files.

interpreter.c

The purpose of interpreter is to get the numerical answer of the input. It contains three functions. They are ex(), BtoD(), factorial(), respectively.

float ex(nodeType *p)

Function ex() carries the main execution of this file. It is recursive. And switch-case structure inside it makes the compiler be able to recognize different types of variables and do the function such as assignment, calculation, iteration, logic judgement,and demonstration correspondingly.

Here is the pseudocode of ex():

float BtoD (char* bin)

Function BtoD is in charge of transferring binary string into decimal number. It will keep doing the calculation until it encountersa capital B, which represents the end of the binary string. There are two reasons why we use “char*” rather than int type. On the one hand, “char*” can contain a very long sequence of string while “int” cannot. On the other hand, we need an identifier to differentiate the binary strings.

float factorial (float fac)

Function factorial can calculate the factorial of zero and positive integer (Here we use floating number to represent integer). Therefore, this function will check if the variable “fac” is a positive integer-like floating number. If the answer is yes, then the function of calculating the factorial will continue, otherwise, the corresponding error information will be returned.

graph.c

This file can output the syntax tree of the given code. It contains 3 main parts, which are the ex() function, exNode() function, and a bunch of drawing functions.When a new operator is added to the calculator, we just need to modify the exNode()function and left the remaindingjobto the drawing functions.

Adding new operators:

For example, whenthesin()function is addedto the calculator, we just need to add the following code inside the switch structure under the case ofp->type ==typeOpr:

Adding new data types:

At this calculator, we also addeda string datatype, which is a new type of node. Therefore, weshouldfirstly add a string node in the calc3.h file like other types of node. Assume its"enum"namedefined in the calc3.his"typeStr", then weshouldadd the following code inside the first switch structure, which tells the exNode() function what to do when encountering a string type node:

Also, weshouldadd the"typeStr"into the following if statement, which is used to draw the graph ofleaves in the parse tree:

compiler.c

Adding new data types:

When a new type of data is added to thecalculator, wealsoneed to update the compiler.c file. For example, we added a string data type to the compiler and declared it in the calc3.h file, whose"enum"name is"typeStr". We just need to add a new case like the following picture into the first switch structure:

Adding new operators:

Another case is that a new operator is added. For example, the factorial operator is added, and we should add the following code inside the switch structure under"typeOpr"case:

It will runtheex()function with the operands recursivelyfirstly,and print out"fac"on the terminal.

Difficulty

Duringthe development of our calculator, we faced the followingdifficulties,andwe solve them at the end.

SA.y file receives the wrong lexeme

We have a function B2D(), which can turn the binary number into decimal. It will receive a regular expression, “[01]+B”, that is, a series of 0 and 1 ending with “B”, and then return a token named"BIN", whose type is"char*". However, we foundthat the lexeme passed to the syntax analyzer is not our expectation. The wrong lexeme contains the semi-colon(😉 following behind, which is shown in the following picture:

By printing out the lexeme received bythesyntax analyzer, we found the bug is in the lexical analyzer. At that moment, the rule of this token was like this:

We guessed it is the problem of"yytext", since"yytext"is the pointer to the matched string and the situation is like the following graph:

And we write"yylval.s = yytext;", which means the following situation:

And the pointer"yylval.s"will be passed to token BIN. So here is the problem, the"BIN"will be the string stored in the memory it points to, which is"1000B);". Therefore, the solution is very simple, we just need to copy the string to another location and return the pointer to this location, which is the following modified code inLA.l file:

Actually, when we want to pass a string type token to the parser, we should handle the lexeme in this way.

Cannot find the variable in the symbol table.

The original calculator only supports single-character variable names, so we decidedto make it support the variable name like that of C language. Toachieve our goal, we had to change the symbol table into a map, whose key is the variable’s name and the value is thevariablevalue. After changing the symbol to a map and modified all the corresponding code, we found that when we tried to get the value of a variable from the symbol table, it will return0,but when we printedout all the key and value in the symbol table, it showed that the variable name and value had already been stored in the symbol table. Also, we found that the map will contain 2 same keys when we assigned values to a variable twice. After discussion, we finally found the reason. Because for"char*", it has the following characteristic:

For the following code:

It has the output:

Therefore, the map will treat thesame variable nameas different keys.

Our solution is tocreate acomparatorfor themap:

Then the problem was solved.

Actually, using std:: string might be a better choice, but we need to change the include part of the code, which will cause many strange problems of the compatibility of C and C++, so we gave up thisway.

Limitation of our compiler

The output can only be floating data type.

Print doesnot support the output of string.

The output results can only keep four decimals.

ANTLR - TheBetterGenerator

As introduced above, Lex and Yacc were very efficient tools to generate lexical analyzers and syntax analyzers, thus popular at that time. However, with the rapid development of computers and high-tech, their design based on old-fashioned principles cannot meet every requirement of users anymore. Therefore, a better and modern tool named ANTLR will be introduced as follow.

Has a More Portable Grammar

As known, Lex and Yacc support only the grammar containing actions as shown. Therefore, the code is hard to read and understand. Both the readability and productivity are low.

Unlike the Lex and Yacc, ANTLR allows reusing ofthe grammar with other languages since it supports generating syntax analyzers in multiple languages from the same grammar. There is no need to use actions when parsing JSON. Therefore, the grammar is more readable and more understandable as show

Can Support the Unicode

Since Lex can only accept 8-bit input, it has no direct support for Unicode patterns. However, ANTLR has full support for the full range of Unicode (Multi-byte character sets). For example, here is the lexer rule for a single Armenian character by creating a range from U+0530 to U+058F.

Apart from that, it also supports Unicode properties such as the language script.

Grammar Format is Easier (Support EBNF)

Yacc only allows to define rules in Backus-Naur Form (BNF) format as shown. It shows the program is a series of one or more statements. Since it doesn’t end until it findsthe last statement, it’s a complicated grammar compared to the one used in ANTL

However, ANTLR supports Extended Backus-Naur Form both in the lexical analyzer and syntax analyzer as shown, which is much easier to describe some languages.

Supports Context-Free Expressions in Lexer Rules

Lex supports regular expressions in lexer rules, while ANTLR supports context-free expressions. Since context-free expressions can match more rules and is more readable compared to regular expressions, plus ANTLR is a combination of lexer and parser, ANTLR is much more powerful than Lex obviously.

Easier to Debug

As known, Yacc supports two parsing algorithms which are LALR and GLR, while ANTLR uses LL algorithm. Since GLR belongs to bottom-up algorithms and LL follows the top-down pattern (from the root of the tree to the leaves) to debug, it’s obvious that LL algorithm makes debugging easier.

For instance, if the GLR algorithas shown m is applied and the input is 1 * 2, the error messagecalled shift/reduce conflicts will be sent to the user.

Conclusion

To summarize, the Lex and Yacc greatly simplified the task of writing a compiler.At the very beginning, we write a tutorial for the download and installation inthepreparation stage. Next, we demonstrate what we can do with our calculator and the difficulties we met during development. Then, the code of our compiler including LA.l, SA.y, calc3.h, interpreter.c, graph.c and compiler.c are explained in detail. We also list some ofthelimitation of our compiler. Finally, we briefly introduce ANTLR, a better generator than Lex and Yacc. We believe Lex and Yacc now can be fully understood by beginnersthrough this little project of building a calculator.

Reference

The installation, utilization, compilation and operation of a Lexical Analyzer in Window. Retrieved fromhttps://www.cnblogs.com/ASE265/p/12337922.html

Why you should not use (f)lex, yacc and bison. Retrieved fromhttps://tomassetti.me/why-you-should-not-use-flex-yacc-and-bison/