资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Lecture Note#2,:Ch2.Programming Language Syntax,Programming Language Pragmatics,Term 1,2023,Minyoung Kim,1,Programming LanguageSyntax Notations,2,Simple Syntax Notations,|,means“or”,e.g.digit,-,0|1|2|3|4|5|6|7|8|9,*(pronounced as“Kleene star”),indicates zero or more repetitions of the symbol to its left,e.g.,digit,-,0|1|2|3|4|5|6|7|8|9,non_zero_digit,-,1|2|3|4|5|6|7|8|9,natural_number,-,non_zero_digit digit*,Semantics,3,Digits in the previous are only symbols,ink blobs on paper or pixels on a screen,no meanings in and of themselves,Semantics can add meanings to symbols,The digits mean natural numbers from zero to nine,as defined by mathematicians,May represent colors,the days of the week in a decimal calendar,May have alternative semantics with the same syntax,The reasons of distinguishing between syntax and semantics,Different programming languages often provide features with similar semantics but very different syntax,Efficient and elegant algorithms to discover the syntactic structure is available,2.1 Specifying Syntax,4,Formal specification of syntax requires a set of rules,Tokens can be specified using just three kinds of rules(Regular expressions),A Character,Empty string,denoted by,Concatenation-two regular expressions concatenated,alternation(choice among a finite set of alternatives),separated by|,Kleene closure(concatenation of zero or more strings),Specifying Syntax,5,Syntax can be specified with 4 kinds of rules(Context-free Grammars),concatenation,alternation(choice among a finite set of alternatives),Kleene closure(repetition an arbitrary number of times),recursion(creation of a construct from simpler instances of the same construct),context-free language(CFL)is the string generated by the above rules,Tokens and Regular Expressions,6,Tokens are basic building blocks of programs,including keywords,identifiers,numbers,punctuation marks,and other symbols,Pascal has 64 kinds of tokens,21 symbols(,+,-,;,:=,.,etc.),35 keywords(begin,end,div,record,while,etc),integer literals(e.g.,137),real(floating-point)literals(6.022e23),character/string literals(e.g.,snerk),identifiers(MyVariable,YourType,maxint,readln,etc.,39 of which are predefined),two different kinds of comments(*),Tokens and Regular Expressions,7,C has 100 kinds of tokens,37 keywords(double,if,return,struct,etc.),identifiers(my_variable,your_type,sizeof,printf,etc.),integer(0765,0 x1f5,501),floating-point(6.022e23),character(x,0170),string literals(“snerk”,“say”hi”n”),54 punctuators(,+,-,*=,:,|,etc.),35 keywords(begin,end,div,record,while,etc),two different kinds of comments/*/,/,Preprocessor macros that build tokens from smaller pieces,Specifying Tokens Use Regular Expressions,8,A regular expression is one of the following:,A character,The empty string,denoted by,Two regular expressions concatenated,Two regular expressions separated by|(i.e.,or),A regular expression followed by the Kleene star(concatenation of zero or more strings),parenthesis are used to avoid ambiguity,Tokens and Regular Expressions,9,Numeric literals of Pascal can be generated with the following regular expression,digit,0|1|2|3|4|5|6|7|8|9,unsigned_integer,digit,digit,*,unsigned _number,unsigned_integer,(.,unsigned_integer,)|)(e|E),(+|-|),unsigned_integer,)|),E.g.,1 1.3,1e31.3e-31.3938e+3,1.1.3.4931.e3,Simple Hand-held Calculator Language,10,Syntax of numeric constants accepted by a simple hand-held calculators,number,integer|real,integer,digit digit*,real integer exponent|decimal(exponent|,),decimal digit*(.digit|digit.)digit*,exponent (e|E)(,+,|,-,|,)integer,digit 0|1|2|3|4|5|6|7|8|9,Tokens,11,Case-sensitiveness,sensitive:Modula-2/3,C/C+,Java,Keywords in C/C+are lowercase,whereas they are uppercase in Modula-2/3,not sensitive:Ada,Common Lisp,Fortran 90,Pascal,Modula-2 and Modula-3 require keywords and predefined identifiers to be written in uppercase,whereas C and its variants uses lowercase,Limitations in Identifiers,Only allow letters and digits:Modula-3 and Standard Pascal,Allow letters,digits,and underscores:C/C+,C#,Pascal(most implementations),A variety of additional characters:Lisp(even includes,-,),Some language impose limits on the maximum length of identifiers,Old Turbo C compiler,Tokens,12,Most modern languages considers a program as simply a sequence of tokens:,“White space”(blanks,tabs,carriage returns,and line and page feed characters)between tokens is usually ignored,Fortran prior to Fortran 90 use a fixed format,with 72 characters(the width of an IBM punch card)per line,different columns within the line reserved for different purposes,e.g.1-5 for labels,6 for continuation of lines,7-72 source code,Line breaks serve to separate statements in several other languages,Haskell,Occam,and SR,Some languages treats indentation specially,Haskell,Occam,Python,Tokens,13,Many modern languages,including C99,C+,Ada 95,Java,C#,and Fortran 2023 have explicit support for multibyte character sets,generally based on Unicode and ISO/IEC 10646,Most modern languages allow non-Latin characters to appear within comments and character strings,Some languages allow non-Latin characters in identifiers,Context-Free Grammars,14,Regular expressions cannot be used to specify,nested,constructs,E.g.the structure of an arithmetic expression,(Grammar Example 2.4),expr,id|number|,-,expr,|(,expr,),|,expr operator expr,op,+,|,-,|*|/,Each rule in a CFG is a,production,Symbols on the left-hand sides of the productions are,variables,Symbols that are to make up the strings derived from the grammar are terminals,cannot appear on the left-hand side of any production,variable,production,terminal,a 7 -a -4 (a),(12)a+4 -(3*6),Context-Free Grammars,15,The notation for context-free grammars(CFG)is sometimes called Backus-Naur Form(BNF),A CFG consists of,A set of terminals T,A set of non-terminals N,A start symbol S(a non-terminal),A set of productions,BNF doesnt allow using the vertical bar,Kleene star,and meta-level parentheses of regular expressions,EBNF(Extended BNF),Allows the extra operators including Kleene plus,Kleene plus(,+,):indicates one or more instances of the symbol or groups of symbols in front of it,digit*:0,digit+:1,Context-Free Grammars,16,BNF(Example),op,+|,-,|*|/(not allowed using|sign),can be rewritten as,op,+,op,-,op,*,op,/,id_list,id(,id)*,(not allowed using Kleene star),can be rewritten as,id_list,id,id_list,id_list,id,identifier ,identifier ,identifier,Derivation and Parse Trees,17,A context-free grammar shows how to generate syntactically valid strings of terminals:,Begins with the start symbol,Choose a production with the start symbol on the left-hand side,Replace the start symbol with the right-handed side of that production(choose a nonterminal A in the resulting string,choose a production P with A on its left-hand side,and replace A with the right-hand side of P.,Repeat this process until no nonterminal remain,Derivation and Parse Trees,18,to generate“slope*x+intercept”,(Right most derivation:a*x+b),Grammar Example 2.6,expr,=,expr op,expr,=,expr,op,id,=,expr,+id,=,expr op,expr,+id,=,expr,op,id +id,=,expr,*id +id,=,id *id +id,(slope)(x)(intercept),Derivations and Parse Trees,19,A,parse tree,can be used to represent a derivation grammar in grammar Example 2.4,Right-most derivation:(slope*x)+intercept,Fig.2.1,Left-most derivation:slope*(x+intercept),Fig.2.2,Alternative(undesirable)parse tree in this case,Derivations and Parse Trees,20,More than one tree exists implies that the grammar is,ambiguous,Structure is particularly important in arithmetic expressions,Need to use,productions to capture the,associativity,and,precedence,Associativity:,the operators in most language group left-to-right,Precedence:*and/in most language group are computed more,Derivations and Parse Trees,21,A better version of expression grammar,without ambiguity,Grammar Example 2.8,expr,term,|,expr add_op term,term,factor,|,term multi_op factor,factor,id|number|-,factor,|(,expr,),add_op,+|-,multi_op,*|/,Draw a Parse Tree for 10*4 3,with Left Associativity,22,Left most derivation in grammar Example 2.8,expression,=,term,=,term,multi_op factor,=,factor,multi_op factor,=,number,multi_op,factor,=,number*,factor,This is not processed,expression,term,|,expression,add_op term,term,factor,|,term multi_op factor,factor,identifier|number|-,factor,|(,expression,),add_op,+/-,multi_op,*|/,Draw a Parse Tree for 10*4 3,with Precedence,23,Right most derivation in grammar Example 2.8,expression,term,|,expression,add_op term,term,factor,|,term multi_op factor,factor,identifier|number|-,factor,|(,expression,),add_op,+/-,multi_op,*|/,expression,=,expression add_op,term,=expression add_op,factor,=,expression,add_op,number,=,expression,number,=,term multi_op,factor,number,=term,multi_op,number number,=,term,*number number,=,factor,*number number,=number*number number,(10)(4)(3),Thus,grammar 2.8,is clear,Draw a Parse Tree for 10*4 3,with Precedence,24,Parse tree for expression grammar for 10*4-3,Draw a Parse Tree for a+4*5,with Precedence,25,Left most derivation in grammar Example 2.8,expression,term,|,expression,add_op term,term,factor,|,term multi_op factor,factor,identifier|number|-,factor,|(,expression,),add_op,+/-,multi_op,*|/,expression,=,expression add_op,term,=expression add_op term multi_op,factor,=,expression add_op term,multi_op,number,=,expression add_op,term,*number,=,expression add_op,factor,*number,=expression,add_op,number*number,=,expression,+number*number,=,factor,+number*number,=,identifier,+number*number,(a)(4)(5),expression,=,term,=,term multi_op,factor,=term,multi_op,number,=,term,*number,!,Draw a Parse Tree for a+4*5,with Precedence,26,Parse tree for expression grammar for a+4*5,Draw a Parse Tree for 10 4 3,with Precedence,27,Derivation in grammar Example 2.8,expression,term,|,expression,add_op term,term,factor,|,term multi_op factor,factor,identifier|number|-,factor,|(,expression,),add_op,+/-,multi_op,*|/,expression,=,expression add_op,term,=expression add_op,factor,=,expression,add_op,number,=,expression,-number,=,expression add_op,term,-number,=expression,add_op,number-number,=,expression,-number-number,=,term,-number-number,=number-number-number,(10)(4)(3),terminal,!,Draw a Parse Tree for 10 4 3,with Precedence,28,Parse tree for expression grammar for 10 4 3,2.2 Recognizing Syntax:Scanners and Parsers,29,Recall scanner is responsible for,tokenizing source,removing comments,nested comment needs special treatments because scanners normally do not deal with recursive constructs,some language disallow them,others may put special code in the scanner to deal with it,(often)dealing with pragmas(i.e.,significant comments),saving text of identifiers,numbers,strings,saving source locations(file,line,column)for error messages,Recognizing Syntax:Scanners and Parsers,30,Parser in charge of the compilation process,calls the scanner to obtain the tokens of the input program,assembles the tokens together into a parse tree,passes the tree to the later phases of the compiler,which perform semantic analysis,code generation and improvement,Scanner and parser for a programming language are responsible for discovering the syntactic structure of a program,Scanning,31,Will create a simple“calculator language”with input,output,variables,and assignment extending the grammar shown in Example 2.4 and 2.8,Tokens for the simple calculator language,assign,:=,plus,+,minus,times,*,div,/,l-paren,(,r-paren,),id,letter,(,letter,|,digit,)*except for,read,and,write,number digit digit*,(.,digit,|,digit,.),digit*,comment,/*,(non-*,|,*non-,/,)*,/,|,/,(non-newline)*newline,Scanning,How do we build a scanner that recognizes the tokens of our calculator language?,build ad hoc scanner,Right shows the outline of an ad hoc scanner for the simple calculator language,32,Fig.2.5,Scanning,33,Deterministic finite automation for calculator language,Scanning,34,As a rule,the scanner returns the longest possible token in each invocation of the scanner,e.g.foobar is always foobar and never f or foo or foob,In language like C,3.14159 is a real number and never 3,.,and 14159,White space(blanks,tabs,carriage returns,comments)is generally ignored except to the extent that it separates tokens,e.g.foo bar is different from foobar,Scanning,35,Scanners tend to be built three ways,ad-hoc,manually constructed,generally yields the fastest,most compact code by doing lots of special-purpose things,semi-mechanical pure DFA(usually realized as nested case statements),either manually or using a,scanner generator,table-driven DFA,usually built with a,scanner generator,such as Lex,Scanning,36,This is a deterministic finite automaton(DFA),Lex,scangen,etc.build these things automatically from a set of regular expressions,Specifically,they construct a machine that accepts the language,id|int const|real const|comment|symbol|.,We run the machine over and over to get one token after another,Nearly universal rule:,always take the longest possible token from the input thus foobar is foobar and never f or foo or foob,more to the point,3.14159 is a real const and never 3,.,and 14159,Regular expressions generate a regular language;DFAs recognize it,Scanning,37,We can implement a scanner that explicitly captures the“circles-and-arrows”structure of a DFA in either of two main ways,Embeds the automaton in the control flow of the program using,goto,s or nested,case(switch),Writing a pure DFA as a set of nested case(switch in C/C+)statements is a surprisingly useful programming technique,though its often easier to use perl,awk,sed,Scanning,38,Note that the rule about longest-possible tokens means you return only when the next character cannot be used to continue the current token,the next character will generally need to be saved for the next token,In some cases,you may need to peek at more than one character of look-ahead in order to know whether to proceed,In Pascal,for example,when you have a 3 and you a see a dot,do you proceed(in hopes of getting 3.14)?or,do you stop(in fear of getting 3.5)?,Scanning Special Handling,39,Keywords in most languages(including Pascal),Look like identifiers,Specially reserved for a special purpose,Writing a finite automaton that distinguishes keywords and identifiers,but requires a lot of states,Thus,treated as“exceptions”to the rule for identifiers(before returning an identifier to the parser,the scanner looks it up in a hash table or trie to make sure it isnt really a keyword,Dot-dot problem in Pascal,Needs to distinguish 3.14,3.14,or 3.foo,If the scanner has seen a 3 and has a dot(.)coming up in the input,it needs to peek at the character beyond the dot to distinguish 3.14 and 3.14(valid syntax)or 3.foo(invalid syntax),Scanning,40,Table-driven DFA is what lex and scangen produce,Represents the automaton as a data structure:a two-dimensional table,A driver program uses the current state and input character to index into the table,Each entry in t
展开阅读全文