|
- <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
- <html>
- <!-- Copyright (C) 1987-2020 Free Software Foundation, Inc.
-
- Permission is granted to copy, distribute and/or modify this document
- under the terms of the GNU Free Documentation License, Version 1.3 or
- any later version published by the Free Software Foundation. A copy of
- the license is included in the
- section entitled "GNU Free Documentation License".
-
- This manual contains no Invariant Sections. The Front-Cover Texts are
- (a) (see below), and the Back-Cover Texts are (b) (see below).
-
- (a) The FSF's Front-Cover Text is:
-
- A GNU Manual
-
- (b) The FSF's Back-Cover Text is:
-
- You have freedom to copy and modify this GNU Manual, like GNU
- software. Copies published by the Free Software Foundation raise
- funds for GNU development. -->
- <!-- Created by GNU Texinfo 6.5, http://www.gnu.org/software/texinfo/ -->
- <head>
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
- <title>Tokenization (The C Preprocessor)</title>
-
- <meta name="description" content="Tokenization (The C Preprocessor)">
- <meta name="keywords" content="Tokenization (The C Preprocessor)">
- <meta name="resource-type" content="document">
- <meta name="distribution" content="global">
- <meta name="Generator" content="makeinfo">
- <link href="index.html#Top" rel="start" title="Top">
- <link href="Index-of-Directives.html#Index-of-Directives" rel="index" title="Index of Directives">
- <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
- <link href="Overview.html#Overview" rel="up" title="Overview">
- <link href="The-preprocessing-language.html#The-preprocessing-language" rel="next" title="The preprocessing language">
- <link href="Initial-processing.html#Initial-processing" rel="prev" title="Initial processing">
- <style type="text/css">
- <!--
- a.summary-letter {text-decoration: none}
- blockquote.indentedblock {margin-right: 0em}
- blockquote.smallindentedblock {margin-right: 0em; font-size: smaller}
- blockquote.smallquotation {font-size: smaller}
- div.display {margin-left: 3.2em}
- div.example {margin-left: 3.2em}
- div.lisp {margin-left: 3.2em}
- div.smalldisplay {margin-left: 3.2em}
- div.smallexample {margin-left: 3.2em}
- div.smalllisp {margin-left: 3.2em}
- kbd {font-style: oblique}
- pre.display {font-family: inherit}
- pre.format {font-family: inherit}
- pre.menu-comment {font-family: serif}
- pre.menu-preformatted {font-family: serif}
- pre.smalldisplay {font-family: inherit; font-size: smaller}
- pre.smallexample {font-size: smaller}
- pre.smallformat {font-family: inherit; font-size: smaller}
- pre.smalllisp {font-size: smaller}
- span.nolinebreak {white-space: nowrap}
- span.roman {font-family: initial; font-weight: normal}
- span.sansserif {font-family: sans-serif; font-weight: normal}
- ul.no-bullet {list-style: none}
- -->
- </style>
-
-
- </head>
-
- <body lang="en">
- <a name="Tokenization"></a>
- <div class="header">
- <p>
- Next: <a href="The-preprocessing-language.html#The-preprocessing-language" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html#Initial-processing" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html#Overview" accesskey="u" rel="up">Overview</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html#Index-of-Directives" title="Index" rel="index">Index</a>]</p>
- </div>
- <hr>
- <a name="Tokenization-1"></a>
- <h3 class="section">1.3 Tokenization</h3>
-
- <a name="index-tokens"></a>
- <a name="index-preprocessing-tokens"></a>
- <p>After the textual transformations are finished, the input file is
- converted into a sequence of <em>preprocessing tokens</em>. These mostly
- correspond to the syntactic tokens used by the C compiler, but there are
- a few differences. White space separates tokens; it is not itself a
- token of any kind. Tokens do not have to be separated by white space,
- but it is often necessary to avoid ambiguities.
- </p>
- <p>When faced with a sequence of characters that has more than one possible
- tokenization, the preprocessor is greedy. It always makes each token,
- starting from the left, as big as possible before moving on to the next
- token. For instance, <code>a+++++b</code> is interpreted as
- <code>a ++ ++ + b<!-- /@w --></code>, not as <code>a ++ + ++ b<!-- /@w --></code>, even though the
- latter tokenization could be part of a valid C program and the former
- could not.
- </p>
- <p>Once the input file is broken into tokens, the token boundaries never
- change, except when the ‘<samp>##</samp>’ preprocessing operator is used to paste
- tokens together. See <a href="Concatenation.html#Concatenation">Concatenation</a>. For example,
- </p>
- <div class="smallexample">
- <pre class="smallexample">#define foo() bar
- foo()baz
- → bar baz
- <em>not</em>
- → barbaz
- </pre></div>
-
- <p>The compiler does not re-tokenize the preprocessor’s output. Each
- preprocessing token becomes one compiler token.
- </p>
- <a name="index-identifiers"></a>
- <p>Preprocessing tokens fall into five broad classes: identifiers,
- preprocessing numbers, string literals, punctuators, and other. An
- <em>identifier</em> is the same as an identifier in C: any sequence of
- letters, digits, or underscores, which begins with a letter or
- underscore. Keywords of C have no significance to the preprocessor;
- they are ordinary identifiers. You can define a macro whose name is a
- keyword, for instance. The only identifier which can be considered a
- preprocessing keyword is <code>defined</code>. See <a href="Defined.html#Defined">Defined</a>.
- </p>
- <p>This is mostly true of other languages which use the C preprocessor.
- However, a few of the keywords of C++ are significant even in the
- preprocessor. See <a href="C_002b_002b-Named-Operators.html#C_002b_002b-Named-Operators">C++ Named Operators</a>.
- </p>
- <p>In the 1999 C standard, identifiers may contain letters which are not
- part of the “basic source character set”, at the implementation’s
- discretion (such as accented Latin letters, Greek letters, or Chinese
- ideograms). This may be done with an extended character set, or the
- ‘<samp>\u</samp>’ and ‘<samp>\U</samp>’ escape sequences.
- </p>
- <p>As an extension, GCC treats ‘<samp>$</samp>’ as a letter. This is for
- compatibility with some systems, such as VMS, where ‘<samp>$</samp>’ is commonly
- used in system-defined function and object names. ‘<samp>$</samp>’ is not a
- letter in strictly conforming mode, or if you specify the <samp>-$</samp>
- option. See <a href="Invocation.html#Invocation">Invocation</a>.
- </p>
- <a name="index-numbers"></a>
- <a name="index-preprocessing-numbers"></a>
- <p>A <em>preprocessing number</em> has a rather bizarre definition. The
- category includes all the normal integer and floating point constants
- one expects of C, but also a number of other things one might not
- initially recognize as a number. Formally, preprocessing numbers begin
- with an optional period, a required decimal digit, and then continue
- with any sequence of letters, digits, underscores, periods, and
- exponents. Exponents are the two-character sequences ‘<samp>e+</samp>’,
- ‘<samp>e-</samp>’, ‘<samp>E+</samp>’, ‘<samp>E-</samp>’, ‘<samp>p+</samp>’, ‘<samp>p-</samp>’, ‘<samp>P+</samp>’, and
- ‘<samp>P-</samp>’. (The exponents that begin with ‘<samp>p</samp>’ or ‘<samp>P</samp>’ are
- used for hexadecimal floating-point constants.)
- </p>
- <p>The purpose of this unusual definition is to isolate the preprocessor
- from the full complexity of numeric constants. It does not have to
- distinguish between lexically valid and invalid floating-point numbers,
- which is complicated. The definition also permits you to split an
- identifier at any position and get exactly two tokens, which can then be
- pasted back together with the ‘<samp>##</samp>’ operator.
- </p>
- <p>It’s possible for preprocessing numbers to cause programs to be
- misinterpreted. For example, <code>0xE+12</code> is a preprocessing number
- which does not translate to any valid numeric constant, therefore a
- syntax error. It does not mean <code>0xE + 12<!-- /@w --></code>, which is what you
- might have intended.
- </p>
- <a name="index-string-literals"></a>
- <a name="index-string-constants"></a>
- <a name="index-character-constants"></a>
- <a name="index-header-file-names"></a>
- <p><em>String literals</em> are string constants, character constants, and
- header file names (the argument of ‘<samp>#include</samp>’).<a name="DOCF2" href="#FOOT2"><sup>2</sup></a> String constants and character
- constants are straightforward: <tt>"…"</tt> or <tt>'…'</tt>. In
- either case embedded quotes should be escaped with a backslash:
- <tt>'\''</tt> is the character constant for ‘<samp>'</samp>’. There is no limit on
- the length of a character constant, but the value of a character
- constant that contains more than one character is
- implementation-defined. See <a href="Implementation-Details.html#Implementation-Details">Implementation Details</a>.
- </p>
- <p>Header file names either look like string constants, <tt>"…"</tt>, or are
- written with angle brackets instead, <tt><…></tt>. In either case,
- backslash is an ordinary character. There is no way to escape the
- closing quote or angle bracket. The preprocessor looks for the header
- file in different places depending on which form you use. See <a href="Include-Operation.html#Include-Operation">Include Operation</a>.
- </p>
- <p>No string literal may extend past the end of a line. You may use continued
- lines instead, or string constant concatenation.
- </p>
- <a name="index-punctuators"></a>
- <a name="index-digraphs"></a>
- <a name="index-alternative-tokens"></a>
- <p><em>Punctuators</em> are all the usual bits of punctuation which are
- meaningful to C and C++. All but three of the punctuation characters in
- ASCII are C punctuators. The exceptions are ‘<samp>@</samp>’, ‘<samp>$</samp>’, and
- ‘<samp>`</samp>’. In addition, all the two- and three-character operators are
- punctuators. There are also six <em>digraphs</em>, which the C++ standard
- calls <em>alternative tokens</em>, which are merely alternate ways to spell
- other punctuators. This is a second attempt to work around missing
- punctuation in obsolete systems. It has no negative side effects,
- unlike trigraphs, but does not cover as much ground. The digraphs and
- their corresponding normal punctuators are:
- </p>
- <div class="smallexample">
- <pre class="smallexample">Digraph: <% %> <: :> %: %:%:
- Punctuator: { } [ ] # ##
- </pre></div>
-
- <a name="index-other-tokens"></a>
- <p>Any other single byte is considered “other” and passed on to the
- preprocessor’s output unchanged. The C compiler will almost certainly
- reject source code containing “other” tokens. In ASCII, the only
- “other” characters are ‘<samp>@</samp>’, ‘<samp>$</samp>’, ‘<samp>`</samp>’, and control
- characters other than NUL (all bits zero). (Note that ‘<samp>$</samp>’ is
- normally considered a letter.) All bytes with the high bit set
- (numeric range 0x7F–0xFF) that were not succesfully interpreted as
- part of an extended character in the input encoding are also “other”
- in the present implementation.
- </p>
- <p>NUL is a special case because of the high probability that its
- appearance is accidental, and because it may be invisible to the user
- (many terminals do not display NUL at all). Within comments, NULs are
- silently ignored, just as any other character would be. In running
- text, NUL is considered white space. For example, these two directives
- have the same meaning.
- </p>
- <div class="smallexample">
- <pre class="smallexample">#define X^@1
- #define X 1
- </pre></div>
-
- <p>(where ‘<samp>^@</samp>’ is ASCII NUL). Within string or character constants,
- NULs are preserved. In the latter two cases the preprocessor emits a
- warning message.
- </p>
- <div class="footnote">
- <hr>
- <h4 class="footnotes-heading">Footnotes</h4>
-
- <h3><a name="FOOT2" href="#DOCF2">(2)</a></h3>
- <p>The C
- standard uses the term <em>string literal</em> to refer only to what we are
- calling <em>string constants</em>.</p>
- </div>
- <hr>
- <div class="header">
- <p>
- Next: <a href="The-preprocessing-language.html#The-preprocessing-language" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html#Initial-processing" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html#Overview" accesskey="u" rel="up">Overview</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html#Index-of-Directives" title="Index" rel="index">Index</a>]</p>
- </div>
-
-
-
- </body>
- </html>
|