(gawk) String Functions

Info Catalog (gawk) Numeric Functions (gawk) Built-in (gawk) I/O Functions
 
 8.1.3 String-Manipulation Functions
 -----------------------------------
 
 The functions in this minor node look at or change the text of one or
 more strings.  Optional parameters are enclosed in square
 brackets ([ ]).  Those functions that are specific to `gawk' are marked
 with a pound sign (`#'):
 

Menu

 
* Gory Details                More than you want to know about `\' and
                                 `&' with `sub', `gsub', and
                                 `gensub'.
 
 `asort(SOURCE [, DEST]) #'
      `asort' is a `gawk'-specific extension, returning the number of
      elements in the array SOURCE.  The contents of SOURCE are sorted
      using `gawk''s normal rules for comparing values (in particular,
      `IGNORECASE' affects the sorting) and the indices of the sorted
      values of SOURCE are replaced with sequential integers starting
      with one. If the optional array DEST is specified, then SOURCE is
      duplicated into DEST.  DEST is then sorted, leaving the indices of
      SOURCE unchanged.  For example, if the contents of `a' are as
      follows:
 
           a["last"] = "de"
           a["first"] = "sac"
           a["middle"] = "cul"
 
      A call to `asort':
 
           asort(a)
 
      results in the following contents of `a':
 
           a[1] = "cul"
           a[2] = "de"
           a[3] = "sac"
 
      The `asort' function is described in more detail in  Array
      Sorting.  `asort' is a `gawk' extension; it is not available in
      compatibility mode ( Options).
 
 `asorti(SOURCE [, DEST]) #'
      `asorti' is a `gawk'-specific extension, returning the number of
      elements in the array SOURCE.  It works similarly to `asort',
      however, the _indices_ are sorted, instead of the values.  As
      array indices are always strings, the comparison performed is
      always a string comparison.  (Here too, `IGNORECASE' affects the
      sorting.)
 
      The `asorti' function is described in more detail in  Array
      Sorting.  It was added in `gawk' 3.1.2.  `asorti' is a `gawk'
      extension; it is not available in compatibility mode (
      Options).
 
 `index(IN, FIND)'
      This searches the string IN for the first occurrence of the string
      FIND, and returns the position in characters where that occurrence
      begins in the string IN.  Consider the following example:
 
           $ awk 'BEGIN { print index("peanut", "an") }'
           -| 3
 
      If FIND is not found, `index' returns zero.  (Remember that string
      indices in `awk' start at one.)
 
 `length([STRING])'
      This returns the number of characters in STRING.  If STRING is a
      number, the length of the digit string representing that number is
      returned.  For example, `length("abcde")' is 5.  By contrast,
      `length(15 * 35)' works out to 3. In this example, 15 * 35 = 525,
      and 525 is then converted to the string `"525"', which has three
      characters.
 
      If no argument is supplied, `length' returns the length of `$0'.
 
           NOTE: In older versions of `awk', the `length' function could
           be called without any parentheses.  Doing so is marked as
           "deprecated" in the POSIX standard.  This means that while a
           program can do this, it is a feature that can eventually be
           removed from a future version of the standard.  Therefore,
           for programs to be maximally portable, always supply the
           parentheses.
 
      Beginning with `gawk' version 3.2, when supplied an array
      argument, the `length' function returns the number of elements in
      the array.  This is less useful than it might seem at first, as the
      array is not guaranteed to be indexed from one to the number of
      elements in it.  If `--lint' is provided on the command line
      ( Options), `gawk' warns that passing an array argument is
      not portable.  If `--posix' is supplied, using an array argument
      is a fatal error ( Arrays).
 
 `match(STRING, REGEXP [, ARRAY])'
      The `match' function searches STRING for the longest, leftmost
      substring matched by the regular expression, REGEXP.  It returns
      the character position, or "index", at which that substring begins
      (one, if it starts at the beginning of STRING).  If no match is
      found, it returns zero.
 
      The REGEXP argument may be either a regexp constant (`/.../') or a
      string constant ("...").  In the latter case, the string is
      treated as a regexp to be matched.   Computed Regexps, for a
      discussion of the difference between the two forms, and the
      implications for writing your program correctly.
 
      The order of the first two arguments is backwards from most other
      string functions that work with regular expressions, such as `sub'
      and `gsub'.  It might help to remember that for `match', the order
      is the same as for the `~' operator: `STRING ~ REGEXP'.
 
      The `match' function sets the built-in variable `RSTART' to the
      index.  It also sets the built-in variable `RLENGTH' to the length
      in characters of the matched substring.  If no match is found,
      `RSTART' is set to zero, and `RLENGTH' to -1.
 
      For example:
 
           {
                  if ($1 == "FIND")
                    regex = $2
                  else {
                    where = match($0, regex)
                    if (where != 0)
                      print "Match of", regex, "found at",
                                where, "in", $0
                  }
           }
 
      This program looks for lines that match the regular expression
      stored in the variable `regex'.  This regular expression can be
      changed.  If the first word on a line is `FIND', `regex' is
      changed to be the second word on that line.  Therefore, if given:
 
           FIND ru+n
           My program runs
           but not very quickly
           FIND Melvin
           JF+KM
           This line is property of Reality Engineering Co.
           Melvin was here.
 
      `awk' prints:
 
           Match of ru+n found at 12 in My program runs
           Match of Melvin found at 1 in Melvin was here.
 
      If ARRAY is present, it is cleared, and then the 0th element of
      ARRAY is set to the entire portion of STRING matched by REGEXP.
      If REGEXP contains parentheses, the integer-indexed elements of
      ARRAY are set to contain the portion of STRING matching the
      corresponding parenthesized subexpression.  For example:
 
           $ echo foooobazbarrrrr |
           > gawk '{ match($0, /(fo+).+(bar*)/, arr)
           >           print arr[1], arr[2] }'
           -| foooo barrrrr
 
      In addition, beginning with `gawk' 3.1.2, multidimensional
      subscripts are available providing the start index and length of
      each matched subexpression:
 
           $ echo foooobazbarrrrr |
           > gawk '{ match($0, /(fo+).+(bar*)/, arr)
           >           print arr[1], arr[2]
           >           print arr[1, "start"], arr[1, "length"]
           >           print arr[2, "start"], arr[2, "length"]
           > }'
           -| foooo barrrrr
           -| 1 5
           -| 9 7
 
      There may not be subscripts for the start and index for every
      parenthesized subexpressions, since they may not all have matched
      text; thus they should be tested for with the `in' operator (
      Reference to Elements).
 
      The ARRAY argument to `match' is a `gawk' extension.  In
      compatibility mode ( Options), using a third argument is a
      fatal error.
 
 `split(STRING, ARRAY [, FIELDSEP])'
      This function divides STRING into pieces separated by FIELDSEP and
      stores the pieces in ARRAY.  The first piece is stored in
      `ARRAY[1]', the second piece in `ARRAY[2]', and so forth.  The
      string value of the third argument, FIELDSEP, is a regexp
      describing where to split STRING (much as `FS' can be a regexp
      describing where to split input records).  If FIELDSEP is omitted,
      the value of `FS' is used.  `split' returns the number of elements
      created.
 
      The `split' function splits strings into pieces in a manner
      similar to the way input lines are split into fields.  For example:
 
           split("cul-de-sac", a, "-")
 
      splits the string `cul-de-sac' into three fields using `-' as the
      separator.  It sets the contents of the array `a' as follows:
 
           a[1] = "cul"
           a[2] = "de"
           a[3] = "sac"
 
      The value returned by this call to `split' is three.
 
      As with input field-splitting, when the value of FIELDSEP is
      `" "', leading and trailing whitespace is ignored, and the elements
      are separated by runs of whitespace.  Also as with input
      field-splitting, if FIELDSEP is the null string, each individual
      character in the string is split into its own array element.
      (This is a `gawk'-specific extension.)
 
      Note, however, that `RS' has no effect on the way `split' works.
      Even though `RS = ""' causes newline to also be an input field
      separator, this does not affect how `split' splits strings.
 
      Modern implementations of `awk', including `gawk', allow the third
      argument to be a regexp constant (`/abc/') as well as a string.
      (d.c.)  The POSIX standard allows this as well.   Computed
      Regexps, for a discussion of the difference between using a
      string constant or a regexp constant, and the implications for
      writing your program correctly.
 
      Before splitting the string, `split' deletes any previously
      existing elements in the array ARRAY.
 
      If STRING is null, the array has no elements. (So this is a
      portable way to delete an entire array with one statement.  
      Delete.)
 
      If STRING does not match FIELDSEP at all (but is not null), ARRAY
      has one element only. The value of that element is the original
      STRING.
 
 `sprintf(FORMAT, EXPRESSION1, ...)'
      This returns (without printing) the string that `printf' would
      have printed out with the same arguments ( Printf).  For
      example:
 
           pival = sprintf("pi = %.2f (approx.)", 22/7)
 
      assigns the string `"pi = 3.14 (approx.)"' to the variable `pival'.
 
 `strtonum(STR) #'
      Examines STR and returns its numeric value.  If STR begins with a
      leading `0', `strtonum' assumes that STR is an octal number.  If
      STR begins with a leading `0x' or `0X', `strtonum' assumes that
      STR is a hexadecimal number.  For example:
 
           $ echo 0x11 |
           > gawk '{ printf "%d\n", strtonum($1) }'
           -| 17
 
      Using the `strtonum' function is _not_ the same as adding zero to
      a string value; the automatic coercion of strings to numbers works
      only for decimal data, not for octal or hexadecimal.(1)
 
      Note also that `strtonum' uses the current locale's decimal point
      for recognizing numbers.
 
      `strtonum' is a `gawk' extension; it is not available in
      compatibility mode ( Options).
 
 `sub(REGEXP, REPLACEMENT [, TARGET])'
      The `sub' function alters the value of TARGET.  It searches this
      value, which is treated as a string, for the leftmost, longest
      substring matched by the regular expression REGEXP.  Then the
      entire string is changed by replacing the matched text with
      REPLACEMENT.  The modified string becomes the new value of TARGET.
 
      The REGEXP argument may be either a regexp constant (`/.../') or a
      string constant ("...").  In the latter case, the string is
      treated as a regexp to be matched.   Computed Regexps, for a
      discussion of the difference between the two forms, and the
      implications for writing your program correctly.
 
      This function is peculiar because TARGET is not simply used to
      compute a value, and not just any expression will do--it must be a
      variable, field, or array element so that `sub' can store a
      modified value there.  If this argument is omitted, then the
      default is to use and alter `$0'.(2) For example:
 
           str = "water, water, everywhere"
           sub(/at/, "ith", str)
 
      sets `str' to `"wither, water, everywhere"', by replacing the
      leftmost longest occurrence of `at' with `ith'.
 
      The `sub' function returns the number of substitutions made (either
      one or zero).
 
      If the special character `&' appears in REPLACEMENT, it stands for
      the precise substring that was matched by REGEXP.  (If the regexp
      can match more than one string, then this precise substring may
      vary.)  For example:
 
           { sub(/candidate/, "& and his wife"); print }
 
      changes the first occurrence of `candidate' to `candidate and his
      wife' on each input line.  Here is another example:
 
           $ awk 'BEGIN {
           >         str = "daabaaa"
           >         sub(/a+/, "C&C", str)
           >         print str
           > }'
           -| dCaaCbaaa
 
      This shows how `&' can represent a nonconstant string and also
      illustrates the "leftmost, longest" rule in regexp matching (
      Leftmost Longest).
 
      The effect of this special character (`&') can be turned off by
      putting a backslash before it in the string.  As usual, to insert
      one backslash in the string, you must write two backslashes.
      Therefore, write `\\&' in a string constant to include a literal
      `&' in the replacement.  For example, the following shows how to
      replace the first `|' on each line with an `&':
 
           { sub(/\|/, "\\&"); print }
 
      As mentioned, the third argument to `sub' must be a variable,
      field or array reference.  Some versions of `awk' allow the third
      argument to be an expression that is not an lvalue.  In such a
      case, `sub' still searches for the pattern and returns zero or
      one, but the result of the substitution (if any) is thrown away
      because there is no place to put it.  Such versions of `awk'
      accept expressions such as the following:
 
           sub(/USA/, "United States", "the USA and Canada")
 
      For historical compatibility, `gawk' accepts erroneous code, such
      as in the previous example. However, using any other nonchangeable
      object as the third parameter causes a fatal error and your program
      will not run.
 
      Finally, if the REGEXP is not a regexp constant, it is converted
      into a string, and then the value of that string is treated as the
      regexp to match.
 
 `gsub(REGEXP, REPLACEMENT [, TARGET])'
      This is similar to the `sub' function, except `gsub' replaces
      _all_ of the longest, leftmost, _nonoverlapping_ matching
      substrings it can find.  The `g' in `gsub' stands for "global,"
      which means replace everywhere.  For example:
 
           { gsub(/Britain/, "United Kingdom"); print }
 
      replaces all occurrences of the string `Britain' with `United
      Kingdom' for all input records.
 
      The `gsub' function returns the number of substitutions made.  If
      the variable to search and alter (TARGET) is omitted, then the
      entire input record (`$0') is used.  As in `sub', the characters
      `&' and `\' are special, and the third argument must be assignable.
 
 `gensub(REGEXP, REPLACEMENT, HOW [, TARGET]) #'
      `gensub' is a general substitution function.  Like `sub' and
      `gsub', it searches the target string TARGET for matches of the
      regular expression REGEXP.  Unlike `sub' and `gsub', the modified
      string is returned as the result of the function and the original
      target string is _not_ changed.  If HOW is a string beginning with
      `g' or `G', then it replaces all matches of REGEXP with
      REPLACEMENT.  Otherwise, HOW is treated as a number that indicates
      which match of REGEXP to replace. If no TARGET is supplied, `$0'
      is used.
 
      `gensub' provides an additional feature that is not available in
      `sub' or `gsub': the ability to specify components of a regexp in
      the replacement text.  This is done by using parentheses in the
      regexp to mark the components and then specifying `\N' in the
      replacement text, where N is a digit from 1 to 9.  For example:
 
           $ gawk '
           > BEGIN {
           >      a = "abc def"
           >      b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
           >      print b
           > }'
           -| def abc
 
      As with `sub', you must type two backslashes in order to get one
      into the string.  In the replacement text, the sequence `\0'
      represents the entire matched text, as does the character `&'.
 
      The following example shows how you can use the third argument to
      control which match of the regexp should be changed:
 
           $ echo a b c a b c |
           > gawk '{ print gensub(/a/, "AA", 2) }'
           -| a b c AA b c
 
      In this case, `$0' is used as the default target string.  `gensub'
      returns the new string as its result, which is passed directly to
      `print' for printing.
 
      If the HOW argument is a string that does not begin with `g' or
      `G', or if it is a number that is less than or equal to zero, only
      one substitution is performed.  If HOW is zero, `gawk' issues a
      warning message.
 
      If REGEXP does not match TARGET, `gensub''s return value is the
      original unchanged value of TARGET.
 
      `gensub' is a `gawk' extension; it is not available in
      compatibility mode ( Options).
 
 `substr(STRING, START [, LENGTH])'
      This returns a LENGTH-character-long substring of STRING, starting
      at character number START.  The first character of a string is
      character number one.(3) For example, `substr("washington", 5, 3)'
      returns `"ing"'.
 
      If LENGTH is not present, this function returns the whole suffix of
      STRING that begins at character number START.  For example,
      `substr("washington", 5)' returns `"ington"'.  The whole suffix is
      also returned if LENGTH is greater than the number of characters
      remaining in the string, counting from character START.
 
      If START is less than one, `substr' treats it as if it was one.
      (POSIX doesn't specify what to do in this case: Unix `awk' acts
      this way, and therefore `gawk' does too.)  If START is greater
      than the number of characters in the string, `substr' returns the
      null string.  Similarly, if LENGTH is present but less than or
      equal to zero, the null string is returned.
 
      The string returned by `substr' _cannot_ be assigned.  Thus, it is
      a mistake to attempt to change a portion of a string, as shown in
      the following example:
 
           string = "abcdef"
           # try to get "abCDEf", won't work
           substr(string, 3, 3) = "CDE"
 
      It is also a mistake to use `substr' as the third argument of
      `sub' or `gsub':
 
           gsub(/xyz/, "pdq", substr($0, 5, 20))  # WRONG
 
      (Some commercial versions of `awk' do in fact let you use `substr'
      this way, but doing so is not portable.)
 
      If you need to replace bits and pieces of a string, combine
      `substr' with string concatenation, in the following manner:
 
           string = "abcdef"
           ...
           string = substr(string, 1, 2) "CDE" substr(string, 6)
 
 `tolower(STRING)'
      This returns a copy of STRING, with each uppercase character in
      the string replaced with its corresponding lowercase character.
      Nonalphabetic characters are left unchanged.  For example,
      `tolower("MiXeD cAsE 123")' returns `"mixed case 123"'.
 
 `toupper(STRING)'
      This returns a copy of STRING, with each lowercase character in
      the string replaced with its corresponding uppercase character.
      Nonalphabetic characters are left unchanged.  For example,
      `toupper("MiXeD cAsE 123")' returns `"MIXED CASE 123"'.
 
 ---------- Footnotes ----------
 
 (1) Unless you use the `--non-decimal-data' option, which isn't
 recommended.   Nondecimal Data, for more information.
 
 (2) Note that this means that the record will first be regenerated
 using the value of `OFS' if any fields have been changed, and that the
 fields will be updated after the substituion, even if the operation is
 a "no-op" such as `sub(/^/, "")'.
 
 (3) This is different from C and C++, in which the first character is
 number zero.
 
Info Catalog (gawk) Numeric Functions (gawk) Built-in (gawk) I/O Functions
automatically generated by info2html