Page Body

GNU/Linux Tools for Manipulating Text

GNU/Linux has many commands that can filter text strings.

Note: If you are not familiar with the GNU/Linux command line interface, review the Conventions page before proceeding.

sort

The sort command sorts the lines in text files and sends the results to the standard output.

sort ex_file...

By default, the sorting rules are:

  • Lines starting with a number will appear before lines starting with a letter.
  • Lines starting with a letter that appears earlier in the alphabet will appear before lines starting with a letter that appears later in the alphabet.
  • Lines starting with a lowercase letter will appear before lines starting with the same letter in uppercase.

sort sorts lines lexicographically, considering all of the input lines, i.e., if the initial characters of two lines are equal, the first differing character within the line governs the line's relative positioning.

Some of sort's helpful options include:

-b, --ignore-leading-blanks
Ignore leading blanks in field contents (i.e., treat a run of spaces like a single space).
-d, --dictionary-order
Sort in dictionary order (i.e., only alphanumeric and space characters are taken into account).
--debug
Annotate the part of the line used to sort. Warn about questionable usage to the standard error.
-f, --ignore-case
Fold uppercase to lowercase letters.
This option is relevant if your system is set up for bytewise sorting. For example, to have the sort command sort characters based on their ASCII order, you can set the LC_ALL localization variable to C (e.g., export LC_ALL=C). Then, to have sort ignore case, you can use the -f option. Run man 7 ascii to see a visualization of the ASCII character table.
-h, --human-numeric-sort
Compare human readable numbers (e.g., 2K, 1G).
-i, --ignore-nonprinting
Consider only printable characters.
-k ex_keydef, --key=ex_keydef
Sort according to ex_keydef. ex_keydef is a key definition in the form of ex_field_start.ex_character_ex_option,ex_field_end.ex_character_ex_option.
ex_character (a character position in the field), ex_option (one or more single-letter ordering options, [bdfgiMhnRrV], which override global ordering options for that key), and ex_field_end (a field number) are optional. The default value of ex_field_end is the end of the line.
-n, --numeric-sort
Compare according to string numerical value. Consider key (field) value as a number and sort according to its numeric value (leading blanks ignored), i.e., sort numerically instead of lexicographically.
A whole field will be considered as a single number, instead of being evaluated character by character (i.e., 123 is considered to be a single number, one hundred and twenty three, instead of the three numbers 1, 2, and 3).
-o ex_file, --output=ex_file
Write results to ex_file instead of the standard output (the filename may match the original input filename).
-r, --reverse
Reverse the result of comparisons. Sort in reverse (descending) order.
-u, --unique
Output only the first of an equal run. Outputs only the first of a sequence of equal output lines (i.e., those that are unique).
-t ex_delimiter, --field-separator=ex_delimiter
Specify field delimiter (e.g., -t :).

Lexicographical Sort

Assume a data.txt file:

$ cat 'data.txt'
a
b
c
1
2
3
A
B
C

Running sort on this file results in a lexicographical sort:

$ sort 'data.txt'
1
2
3
a
A
b
B
c
C

When we add the -f option, we see the same lexicographical sort:

$ sort -f 'data.txt'
1
2
3
a
A
b
B
c
C

Bytewise Sort

Bytewise sorting can be enabled by setting the LC_ALL localization variable to C:

$ export LC_ALL=C
$ sort 'data.txt'
1
2
3
A
B
C
a
b
c

This will result in a sorting based on the ASCII character table (i.e., numbers first, then uppercase letters, followed by lowercase letters).

After bytewise sorting is enabled, you can see the result of using the -f option to tell sort to ignore case:

$ sort -f 'data.txt'
1
2
3
A
a
B
b
C
c

Numeric Sort

By default, sort does not treat numbers as whole values, but as individual string number characters. For example, assume the following data.txt file:

$ cat 'data.txt'
2
22
12
1

Sorting this file, without any additional options, does not result in what many would assume to be an expected result:

$ sort 'data.txt'
1
12
2
22

To have sort treat each of these lines' sorting fields as whole numeric values, add the -n option:

$ sort -n 'data.txt'
1
2
12
22

Sort by Key (Field)

The first of a line's spaces is considered a field separator (i.e., a Space is the default delimiter for the sort command). The rest of the spaces are made a part of the next field's value.

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ sort 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324

Above, sort sees each line in data.txt as having two fields, i.e., everything up until the first space, and the space and everything after it (e.g., 01,Beverly,Crusher,TNG and 2324 for the first line in data.txt).

To sort a line by a specific field, use the -k option. sort will operate from the field you specify to the end of the line. The following sorts data.txt's content by first names (i.e., the second field), where the delimiter has been specified as a comma (,):

$ sort -k 2 -t ',' 'data.txt'
01,Beverly,Crusher,TNG 2324
05,Beverly,Picard,TNG 2324
02,Julian,Bashir,DS9 2341
04,Katherine,Pulaski,TNG 2318
03,Leonard,McCoy,TOS 2227

To get a nice visualization of where sort begins and ends its sorting evaluation, add the --debug option:

$ sort --debug -k 2 -t ',' 'data.txt'
sort: using ‘en_US.UTF-8’ sorting rules
01,Beverly,Crusher,TNG 2324
   ________________________
___________________________
05,Beverly,Picard,TNG 2324
   _______________________
__________________________
02,Julian,Bashir,DS9 2341
   ______________________
_________________________
04,Katherine,Pulaski,TNG 2318
   __________________________
_____________________________
03,Leonard,McCoy,TOS 2227
   ______________________
_________________________

Above, we can see that sort begins its evaluation of each line at the second field (where fields are delimited by a comma) until the end of each line.

For lines that are equal based on the field(s) you specified, sort will go back to the beginning of the line to evaluate other fields to serve as a tie breaker. This is demonstrated in the example above by the second horizontal line underneath each line of text in data.txt.

To sort by a specific character in a specific field, append a . and the character position in that field to sort by to the sort field in question. For example, the following command sorts by the fifth character of the fourth field (i.e., by each character's birth year; note the addition of the -n option to ensure that the birth year values are treated as whole numbers and not individual numeric characters):

$ sort -k 4.5 -n -t ',' 'data.txt'
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
01,Beverly,Crusher,TNG 2324
05,Beverly,Picard,TNG 2324
02,Julian,Bashir,DS9 2341

If you only want sort to evaluate a line from one specific point to another, you can do so by separating the start and end points with a ,. This example sorts each line by first name and last name (the --debug option is added to illustrate the new sort evaluation area):

$ sort --debug -k 2,3 -t ',' 'data.txt'
sort: using ‘en_US.UTF-8’ sorting rules
01,Beverly,Crusher,TNG.2324
   _______________
___________________________
05,Beverly,Picard,TNG.2324
   ______________
__________________________
02,Julian,Bashir,DS9.2341
   _____________
_________________________
04,Katherine,Pulaski,TNG.2318
   _________________
_____________________________
03,Leonard,McCoy,TOS.2227
   _____________
_________________________

As previously mentioned, for lines that are equal based on the field(s) you specify, sort will go back to the beginning of the line to evaluate other fields to serve as a tie breaker.

Multiple key fields can be specified with multiple -k options. The following sorts by character first name, followed by row number, using reverse order:

$ sort -k 2 -k 1 -r -t ',' 'data.txt'
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
02,Julian,Bashir,DS9 2341
05,Beverly,Picard,TNG 2324
01,Beverly,Crusher,TNG 2324

uniq

The uniq command displays lines in a file after omitting any repeated contiguous lines (i.e., only the first instance of a repeated contiguous line is shown):

uniq ex_file

Unlike most commands, uniq can only accept one file as an argument. If a second argument is given, uniq will treat it as an output file, instead of sending its output to the standard output.

Since uniq only filters out duplicate lines when they are contiguous, it often makes sense to use uniq with the sort command:

sort ex_file | uniq

Alternatively, you could just use sort's -u option for the same effect:

sort -u ex_file...

Display Unique Lines

To only display unique lines in a file (i.e., those that are not repeated), use uniq's -u (--unique) option:

sort ex_file | uniq -u

Display Single Instance of Repeated Lines

To display a single instance of repeated lines in a file, use uniq's -d (--repeated) option:

sort ex_file | uniq -d

Display All Instances of Repeated Lines

To display all instances of repeated lines in a file, use uniq's -D option:

sort ex_file | uniq -D

expand and unexpand

The expand command converts tabs to a set number of spaces:

expand ex_file...

The default number of spaces is set to 8, but you can set this with the -t (--tabs=ex_integer) option (e.g., -t 5).

To convert spaces in a file to a set number of tabs, use the unexpand command:

unexpand ex_file...

cut

The cut command removes sections (columns) from each line of files:

cut ex_file...

Options can be used to specify what kind of section it removes. Specifically, it can be used to remove characters (-c, --characters=ex_list), bytes (-b, --bytes=ex_list), or fields (-f, --fields=ex_list).

A column is equivalent to a character and a field is a collection of columns. The -c, -b, and -f options are mutually exclusive.

cut's output will always contain the columns in the same order as the input, not the order that is specified in the command. For example, the following command displays the second and fourth characters of each line in the data.txt file:

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ cut -c 4,2 'data.txt'
1B
2J
3L
4K
5B

Even though the fourth character was specified first in the cut command above, the specified columns still appear in the same order in the command output as they do in the data.txt file (i.e., character 2 and then character 4).

Cutting a Character Range

A hyphen (-) can be used to specify a character range. The following extracts just the row number of each Star Trek character in data.txt:

$ cut -c 1-2 'data.txt'
01
02
03
04
05

Cutting Specific Columns

Cutting specific columns and ranges can be used together. The following displays the row number and first three letters of each Star Trek character's first name:

$ cut -c 1-2,4-6 'data.txt'
01Bev
02Jul
03Leo
04Kat
05Bev

If character ranges ever overlap, every input character is still output at most once.

Cutting Specific Fields and Changing the Field Delimiter

By default, cut's field delimiter is the Tab character. A different delimiter can be set with the -d (--delimiter=ex_delimiter) option. For example, the following sets a comma (,) as the field delimiter and cuts the first and last Star Trek character names (i.e., the second and third fields) from the data.txt file:

$ cut -d ',' -f 2,3 'data.txt'
Beverly,Crusher
Julian,Bashir
Leonard,McCoy
Katherine,Pulaski
Beverly,Picard

Cutting a Field Range

Like with characters, you can specify a field range with a hyphen (-). The following displays the last name and series/birth year of each Star Trek character:

$ cut -d ',' -f 3-4 'data.txt'
Crusher,TNG 2324
Bashir,DS9 2341
McCoy,TOS 2227
Pulaski,TNG 2318
Picard,TNG 2324

Setting an Output Delimiter

cut allows you to specify different input and output delimiters. The --output-delimiter option is used to set the output delimiter. The following example uses a , as the input delimiter and a (a Space) as the output delimiter:

$ cut -d ',' -f 2,3 --output-delimiter=' ' 'data.txt'
Beverly Crusher
Julian Bashir
Leonard McCoy
Katherine Pulaski
Beverly Picard

If a line in a file is missing the specified delimiter character, cut outputs the line in its entirety. You can suppress these lines by using cut's -s (--only-delimited) option.

tr

The tr command translates (or deletes) characters for each line of a file. It does not act on whole words. Unlike most commands, tr cannot take a file as an argument. It can only receive input via redirection or piping.

The following example changes all commas in data.txt to semicolons:

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ tr ',' ':' < 'data.txt'
01:Beverly:Crusher:TNG 2324
02:Julian:Bashir:DS9 2341
03:Leonard:McCoy:TOS 2227
04:Katherine:Pulaski:TNG 2318
05:Beverly:Picard:TNG 2324

Translating Character Classes and Ranges

Besides working with characters, tr can also accept character classes (e.g., [:upper:]) and ranges (e.g., a-z). The following translates all lowercase alphabetic characters in data.txt to uppercase alphabetic characters:

$ tr '[:lower:]' '[:upper:]' < 'data.txt'
01,BEVERLY,CRUSHER,TNG 2324
02,JULIAN,BASHIR,DS9 2341
03,LEONARD,MCCOY,TOS 2227
04,KATHERINE,PULASKI,TNG 2318
05,BEVERLY,PICARD,TNG 2324

Alternatively, you could do:

$ tr 'a-z' 'A-Z' < 'data.txt'
01,BEVERLY,CRUSHER,TNG 2324
02,JULIAN,BASHIR,DS9 2341
03,LEONARD,MCCOY,TOS 2227
04,KATHERINE,PULASKI,TNG 2318
05,BEVERLY,PICARD,TNG 2324

Squeezing Characters

The -s option can be used to squeeze (remove) redundant adjacent characters. For example, if an extra 0 was prepended to the beginning of each line of data.txt, it could be filtered out like so:

$ cat 'data.txt'
001,Beverly,Crusher,TNG 2324
002,Julian,Bashir,DS9 2341
003,Leonard,McCoy,TOS 2227
004,Katherine,Pulaski,TNG 2318
005,Beverly,Picard,TNG 2324
$ tr -s '0' < 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324

Deleting Characters

The -d option is used to filter out (delete) specified characters. The following filters out the 0 character from each line in data.txt.

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ tr -d '0' < 'data.txt'
1,Beverly,Crusher,TNG 2324
2,Julian,Bashir,DS9 2341
3,Leonard,McCoy,TOS 2227
4,Katherine,Pulaski,TNG 2318
5,Beverly,Picard,TNG 2324

Creating Complement Substitution Sets

tr can be used to create complement substitution sets (i.e., to filter out all characters in a line except for those that you specify) using the -c (--complement) option. The following example grabs the first line of data.txt and pipes it to tr, which filters out all non-alphabetic characters (except for the newline character, \n):

$ head -n 1 'data.txt' | tr -cd 'a-zA-z\n'
BeverlyCrusherTNG

Remove All Non-Printable Characters

The tr command can be used to accomplish useful tasks that may be common when working with large text files. For example, it can remove all non-printable characters from a file:

tr -cd '[:print:]' < ex_file

Join All Lines Into a Single Line

Also, tr can join all lines in a file into a single line:

tr -s '\n' ' ' < ex_file

Above, tr replaces newlines with spaces, and the -s option squeezes any redundant spaces to a single space (the -s option only operates on the last specified set, i.e., ' ' in this example).

sed

The sed command is a stream editor. Stream editors perform basic text transformation on an input stream (e.g., a file or input from a pipeline). The basic syntax for sed is:

sed ex_script ex_file...

sed takes a number of instructions (i.e., ex_script), reads its input line by line, and applies the instructions to the input lines, where appropriate. Finally, the lines are output in a (possibly) modified form.

sed only reads its input file (if there is one) and never modifies it. If an input file is not provided, sed filters the contents of the standard input. By default, sed sends its output to the standard output.

Helpful sed options include:

-e ex_script, --expression=ex_script
Add ex_script to the commands to be executed. This option can be used multiple times in a command to specify multiple sed scripts.
-f ex_script_file, --file=ex_script_file
Add the contents of ex_script_file to the commands to be executed. This option can be used multiple times in a command to specify multiple files.
-i, --in-place
Edit files in place. Also, make a backup if an optional ex_suffix is supplied as an argument (e.g., -i ex_suffix, --in-place=ex_suffix).
-n, --quiet, --silent
Suppress automatic printing of pattern space. This option can be overridden with an explicit p (print) command.
For example, sed -ne '1,10p' ex_file is equivalent to the head command, i.e., only the first 10 lines of ex_file are sent to the standard output. However, sed -ne '1,10p' ex_file is not as efficient as the head command because every line of ex_file is read before sed generates its output.
-s, --separate
Consider files as separate, rather than as a single continuous long stream. Every input file starts with a line number of 1.

How sed Works

sed maintains two data buffers:

  1. Active Pattern Space
  2. Auxiliary Hold Space

By default, both data buffers are empty.

sed performs the following cycle on each line of input:

  • First, sed reads one line from the input stream, removes any trailing newline, and places it in the pattern space.
  • Then, commands are executed. Each command can have an address associated with it. Addresses are a condition code, and a command is only executed if the condition is verified before the command to be executed.
  • When the end of the script is reached (unless the -n option is used), the contents of the pattern space are printed out to the output stream. The trailing newline is added back if it was removed.
  • The next cycle starts for the next input line.

Unless a command like D is used, the pattern space is deleted between two cycles (if the pattern space contains no newlines, the D command starts a normal new cycle; the d command deletes the pattern space and immediately starts a new cycle). On the other hand, the hold space keeps its data between cycles.

sed Programs

A sed program consists of one or more sed commands passed in by one or more of the -e and -f options, or the first non-option argument passed to the command if these options are not used. The sed script is the in-order concatenation of all the scripts and script files passed to sed.

Selecting Lines With sed

For sed, a number as a line address is considered a line number. The first input line is number 1, and the count continues from there, even if the input consists of several files.

Line addresses in a sed script can follow these forms:

ex_number
Specifying a line number will only match that line in the input. Unless the -i or -s options are specified, sed continuously counts lines across all input files (e.g., sed -e '3p' -n 'data.txt' suppresses all output lines of data.txt except for line 3, which sed is explicitly told to print).
ex_first_line~ex_step
A GNU extension of sed that matches every ex_step lines starting with ex_first_line. Essentially, ex_first_line is the starting point, and ex_step is the step (e.g., sed -e '1~2!d' 'data.txt' removes all lines from data.txt's sed output except for the odd-numbered lines).
$
Matches the last line of the last file of input, or the last line of each file when the -i or -s options are specified (e.g., sed -e '5,$p' -n 'data.txt' prints out the content of data.txt from line 5 to the end of the file).
/ex_regex/
Selects any line that matches the regular expression ex_regex (e.g., sed -e '/TNG/p' -n 'data.txt' prints out any lines in data.txt that have TNG in them). If ex_regex itself includes any / characters, each must be escaped with a \.
The empty regular expression // repeats the last regular expression match (the same is true if the empty regular expression is passed to the s command, i.e., the substitute command). Keep in mind, modifiers to regular expressions are evaluated when the regular expression is compiled. Therefore, it is invalid to specify them together with the empty regular expression.
\%ex_regex%
Matches the regular expression ex_regex, but allows you to use a different delimiter other than / (e.g., sed -e '\%DS9%!d' 'data.txt' removes all lines from data.txt's sed output except for those that have DS9 in them). This is useful if ex_regex itself contains /s. If ex_regex contains any delimiter characters, each must be escaped by a \.
The % character may be replaced by any other single character.
/ex_regex/I, \%ex_regex%I
A GNU extension that causes ex_regex to be matched in a case-insensitive manner.
/ex_regex/M, \%ex_regex%M
A GNU extension that causes ^ and $ to respectively match (in addition to the normal behavior) the empty string after a newline and the empty string before a newline. There are special character sequences (i.e., \`, \') that always match the beginning or end of the buffer. M is short for multi-line.

If no line addresses are provided to sed, then all lines are matched (e.g., sed -e '' 'data.txt' prints out all lines of data.txt). If one line address is given, then only lines matching that address are matched (e.g., sed -e '3!d' 'data.txt' removes all lines from data.txt's sed output except for line 3).

A line address range can be specified by providing two line addresses separated by a comma (,). A line address range matches lines starting from where the first address matches and continues until the second address matches (inclusively). For example, sed -e '2,4!d' 'data.txt' removes all lines from data.txt's sed output except for lines 2 through 4.

If the second address is a regular expression, then checking for the ending match will start with the line following the line that matched the first address. Hence, the expression could never match the first input line. You can get around this by using 0,/ex_regex/.

If the first address is a regular expression too, then once a line matching the second expression is found, sed continues looking for another line matching the first expression, as there might be another matching range of lines.

A range will always span at least two lines (except if the input stream ends).

If the second address is a number less than or equal to the line matching the first line address, then only the one line is matched.

GNU/Linux sed supports special two address forms, as well:

0,/ex_regex/
sed will try to match ex_regex in the first input line, too. That is, 0,/ex_regex/ is similar to 1/ex_regex/, except that if line address 2 matches the first line of input, the 0,/ex_regex/ form will consider it to end the range, while the 1/ex_regex/ form will match the beginning of its range and make the range span up to the second occurrence of the regular expression.
ex_address,+ex_number
Matches ex_address and the ex_number lines following ex_address.
ex_address,~ex_number
Matches ex_address and the lines following ex_address until the next line whose input line number is a multiple of ex_number.

Appending the ! character to the end of a line address specification negates the sense of the match, i.e., if the ! character follows a line address range, then only lines that do not match the line address range will be selected (e.g., sed -e '2!p' -n 'data.txt' prints all lines of data.txt except for line 2). Also, this works for single line addresses for the null address.

These are examples of range addressing:

1,/^$/
This selects all lines up to the first empty line. This can be useful to extract the header of an email message (which, by definition, always finishes with an empty line, but may not contain empty lines). Here, ^$ is a regular expression signifying an empty line. An example command to extract an email message header from a file is sed -e '1,/^$/p' -n ex_file.
You can use the ^$ regular expression to delete all empty lines in a file with sed '/^$/d', as well.
1,$
This describes all input lines (e.g., sed -e '1,$!d' 'data.txt' removes all lines from data.txt's sed output except for lines 1 through to the last line of the file, i.e., everything).
/^BEGIN/,/^END/
This describes all ranges of lines starting at one beginning with the text BEGIN up to one beginning with the text END (inclusively). Here, ^BEGIN and ^END are regular expression patterns (e.g., sed '/^BEGIN/,/^END/!d' ex_file removes all lines from ex_file's sed output except for all ranges of lines starting at one beginning with the text BEGIN up to one beginning with the text END (inclusively)).

sed and Regular Expressions

Certain regular expression characters need to be escaped with a backslash (\) when being used with sed:

  • \+
  • \?
  • \{i\}
  • \{i,j\}
  • \{i,\}
  • \(ex_regex\)
  • ex_regex\|ex_regex
  • \(\ex_digit\)

sed Commands

sed commands in a script or script file can be separated by semicolons (;) or newlines. However, some commands cannot be followed by semicolons as command separators and should be terminated with newlines, or be placed at the end of a script or script file. Commands can be preceded with optional non-significant whitespace characters.

A sed command consists of an optional address or address range, followed by a one character command name and any additional parameters. Line addresses determine which lines the command applies to.

These are some helpful sed commands:

a\ex_text
The append command. Queue the lines of text that follow this command (each but the last ending with a \, which are removed from the output) to be output at the end of the current cycle, or when the next input line is read. a is a one address command (e.g., sed -e '5a\06,The,Doctor,VYG 2371' 'data.txt' adds the 06,The,Doctor,VYG 2371 line after line 5 in data.txt).
Escape sequences in text are processed, so you should use \\ in text to print a single backslash.
If between the a and the newline there is something other than a whitespace-\ sequence, then the text of this line, starting at the first non-whitespace character after the a, is taken as the first line of the text block. (This enables a simplification in scripting a one-line add.) This extension also works with the i and c commands.
c\ex_text
The change command. Delete the lines matching the address or address-range, and output the lines of text which follow this command (each but the last ending with a \, which are removed from the output) in place of the last line (or in place of each line, if no addresses were specified), e.g., sed -e '5c\05,The,Doctor,VYG 2371' 'data.txt' replaces line 5 in data.txt with 05,The,Doctor,VYG s2371.
A new cycle is started after this command is done, since the pattern space will have been deleted.
d
The delete command. Delete the pattern space. Immediately start a new cycle.
This is useful if you want to suppress input lines so that sed does not output them. For example, sed -e '11,$d' ex_file suppresses lines from being sent to the standard output starting at line 11, essentially making it analogous to the head command, i.e., only the first 10 lines of ex_file are sent to the standard output.
i\ex_text
Immediately output the lines of text that follow this command (each but the last ending with a \, which are removed from the output). i is a one address command (e.g., sed -e '1i\Star Trek Doctors\n' 'data.txt' inserts Star Trek Doctors and a newline before line 1 of data.txt).
p
The print command. Print out the pattern space to the standard output. Usually, this command is only used in conjunction with the -n option.
n
The next command. It auto-print is enabled, print the pattern space. Then, replace the pattern space with the next line of input. If there is no more input, make sed exit without processing any more commands. For example, sed -e 'n' 'data.txt' displays the contents of data.txt.
q
The quit command. Exit sed without processing any more commands or input (e.g., sed -e '10q' ex_file is an efficient alternative to the head command). This command only accepts a single address and the current pattern space is printed if the -n option is not used. An optional ex_exit_code may be supplied to this command as an argument.
y/ex_source_characters/ex_destination_characters/
Transliterate any characters in the pattern space that match any of the ex_source_characters with the corresponding character in ex_destination_characters.
Instances of the / (or whatever other character is used instead), \, or newlines can appear in the ex_source_characters or ex_destination_characters lists, provided that each instance is escaped by a \. The ex_source_characters and ex_destination_characters lists must contain the same number of characters (after de-escaping). The / characters may be uniformly replaced by any other single character within any given y command.
For example, sed -e 'y/, /|:/' 'data.txt' changes ,s to |s and s (spaces) to :s in data.txt.
{ ex_commands }
A group of commands. This is useful when you want a group of commands to be triggered by a single line address (or line address range) match.

In traditional sed syntax, sed expects a backslash (\) after the a, i, or c commands immediately preceding the end of line. If more than one line is to be inserted, each of these lines, except for the last line, must also be terminated with a backslash immediately preceding the end of line.

For example, you can insert a blank line after every input line that ends with a capital letter only with sed -e '/[A-Z]$/a\\' ex_file. Here, the \ after a\ with no content to append before it represents a blank line.

The a and i commands are one-address commands, i.e., they allow only addresses matching a single line, no ranges. However, there may be several input lines that match the one address given with a and i, and which will be processed. This one address may include regular expressions. With the c command, an address range implies that all of the range is to be replaced.

The y command makes sed replace single characters. It does not allow for ranges, like a-z, so it is not a good replacement for the tr command.

The s Command

The s (substitute) command is a powerful sed command that allows substitution of a regular expression by a character string whose composition may dynamically change. Its syntax is:

's/ex_regex/ex_replacement/ex_flags'

The s command tries to match the pattern space against ex_regex. If the match is successful, then the portion of the pattern space that was matched is replaced with ex_replacement.

The / character separator may be replaced by any other single character within any s command. You just need to be consistent and use the same separator character three times. Only spaces and newline characters are not allowed as separators. The / character (or whichever character is used to replace it) may only appear in ex_regex or ex_replacement if it is preceded by a \ character.

For regular expressions used as line addresses, the /s as separators are mandatory.

This example replaces the abbreviation TNG with the phrase The Next Generation in the data.txt file:

sed -e 's/\<TNG\>/The Next Generation/g' 'data.txt'

\<TNG\> is a regular expression that ensures that only the abbreviation TNG, and not words that contain the characters TNG, are targeted.

Note that \<, \>, ^, and $ correspond to empty strings. In particular, $ is not the actual newline character at the end of a line, but the empty string "" immediately preceding the newline character. Therefore, s/$/|/ inserts a | immediately before the end of line, instead of replacing the end of line with it (e.g., sed -e 's/$/|/' 'data.txt' substitutes the empty string at the end of each line in data.txt with a |).

ex_replacement can contain \ex_number references (where ex_number is a number from 1 through 9, inclusive), which refers to the portion of the match that is contained between the ex_numberth \( and its matching \).

For example, the following sed command can take a full HTTP or HTTPS URL as input and just return the protocol portion:

sed -e 's,^\(.*://\).*,\1,'

ex_replacement may also contain unescaped & characters that reference the whole matched portion of the pattern space.

In addition, ex_replacement may include the following special GNU extension sequences:

\E
Stop case conversion started by \L or \U.
\L
Turn the replacement to lowercase until a \U or \E is found.
\l
Turn the next character to lowercase.
\U
Turn the replacement to uppercase until a \L or \E is found.
\u
Turn the next character to uppercase.

To include a literal \, &, or newline in ex_replacement, precede the desired \, &, or newline in ex_replacement with a \.

The s command can be followed by zero or more of the following flags (modifiers):

ex_number
Only replace the ex_numberth match of ex_regex.
e
Allows one pipe input from a shell command into pattern space. If a substitution was made, the command that is found in pattern space is executed and pattern space is replaced with its output. A trailing newline is suppressed. Results are undefined if the command to be executed contains a null character.
The e flag is a GNU sed extension.
g
Apply the replacement to all matches of ex_regex, not just the first.
g stands for global.
I or i
A GNU extension that makes sed match ex_regex in a case-insensitive manner.
M or m
A GNU extension that causes ^ and $ to match (in addition to normal behavior) the empty string after a newline and the empty string before a newline, respectively. There are special character sequences (i.e., \`, \') that always match the beginning or end of the buffer.
M stands for multi-line.
p
If the substitution was made, then print the new pattern space.
w ex_file
If the substitution was made, then write out the result to ex_file. Two special values of ex_file are supported on GNU/Linux: (1) /dev/stderr, which writes the result to the standard error, and (2) /dev/stdout, which writes the result to the standard output.

Normally, the s command replaces only the first match on every line. If the g (global) flag is appended to the s command, it replaces every occurrence of the search pattern on each line.

For example, the following sed command matches words ([A-Za-z]\+), then replaces them with quoted versions ("&"). Since the g flag is used, every echoed word is subject to replacement:

$ echo 'Every word quoted.' | sed -e 's/[A-Za-z]\+/"&"/g'
"Every" "word" "quoted".

Above, \+ in the regular expression matches for a space, and the ampersand (&) references the whole matched portion of the pattern space (i.e., the words that are matched by the [A-Za-z]\+ regular expression).

Another useful flag is p (print), which outputs the line after replacement, like the p command. The following example deletes the first word from all lines beginning with A (here, a word is a sequence of letters):

$ cat 'data.txt'
Oranges are acidic.
Apples are sweet.
Plums are juicy.
Apricots are dry.
Pineapples are sweet and juicy.
$ sed -e '/^[^A]/p; /^A/s/[A-Za-z]\+//p' -n 'data.txt'
Oranges are acidic.
 are sweet.
Plums are juicy.
 are dry.
Pineapples are sweet and juicy.

/^[^A]/p prints out the lines that do not begin with an A. /^A/s/[A-Za-z]\+//p matches all lines that start with an A, and then replaces the first word in that line with nothing. Next, the replacement line is printed.

awk

AWK is a programming language for text file processing. GNU/Linux distributions do not contain AT&T's original awk, but a compatible version called gawk (GNU awk; awk is likely a symbolic link to either /etc/alternatives/awk or /usr/bin/gawk on your GNU/Linux system).

awk has the following syntax:

awk ex_program ex_file...

Various kinds of formatted reports structured data operations can be performed with awk. awk can:

  • Interpret text files consisting of records, which in turn consist of fields.
  • Store data in variables and arrays.
  • Perform arithmetic, logical, and string operations.
  • Evaluate loops and conditionals.
  • Define functions.
  • Post-process the output of commands.

awk reads text from its standard input or files named on the command line, usually line by line. Every line (record) is divided into fields and processed. The results are written to the standard output or a named file.

awk programs exist somewhere between shell scripts and programs in languages like Python. The main difference between awk and these languages consists of awk's data-driven operation, while the typical programming languages are more geared toward functions.

Useful awk options include:

-e ex_program, --source ex_program
Use ex_program as AWK program source code.
-f ex_program, --file ex_program
Read the AWK program source from ex_program, instead of from the first command line argument. Multiple -f options may be used.
-F ex_field_separator, --field-separator ex_field_separator
Use ex_field_separator for the input field separator. The default field separator for awk is a Space.
-v ex_var=ex_value, --assign ex_var=ex_value
Assign the value ex_value to the variable ex_var before the execution of the program begins.

AWK Program Execution

An AWK program is a sequence of pattern-action statements and option-function definitions:

  • @include ex_file
  • @load ex_file
  • ex_pattern {ex_action_statements}
  • ex_function(ex_parameters) {ex_action_statements}

The action is executed on the text that matches the pattern. The entire AWK program is enclosed in single quotes ('').

awk reads the program source from the ex_program file(s) if specified, from arguments to -e, or from the first non-option argument on the command line. awk reads the program text as if all of the program files and command line source texts were concatenated.

Lines beginning with @include may be used to include other source files into your program.

awk executes AWK programs in the following order:

  1. All variable assignments specified via the -v option are performed.
  2. The program is compiled into an internal form.
  3. The code in the BEGIN block(s), if there are any, is executed and proceeds to read each file named in the ARGV array (up to ARGV[ARGC]).
  4. If there are no files named on the command line, awk reads the standard input.
  5. If a file name on the command line has the form ex_var=ex_value, it is treated as a variable assignment. The variable ex_var will be assigned the value ex_value. This happens after any BEGIN block(s) have been run.
    • Command line variable assignment is most useful for dynamically assigning values to the variables AWK uses to control how input is broken into fields and records. Also, it is useful for controlling state if multiple passes are needed over a single data file.
  6. If the value of a particular element of ARGV is empty (""), awk skips over it.
  7. For each input file, if a BEGINFILE rule exists, awk executes the associated code before processing the contents of the file.
  8. Similarly, awk executes the code associated with ENDFILE after processing the file.
  9. For each record in the input, awk tests to see if it matches any pattern in the AWK program. For each pattern that the record matches, the associated action is executed. The patterns are tested in the order they occur in the program.
  10. After all the input is exhausted, awk executes the code in the END block(s) (if any).

In principle, an awk program works like a loop over the input records (usually lines) that is repeated until no input is left or the program is terminated. The control flow is largely given by the data. In most other languages, the main program is started once and functions (which may read input) influence the progress of the calculation.

In the simplest case, awk works like sed, i.e., you select lines and then apply commands to those lines. For example, the following command outputs all input lines containing at least three bs:

awk '/b.*b.*b/ {print}' ex_file

The braces ({}) may contain one or more commands that are applied to the lines matching the regular expression between the forward slashes (/). Commands are separated by a semicolon (;).

A sequence of AWK commands does not need to include a regular expression in front of it. A {ex_commands} without a regular expression will be applied to every input record.

AWK Scripts

Often, it is more convenient to put AWK scripts into their own files. You can execute these files using awk -f ex_script_file. Lines starting with a # are considered comments.

As with sed (e.g., #!/usr/bin/env -S sed -f), there is nothing wrong with directly executable awk scripts:

#!/usr/bin/env -S awk -f

/a.*b.*/ {print}

For every input line, awk checks which script lines match it, and all awk command sequences that match are executed.

Records and Fields

awk assumes that its input is structured and not just a stream of bytes, like sed does. Usually, every input line is considered a record and split into fields on white space (i.e., a field is a string surrounded by white space). Unlike programs like sort and cut, awk considers sequences of spaces one field separator, instead of seeing an empty field between two adjacent space characters.

The default delimiter for awk is a Space. The input record delimiter can be changed using the -F ex_field_separator option. Output fields are separated by blanks.

The $ is awk's field access operator. You can refer to the individual fields of a record using the awk expressions $ex_number (e.g., $1 for field 1, $2 for field 2, etc.):

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ awk -F ',' '{print $2,$3}' 'data.txt'
Beverly Crusher
Julian Bashir
Leonard McCoy
Katherine Pulaski
Beverly Picard

The $0 field identifier returns the full input record (i.e., all fields for a record).

$ awk '{print $0}' 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324

The $NF field identifier returns the last field's value for a line, regardless of the number of fields in the current record.

$ awk -F ',' '{print $NF}' 'data.txt'
TNG 2324
DS9 2341
TOS 2227
TNG 2318
TNG 2324

Unlike cut, awk is able to change the order of columns in its output with respect to their order in the input.

$ awk -F ',' '{print $3,$2}' 'data.txt'
Crusher Beverly
Bashir Julian
McCoy Leonard
Pulaski Katherine
Picard Beverly

When outputting multiple fields with a comma (e.g., {print $3,$2}), awk prints a space between each field. If you do not place a comma in between each field, their values will be concatenated.

BEGIN and END

awk lets you specify command sequences to be executed at the beginning of a run, before data has been read, and at the end, after the last records have been read. This can be used for initialization or final results.

$ ls -l *.txt | awk '
    BEGIN {sum = 0}
    {sum = sum + $5}
    END {print sum}'
255

The above command adds the lengths/sizes of all files with the .txt extension in the current directory and outputs the results at the end. sum is a variable that contains the current total (the BEGIN rule is not strictly required since new variables are set to 0 when they are first used).

A BEGIN rule is executed once before any text processing starts. An END rule is executed after all processing has completed. You can have multiple BEGIN and END rules, and they will execute sequentially.

BEGIN and END rules have their own set of actions enclosed within their own set of curly braces ({}).

AWK Variables

Variables in AWK behave like shell variables, except that you can refer to their values without having to put a $ in front of the variable name. They may contain either strings or floating-point numbers. AWK variables are typeless and are interpreted as required:

$ awk 'BEGIN {a = "456def"; print 2*a; exit}'
912
$ awk 'BEGIN {a = "def"; print 2*a; exit}'
0

Variable names always start with a letter and may otherwise contain letters, digits, and the underscore (_).

You can include assignments to AWK variables on the command line, among the awk options and filename arguments. The only condition is that awk gets to see each assignment as a single argument (hence, there may be no spaces around the =).

This is an improved version of the previous text file size summation example:

$ ls -l *.txt | awk '
    {
      sum += $5
      count++
    }
    END {print sum, "bytes in", count, "files"}'
255 bytes in 3 files

sum += $5 is equivalent to sum = sum + $5 and count++ is equal to count = count + 1 (here, count represents the number of files whose length/size was summed).

awk uses the NF variable to make available the number of fields found in a record (i.e., the number of columns/words in a line).

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ awk -F ',' '{print NF}' 'data.txt' | head -n 1
4

Besides NF, awk defines other system variables:

FS
The input field separator, which you set using the -F option (an assignment to FS within a BEGIN command accomplishes the same, as well).
RS
The input record separator (i.e., the character that marks the end of a record/line). This is usually the newline character, but you can choose something else.
""
Special value that stands for an empty line. This makes it easy to process files that are block structured rather than line structured.

This is an example of a block structured file:

$ cat 'block.txt'
Beverly
Crusher
TNG
2324

Julian
Bashir
DS9
2341

For the above file, you can set FS and RS to the appropriate values to convert it to a line structured file. Instead of a default blank, the output field separator can be specified using the OFS variable:

$ awk 'BEGIN {FS = "\n"; RS = ""; OFS = ","} {print $1,$2,$NF}' 'block.txt'
Beverly,Crusher,2324
Julian,Bashir,2341

Above, for the block.txt file, we specify the field separator as a new line (\n), the record separator as an empty line (""), and the output field separator as a comma (,). Then, we print the first, second, and last fields ({print $1,$2,$NF}).

awk supports arrays, i.e., indexed groups of variables sharing a name. awk allows arrays to be indexed using arbitrary character strings, rather than just numbers, as well. Often, this is called an associative array.

#!/usr/bin/env -S awk -f
# Display users for each login shell

BEGIN {FS = ":"}
      {login_shells[$NF] = login_shells[$NF] $1 ", "}
END   {
          for (i in login_shells) {
              print i ": " login_shells[i]
          }
      }

The BEGIN command serves to collect the data. The END command outputs it. The for command introduces a loop in which the i variable is set to every index of the login_shells array in turn (the order is nondeterministic).

When the above AWK script is used on the /etc/passwd file, you will get a list of login shells together with their respective users:

$ shell_users.awk '/etc/passwd'
/bin/false: tss, speech-dispatcher, hplip, Debian-gdm,
/bin/bash: root, guest,
/usr/sbin/nologin: daemon, bin, sys, games, man, lp, mail, news, uucp, proxy, www-data, backup, list, irc, gnats, nobody, _apt, systemd-timesync, systemd-network, systemd-resolve, messagebus, dnsmasq, usbmux, rtkit, pulse, avahi, saned, colord, geoclue, systemd-coredump,
/bin/sync: sync,

AWK Expressions

awk expressions may contain, among others, the common basic arithmetic (+, -, *, and /) and comparison operators (<, <=, >, >=, ==, and !=). The ^ character can be used for exponentiation.

In awk, && is the logical AND operator (true is considered a non-zero value), || is the logical OR operator, and ! is the logical NOT operator.

For example, the following prints out the fourth field for lines in data.txt where the third field is either Bashir or McCoy:

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ awk -F ',' '$3 == "Bashir" || $3 == "McCoy" {print $4}' 'data.txt'
DS9 2341
TOS 2227

There are also test operators for regular expressions, ~ and !~, which you can use to check whether a string matches or does not match a regular expression, respectively:

$1 ~ /b.*b.*b/ {ex_statements}

The above command evaluates whether the first field contains at least three b characters.

Two strings (or variable values) can be concatenated by writing them next to each other (separated by a space):

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ awk -F ',' 'NR==1 {print $2 " knows best."}' 'data.txt'
Beverly knows best.

Above, we specify a , as the field delimiter with -F ','. NR==1 is used to tell awk that we want to target the first row of data.txt (specifically, NR is the number of input records awk has processed since the beginning of the program's execution). Then, we concatenate the value of line 1's second field (i.e., Beverly) with the string knows best. using print $2 " knows best.".

AWK expressions may refer to functions. Some functions are predefined, including the arithmetic functions:

int
Determine the integer part of a number.
log
Logarithm.
sqrt
Square root.
$ awk 'BEGIN {print sqrt(25)}'
5

awk's calculation abilities are roughly equivalent to those of a scientific calculator.

There are also string functions:

length
Determine the length of a string.
sub
Substitute strings. Corresponds to sed's s operator.
substr
Return arbitrary substrings.

You can define your own functions like so:

#!/usr/bin/env -S awk -f
# Multiply numbers by 4

function quad(n) {
    return 4*n
}

{print $1, quad($1)}

The above script reads a file of numbers (one per line) and outputs the original number and that number quadrupled.

$ cat 'numbers.txt'
3
2
1
$ quad.awk 'numbers.txt'
3 12
2 8
1 4

A function's body may consist of one or more awk commands. The return command is used to return a value as the function's result. The variables mentioned in a function's parameter list (in the above example, n) are passed to the function and are local to it (i.e., they may be changed, but the changes are invisible outside of the function). All other variables are global (i.e., they are visible everywhere within the awk program).

There is no provision in awk for defining extra local variables within a function. However, you can work around this by defining string functions inside of your original function.

This is a function that sorts the elements of an array F in alphabetical order, which uses numerical indices between 1 and N:

#!/usr/bin/env -S awk -f
# Sort array of Star Trek doctor names

function sort(A, N, i, j, temp_var) {
    # Insertion sort
    for (i = 2; i <= N; i++) {
        for (j = i; A[j-1] > A[j]; j--) {
            temp_var = A[j]; A[j] = A[j-1]; A[j-1] = temp_var
        }
    }
    return
}

BEGIN {
    doc_name[1] = "Katherine"; doc_name[2] = "Leonard"
    doc_name[3] = "Julian"; doc_name[4] = "Beverly"
    sort(doc_name, 4)
    for (i = 1; i <= 4; i++) {
        print i ": " doc_name[i]
    }
}

The for loop executes its first argument (i = 2). Then, it repeats the following:

  • Evaluate the second argument (i <= N).
  • If the result of this is less than or equal to the highest numerical index in the array, execute the loop body (here, a second for loop).
  • Evaluate the third argument (i++).

This is repeated until the second argument (i) is greater than N.

Note the output of the array's elements by means of a counting for loop (i.e., for (i = 1; i <= 4; i++) {). A for (i in a) loop would have produced the elements in a nondeterministic order (i.e., there would have been no point in sorting them first).

$ sort.awk
1: Beverly
2: Julian
3: Katherine
4: Leonard

AWK supports if conditional statements. Their syntax is modeled on the C language:

if (ex_condition) {
    ex_commands
}

An if else conditional statement is done like so:

if (ex_condition) {
    ex_commands
} else {
    ex_commands
}

The data[$0]++ == 0 expression is a common idiom. For example, it is used here in an awk script to remove duplicate lines from files:

#!/usr/bin/env -S awk -f
# Remove duplicate lines

{
    if (data[$0]++ == 0) {
        lines[++count] = $0
    }
}

END {
    for (i = 1; i <= count; i++) {
        print lines[i]
    }
}

Above, if (data[$0]++ == 0) is true if $0's value is seen for the first time. The ++count expression is equal to count++, except that it returns the value of count after it has been incremented (count++ returns the value before incrementing it). This ensures that the first line seen has an index of 1, even though we do not explicitly set count to 1.

$ cat 'dupes.txt'
1
1
2
2
3
3
$ compact.awk 'dupes.txt'
1
2
3

split

The split command splits a file into pieces.

split ex_file

By default, ex_file is broken into pieces of 1,000 lines each with a default prefix of x (e.g., xaa, xab, xac, etc.). The original file is left unchanged.

The -l ex_integer (--lines=ex_integer) option can be used to specify a different number of lines (records) per output file and a different prefix can be used by appending it to your command:

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ split -l 2 --verbose 'data.txt' 'star_trek_'
creating file 'star_trek_aa'
creating file 'star_trek_ab'
creating file 'star_trek_ac'
$ cat 'star_trek_aa'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
$ cat 'star_trek_ab'
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
$ cat 'star_trek_ac'
05,Beverly,Picard,TNG 2324

Above, the --verbose option is used to print a diagnostic message before each output file is opened.

paste

The paste command is used to merge together the lines of files.

paste ex_file...

paste causes every line of its file arguments to make up one column of the output.

$ cat 'dates.txt'
2324
2341
2227
2318
2324
$ cat 'names.txt'
Beverly Crusher
Julian Bashir
Leonard McCoy
Katherine Pulaski
Beverly Picard
$ paste 'dates.txt' 'names.txt' > 'dates_names.txt'
$ cat 'dates_names.txt'
2324    Beverly Crusher
2341    Julian Bashir
2227    Leonard McCoy
2318    Katherine Pulaski
2324    Beverly Picard

The above command takes the birth dates in dates.txt and the names in names.txt and joins them together into a third file, dates_names.txt using redirection.

The -s (--serial) option causes paste to append data in series, rather than parallel, i.e., it pastes the value of each line in the original file into the new output file horizontally, instead of vertically.

$ paste -s 'dates.txt' 'names.txt' > 'dates_names_s.txt'
$ cat 'dates_names_s.txt'
2324    2341    2227    2318    2324
Beverly Crusher Julian Bashir   Leonard McCoy   Katherine Pulaski   Beverly Picard

By default, paste uses a Tab as the field delimiter, but you can specify one or more alternative delimiters with the -d ex_delimiters (--delimiters=ex_delimiters) option (similar to the cut command).

Other delimiters can be a Space or Enter (|), and each delimiter is used in turn. When the list has been exhausted, paste begins again at the first delimiter. You can specify just one delimiter, i.e., a list is not required.

join

The join command joins lines of two files together based on a common field.

join ex_file_1 ex_file_2

The default join field is the first field, but this can be changed with the -j ex_join_field option.

$ cat 'dates_names.txt'
2324    Beverly Crusher
2341    Julian Bashir
2227    Leonard McCoy
2318    Katherine Pulaski
2324    Beverly Picard
$ cat 'dates_series.txt'
2324 TNG
2341 DS9
2227 TOS
2318 TNG
2324 TNG
$ join 'dates_names.txt' 'dates_series.txt' > 'st_doctors.txt'
$ cat 'st_doctors.txt'
2324 Beverly Crusher TNG
2341 Julian Bashir DS9
2227 Leonard McCoy TOS
2318 Katherine Pulaski TNG
2324 Beverly Picard TNG

The above command takes the Star Trek doctor names in dates_names.txt and the Star Trek doctor series in dates_series.txt and joins them on a common field (the characters' birth dates) into a new file, st_doctors.txt, using redirection.

The default delimiter for join is a Space, but a different delimiter can be set with the -t ex_delimiter option (like the sort command).

head outputs the first part of files.

head ex_file...

By default, the first ten lines of a file are displayed. However, this can be changed with the -n ex_integer (--lines=ex_integer) option:

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ head -n 2 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341

By supplying a negative value to the -n option, head will print everything except the last ex_integer lines:

$ head -n -2 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227

head can also display the first bytes of a file with the -c ex_integer (--bytes=ex_integer) option:

$ head -c 10 'data.txt'
01,Beverly

tail

tail outputs the last part of files.

tail ex_file...

By default, the last ten lines of a file are displayed. As with the head command, you can change the number of lines to display with the -n ex_integer (--lines=ex_integer) option:

$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ tail -n 2 'data.txt'
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324

To read from a specific line to the end of the file, you can use the +ex_integer syntax with the -n option:

$ tail -n +2 'data.txt'
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324

head and tail can be used together to target increasingly specific lines in a file. For example, the following outputs lines 2 through 3 of data.txt:

$ head -n 3 'data.txt' | tail -n 2
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227

tail's -f (--follow) option can be used to follow a file(s). This can be useful when you are troubleshooting an issue and want to closely follow a log as it is updated.

od

The od command can dump files in octal and other formats.

od ex_file...

For example, take the /usr/bin/python3.7 file:

$ file /usr/bin/python3.7
/usr/bin/python3.7: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=d8d5d37d3e53baef1a47596ab71690406d8a272d, stripped

By default, od displays the file in octal form:

$ od '/usr/bin/python3.7' | head -n 5
0000000 042577 043114 000402 000001 000000 000000 000000 000000
0000020 000002 000076 000001 000000 051360 000135 000000 000000
0000040 000100 000000 000000 000000 023600 000112 000000 000000
0000060 000000 000000 000100 000070 000013 000100 000033 000032
0000100 000006 000000 000004 000000 000100 000000 000000 000000

The -t ex_output_format (--format=ex_output_format) option can be used to specify a different format specification. The accepted values are:

  • a Named characters, ignoring high-order bit
  • b Octal bytes
  • c Printable characters or backslash escapes
  • u2 Unsigned decimal 2-byte units
  • fF Floats
  • dI Decimal ints
  • dL Decimal longs
  • o2 Octal 2-byte units
  • d2 Decimal 2-byte units
  • x2 Hexadecimal 2-byte units

This is the /usr/bin/python3.7 file in hexadecimal form:

$ od -t 'x2' '/usr/bin/python3.7' | head -n 5
0000000 457f 464c 0102 0001 0000 0000 0000 0000
0000020 0002 003e 0001 0000 52f0 005d 0000 0000
0000040 0040 0000 0000 0000 2780 004a 0000 0000
0000060 0000 0000 0040 0038 000b 0040 001b 001a
0000100 0006 0000 0004 0000 0040 0000 0000 0000

strings

The strings command prints strings of printable characters in files.

strings ex_file...

For example, /usr/bin/python3.7 is not a text file, but an ELF 64-bit LSB executable file. If you try to view its content, you are going to see mostly gibberish:

$ cat '/usr/bin/python3.7' | head -n 5
ELF>�R]@�'J@8
             @�@@@@@h��@�@@@

                            BB�#�#0%0e0e��������?�����X
ȍ
 ��?����@��@�@DDP�td��9��y��y$$Q�tdR�td��?����``/lib64/ld-linux-x86-64.so.2GNUGNU���}>S���GYj��@m�'-��،Q
@�"X"8��SA��P$@��F��`D!����"QHB�

           �Q���!��*QP

� x��@���!!
5B�B@�8XX �CH�� 2 � "tB� � @�@P`!��pC
                                     �  D�

However, the strings command will extract and display any strings in the file:

$ strings '/usr/bin/python3.7' | head -n 5
/lib64/ld-linux-x86-64.so.2
"X"8
 "tB
@P`!
2B B

nl

nl writes the lines of a file to the standard output, with line numbers added.

nl ex_file...

By default, nl only numbers lines with data, not blank lines. You can instruct nl to number all lines of a file with the -ba (--body-numbering=ex_style) option (a is the ex_style meaning number all lines).

STYLE is one of:

  • a Number all lines
  • t Number only non-empty lines
  • n Number no lines
  • pex_regex Number only lines that contain a match for the basic regular expression ex_regex

iconv

iconv converts text from one character encoding to another.

iconv -f ex_from_encoding_value -t ex_to_encoding_value ex_file... -o ex_output_file

This command uses several options:

-f ex_from_encoding_value, --from-code=ex_from_encoding_value
Use ex_from_encoding_value for input characters.
-t ex_to_encoding_value, --to-code=ex_to_encoding_value
Use ex_to_encoding_value for output characters.
-o ex_output_file, --output=ex_output_file
Use ex_output_file for output.

For example, the following changes the text encoding of the utf.txt file:

$ cat 'utf.txt'
abc ß ? € à?ç
$ file -i 'utf.txt'
utf.txt: text/plain; charset=utf-8
$ iconv -f UTF-8 -t ASCII//TRANSLIT 'utf.txt' -o 'ascii.txt'
$ cat 'ascii.txt'
abc ss ? EUR a?c
$ file -i 'ascii.txt'
ascii.txt: text/plain; charset=us-ascii

Above, file's -i (--mime) option causes the command to output mime type strings rather than the more traditional human readable ones.

The //TRANSLIT part of the iconv command instructs iconv to transliterate characters being encoded when needed, and possible.

printf

printf prints an argument according to a specific format. It is available as both an internal and external command.

printf ex_format ex_argument...

ex_format controls the output as in C printf.

Interpreted sequences are:

\"
Double quote
\\
Backslash
\a
Alert (BELL)
\b
Backspace
\c
Produce no further output
\e
Escape
\f
Form feed
\n
New line
\r
Carriage return
\t
Horizontal tab
\v
Vertical tab
\NNN
Byte with octal value NNN (1 to 3 digits)
\xHH
Byte with hexadecimal value HH (1 to 2 digits)
\uHHHH
Unicode (ISO/IEC 10646) character with hex value HHHH (4 digits)
\UHHHHHHHH
Unicode character with hex value HHHHHHHH (8 digits)

Interpreted sequences also include all C format specifications ending with one of diouxXfeEgGcs, with ex_argument converted to proper type first (variable widths are handled).

Format Specifiers

Some of the most commonly used format specifiers include:

%%
A single %.
%b
ex_argument as a string with '\' escapes interpreted, except that octal escapes are of the form \0 or \0NNN.
%d
An integer specifier for showing integral values.
%f
Specifier for showing floating point values.
%s
ex_argument as a string with '\' escapes not interpreted.
%q
ex_argument is printed in a format that can be reused as shell input, escaping non-printable characters with the proposed POSIX $'' syntax.
%x
Specifier for output of lowercase hexadecimal values for integers and for padding the output.

Details about available formats are in the documentation of the C library function, which can be viewed by running man printf.3.

Examples

By default, printf does not add newlines to the strings provided to it as arguments, like echo does:

$ echo 'Hello, world!'
Hello, world!
$ printf 'Hello, world!'
Hello, world!$

To add newlines, printf needs to be supplied a format string with the newline escape sequence:

$ printf '%s\n' 'Hello, world!'
Hello, world!
$ printf '%s\n' 'Hello, world!' 'Even more' 'new' 'lines.'
Hello, world!
Even more
new
lines.

fmt and pr

The fmt command is used for simple optimal text formatting.

fmt -w ex_width_number ex_file...

The -w ex_integer (--width=ex_integer) option sets the maximum line width (the default is 75 columns). An abbreviated form of this option is -ex_width_number.

The pr command converts text files for printing.

pr --columns=ex_column_number ex_file...

The --columns=ex_column_number option tells pr to output ex_column_number columns and print columns down, unless the -a (--across) option is used. An abbreviated form of this option is -ex_column_number.

Together, these commands can be used to format a file for printing by setting line width and the number of columns:

$ cat 'fruit.txt'
Apple
Watermelon
Orange
Pear
Cherry
Strawberry
Nectarine
Grape
Mango
Blueberry
Pomegranate
Plum
Banana
Raspberry
Mandarin
Jackfruit
Papaya
Kiwi
Pineapple
Lime
Lemon
Apricot
Grapefruit
Melon
Coconut
Avocado
Peach
$ fmt -11 'fruit.txt' | pr -3


2021-01-16 14:40                                                  Page 1


Apple           Blueberry       Pineapple
Watermelon      Pomegranate     Lime Lemon
Orange          Plum            Apricot
Pear            Banana          Grapefruit
Cherry          Raspberry       Melon
Strawberry      Mandarin        Coconut
Nectarine       Jackfruit       Avocado
Grape           Papaya          Peach
Mango           Kiwi

Documentation

For more on the text manipulation commands discussed here, refer to the Linux User's Manual, either at the command prompt or online.

Enjoyed this post?

Subscribe to the feed for the latest updates.