GNU/Linux has many commands that can filter text strings.
Note: If you are not familiar with the GNU/Linux command line interface, review the Conventions page before proceeding.
sort
The sort
command sorts the lines in text files and sends the results to the standard output.
sort ex_file...
By default, the sorting rules are:
- Lines starting with a number will appear before lines starting with a letter.
- Lines starting with a letter that appears earlier in the alphabet will appear before lines starting with a letter that appears later in the alphabet.
- Lines starting with a lowercase letter will appear before lines starting with the same letter in uppercase.
sort
sorts lines lexicographically, considering all of the input lines, i.e., if the initial characters of two lines are equal, the first differing character within the line governs the line's relative positioning.
Some of sort
's helpful options include:
-b
,--ignore-leading-blanks
- Ignore leading blanks in field contents (i.e., treat a run of spaces like a single space).
-d
,--dictionary-order
- Sort in dictionary order (i.e., only alphanumeric and space characters are taken into account).
--debug
- Annotate the part of the line used to sort. Warn about questionable usage to the standard error.
-f
,--ignore-case
- Fold uppercase to lowercase letters.
- This option is relevant if your system is set up for bytewise sorting. For example, to have the
sort
command sort characters based on their ASCII order, you can set theLC_ALL
localization variable toC
(e.g.,export LC_ALL=C
). Then, to havesort
ignore case, you can use the-f
option. Runman 7 ascii
to see a visualization of the ASCII character table. -h
,--human-numeric-sort
- Compare human readable numbers (e.g.,
2K
,1G
). -i
,--ignore-nonprinting
- Consider only printable characters.
-k ex_keydef
,--key=ex_keydef
- Sort according to
ex_keydef
.ex_keydef
is a key definition in the form ofex_field_start.ex_character_ex_option,ex_field_end.ex_character_ex_option
. ex_character
(a character position in the field),ex_option
(one or more single-letter ordering options,[bdfgiMhnRrV]
, which override global ordering options for that key), andex_field_end
(a field number) are optional. The default value ofex_field_end
is the end of the line.-n
,--numeric-sort
- Compare according to string numerical value. Consider key (field) value as a number and sort according to its numeric value (leading blanks ignored), i.e., sort numerically instead of lexicographically.
- A whole field will be considered as a single number, instead of being evaluated character by character (i.e.,
123
is considered to be a single number, one hundred and twenty three, instead of the three numbers1
,2
, and3
). -o ex_file
,--output=ex_file
- Write results to
ex_file
instead of the standard output (the filename may match the original input filename). -r
,--reverse
- Reverse the result of comparisons. Sort in reverse (descending) order.
-u
,--unique
- Output only the first of an equal run. Outputs only the first of a sequence of equal output lines (i.e., those that are unique).
-t ex_delimiter
,--field-separator=ex_delimiter
- Specify field delimiter (e.g.,
-t :
).
Lexicographical Sort
Assume a data.txt
file:
$ cat 'data.txt'
a
b
c
1
2
3
A
B
C
Running sort
on this file results in a lexicographical sort:
$ sort 'data.txt'
1
2
3
a
A
b
B
c
C
When we add the -f
option, we see the same lexicographical sort:
$ sort -f 'data.txt'
1
2
3
a
A
b
B
c
C
Bytewise Sort
Bytewise sorting can be enabled by setting the LC_ALL
localization variable to C
:
$ export LC_ALL=C
$ sort 'data.txt'
1
2
3
A
B
C
a
b
c
This will result in a sorting based on the ASCII character table (i.e., numbers first, then uppercase letters, followed by lowercase letters).
After bytewise sorting is enabled, you can see the result of using the -f
option to tell sort
to ignore case:
$ sort -f 'data.txt'
1
2
3
A
a
B
b
C
c
Numeric Sort
By default, sort
does not treat numbers as whole values, but as individual string number characters. For example, assume the following data.txt
file:
$ cat 'data.txt'
2
22
12
1
Sorting this file, without any additional options, does not result in what many would assume to be an expected result:
$ sort 'data.txt'
1
12
2
22
To have sort
treat each of these lines' sorting fields as whole numeric values, add the -n
option:
$ sort -n 'data.txt'
1
2
12
22
Sort by Key (Field)
The first of a line's spaces is considered a field separator (i.e., a Space is the default delimiter for the sort
command). The rest of the spaces are made a part of the next field's value.
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ sort 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
Above, sort
sees each line in data.txt
as having two fields, i.e., everything up until the first space, and the space and everything after it (e.g., 01,Beverly,Crusher,TNG
and 2324
for the first line in data.txt
).
To sort a line by a specific field, use the -k
option. sort
will operate from the field you specify to the end of the line. The following sorts data.txt
's content by first names (i.e., the second field), where the delimiter has been specified as a comma (,
):
$ sort -k 2 -t ',' 'data.txt'
01,Beverly,Crusher,TNG 2324
05,Beverly,Picard,TNG 2324
02,Julian,Bashir,DS9 2341
04,Katherine,Pulaski,TNG 2318
03,Leonard,McCoy,TOS 2227
To get a nice visualization of where sort
begins and ends its sorting evaluation, add the --debug
option:
$ sort --debug -k 2 -t ',' 'data.txt'
sort: using ‘en_US.UTF-8’ sorting rules
01,Beverly,Crusher,TNG 2324
________________________
___________________________
05,Beverly,Picard,TNG 2324
_______________________
__________________________
02,Julian,Bashir,DS9 2341
______________________
_________________________
04,Katherine,Pulaski,TNG 2318
__________________________
_____________________________
03,Leonard,McCoy,TOS 2227
______________________
_________________________
Above, we can see that sort
begins its evaluation of each line at the second field (where fields are delimited by a comma) until the end of each line.
For lines that are equal based on the field(s) you specified, sort
will go back to the beginning of the line to evaluate other fields to serve as a tie breaker. This is demonstrated in the example above by the second horizontal line underneath each line of text in data.txt
.
To sort by a specific character in a specific field, append a .
and the character position in that field to sort by to the sort field in question. For example, the following command sorts by the fifth character of the fourth field (i.e., by each character's birth year; note the addition of the -n
option to ensure that the birth year values are treated as whole numbers and not individual numeric characters):
$ sort -k 4.5 -n -t ',' 'data.txt'
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
01,Beverly,Crusher,TNG 2324
05,Beverly,Picard,TNG 2324
02,Julian,Bashir,DS9 2341
If you only want sort
to evaluate a line from one specific point to another, you can do so by separating the start and end points with a ,
. This example sorts each line by first name and last name (the --debug
option is added to illustrate the new sort evaluation area):
$ sort --debug -k 2,3 -t ',' 'data.txt'
sort: using ‘en_US.UTF-8’ sorting rules
01,Beverly,Crusher,TNG.2324
_______________
___________________________
05,Beverly,Picard,TNG.2324
______________
__________________________
02,Julian,Bashir,DS9.2341
_____________
_________________________
04,Katherine,Pulaski,TNG.2318
_________________
_____________________________
03,Leonard,McCoy,TOS.2227
_____________
_________________________
As previously mentioned, for lines that are equal based on the field(s) you specify, sort
will go back to the beginning of the line to evaluate other fields to serve as a tie breaker.
Multiple key fields can be specified with multiple -k
options. The following sorts by character first name, followed by row number, using reverse order:
$ sort -k 2 -k 1 -r -t ',' 'data.txt'
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
02,Julian,Bashir,DS9 2341
05,Beverly,Picard,TNG 2324
01,Beverly,Crusher,TNG 2324
uniq
The uniq
command displays lines in a file after omitting any repeated contiguous lines (i.e., only the first instance of a repeated contiguous line is shown):
uniq ex_file
Unlike most commands, uniq
can only accept one file as an argument. If a second argument is given, uniq
will treat it as an output file, instead of sending its output to the standard output.
Since uniq
only filters out duplicate lines when they are contiguous, it often makes sense to use uniq
with the sort
command:
sort ex_file | uniq
Alternatively, you could just use sort
's -u
option for the same effect:
sort -u ex_file...
Display Unique Lines
To only display unique lines in a file (i.e., those that are not repeated), use uniq
's -u
(--unique
) option:
sort ex_file | uniq -u
Display Single Instance of Repeated Lines
To display a single instance of repeated lines in a file, use uniq
's -d
(--repeated
) option:
sort ex_file | uniq -d
Display All Instances of Repeated Lines
To display all instances of repeated lines in a file, use uniq
's -D
option:
sort ex_file | uniq -D
expand and unexpand
The expand
command converts tabs to a set number of spaces:
expand ex_file...
The default number of spaces is set to 8
, but you can set this with the -t
(--tabs=ex_integer
) option (e.g., -t 5
).
To convert spaces in a file to a set number of tabs, use the unexpand
command:
unexpand ex_file...
cut
The cut
command removes sections (columns) from each line of files:
cut ex_file...
Options can be used to specify what kind of section it removes. Specifically, it can be used to remove characters (-c
, --characters=ex_list
), bytes (-b
, --bytes=ex_list
), or fields (-f
, --fields=ex_list
).
A column is equivalent to a character and a field is a collection of columns. The -c
, -b
, and -f
options are mutually exclusive.
cut
's output will always contain the columns in the same order as the input, not the order that is specified in the command. For example, the following command displays the second and fourth characters of each line in the data.txt
file:
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ cut -c 4,2 'data.txt'
1B
2J
3L
4K
5B
Even though the fourth character was specified first in the cut
command above, the specified columns still appear in the same order in the command output as they do in the data.txt
file (i.e., character 2
and then character 4
).
Cutting a Character Range
A hyphen (-
) can be used to specify a character range. The following extracts just the row number of each Star Trek character in data.txt
:
$ cut -c 1-2 'data.txt'
01
02
03
04
05
Cutting Specific Columns
Cutting specific columns and ranges can be used together. The following displays the row number and first three letters of each Star Trek character's first name:
$ cut -c 1-2,4-6 'data.txt'
01Bev
02Jul
03Leo
04Kat
05Bev
If character ranges ever overlap, every input character is still output at most once.
Cutting Specific Fields and Changing the Field Delimiter
By default, cut
's field delimiter is the Tab character. A different delimiter can be set with the -d
(--delimiter=ex_delimiter
) option. For example, the following sets a comma (,
) as the field delimiter and cuts the first and last Star Trek character names (i.e., the second and third fields) from the data.txt
file:
$ cut -d ',' -f 2,3 'data.txt'
Beverly,Crusher
Julian,Bashir
Leonard,McCoy
Katherine,Pulaski
Beverly,Picard
Cutting a Field Range
Like with characters, you can specify a field range with a hyphen (-
). The following displays the last name and series/birth year of each Star Trek character:
$ cut -d ',' -f 3-4 'data.txt'
Crusher,TNG 2324
Bashir,DS9 2341
McCoy,TOS 2227
Pulaski,TNG 2318
Picard,TNG 2324
Setting an Output Delimiter
cut
allows you to specify different input and output delimiters. The --output-delimiter
option is used to set the output delimiter. The following example uses a ,
as the input delimiter and a (a Space) as the output delimiter:
$ cut -d ',' -f 2,3 --output-delimiter=' ' 'data.txt'
Beverly Crusher
Julian Bashir
Leonard McCoy
Katherine Pulaski
Beverly Picard
If a line in a file is missing the specified delimiter character, cut
outputs the line in its entirety. You can suppress these lines by using cut
's -s
(--only-delimited
) option.
tr
The tr
command translates (or deletes) characters for each line of a file. It does not act on whole words. Unlike most commands, tr
cannot take a file as an argument. It can only receive input via redirection or piping.
The following example changes all commas in data.txt
to semicolons:
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ tr ',' ':' < 'data.txt'
01:Beverly:Crusher:TNG 2324
02:Julian:Bashir:DS9 2341
03:Leonard:McCoy:TOS 2227
04:Katherine:Pulaski:TNG 2318
05:Beverly:Picard:TNG 2324
Translating Character Classes and Ranges
Besides working with characters, tr
can also accept character classes (e.g., [:upper:]
) and ranges (e.g., a-z
). The following translates all lowercase alphabetic characters in data.txt
to uppercase alphabetic characters:
$ tr '[:lower:]' '[:upper:]' < 'data.txt'
01,BEVERLY,CRUSHER,TNG 2324
02,JULIAN,BASHIR,DS9 2341
03,LEONARD,MCCOY,TOS 2227
04,KATHERINE,PULASKI,TNG 2318
05,BEVERLY,PICARD,TNG 2324
Alternatively, you could do:
$ tr 'a-z' 'A-Z' < 'data.txt'
01,BEVERLY,CRUSHER,TNG 2324
02,JULIAN,BASHIR,DS9 2341
03,LEONARD,MCCOY,TOS 2227
04,KATHERINE,PULASKI,TNG 2318
05,BEVERLY,PICARD,TNG 2324
Squeezing Characters
The -s
option can be used to squeeze (remove) redundant adjacent characters. For example, if an extra 0
was prepended to the beginning of each line of data.txt
, it could be filtered out like so:
$ cat 'data.txt'
001,Beverly,Crusher,TNG 2324
002,Julian,Bashir,DS9 2341
003,Leonard,McCoy,TOS 2227
004,Katherine,Pulaski,TNG 2318
005,Beverly,Picard,TNG 2324
$ tr -s '0' < 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
Deleting Characters
The -d
option is used to filter out (delete) specified characters. The following filters out the 0
character from each line in data.txt
.
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ tr -d '0' < 'data.txt'
1,Beverly,Crusher,TNG 2324
2,Julian,Bashir,DS9 2341
3,Leonard,McCoy,TOS 2227
4,Katherine,Pulaski,TNG 2318
5,Beverly,Picard,TNG 2324
Creating Complement Substitution Sets
tr
can be used to create complement substitution sets (i.e., to filter out all characters in a line except for those that you specify) using the -c
(--complement
) option. The following example grabs the first line of data.txt
and pipes it to tr
, which filters out all non-alphabetic characters (except for the newline character, \n
):
$ head -n 1 'data.txt' | tr -cd 'a-zA-z\n'
BeverlyCrusherTNG
Remove All Non-Printable Characters
The tr
command can be used to accomplish useful tasks that may be common when working with large text files. For example, it can remove all non-printable characters from a file:
tr -cd '[:print:]' < ex_file
Join All Lines Into a Single Line
Also, tr
can join all lines in a file into a single line:
tr -s '\n' ' ' < ex_file
Above, tr
replaces newlines with spaces, and the -s
option squeezes any redundant spaces to a single space (the -s
option only operates on the last specified set, i.e., ' '
in this example).
sed
The sed
command is a stream editor. Stream editors perform basic text transformation on an input stream (e.g., a file or input from a pipeline). The basic syntax for sed
is:
sed ex_script ex_file...
sed
takes a number of instructions (i.e., ex_script
), reads its input line by line, and applies the instructions to the input lines, where appropriate. Finally, the lines are output in a (possibly) modified form.
sed
only reads its input file (if there is one) and never modifies it. If an input file is not provided, sed
filters the contents of the standard input. By default, sed
sends its output to the standard output.
Helpful sed
options include:
-e ex_script
,--expression=ex_script
- Add
ex_script
to the commands to be executed. This option can be used multiple times in a command to specify multiplesed
scripts. -f ex_script_file
,--file=ex_script_file
- Add the contents of
ex_script_file
to the commands to be executed. This option can be used multiple times in a command to specify multiple files. -i
,--in-place
- Edit files in place. Also, make a backup if an optional
ex_suffix
is supplied as an argument (e.g.,-i ex_suffix
,--in-place=ex_suffix
). -n
,--quiet
,--silent
- Suppress automatic printing of pattern space. This option can be overridden with an explicit
p
(print) command. - For example,
sed -ne '1,10p' ex_file
is equivalent to thehead
command, i.e., only the first 10 lines ofex_file
are sent to the standard output. However,sed -ne '1,10p' ex_file
is not as efficient as thehead
command because every line ofex_file
is read beforesed
generates its output. -s
,--separate
- Consider files as separate, rather than as a single continuous long stream. Every input file starts with a line number of 1.
How sed Works
sed
maintains two data buffers:
- Active Pattern Space
- Auxiliary Hold Space
By default, both data buffers are empty.
sed
performs the following cycle on each line of input:
- First,
sed
reads one line from the input stream, removes any trailing newline, and places it in the pattern space. - Then, commands are executed. Each command can have an address associated with it. Addresses are a condition code, and a command is only executed if the condition is verified before the command to be executed.
- When the end of the script is reached (unless the
-n
option is used), the contents of the pattern space are printed out to the output stream. The trailing newline is added back if it was removed. - The next cycle starts for the next input line.
Unless a command like D
is used, the pattern space is deleted between two cycles (if the pattern space contains no newlines, the D
command starts a normal new cycle; the d
command deletes the pattern space and immediately starts a new cycle). On the other hand, the hold space keeps its data between cycles.
sed Programs
A sed
program consists of one or more sed
commands passed in by one or more of the -e
and -f
options, or the first non-option argument passed to the command if these options are not used. The sed
script is the in-order concatenation of all the scripts and script files passed to sed
.
Selecting Lines With sed
For sed
, a number as a line address is considered a line number. The first input line is number 1, and the count continues from there, even if the input consists of several files.
Line addresses in a sed
script can follow these forms:
ex_number
- Specifying a line number will only match that line in the input. Unless the
-i
or-s
options are specified,sed
continuously counts lines across all input files (e.g.,sed -e '3p' -n 'data.txt'
suppresses all output lines ofdata.txt
except for line 3, whichsed
is explicitly told to print). ex_first_line~ex_step
- A GNU extension of
sed
that matches everyex_step
lines starting withex_first_line
. Essentially,ex_first_line
is the starting point, andex_step
is the step (e.g.,sed -e '1~2!d' 'data.txt'
removes all lines fromdata.txt
'ssed
output except for the odd-numbered lines). $
- Matches the last line of the last file of input, or the last line of each file when the
-i
or-s
options are specified (e.g.,sed -e '5,$p' -n 'data.txt'
prints out the content ofdata.txt
from line 5 to the end of the file). /ex_regex/
- Selects any line that matches the regular expression
ex_regex
(e.g.,sed -e '/TNG/p' -n 'data.txt'
prints out any lines indata.txt
that haveTNG
in them). Ifex_regex
itself includes any/
characters, each must be escaped with a\
. - The empty regular expression
//
repeats the last regular expression match (the same is true if the empty regular expression is passed to thes
command, i.e., the substitute command). Keep in mind, modifiers to regular expressions are evaluated when the regular expression is compiled. Therefore, it is invalid to specify them together with the empty regular expression. \%ex_regex%
- Matches the regular expression
ex_regex
, but allows you to use a different delimiter other than/
(e.g.,sed -e '\%DS9%!d' 'data.txt'
removes all lines fromdata.txt
'ssed
output except for those that haveDS9
in them). This is useful ifex_regex
itself contains/
s. Ifex_regex
contains any delimiter characters, each must be escaped by a\
. - The
%
character may be replaced by any other single character. /ex_regex/I
,\%ex_regex%I
- A GNU extension that causes
ex_regex
to be matched in a case-insensitive manner. /ex_regex/M
,\%ex_regex%M
- A GNU extension that causes
^
and$
to respectively match (in addition to the normal behavior) the empty string after a newline and the empty string before a newline. There are special character sequences (i.e.,\`
,\'
) that always match the beginning or end of the buffer.M
is short for multi-line.
If no line addresses are provided to sed
, then all lines are matched (e.g., sed -e '' 'data.txt'
prints out all lines of data.txt
). If one line address is given, then only lines matching that address are matched (e.g., sed -e '3!d' 'data.txt'
removes all lines from data.txt
's sed
output except for line 3).
A line address range can be specified by providing two line addresses separated by a comma (,
). A line address range matches lines starting from where the first address matches and continues until the second address matches (inclusively). For example, sed -e '2,4!d' 'data.txt'
removes all lines from data.txt
's sed
output except for lines 2 through 4.
If the second address is a regular expression, then checking for the ending match will start with the line following the line that matched the first address. Hence, the expression could never match the first input line. You can get around this by using 0,/ex_regex/
.
If the first address is a regular expression too, then once a line matching the second expression is found, sed
continues looking for another line matching the first expression, as there might be another matching range of lines.
A range will always span at least two lines (except if the input stream ends).
If the second address is a number less than or equal to the line matching the first line address, then only the one line is matched.
GNU/Linux sed
supports special two address forms, as well:
0,/ex_regex/
sed
will try to matchex_regex
in the first input line, too. That is,0,/ex_regex/
is similar to1/ex_regex/
, except that if line address 2 matches the first line of input, the0,/ex_regex/
form will consider it to end the range, while the1/ex_regex/
form will match the beginning of its range and make the range span up to the second occurrence of the regular expression.ex_address,+ex_number
- Matches
ex_address
and theex_number
lines followingex_address
. ex_address,~ex_number
- Matches
ex_address
and the lines followingex_address
until the next line whose input line number is a multiple ofex_number
.
Appending the !
character to the end of a line address specification negates the sense of the match, i.e., if the !
character follows a line address range, then only lines that do not match the line address range will be selected (e.g., sed -e '2!p' -n 'data.txt'
prints all lines of data.txt
except for line 2). Also, this works for single line addresses for the null address.
These are examples of range addressing:
1,/^$/
- This selects all lines up to the first empty line. This can be useful to extract the header of an email message (which, by definition, always finishes with an empty line, but may not contain empty lines). Here,
^$
is a regular expression signifying an empty line. An example command to extract an email message header from a file issed -e '1,/^$/p' -n ex_file
. - You can use the
^$
regular expression to delete all empty lines in a file withsed '/^$/d'
, as well. 1,$
- This describes all input lines (e.g.,
sed -e '1,$!d' 'data.txt'
removes all lines fromdata.txt
'ssed
output except for lines 1 through to the last line of the file, i.e., everything). /^BEGIN/,/^END/
- This describes all ranges of lines starting at one beginning with the text
BEGIN
up to one beginning with the textEND
(inclusively). Here,^BEGIN
and^END
are regular expression patterns (e.g.,sed '/^BEGIN/,/^END/!d' ex_file
removes all lines fromex_file
'ssed
output except for all ranges of lines starting at one beginning with the textBEGIN
up to one beginning with the textEND
(inclusively)).
sed and Regular Expressions
Certain regular expression characters need to be escaped with a backslash (\
) when being used with sed
:
\+
\?
\{i\}
\{i,j\}
\{i,\}
\(ex_regex\)
ex_regex\|ex_regex
\(\ex_digit\)
sed Commands
sed
commands in a script or script file can be separated by semicolons (;
) or newlines. However, some commands cannot be followed by semicolons as command separators and should be terminated with newlines, or be placed at the end of a script or script file. Commands can be preceded with optional non-significant whitespace characters.
A sed
command consists of an optional address or address range, followed by a one character command name and any additional parameters. Line addresses determine which lines the command applies to.
These are some helpful sed
commands:
a\ex_text
- The append command. Queue the lines of text that follow this command (each but the last ending with a
\
, which are removed from the output) to be output at the end of the current cycle, or when the next input line is read.a
is a one address command (e.g.,sed -e '5a\06,The,Doctor,VYG 2371' 'data.txt'
adds the06,The,Doctor,VYG 2371
line after line 5 indata.txt
). - Escape sequences in text are processed, so you should use
\\
in text to print a single backslash. - If between the
a
and the newline there is something other than a whitespace-\
sequence, then the text of this line, starting at the first non-whitespace character after thea
, is taken as the first line of the text block. (This enables a simplification in scripting a one-line add.) This extension also works with thei
andc
commands. c\ex_text
- The change command. Delete the lines matching the address or address-range, and output the lines of text which follow this command (each but the last ending with a
\
, which are removed from the output) in place of the last line (or in place of each line, if no addresses were specified), e.g.,sed -e '5c\05,The,Doctor,VYG 2371' 'data.txt'
replaces line 5 indata.txt
with05,The,Doctor,VYG s2371
. - A new cycle is started after this command is done, since the pattern space will have been deleted.
d
- The delete command. Delete the pattern space. Immediately start a new cycle.
- This is useful if you want to suppress input lines so that
sed
does not output them. For example,sed -e '11,$d' ex_file
suppresses lines from being sent to the standard output starting at line 11, essentially making it analogous to thehead
command, i.e., only the first 10 lines ofex_file
are sent to the standard output. i\ex_text
- Immediately output the lines of text that follow this command (each but the last ending with a
\
, which are removed from the output).i
is a one address command (e.g.,sed -e '1i\Star Trek Doctors\n' 'data.txt'
insertsStar Trek Doctors
and a newline before line 1 ofdata.txt
). p
- The print command. Print out the pattern space to the standard output. Usually, this command is only used in conjunction with the
-n
option. n
- The next command. It auto-print is enabled, print the pattern space. Then, replace the pattern space with the next line of input. If there is no more input, make
sed
exit without processing any more commands. For example,sed -e 'n' 'data.txt'
displays the contents ofdata.txt
. q
- The quit command. Exit
sed
without processing any more commands or input (e.g.,sed -e '10q' ex_file
is an efficient alternative to thehead
command). This command only accepts a single address and the current pattern space is printed if the-n
option is not used. An optionalex_exit_code
may be supplied to this command as an argument. y/ex_source_characters/ex_destination_characters/
- Transliterate any characters in the pattern space that match any of the
ex_source_characters
with the corresponding character inex_destination_characters
. - Instances of the
/
(or whatever other character is used instead),\
, or newlines can appear in theex_source_characters
orex_destination_characters
lists, provided that each instance is escaped by a\
. Theex_source_characters
andex_destination_characters
lists must contain the same number of characters (after de-escaping). The/
characters may be uniformly replaced by any other single character within any giveny
command. - For example,
sed -e 'y/, /|:/' 'data.txt'
changes,
s to|
s ands (spaces) to
:
s indata.txt
. { ex_commands }
- A group of commands. This is useful when you want a group of commands to be triggered by a single line address (or line address range) match.
In traditional sed
syntax, sed
expects a backslash (\
) after the a
, i
, or c
commands immediately preceding the end of line. If more than one line is to be inserted, each of these lines, except for the last line, must also be terminated with a backslash immediately preceding the end of line.
For example, you can insert a blank line after every input line that ends with a capital letter only with sed -e '/[A-Z]$/a\\' ex_file
. Here, the \
after a\
with no content to append before it represents a blank line.
The a
and i
commands are one-address commands, i.e., they allow only addresses matching a single line, no ranges. However, there may be several input lines that match the one address given with a
and i
, and which will be processed. This one address may include regular expressions. With the c
command, an address range implies that all of the range is to be replaced.
The y
command makes sed
replace single characters. It does not allow for ranges, like a-z
, so it is not a good replacement for the tr
command.
The s Command
The s
(substitute) command is a powerful sed
command that allows substitution of a regular expression by a character string whose composition may dynamically change. Its syntax is:
's/ex_regex/ex_replacement/ex_flags'
The s
command tries to match the pattern space against ex_regex
. If the match is successful, then the portion of the pattern space that was matched is replaced with ex_replacement
.
The /
character separator may be replaced by any other single character within any s
command. You just need to be consistent and use the same separator character three times. Only spaces and newline characters are not allowed as separators. The /
character (or whichever character is used to replace it) may only appear in ex_regex
or ex_replacement
if it is preceded by a \
character.
For regular expressions used as line addresses, the /
s as separators are mandatory.
This example replaces the abbreviation TNG
with the phrase The Next Generation
in the data.txt
file:
sed -e 's/\<TNG\>/The Next Generation/g' 'data.txt'
\<TNG\>
is a regular expression that ensures that only the abbreviation TNG
, and not words that contain the characters TNG
, are targeted.
Note that \<
, \>
, ^
, and $
correspond to empty strings. In particular, $
is not the actual newline character at the end of a line, but the empty string ""
immediately preceding the newline character. Therefore, s/$/|/
inserts a |
immediately before the end of line, instead of replacing the end of line with it (e.g., sed -e 's/$/|/' 'data.txt'
substitutes the empty string at the end of each line in data.txt
with a |
).
ex_replacement
can contain \ex_number
references (where ex_number
is a number from 1
through 9
, inclusive), which refers to the portion of the match that is contained between the ex_number
th \(
and its matching \)
.
For example, the following sed
command can take a full HTTP or HTTPS URL as input and just return the protocol portion:
sed -e 's,^\(.*://\).*,\1,'
ex_replacement
may also contain unescaped &
characters that reference the whole matched portion of the pattern space.
In addition, ex_replacement
may include the following special GNU extension sequences:
\E
- Stop case conversion started by
\L
or\U
. \L
- Turn the replacement to lowercase until a
\U
or\E
is found. \l
- Turn the next character to lowercase.
\U
- Turn the replacement to uppercase until a
\L
or\E
is found. \u
- Turn the next character to uppercase.
To include a literal \
, &
, or newline in ex_replacement
, precede the desired \
, &
, or newline in ex_replacement
with a \
.
The s
command can be followed by zero or more of the following flags (modifiers):
ex_number
- Only replace the
ex_number
th match ofex_regex
. e
- Allows one pipe input from a shell command into pattern space. If a substitution was made, the command that is found in pattern space is executed and pattern space is replaced with its output. A trailing newline is suppressed. Results are undefined if the command to be executed contains a null character.
- The
e
flag is a GNUsed
extension. g
- Apply the replacement to all matches of
ex_regex
, not just the first. g
stands for global.I
ori
- A GNU extension that makes
sed
matchex_regex
in a case-insensitive manner. M
orm
- A GNU extension that causes
^
and$
to match (in addition to normal behavior) the empty string after a newline and the empty string before a newline, respectively. There are special character sequences (i.e.,\`
,\'
) that always match the beginning or end of the buffer. M
stands for multi-line.p
- If the substitution was made, then print the new pattern space.
w ex_file
- If the substitution was made, then write out the result to
ex_file
. Two special values ofex_file
are supported on GNU/Linux: (1)/dev/stderr
, which writes the result to the standard error, and (2)/dev/stdout
, which writes the result to the standard output.
Normally, the s
command replaces only the first match on every line. If the g
(global) flag is appended to the s
command, it replaces every occurrence of the search pattern on each line.
For example, the following sed
command matches words ([A-Za-z]\+
), then replaces them with quoted versions ("&"
). Since the g
flag is used, every echoed word is subject to replacement:
$ echo 'Every word quoted.' | sed -e 's/[A-Za-z]\+/"&"/g'
"Every" "word" "quoted".
Above, \+
in the regular expression matches for a space, and the ampersand (&
) references the whole matched portion of the pattern space (i.e., the words that are matched by the [A-Za-z]\+
regular expression).
Another useful flag is p
(print), which outputs the line after replacement, like the p
command. The following example deletes the first word from all lines beginning with A
(here, a word is a sequence of letters):
$ cat 'data.txt'
Oranges are acidic.
Apples are sweet.
Plums are juicy.
Apricots are dry.
Pineapples are sweet and juicy.
$ sed -e '/^[^A]/p; /^A/s/[A-Za-z]\+//p' -n 'data.txt'
Oranges are acidic.
are sweet.
Plums are juicy.
are dry.
Pineapples are sweet and juicy.
/^[^A]/p
prints out the lines that do not begin with an A
. /^A/s/[A-Za-z]\+//p
matches all lines that start with an A
, and then replaces the first word in that line with nothing. Next, the replacement line is printed.
awk
AWK is a programming language for text file processing. GNU/Linux distributions do not contain AT&T's original awk
, but a compatible version called gawk
(GNU awk
; awk
is likely a symbolic link to either /etc/alternatives/awk
or /usr/bin/gawk
on your GNU/Linux system).
awk
has the following syntax:
awk ex_program ex_file...
Various kinds of formatted reports structured data operations can be performed with awk
. awk
can:
- Interpret text files consisting of records, which in turn consist of fields.
- Store data in variables and arrays.
- Perform arithmetic, logical, and string operations.
- Evaluate loops and conditionals.
- Define functions.
- Post-process the output of commands.
awk
reads text from its standard input or files named on the command line, usually line by line. Every line (record) is divided into fields and processed. The results are written to the standard output or a named file.
awk
programs exist somewhere between shell scripts and programs in languages like Python. The main difference between awk
and these languages consists of awk
's data-driven operation, while the typical programming languages are more geared toward functions.
Useful awk
options include:
-e ex_program
,--source ex_program
- Use
ex_program
as AWK program source code. -f ex_program
,--file ex_program
- Read the AWK program source from
ex_program
, instead of from the first command line argument. Multiple-f
options may be used. -F ex_field_separator
,--field-separator ex_field_separator
- Use
ex_field_separator
for the input field separator. The default field separator forawk
is a Space. -v ex_var=ex_value
,--assign ex_var=ex_value
- Assign the value
ex_value
to the variableex_var
before the execution of the program begins.
AWK Program Execution
An AWK program is a sequence of pattern-action statements and option-function definitions:
@include ex_file
@load ex_file
ex_pattern {ex_action_statements}
ex_function(ex_parameters) {ex_action_statements}
The action is executed on the text that matches the pattern. The entire AWK program is enclosed in single quotes (''
).
awk
reads the program source from the ex_program
file(s) if specified, from arguments to -e
, or from the first non-option argument on the command line. awk
reads the program text as if all of the program files and command line source texts were concatenated.
Lines beginning with @include
may be used to include other source files into your program.
awk
executes AWK programs in the following order:
- All variable assignments specified via the
-v
option are performed. - The program is compiled into an internal form.
- The code in the
BEGIN
block(s), if there are any, is executed and proceeds to read each file named in theARGV
array (up toARGV[ARGC]
). - If there are no files named on the command line,
awk
reads the standard input. - If a file name on the command line has the form
ex_var=ex_value
, it is treated as a variable assignment. The variableex_var
will be assigned the valueex_value
. This happens after anyBEGIN
block(s) have been run.- Command line variable assignment is most useful for dynamically assigning values to the variables AWK uses to control how input is broken into fields and records. Also, it is useful for controlling state if multiple passes are needed over a single data file.
- If the value of a particular element of
ARGV
is empty (""
),awk
skips over it. - For each input file, if a
BEGINFILE
rule exists,awk
executes the associated code before processing the contents of the file. - Similarly,
awk
executes the code associated withENDFILE
after processing the file. - For each record in the input,
awk
tests to see if it matches any pattern in the AWK program. For each pattern that the record matches, the associated action is executed. The patterns are tested in the order they occur in the program. - After all the input is exhausted,
awk
executes the code in theEND
block(s) (if any).
In principle, an awk
program works like a loop over the input records (usually lines) that is repeated until no input is left or the program is terminated. The control flow is largely given by the data. In most other languages, the main program is started once and functions (which may read input) influence the progress of the calculation.
In the simplest case, awk
works like sed
, i.e., you select lines and then apply commands to those lines. For example, the following command outputs all input lines containing at least three b
s:
awk '/b.*b.*b/ {print}' ex_file
The braces ({}
) may contain one or more commands that are applied to the lines matching the regular expression between the forward slashes (/
). Commands are separated by a semicolon (;
).
A sequence of AWK commands does not need to include a regular expression in front of it. A {ex_commands}
without a regular expression will be applied to every input record.
AWK Scripts
Often, it is more convenient to put AWK scripts into their own files. You can execute these files using awk -f ex_script_file
. Lines starting with a #
are considered comments.
As with sed
(e.g., #!/usr/bin/env -S sed -f
), there is nothing wrong with directly executable awk
scripts:
#!/usr/bin/env -S awk -f
/a.*b.*/ {print}
For every input line, awk
checks which script lines match it, and all awk
command sequences that match are executed.
Records and Fields
awk
assumes that its input is structured and not just a stream of bytes, like sed
does. Usually, every input line is considered a record and split into fields on white space (i.e., a field is a string surrounded by white space). Unlike programs like sort
and cut
, awk
considers sequences of spaces one field separator, instead of seeing an empty field between two adjacent space characters.
The default delimiter for awk
is a Space. The input record delimiter can be changed using the -F ex_field_separator
option. Output fields are separated by blanks.
The $
is awk
's field access operator. You can refer to the individual fields of a record using the awk
expressions $ex_number
(e.g., $1
for field 1, $2
for field 2, etc.):
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ awk -F ',' '{print $2,$3}' 'data.txt'
Beverly Crusher
Julian Bashir
Leonard McCoy
Katherine Pulaski
Beverly Picard
The $0
field identifier returns the full input record (i.e., all fields for a record).
$ awk '{print $0}' 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
The $NF
field identifier returns the last field's value for a line, regardless of the number of fields in the current record.
$ awk -F ',' '{print $NF}' 'data.txt'
TNG 2324
DS9 2341
TOS 2227
TNG 2318
TNG 2324
Unlike cut
, awk
is able to change the order of columns in its output with respect to their order in the input.
$ awk -F ',' '{print $3,$2}' 'data.txt'
Crusher Beverly
Bashir Julian
McCoy Leonard
Pulaski Katherine
Picard Beverly
When outputting multiple fields with a comma (e.g., {print $3,$2}
), awk
prints a space between each field. If you do not place a comma in between each field, their values will be concatenated.
BEGIN and END
awk
lets you specify command sequences to be executed at the beginning of a run, before data has been read, and at the end, after the last records have been read. This can be used for initialization or final results.
$ ls -l *.txt | awk '
BEGIN {sum = 0}
{sum = sum + $5}
END {print sum}'
255
The above command adds the lengths/sizes of all files with the .txt
extension in the current directory and outputs the results at the end. sum
is a variable that contains the current total (the BEGIN
rule is not strictly required since new variables are set to 0
when they are first used).
A BEGIN
rule is executed once before any text processing starts. An END
rule is executed after all processing has completed. You can have multiple BEGIN
and END
rules, and they will execute sequentially.
BEGIN
and END
rules have their own set of actions enclosed within their own set of curly braces ({}
).
AWK Variables
Variables in AWK behave like shell variables, except that you can refer to their values without having to put a $
in front of the variable name. They may contain either strings or floating-point numbers. AWK variables are typeless and are interpreted as required:
$ awk 'BEGIN {a = "456def"; print 2*a; exit}'
912
$ awk 'BEGIN {a = "def"; print 2*a; exit}'
0
Variable names always start with a letter and may otherwise contain letters, digits, and the underscore (_
).
You can include assignments to AWK variables on the command line, among the awk
options and filename arguments. The only condition is that awk
gets to see each assignment as a single argument (hence, there may be no spaces around the =
).
This is an improved version of the previous text file size summation example:
$ ls -l *.txt | awk '
{
sum += $5
count++
}
END {print sum, "bytes in", count, "files"}'
255 bytes in 3 files
sum += $5
is equivalent to sum = sum + $5
and count++
is equal to count = count + 1
(here, count
represents the number of files whose length/size was summed).
awk
uses the NF
variable to make available the number of fields found in a record (i.e., the number of columns/words in a line).
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ awk -F ',' '{print NF}' 'data.txt' | head -n 1
4
Besides NF
, awk
defines other system variables:
FS
- The input field separator, which you set using the
-F
option (an assignment toFS
within aBEGIN
command accomplishes the same, as well). RS
- The input record separator (i.e., the character that marks the end of a record/line). This is usually the newline character, but you can choose something else.
""
- Special value that stands for an empty line. This makes it easy to process files that are block structured rather than line structured.
This is an example of a block structured file:
$ cat 'block.txt'
Beverly
Crusher
TNG
2324
Julian
Bashir
DS9
2341
For the above file, you can set FS
and RS
to the appropriate values to convert it to a line structured file. Instead of a default blank, the output field separator can be specified using the OFS
variable:
$ awk 'BEGIN {FS = "\n"; RS = ""; OFS = ","} {print $1,$2,$NF}' 'block.txt'
Beverly,Crusher,2324
Julian,Bashir,2341
Above, for the block.txt
file, we specify the field separator as a new line (\n
), the record separator as an empty line (""
), and the output field separator as a comma (,
). Then, we print the first, second, and last fields ({print $1,$2,$NF}
).
awk
supports arrays, i.e., indexed groups of variables sharing a name. awk
allows arrays to be indexed using arbitrary character strings, rather than just numbers, as well. Often, this is called an associative array.
#!/usr/bin/env -S awk -f
# Display users for each login shell
BEGIN {FS = ":"}
{login_shells[$NF] = login_shells[$NF] $1 ", "}
END {
for (i in login_shells) {
print i ": " login_shells[i]
}
}
The BEGIN
command serves to collect the data. The END
command outputs it. The for
command introduces a loop in which the i
variable is set to every index of the login_shells
array in turn (the order is nondeterministic).
When the above AWK script is used on the /etc/passwd
file, you will get a list of login shells together with their respective users:
$ shell_users.awk '/etc/passwd'
/bin/false: tss, speech-dispatcher, hplip, Debian-gdm,
/bin/bash: root, guest,
/usr/sbin/nologin: daemon, bin, sys, games, man, lp, mail, news, uucp, proxy, www-data, backup, list, irc, gnats, nobody, _apt, systemd-timesync, systemd-network, systemd-resolve, messagebus, dnsmasq, usbmux, rtkit, pulse, avahi, saned, colord, geoclue, systemd-coredump,
/bin/sync: sync,
AWK Expressions
awk
expressions may contain, among others, the common basic arithmetic (+
, -
, *
, and /
) and comparison operators (<
, <=
, >
, >=
, ==
, and !=
). The ^
character can be used for exponentiation.
In awk
, &&
is the logical AND operator (true
is considered a non-zero value), ||
is the logical OR operator, and !
is the logical NOT operator.
For example, the following prints out the fourth field for lines in data.txt
where the third field is either Bashir
or McCoy
:
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ awk -F ',' '$3 == "Bashir" || $3 == "McCoy" {print $4}' 'data.txt'
DS9 2341
TOS 2227
There are also test operators for regular expressions, ~
and !~
, which you can use to check whether a string matches or does not match a regular expression, respectively:
$1 ~ /b.*b.*b/ {ex_statements}
The above command evaluates whether the first field contains at least three b
characters.
Two strings (or variable values) can be concatenated by writing them next to each other (separated by a space):
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ awk -F ',' 'NR==1 {print $2 " knows best."}' 'data.txt'
Beverly knows best.
Above, we specify a ,
as the field delimiter with -F ','
. NR==1
is used to tell awk
that we want to target the first row of data.txt
(specifically, NR
is the number of input records awk
has processed since the beginning of the program's execution). Then, we concatenate the value of line 1's second field (i.e., Beverly
) with the string knows best.
using print $2 " knows best."
.
AWK expressions may refer to functions. Some functions are predefined, including the arithmetic functions:
int
- Determine the integer part of a number.
log
- Logarithm.
sqrt
- Square root.
$ awk 'BEGIN {print sqrt(25)}'
5
awk
's calculation abilities are roughly equivalent to those of a scientific calculator.
There are also string functions:
length
- Determine the length of a string.
sub
- Substitute strings. Corresponds to
sed
'ss
operator. substr
- Return arbitrary substrings.
You can define your own functions like so:
#!/usr/bin/env -S awk -f
# Multiply numbers by 4
function quad(n) {
return 4*n
}
{print $1, quad($1)}
The above script reads a file of numbers (one per line) and outputs the original number and that number quadrupled.
$ cat 'numbers.txt'
3
2
1
$ quad.awk 'numbers.txt'
3 12
2 8
1 4
A function's body may consist of one or more awk
commands. The return
command is used to return a value as the function's result. The variables mentioned in a function's parameter list (in the above example, n
) are passed to the function and are local to it (i.e., they may be changed, but the changes are invisible outside of the function). All other variables are global (i.e., they are visible everywhere within the awk
program).
There is no provision in awk
for defining extra local variables within a function. However, you can work around this by defining string functions inside of your original function.
This is a function that sorts the elements of an array F
in alphabetical order, which uses numerical indices between 1
and N
:
#!/usr/bin/env -S awk -f
# Sort array of Star Trek doctor names
function sort(A, N, i, j, temp_var) {
# Insertion sort
for (i = 2; i <= N; i++) {
for (j = i; A[j-1] > A[j]; j--) {
temp_var = A[j]; A[j] = A[j-1]; A[j-1] = temp_var
}
}
return
}
BEGIN {
doc_name[1] = "Katherine"; doc_name[2] = "Leonard"
doc_name[3] = "Julian"; doc_name[4] = "Beverly"
sort(doc_name, 4)
for (i = 1; i <= 4; i++) {
print i ": " doc_name[i]
}
}
The for
loop executes its first argument (i = 2
). Then, it repeats the following:
- Evaluate the second argument (
i <= N
). - If the result of this is less than or equal to the highest numerical index in the array, execute the loop body (here, a second
for
loop). - Evaluate the third argument (
i++
).
This is repeated until the second argument (i
) is greater than N
.
Note the output of the array's elements by means of a counting for
loop (i.e., for (i = 1; i <= 4; i++) {
). A for (i in a)
loop would have produced the elements in a nondeterministic order (i.e., there would have been no point in sorting them first).
$ sort.awk
1: Beverly
2: Julian
3: Katherine
4: Leonard
AWK supports if
conditional statements. Their syntax is modeled on the C language:
if (ex_condition) {
ex_commands
}
An if else
conditional statement is done like so:
if (ex_condition) {
ex_commands
} else {
ex_commands
}
The data[$0]++ == 0
expression is a common idiom. For example, it is used here in an awk
script to remove duplicate lines from files:
#!/usr/bin/env -S awk -f
# Remove duplicate lines
{
if (data[$0]++ == 0) {
lines[++count] = $0
}
}
END {
for (i = 1; i <= count; i++) {
print lines[i]
}
}
Above, if (data[$0]++ == 0)
is true
if $0
's value is seen for the first time. The ++count
expression is equal to count++
, except that it returns the value of count
after it has been incremented (count++
returns the value before incrementing it). This ensures that the first line seen has an index of 1, even though we do not explicitly set count
to 1.
$ cat 'dupes.txt'
1
1
2
2
3
3
$ compact.awk 'dupes.txt'
1
2
3
split
The split
command splits a file into pieces.
split ex_file
By default, ex_file
is broken into pieces of 1,000 lines each with a default prefix of x
(e.g., xaa
, xab
, xac
, etc.). The original file is left unchanged.
The -l ex_integer
(--lines=ex_integer
) option can be used to specify a different number of lines (records) per output file and a different prefix can be used by appending it to your command:
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ split -l 2 --verbose 'data.txt' 'star_trek_'
creating file 'star_trek_aa'
creating file 'star_trek_ab'
creating file 'star_trek_ac'
$ cat 'star_trek_aa'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
$ cat 'star_trek_ab'
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
$ cat 'star_trek_ac'
05,Beverly,Picard,TNG 2324
Above, the --verbose
option is used to print a diagnostic message before each output file is opened.
paste
The paste
command is used to merge together the lines of files.
paste ex_file...
paste
causes every line of its file arguments to make up one column of the output.
$ cat 'dates.txt'
2324
2341
2227
2318
2324
$ cat 'names.txt'
Beverly Crusher
Julian Bashir
Leonard McCoy
Katherine Pulaski
Beverly Picard
$ paste 'dates.txt' 'names.txt' > 'dates_names.txt'
$ cat 'dates_names.txt'
2324 Beverly Crusher
2341 Julian Bashir
2227 Leonard McCoy
2318 Katherine Pulaski
2324 Beverly Picard
The above command takes the birth dates in dates.txt
and the names in names.txt
and joins them together into a third file, dates_names.txt
using redirection.
The -s
(--serial
) option causes paste
to append data in series, rather than parallel, i.e., it pastes the value of each line in the original file into the new output file horizontally, instead of vertically.
$ paste -s 'dates.txt' 'names.txt' > 'dates_names_s.txt'
$ cat 'dates_names_s.txt'
2324 2341 2227 2318 2324
Beverly Crusher Julian Bashir Leonard McCoy Katherine Pulaski Beverly Picard
By default, paste
uses a Tab as the field delimiter, but you can specify one or more alternative delimiters with the -d ex_delimiters
(--delimiters=ex_delimiters
) option (similar to the cut
command).
Other delimiters can be a Space or Enter (|
), and each delimiter is used in turn. When the list has been exhausted, paste
begins again at the first delimiter. You can specify just one delimiter, i.e., a list is not required.
join
The join
command joins lines of two files together based on a common field.
join ex_file_1 ex_file_2
The default join
field is the first field, but this can be changed with the -j ex_join_field
option.
$ cat 'dates_names.txt'
2324 Beverly Crusher
2341 Julian Bashir
2227 Leonard McCoy
2318 Katherine Pulaski
2324 Beverly Picard
$ cat 'dates_series.txt'
2324 TNG
2341 DS9
2227 TOS
2318 TNG
2324 TNG
$ join 'dates_names.txt' 'dates_series.txt' > 'st_doctors.txt'
$ cat 'st_doctors.txt'
2324 Beverly Crusher TNG
2341 Julian Bashir DS9
2227 Leonard McCoy TOS
2318 Katherine Pulaski TNG
2324 Beverly Picard TNG
The above command takes the Star Trek doctor names in dates_names.txt
and the Star Trek doctor series in dates_series.txt
and joins them on a common field (the characters' birth dates) into a new file, st_doctors.txt
, using redirection.
The default delimiter for join
is a Space, but a different delimiter can be set with the -t ex_delimiter
option (like the sort
command).
head
head
outputs the first part of files.
head ex_file...
By default, the first ten lines of a file are displayed. However, this can be changed with the -n ex_integer
(--lines=ex_integer
) option:
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ head -n 2 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
By supplying a negative value to the -n
option, head
will print everything except the last ex_integer
lines:
$ head -n -2 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
head
can also display the first bytes of a file with the -c ex_integer
(--bytes=ex_integer
) option:
$ head -c 10 'data.txt'
01,Beverly
tail
tail
outputs the last part of files.
tail ex_file...
By default, the last ten lines of a file are displayed. As with the head
command, you can change the number of lines to display with the -n ex_integer
(--lines=ex_integer
) option:
$ cat 'data.txt'
01,Beverly,Crusher,TNG 2324
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
$ tail -n 2 'data.txt'
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
To read from a specific line to the end of the file, you can use the +ex_integer
syntax with the -n
option:
$ tail -n +2 'data.txt'
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
04,Katherine,Pulaski,TNG 2318
05,Beverly,Picard,TNG 2324
head
and tail
can be used together to target increasingly specific lines in a file. For example, the following outputs lines 2 through 3 of data.txt
:
$ head -n 3 'data.txt' | tail -n 2
02,Julian,Bashir,DS9 2341
03,Leonard,McCoy,TOS 2227
tail
's -f
(--follow
) option can be used to follow a file(s). This can be useful when you are troubleshooting an issue and want to closely follow a log as it is updated.
od
The od
command can dump files in octal and other formats.
od ex_file...
For example, take the /usr/bin/python3.7
file:
$ file /usr/bin/python3.7
/usr/bin/python3.7: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=d8d5d37d3e53baef1a47596ab71690406d8a272d, stripped
By default, od
displays the file in octal form:
$ od '/usr/bin/python3.7' | head -n 5
0000000 042577 043114 000402 000001 000000 000000 000000 000000
0000020 000002 000076 000001 000000 051360 000135 000000 000000
0000040 000100 000000 000000 000000 023600 000112 000000 000000
0000060 000000 000000 000100 000070 000013 000100 000033 000032
0000100 000006 000000 000004 000000 000100 000000 000000 000000
The -t ex_output_format
(--format=ex_output_format
) option can be used to specify a different format specification. The accepted values are:
a
Named characters, ignoring high-order bitb
Octal bytesc
Printable characters or backslash escapesu2
Unsigned decimal 2-byte unitsfF
FloatsdI
Decimal intsdL
Decimal longso2
Octal 2-byte unitsd2
Decimal 2-byte unitsx2
Hexadecimal 2-byte units
This is the /usr/bin/python3.7
file in hexadecimal form:
$ od -t 'x2' '/usr/bin/python3.7' | head -n 5
0000000 457f 464c 0102 0001 0000 0000 0000 0000
0000020 0002 003e 0001 0000 52f0 005d 0000 0000
0000040 0040 0000 0000 0000 2780 004a 0000 0000
0000060 0000 0000 0040 0038 000b 0040 001b 001a
0000100 0006 0000 0004 0000 0040 0000 0000 0000
strings
The strings
command prints strings of printable characters in files.
strings ex_file...
For example, /usr/bin/python3.7
is not a text file, but an ELF 64-bit LSB executable file. If you try to view its content, you are going to see mostly gibberish:
$ cat '/usr/bin/python3.7' | head -n 5
ELF>�R]@�'J@8
@�@@@@@h��@�@@@
BB�#�#0%0e0e��������?�����X
ȍ
��?����@��@�@DDP�td��9��y��y$$Q�tdR�td��?����``/lib64/ld-linux-x86-64.so.2GNUGNU���}>S���GYj��@m�'-��،Q
@�"X"8��SA��P$@��F��`D!����"QHB�
�Q���!��*QP
� x��@���!!
5B�B@�8XX �CH�� 2 � "tB� � @�@P`!��pC
� D�
However, the strings
command will extract and display any strings in the file:
$ strings '/usr/bin/python3.7' | head -n 5
/lib64/ld-linux-x86-64.so.2
"X"8
"tB
@P`!
2B B
nl
nl
writes the lines of a file to the standard output, with line numbers added.
nl ex_file...
By default, nl
only numbers lines with data, not blank lines. You can instruct nl
to number all lines of a file with the -ba
(--body-numbering=ex_style
) option (a
is the ex_style
meaning number all lines).
STYLE
is one of:
a
Number all linest
Number only non-empty linesn
Number no linespex_regex
Number only lines that contain a match for the basic regular expressionex_regex
iconv
iconv
converts text from one character encoding to another.
iconv -f ex_from_encoding_value -t ex_to_encoding_value ex_file... -o ex_output_file
This command uses several options:
-f ex_from_encoding_value
,--from-code=ex_from_encoding_value
- Use
ex_from_encoding_value
for input characters. -t ex_to_encoding_value
,--to-code=ex_to_encoding_value
- Use
ex_to_encoding_value
for output characters. -o ex_output_file
,--output=ex_output_file
- Use
ex_output_file
for output.
For example, the following changes the text encoding of the utf.txt
file:
$ cat 'utf.txt'
abc ß ? € à?ç
$ file -i 'utf.txt'
utf.txt: text/plain; charset=utf-8
$ iconv -f UTF-8 -t ASCII//TRANSLIT 'utf.txt' -o 'ascii.txt'
$ cat 'ascii.txt'
abc ss ? EUR a?c
$ file -i 'ascii.txt'
ascii.txt: text/plain; charset=us-ascii
Above, file
's -i
(--mime
) option causes the command to output mime type strings rather than the more traditional human readable ones.
The //TRANSLIT
part of the iconv
command instructs iconv
to transliterate characters being encoded when needed, and possible.
printf
printf
prints an argument according to a specific format. It is available as both an internal and external command.
printf ex_format ex_argument...
ex_format
controls the output as in C printf
.
Interpreted sequences are:
\"
- Double quote
\\
- Backslash
\a
- Alert (BELL)
\b
- Backspace
\c
- Produce no further output
\e
- Escape
\f
- Form feed
\n
- New line
\r
- Carriage return
\t
- Horizontal tab
\v
- Vertical tab
\NNN
- Byte with octal value
NNN
(1 to 3 digits) \xHH
- Byte with hexadecimal value
HH
(1 to 2 digits) \uHHHH
- Unicode (ISO/IEC 10646) character with hex value
HHHH
(4 digits) \UHHHHHHHH
- Unicode character with hex value
HHHHHHHH
(8 digits)
Interpreted sequences also include all C format specifications ending with one of diouxXfeEgGcs
, with ex_argument
converted to proper type first (variable widths are handled).
Format Specifiers
Some of the most commonly used format specifiers include:
%%
- A single
%
. %b
ex_argument
as a string with'\'
escapes interpreted, except that octal escapes are of the form\0
or\0NNN
.%d
- An integer specifier for showing integral values.
%f
- Specifier for showing floating point values.
%s
ex_argument
as a string with'\'
escapes not interpreted.%q
ex_argument
is printed in a format that can be reused as shell input, escaping non-printable characters with the proposed POSIX$''
syntax.%x
- Specifier for output of lowercase hexadecimal values for integers and for padding the output.
Details about available formats are in the documentation of the C library function, which can be viewed by running man printf.3
.
Examples
By default, printf
does not add newlines to the strings provided to it as arguments, like echo
does:
$ echo 'Hello, world!'
Hello, world!
$ printf 'Hello, world!'
Hello, world!$
To add newlines, printf
needs to be supplied a format string with the newline escape sequence:
$ printf '%s\n' 'Hello, world!'
Hello, world!
$ printf '%s\n' 'Hello, world!' 'Even more' 'new' 'lines.'
Hello, world!
Even more
new
lines.
fmt and pr
The fmt
command is used for simple optimal text formatting.
fmt -w ex_width_number ex_file...
The -w ex_integer
(--width=ex_integer
) option sets the maximum line width (the default is 75 columns). An abbreviated form of this option is -ex_width_number
.
The pr
command converts text files for printing.
pr --columns=ex_column_number ex_file...
The --columns=ex_column_number
option tells pr
to output ex_column_number
columns and print columns down, unless the -a
(--across
) option is used. An abbreviated form of this option is -ex_column_number
.
Together, these commands can be used to format a file for printing by setting line width and the number of columns:
$ cat 'fruit.txt'
Apple
Watermelon
Orange
Pear
Cherry
Strawberry
Nectarine
Grape
Mango
Blueberry
Pomegranate
Plum
Banana
Raspberry
Mandarin
Jackfruit
Papaya
Kiwi
Pineapple
Lime
Lemon
Apricot
Grapefruit
Melon
Coconut
Avocado
Peach
$ fmt -11 'fruit.txt' | pr -3
2021-01-16 14:40 Page 1
Apple Blueberry Pineapple
Watermelon Pomegranate Lime Lemon
Orange Plum Apricot
Pear Banana Grapefruit
Cherry Raspberry Melon
Strawberry Mandarin Coconut
Nectarine Jackfruit Avocado
Grape Papaya Peach
Mango Kiwi
Documentation
For more on the text manipulation commands discussed here, refer to the Linux User's Manual, either at the command prompt or online.