Search This Blog

2015-02-02

Linux Administration with sed and awk

Getting Started

Grep

Examples:
Match everything that starts with a lowercase character or underscore

grep '^[a-z_]'


Sed

Remove commented and blank lines from file as well as create a backup

sed -i.commented '/^#/d;/^$/d' /etc/ntp.conf


Print only matching patterns

sed -n 'p' /etc/passwd

The 'p' tells it to print matches
the -n tells it to suppress standard output

Print the first five lines
sed -n ' 1,5 p' /etc/passwd

Print when the line starts with user
sed -n ' /^user/ p' /etc/passwd
Where the slashes '/' tell sed to start a regular expression

Awk

Awk has 3 main blocks
BEGIN Block
Main Block
END Block


awk 'BEGIN {} {MAIN(normally print data here)} END {}' filename


Using the 3 blocks in a simple example

awk ' BEGIN {print "hello" } { print } END { print NR }' filename

The above will print the line "hello" then print the entire file contents of filename using the "print" command. Then it will print the Number of Rows that are in the file total.

You can add additional functions to any block. For example

print NR, $0

This will print the line number before each line of output.

Use a bash function inside the main line to bring all of the contents to uppercase

print toupper($0)

Which will print the entire line in upper case.

Regular Expressions

Boundaries

Generally boundary characters will be special characters that represent a group of matches that represent a certain boundary defines.

\s - match whitespace (space, tab)
\b - match word boundary (hyphen, space)

Generally the inverse match is done by inverting the case of the boundary

\s - do NOT match whitespace (space, tab)
\b - do NOT match word boundary (hyphen, space)


Examples:

\ssystem

Will match "file system" or "file\tsystem"


\bsystem

Will match "file system" or "file-system"


\b is a word boundary and can be used as follows

\b[Cc]olou?r\b

In which we're looking for the word Colour in any of the following variations:
Color
Colour
color
colour
Where u? means that we can have 0 or 1 'u' characters

Anchors

^ - beginning or start of a string
Example:

'^root'

Means that the match will begin with the word root

$ - End of the string

4$

Means that 4 is matched at the end of the string

^$

Search for blank lines - begins with end of line characters

Ranges

Are held within square brackets

Example:

'[A-Za-z]'

Matches any letter

You can do OR statements by just putting two ranges or matching patterns
Example:

'[a-z_]'

Matches a-z lowercase or underscore

Every range specified is basically an OR statement


^[a-z_]

Begins with an underscore or lowercase character

Ranges can also be negated using the ^ character

Example: Show me a range that does NOT include 4 at the end of the line

'[^4]$'

Kind of like grep -v but included in the range instead.

Quantifiers

How many of the previous characters are allowed or required


u* - matches u zero or more times
u? - matches zero or once only
u+ - matches one of more occurrences of u
u{3} - matches exactly 3 occurrences of u


Example:
Remove lines that start with comments even if they have whitespace in front of them from proper tabbing

grep -v '^\s*#' test

^ - means starts with
\s - means white space
* - means any of the previous
# - is the matching pattern
Basically look for white-space at the start of the file until you hit a comment, if you hit a comment remove the line from the result because we used -v in grep.

Example:
Look for two words and ignore whether or not there are spaces between the words

grep 'start\s*end' test

Meaning we want the word start and the word end matched regardless of whether or not there are 0-infinite spacing characters

If we want only one space between then we would have to use:

grep -E 'start\s?end' test

Which requires a -E for extended regular expressions which allows for boundaries mixed with quantifiers.

NOTE: egrep is actually deprecated and just a references to "grep -E"

Mixing ranges with qualifiers also requires extended regular expressions like the following:

grep -E '[a-z]{2}[0-9]{1,2}' test


Fundamentals of sed

Sed statements are made up of multiple parts


sed $arguments '$range $command' $file


Example:

sed -n '/^root/ p' /etc/passwd

Which will print the root password from the etc password file where
-n - means supress standard out
/^root/ - is the regex for begins with root
p - means print matches

Substitute Command

Format is as follows

sed ' [range] s/// ' /etc/passwd

Example:

sed ' /^gretchen/ s@/bin/bash@/bin/sh@ ' /etc/passwd

which says only deal with lines that start with gretchen and then when you find ones that start with that string replace /bin/bash with /bin/sh.

Only the first match per line is replaced unless you use the 'g' option which will make more than once replacement per line if required.

Intending with sed

Find the lines you want to indent using the "nl" command

nl $filename



sed ' 6,9 s/^/ / ' $filename

The above will indent the lines 6-9 replacing the start of the line with 4 spaces.

In order to validate changes you can always use the 'p' command to print the matches

sed -n ' 6,9 s/^/ /p ' $filename


Append Insert and Delete


Append will add a new line after the matching line
Append Example:

sed '/^server 3/ a server ntp.example.com' /etc/ntp.conf

The above looks for the line that starts with "server 3" and appends the contents "server ntp.example.com" after it finds that result.

Insert will insert a new line before the matching line
Insert Example:

sed ' /^server 0/ i server ntp.example.com ' /etc/ntp.conf

The above will add a new line of "server ntp.example.com" prior to the line that starts with "server 0"

Delete will delete lines matching a certain pattern from a file
Delete Example:

sed ' /^server\s[0-9]\.ubuntu/ d' /etc/ntp.conf

The above will delete the line that starts with server, followed by whitespace, followed by a number, followed by a dot, followed by ubuntu.

Multiple sed expressions

Method 1: Perform all on commandline with brackets

sed ' {
/^server 0/ i ntp.example.com
/^server\s[0-9]\.ubuntu/ d
} ' /etc/ntp.conf

This can be all included in one line where instead of new line characters you can just put ';' instead and remove the '{' '}'

Method 2: Create a sed file and execute it separately

cat ntp.sed
/^server 0/ i ntp.example.com
/^server\s[0-9]\.ubuntu/ d
sed -f ntp.sed /etc/ntp.conf


Sed over SSH

In order to run sed over ssh we can run ssh with some options to allow us to remotely execute sed commands
Example:

sed -t user@server sudo sed -i.bak -f /tmp/ntp.sed /etc/ntp.conf

Where:
-t means it will assign a TTY which will allow for sudo password
/tmp/ntp.sed - must exist on the remote server (good idea is to put it on an nfs share that all servers have)

Substitution Grouping Using Sed

Example:

s/ \([^,]*\) /\U\1/

This basically says substitute 's/'

Then do grouping (escaping brackets) and look for characters that are NOT commas and then match any characters.

Then replace strings. Replacing upper cases first grouping

The use case for the above is:
- Uppercase the first field (or grouping) which represents the last name

so if you had a file that looked like

lastname,firstname,ssn


The result after running the above sed statement would look like

LASTNAME,firstname,ssn

This is because it matches everything until the first column then stops matching.
Then it replaces the first field with uppercase characters.

Meaning:
The first field is the matching - or grouping statement
The second field is the replacement string

So basically its the same as it always has been its just that we're using grouping statements instead of direct matches and replacement.

Extending grouping

You can extend groups by adding even more grouping statements as follows

sed 's@\([^,]*\),\([^,]*\)@\U\1\L\2@' employees

This will create two groupins
1st grouping - match everything until comma
ex: kind of like how [^0-9]* matches anything but digit for as many characters as you want.
2nd grouping - match again until next comma

Grouping options:
1st grouping - Make uppercase with \U\1
2nd grouping - Make lowercase with \L\2

Overall format is:

sed 's/(),(),()/\$operation\$field\$operation\$field'


You can add additional characters in the substitute string by just adding characters between the substitute strings to format output.

Example:

sed 's@\([^,]*\),\([^,]*\)@\U\1,\L\2@' employees


Example of comma separating every number when it hits the thousands:

s/\(^\|[^0-9.]\)([0-9]\+\)\([0-9]\{3\}\)/\1\2,3/g


Executing Commands with sed

Basically the concept is that whenever you match a specific match you can run a command against that match immediately with sed.

If you have a text file that includes a list of files you can run a sed execute on each line of the file and perform an operation per line as follows


sed 's/^/ls -l/e listoffiles.txt

Basically we read each line as an argument from the input file.

The beginning of the line is substituted with the command to run.

You can tar up files in a file using

sed ' /^\// s/^/tar -rf catalog.tar /e' cat.list

The reason why it needs to begin with a '/' is that we're looking for files relative to the root directory ex. /etc/hosts


Remove the leftover files:

sed ' /^// s/^/rm -f /e' cat.list


Using sed in vim

You can initiate sed in vim using

ESC
: 2,10 s/^/ /

The above would indent lines 2-10 4 spaces.

Write information from one file match to another

ESC
:4,10 w lines

will write lines 4-10 to a new file called lines

You can read in that file using 'r'

ESC
:r lines


Introduction to Awk

Example awk code file

BEGIN { FS=":"; print "Username" }
{ print $1 }
END { print "Total users = " NR }

The above shows the usage of the BEGIN block for header and END for summary information while the middle is performing printing per line of the output.

The BEGIN statement also allows you to define a field separator to use for printing in the MAIN block.

This is how to run. The begin is only run once and the end is only run once as well.

awk -f users.awk /etc/passwd


Example - Print where UID is greater than 499

BEGIN { FS=":"; print "Username" }
# where the third column is the UID print the username (column 1)
$3 > 499 { print $1 }
END { print "Total users = " NR }


Example - Counting the number of returned rows

BEGIN { FS=":"; print "Username" }
/^root/{ print $1 ; count++ }
END { print "Total users = " count }

the above basically matches anything starting with root, prints the first column (the username) and increments the count variable each time it matches.

the END block then prints the final value of the variable count which in turn is the number of root users.

Awk readability vs sed in some cases


Both of the below do the same thing. They take in a comma separated file and uppercase the first item in the comma separated list, lowercase the second, and print the third.

Though sed is less characters total awk is arguably much more readable.

sed 's@\([^,]*\),\([^,]*\)@\U\1,\L\2@' employees

VS

awk -F "," { print toupper($1), tolower($2), $3 } employees


NOTE: OFS=" " can change the output file separator so whenever you type print $1,$2 instead of translating the comma to a space it can change it to whatever you set the OFS to.


Example - Filtering the lastlog file

The following in the main section says that you DO NOT want to work with the items enclosed in the !() field.

We then count the number of matches (basically inverse count of those enclosed in !()

!(/Never logged in/ || /^Username/ || /^root/) {
count ++
# if number of fields in each match is equal to 8 then print the fields specified (1 5 4 8)
# Basically only print the 9th and other fields if there are not only 8.
if ( NF == 8 )
# Formatted output (8 character strings 2 3 and 4) basically formatting with specific number of characters such that we can format nicely and there is no overflow of columns ruining our tabbing
printf "%8s %2s %3s %4s\n", $1,$5,$4,$8
else
printf "%8s %2s %3s %4s\n", $1,$6,$5,$9}
END {print "=============="
print "Total Number of Users Processed: ", count}


Displaying Records from Flat Files Using Awk


XML and awk

We can make the record separator RS="\n\n"

Records are different than fields
Record = $0
Fields = $1,$2...inf

Which means that each record or "line" is separated by two new lines.

This makes it so that we can parse that one record in the MAIN block of awk

Example - Searching in apache VirtualHost tag

BEGIN { RS="\n\n" }
$0 ~ search { print }

Where search is a variable that we're searching for that we can pass to it via the commandline

awk -f xml.awk search=example vh.conf


Cleanup a file in sed

sed ' /^\s*$/d;/^<\/Virt/a \ ' virtualhost.conf

The above looks for any line starting with blank space immediately followed by the end of the line and deletes it (basically deleting all blank lines).

Then it looks for the closing virtualhost brackets and appends a blank line to it effectively sanitizing the file for us to properly parse it with a record separator of \n\n as described above.


You can change the field separator to select individual items from the XML grouping


BEGIN { FS="[><]" RS="\n\n" OFS=" "} $0 ~ search { print $4 ": " $5, $8 ": " $9, $12 ": " $13}

This makes it so that each field separator is either a > or a < meaning that you can print individual records rather than the entire xml record. Also set an OFS so that we can space fields out on the line. OFS essentially indicates replace each field separator ',' with a space

Analyze Web Logs Using Awk

Look for all lines that have 404 in their status field

awk ' $9==404 {print$0} ' access.log


Summary Information of AWK

Count unique accesses by a client

BEGIN { FS=" "; print "Log access" }
# ip is an associative array that instead of having an array number you have an array key name
# for each unique value you hit in $1 you have a new key value that you increment
# example: you hit 192.168.0.2 you will now have an array that is keyed like ip[192.168.0.2] and it will continue to count how many times you've hit that key
{ ip[$1]++ }
# Parses through the entire array of ip key value pairs and prints out all of the value pairs and how many times each value pair has been matched through the main blocks incrementing of that key counter.
END { for (i in ip)
print i, " has accessed ", ip[i], " times."
}


Find Most popular browser

BEGIN { FS=" "; print "Most Popular Browser" }
{ browser[$12]++ }
END { for ( b in browser)
if (max < browser[b] ) { max = browser[b]; maxbrowser =b; } print "Most access was from", maxbrowser, "and ", max, " times."}









No comments:

Post a Comment