Friday, March 18, 2016

Linux awk

Awk can do most things that are actually text processing.
An awk program follows the form:
pattern { action }
awk is line oriented. That is, the pattern specifies a test that is performed with each line read as input. If the condition is true, then the action is taken. The default pattern is something that matches every line. This is the blank or null pattern.
awk program below:
BEGIN { print "START" }       { print         } END   { print "STOP"  }
Example:
BEGIN { print "File\tOwner"}{ print $8, "\t", $3}END { print " - DONE -" }
Example awk_example1.awk
#!/bin/awk -f BEGIN { print "File\tOwner" }{ print $8, "\t", $3}END { print " - DONE -" }
In its simplest usage awk is meant for processing column-oriented text data, such as tables, presented to it on standard input. The variables $1, $2, and so forth are the contents of the first, second, etc. column of the current input line. For example, to print the second column of a file, you might use the following simple awk script:
awk < file '{ print $2 }'
This means "on every line, print the second field". 
By default awk splits input lines into fields based on whitespace, that is, spaces and tabs. You can change this by using the -F option to awk and supplying another character. For instance, to print the home directories of all users on the system, you might do
awk < /etc/passwd -F: '{ print $6 }'
since the password file has fields delimited by colons and the home directory is the 6th field. 
Awk is a weakly typed language; variables can be either strings or numbers, depending on how they're referenced. All numbers are floating-point. So to implement the fahrenheit-to-celsius calculator, you might write
awk '{ print ($1-32)*(5/9) }'
which will convert fahrenheit temperatures provided on standard input to celsius until it gets an end-of-file.  
echo 5 4 | awk '{ print $1 + $2 }'prints 9, whileecho 5 4 | awk '{ print $1 $2 }'prints 54. Note thatecho 5 4 | awk '{ print $1, $2 }'prints "5 4".
awk has some built-in variables that are automatically set; $1 and so on are examples of these. The other builtin variables that are useful for beginners are generally NF, which holds the number of fields in the current input line ($NF gives the last field), and $0, which holds the entire current input line. 
You can make your own variables, with whatever names you like (except for reserved words in the awk language) just by using them. You do not have to declare variables. Variables that haven't been explicitly set to anything have the value "" as strings and 0 as numbers.
For example, the following code prints the average of all the numbers on each line:
awk '{ tot=0; for (i=1; i<=NF; i++) tot += $i; print tot/NF; }'awk '{ tot += $1; n += 1; }  END { print tot/n; }'
Note the use of two different block statements. The second one has END in front of it; this means to run the block once after all input has been processed. 
You can also supply regular expressions to match the whole line against:
awk ' /^test/ { print $2 }'
The block conditions BEGIN and END are special and are run before processing any input, and after processing all input, respectively. 
awk supports loop and conditional statements like in C, that is, for, while, do/while, if, and if/else.
awk '{ for (i=2; i<=NF; i++) printf "%s ", $i; printf "\n"; }'
Note the use of NF to iterate over all the fields and the use of printf to place newlines explicitly. 
finding everything within the last 2 hours:
awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date {print Date, $0}' access_log
Note: date is stored in field 4
To find something between 2-4 hrs ago:
awk -vDate=`date -d'now-4 hours' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date && $4 < Date2 {print Date, Date2, $4} access_log'
The following will show you the IPs of every user who requests the index page sorted by the number of hits:
awk -F'[ "]+' '$7 == "/" { ipcount[$1]++ }    END { for (i in ipcount) {        printf "%15s - %d\n", i, ipcount[i] } }' logfile.log
$7 is the requested url. You can add whatever conditions you want at the beginning. Replace the '$7 == "/" with whatever information you want.
If you replace the $1 in (ipcount[$1]++), then you can group the results by other criteria. Using $7 would show what pages were accessed and how often. Of course then you would want to change the condition at the beginning. The following would show what pages were accessed by a user from a specific IP:
awk -F'[ "]+' '$1 == "1.2.3.4" { pagecount[$7]++ }    END { for (i in pagecount) {        printf "%15s - %d\n", i, pagecount[i] } }' logfile.log
You can also pipe the output through sort to get the results in order, either as part of the shell command, or also in the awk script itself:
awk -F'[ "]+' '$7 == "/" { ipcount[$1]++ }    END { for (i in ipcount) {        printf "%15s - %d\n", i, ipcount[i] | sort } }' logfile.log
Example how to remove duplicate lines in text file
awk '!x[$0]++' [text_file_name]
There are only a few commands in AWK. The list and syntax follows:
  • if ( conditional ) statement [ else
  • statement ]
  • while ( conditional ) statement
  • for ( expression ; conditional ; expression )
  • statement
  • for ( variable in array ) statement
  • break
  • continue
  • { [ statement ] ...}
  • variable = expression
  • print [ expression-list ] [ > expression ]
  • printf format [ , expression-list ] [ >
  • expression ]
  • next
  • exit
Example:
#!/bin/awk -f BEGIN {# Print the squares from 1 to 10 the first wayi=1;while (i <= 10) {printf "The square of ", i, " is ", i*i;i = i+1;}# do it again, using more concise codefor (i=1; i <= 10; i++) {printf "The square of ", i, " is ", i*i;}# now end exit; }
abbreviation:
  • NF : number of field 
  • NR : number of record
  • FS : field ssparator
  • RS: record separator FS="\n"
  • ORS : output record separator ORS="\r\n"
  • FILENAME 
http://dedetoknotes.blogspot.co.id/2016/03/linux-awk.html
References:  
  • http://www.hcs.harvard.edu/~dholland/computers/awk.html
  • http://stackoverflow.com/questions/7706095/find-entries-in-log-file-within-timespan-eg-the-last-hour
  • http://www.grymoire.com/Unix/Awk.html
  • http://serverfault.com/questions/11028/do-you-have-any-useful-awk-and-grep-scripts-for-parsing-apache-logs 
  • http://stackoverflow.com/questions/11532157/unix-removing-duplicate-lines-without-sorting 

No comments:

Post a Comment