Shell utilities

grep

grep                    search for a string

grep string file

(output of prog) | grep string | grep otherstring

-i    ignore case
-v    return all lines that do NOT match

grep
man grep
grep is Substring-oriented (will match inside a word).
Possible alternative:
1. grep -w
grep is Line-oriented (will print whole line that matches).
Not always good: Hit we want may spread across multiple lines, so grep won't find it. Or conversely, HTML file could be entire file on one line.
Possible solution:
- Pre-process the file.
- If hit spread across multiple lines, use tr or sed to change all (or many) newlines to spaces.
- If file all on one line, use same tools to introduce new lines at different points.
Recursive grep:
- This exists on some (but not all) Unix/Linux.
- grep -r string .
- grep -r --include="*html" string .
- Might be able to leave out the .

More complex solution to long-line problem:

Use egrep to display from 0 to max 40 chars each side of the string:
egrep -o ".{0,40}string.{0,40}" file
. means any char
0 to max n of any char is .{0,n}

string matching / regular expressions

^                       start-of-line
$                       end-of-line
.                       any character

where "c" stands for the character:

c*			0 or more instances of c
cc*			1 or more instances of c
grep "  *"		1 or more spaces
.*                  any sequence of characters



where "c" has a special meaning, e.g. is $ or ., etc:

\c                  the character itself
grep "\."           the '.' character itself

recall the two forms of string:

grep '\$'	works (searches for the "$" char instead of end-of-line)
grep "\$"	fails (double quote treatment of $ is different to single quote treatment of $)
grep "\\$"	works

Exercise: Go to share/testsuite.
Find all lines in web pages containing "born".
Find all lines containing the "$" symbol.
Find all lines containing "born" at start of line.
Find all lines containing start of line, then one space, then "born".
Find all lines containing start of line, then any number of spaces, then "born".
Regular expressions
- POSIX syntax
- POSIX character classes

From xkcd.

In July 2019, poor use of regular expressions in a firewall rule took down Cloudflare, taking down a big chunk of the Internet.

cut

cut     extract columns or fields of text on command-line

To extract columns  30  to end of line of the ls listing:

  ls -l | cut -c30-

Note: ls -l outputs are not actually on predictable columns.
Longer/shorter userids give different column numbers. 
Try:

  ls -l /

How to use cut to parse grep output


In grep output, extract the 1st field, with delimiter ":"

 grep string *html | cut -f1 -d':' 

Extract the 2nd to end fields, with delimiter ":"

 grep string *html | cut -f2- -d':'

Q. Why "-f2-" ?
Why not "-f2" ?

sed


sed     "stream editor" - find and replace text on command-line (and other things)

sed 's|oldstring|newstring|'    change first match on each line
sed 's|oldstring|newstring|g'   change all matches 

e.g. ls listing that highlights web pages:

 ls -l 	| sed "s|\.html| [Web page]|"

e.g. ls listing that changes how my username appears:

 ls -l 	| sed "s|$USER|ME|"

sed
man sed

separator:

'|' is just my choice of a separator.
Other people like '/'
We can actually use any character as a separator (whatever comes first after "s").

Often I use sed with "-e"
- "-e" allows multiple substitutes in one command line.
- If you only do one substitute, you don't need it. It does nothing.
sed FAQ
- How to do case-insensitive matching (can be tricky)
sed examples

sed examples

To insert a new line

How to insert new line
On DCU Linux, put a new line in front of every string "www":
```
sed 's|www|\nwww|g'
```
(recognises "\n")
On some other platforms, need to do this:
```
sed 's|www|\
www|g'
```

To put new lines in front of and after every HTML tag

On DCU Linux:
```
sed 's|<|\n<|g'		|

sed 's|>|>\n|g'
```
On some other platforms:
```
sed 's|<|\
<|g'		|

sed 's|>|>\
|g'
```

To substitute back in the pattern we matched


# \( ... \) to mark a pattern
# \1 to reference it later

# e.g. change:
# (start of line)file.html: ...
# to:
# <a href=file.html>file.html</a>: ...

# search for:
# ^\(.*\.html\):
# change to:
# <a href=\1>\1</a>:


grep -i $1 *html |

	sed -e "s|^\(.*\.html\):| <a href=\1>\1</a>: |g"

tr

tr
man tr
character substitutions
can use character classes (numeric, alphanumeric, etc.)
To change spaces to new lines:
```
 cat file | tr ' ' '\n' 
```

How to convert Windows EOL to Unix EOL

To convert Windows text file format to Unix text file format:

tr -d '\015'

   # remove 15 octal, also known as "return" char "\r" 
   # see ASCII chars

To analyse a file to see non-printable or "bad" characters:
```
  cat -tve file 
```

"testsuite" has files with odd chars

My "testsuite" collection has some files with odd characters:

cd /users/tutors/mhumphrysdculab/share/testsuite

# find all files with Windows EOL character:
grep -l '\015' */*html

# find all files with characters other than the basic 7 bit characters (00 to 7E hex):
LC_ALL=C  grep -P "[^\x00-\x7E]" */*html

# if you grep a file with non 7 bit chars you may get a warning like:
grep: file.html: binary file matches

awk

awk - a powerful pattern scanning and processing language

dirname, basename

useful cutting and pasting with filenames

dirname	  

basename	  


$ echo $HOME
/users/group/me

$ dirname $HOME
/users/group

$ basename $HOME
me

$ dirname `dirname $HOME`
/users

head and tail

head
Display the first 100 lines of the output:

grep string files | head -100

pipe can close early

If we do this:

 grep string files | head -100

When head gets the first 100 lines, the pipe closes and grep terminates.
As opposed to: Doing the entire grep and then taking the first 100.
To see this is true, run the program "yes" (which outputs an infinite number of lines) with head, and you will see it does stop:

 yes | head -20

tail
Display the last 30 lines of the logfile:

cat logfile | tail -30

date

date                         looks like: "Tue Feb 17 16:28:33 GMT 2009"
CURRENTDATE=`date`           remember backquotes
echo $CURRENTDATE             

date "+%b %e"                looks like: "Jan 21"  

date "+%b.%e.log"            can add things to the string 

file=`date "+%b.%e.log"`    
echo $file

man date

Using date to get unique filename

Say web server in response to client needs to make a temporary file.
Use date to get a new filename that is unique to the current second:

timenow=`date +%H%M%S`
filename="/tmp/random.$timenow.txt"

Unique to second and nanosecond:

date "+%H.%M.%S.%N"

Alternative ways of getting unique filename

Unique filename based on process ID:
```
filename=/tmp/random.$$.txt
```
Would be different for each process.
Random number generator in Shell:
```
echo $RANDOM  
```
This is a strange environment variable. It does not exist until you try to access it. Then it exists!
```
set | grep -i random
echo $RANDOM  
set | grep -i random
echo $RANDOM            
```
Random (or pseudo-random) number generator

tar

tar
Combine many files/dirs into one file for easy distribution.

For sending:

# bundle directory into one file
tar -cf dir.tar dir

# compress it
gzip dir.tar

# (can actually do the above two in one step)

When receiving:

# de-compress it 
gzip -d dir.tar.gz

# un-tar it
tar -xf dir.tar

# (can actually do the above two in one step)