Meeting-20100612

From WPLUG
Jump to navigation Jump to search

WPLUG will have a General User Meeting and presentation on Saturday, June 12th, 2010, starting at 11am at the Wilkins School Community Center.

Schedule for the Day

10:30am - Doors open, set up
11:00am - Business Meeting starts
11:30am - Featured Presentation
12:30pm - Meeting ends, everyone out. We are likely to go to D's 6pack or Square Cafe for lunch.

Speaker/Presentation

Vance Kochenderfer will be talking a bit about the UNIX text processing utilities such as grep, sed, awk, cat, wc, and the like.

However, you don't get to just sit on your butt and listen; this is an audience-participation event. What we're going to do is take a couple simple tasks, and then explore how you could accomplish them using various UNIX utilities.

The goal is not solely to find the standard, quickest, or simplest solution, but to try out as many different whacked-out options as we can. So don't stop thinking once you've got an answer, even if it's a good one - see what else you can come up with!

We'll talk over all the suggestions and how they work (or don't work), so hopefully we'll all learn something new.

Start thinking about these, and bring your ideas to the meeting:

EXERCISE ONE

You have a large text file. Some lines contain text; others are blank. Your goal is to figure out how many non-blank lines are in the file.

I can think of at least six ways of doing this, how about you?

Below are the various examples I came up with, and running times using an input file with 1.5 million blank lines and 3 million non-blank lines. You can generate these statistics by preceding the command with 'time -p'.

# First example from <http://www.vectorsite.net/tsawk_3.html>
awk 'NF != 0 { ++count } END { print count }' filename
3000000
real 3.67
user 3.55
sys 0.10
awk '/./ { ++count } END { print count }' filename
3000000
real 2.96
user 2.81
sys 0.12
grep -c . filename
3000000
real 0.94
user 0.84
sys 0.08
# sed is just a slower grep here.
sed -n -e '/./p' filename | wc -l
real 6.00
user 5.64
sys 0.14
3000000
# If you REALLY love sed, you can replace wc -l, too!
sed -n -e '/./p' filename | sed -n -e '$='
real 7.43
user 5.70
sys 0.19
3000000
tr -s '\012' < filename | wc -l
real 1.21
user 0.84
sys 0.13
3000000
# -b and -s are non-POSIX extensions to cat found on GNU and
# BSD systems.
cat -b -s filename | tail -n 2 | cut -f 1
real 1.20
user 0.63
sys 0.16
2999999
3000000
sh -c 'count=0
while read ln ; do
[ -n "$ln" ] && count=$(($count+1))
done
echo $count' < filename
3000000
real 240.14
user 214.12
sys 24.96
perl -e 'while (<>) { chomp; if ($_) { ++$count } } ;
print "$count\n"' < filename
3000000
real 7.22
user 7.00
sys 0.14
perl -e 'while (<>) { if (/./) { ++$count } } ;
print "$count\n"' < filename
3000000
real 8.93
user 8.78
sys 0.11
# This one displays a separate count of blank, non-blank, and
# total lines.
awk 'NF != 0 {++nonblank} NF == 0 {++blank}
END {print "Non-blank:",nonblank ; print "Blank:",blank ;
print "Total:",NR}' filename
Non-blank: 3000000
Blank: 1500000
Total: 4500000
real 5.42
user 5.28
sys 0.12
# Actually, we don't need a separate pattern and action to
# count blank lines; we can subtract from the total instead.
awk 'NF != 0 {++count}
END {print "Non-blank:",count ; print "Blank:",NR-count ;
print "Total:",NR}' filename
Non-blank: 3000000
Blank: 1500000
Total: 4500000
real 3.75
user 3.64
sys 0.09
# This does the same, but has to read the file three separate
# times.  On your system, might be faster or slower than the
# one above; depends on whether CPU or I/O is the bottleneck.
sh -c 'printf "Non-blank: " ; grep -c . filename ;
printf "Blank: " ; grep -v -c . filename ;
printf "Total: " ; wc -l filename | cut -d " " -f 1'
Non-blank: 3000000
Blank: 1500000
Total: 4500000
real 2.25
user 1.91
sys 0.31

EXERCISE TWO

Determine whether a given value is numeric (decimal).

Example numeric values:

 123       45.6789   -3.4567   -0        000123
 .01234    54321.    00000.    -0.987    -.987
 -0123.    012       0.0       .0        -.000

Example non-numeric values:

 hello     3f        3F        AB        0xAB
 0.0.      -0-       3.0E8     3.0e-08   .-0123
 1.23.4    5.678-    --98      -.        a space
 a tab

As a bonus, make your command also consider a value numeric if it starts with a + instead of a -.

I haven't thought about this one as much, and only have one solution so far. Maybe you can come up with something using bc or some other non-obvious method?

Answer: I wasn't able to find a way of doing this using bc or shell arithmetic as I thought I might. So the fallback was to use grep and build an appropriate regular expression (regex).

We ran out of time before I could explain the full regex, so here is the command in all its glory:

 egrep -q '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$'

or, if you want to strictly conform to POSIX,

 grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null

We are using an extended regex, so we have to use egrep or the -E option to grep. Since we want to get a true/false value, we use the -q option or redirection to throw away the output. This way, we can just act based on grep's exit status.

Let's pick apart the regex to see what it does.

^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
^
The ^ at the beginning means that our regex will only match starting at the beginning of the line. The anticipated use for this command is something like 'echo "$value" | grep blah...', so we know there won't be any extraneous stuff at the beginning. If you are getting the value from a file or as input from the user, you may need to strip away whitespace from the beginning, or alter the regex to account for that.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
[-+]
An expression inside brackets matches a single character, as long as that character is one of those listed inside the brackets (a range of characters can be specified, as can special pre-defined character classes, but in this case we're not using either of those). So this would match either a - sign or a + sign. Note that inside a bracket expression, + has no special meaning.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
[-+]?
The question mark means "match zero or one of the preceding character." This makes it so that having a sign at the beginning of our number is optional, and also disallows multiple sign characters.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
( ... | ... )
The parentheses are there for grouping what's inside as a single unit. The pipe symbolizes alternation - that is, this part of the regex will match either the expression appearing before the pipe symbol, or the expression appearing after it. We need to use alternation because while it is optional to have numbers before the decimal point (e.g., .123) or after the decimal point (e.g., 123.), it is not valid for both sets of numbers to be missing (e.g., just a .).
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
[0-9]
Another bracket expression, this matches any single numeral (that is, any character in the range 0 through 9).
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
[0-9]+
The + sign outside a bracket expression does have special meaning. It means "match one or more of the preceding character." So this will match one or more numerals, but not an empty string.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
\.
The dot has the special meaning "match any character." If we want to literally match a period, we have to escape it with a backslash to remove its special meaning. Note that [.] would do the same thing, as dot has no special meaning inside a bracket expression.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
\.?
Again, question mark means to match zero or one of the preceding character. This makes the decimal point optional.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
[0-9]+\.?
This is the entire first expression of our alternation. It will match an integer of any length, optionally followed by a decimal point. So 0, 123, 123., 00000., 000123, and 123123123123123123123 would all match, but just a decimal point would not.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
[0-9]*
We've seen [0-9] before, but the * is new. It is similar to +, but means "match zero or more of the preceding character." We use * instead of + because it's valid to have nothing in front of the decimal point (e.g., .123).
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
\.
This matches a decimal point again, but note there is no question mark. In this expression, one single decimal point is mandatory.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
[0-9]+
Again, this matches one or more numerals. Having numbers after the decimal point is not optional here.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
[0-9]*\.[0-9]+
This is the full second expression of our alternation. It will match any floating point value, such as 0.123, 1.234, .123, 0098.6, 123.456, or 3.1415926535897932384626433832795028841971693993751.
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
([0-9]+\.?|[0-9]*\.[0-9]+)
This is the full non-sign part of our regex. As discussed above regarding alternation, it means "either an integer, optionally followed by a decimal point, or an optional set of numerals followed by a (mandatory) decimal point and one or more trailing numerals."
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
$
This is the counterpart to the ^ character, forcing a match at the end of the line. Putting the expression inside ^$ forbids any extraneous characters before or after our match.

As we talked about, we could use this regex in awk like so:

 awk '/^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$/ {action-list}'

where the action list would be executed for each numeric value in input. Note that sed does NOT support the extended regexes understood by egrep or 'grep -E'. It only handles the basic regexes of standard grep, so we cannot use this regex with sed.

If you are having difficulty understanding any of this, try playing around with different input values and/or altering the regex to see what the different parts do. Ask questions on the main wplug mailing list if you get really stuck. Have fun!

Example output:

cat numbers
123
45.6789
-3.4567
-0
000123
.01234
54321.
00000.
-0.987
-.987
-0123.
012
0.0
.0
-.000

cat nonnumbers
hello
3f
3F
AB
0xAB
0.0.
-0-
3.0E8
3.0e-08
.-0123
1.23.4
5.678-
--98
.-
 
	

cat bonusnumbers
123
45.6789
+3.4567
+0
000123
.01234
54321.
00000.
+0.987
+.987
+0123.
012
0.0
.0
+.000

cat numbers | while read value ; do echo "$value" | \
grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null \
&& echo "Number" || echo "Not a number"; done \
| paste numbers -
123	Number
45.6789	Number
-3.4567	Number
-0	Number
000123	Number
.01234	Number
54321.	Number
00000.	Number
-0.987	Number
-.987	Number
-0123.	Number
012	Number
0.0	Number
.0	Number
-.000	Number

cat nonnumbers | while read value ; do echo "$value" | \
grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null \
&& echo "Number" || echo "Not a number"; done \
| paste nonnumbers -
hello	Not a number
3f	Not a number
3F	Not a number
AB	Not a number
0xAB	Not a number
0.0.	Not a number
-0-	Not a number
3.0E8	Not a number
3.0e-08	Not a number
.-0123	Not a number
1.23.4	Not a number
5.678-	Not a number
--98	Not a number
.-	Not a number
 	Not a number
		Not a number

cat bonusnumbers | while read value ; do echo "$value" | \
grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null \
&& echo "Number" || echo "Not a number"; done \
| paste bonusnumbers -
123	Number
45.6789	Number
+3.4567	Number
+0	Number
000123	Number
.01234	Number
54321.	Number
00000.	Number
+0.987	Number
+.987	Number
+0123.	Number
012	Number
0.0	Number
.0	Number
+.000	Number

Meeting Minutes

DRAFT

The regular monthly meeting of the Western Pennsylvania Linux Users Group was held on Saturday, June 12, 2009, at 11:06 AM, at the Wilkins School Community Center, the regular presiding officer being in the chair. In the absence of the regular secretary, Vance Kochenderfer was elected to serve as secretary pro tem. The minutes of the October 31, 2009 meeting were approved as read.

The Treasurer reported that there is $760.80 in the checking account, $66 cash on hand in the refreshment fund, and $40 received in dues yet to be deposited.

The meeting adjourned at 11:09 AM.

Vance Kochenderfer
Secretary pro tem

DRAFT

Meeting Staff

If you would like to volunteer to assist with this meeting, please add your name to one or more of the categories below.

Carpooling

  • Your name/location here