Meeting-20100612
WPLUG will have a General User Meeting and presentation on Saturday, June 12th, 2010, starting at 11am at the Wilkins School Community Center.
Schedule for the Day
10:30am - Doors open, set up
11:00am - Business Meeting starts
11:30am - Featured Presentation
12:30pm - Meeting ends, everyone out. We are likely to go to D's 6pack or Square Cafe for lunch.
Speaker/Presentation
Vance Kochenderfer will be talking a bit about the UNIX text processing utilities such as grep, sed, awk, cat, wc, and the like.
However, you don't get to just sit on your butt and listen; this is an audience-participation event. What we're going to do is take a couple simple tasks, and then explore how you could accomplish them using various UNIX utilities.
The goal is not solely to find the standard, quickest, or simplest solution, but to try out as many different whacked-out options as we can. So don't stop thinking once you've got an answer, even if it's a good one - see what else you can come up with!
We'll talk over all the suggestions and how they work (or don't work), so hopefully we'll all learn something new.
Start thinking about these, and bring your ideas to the meeting:
EXERCISE ONE
You have a large text file. Some lines contain text; others are blank. Your goal is to figure out how many non-blank lines are in the file.
I can think of at least six ways of doing this, how about you?
Below are the various examples I came up with, and running times using an input file with 1.5 million blank lines and 3 million non-blank lines. You can generate these statistics by preceding the command with 'time -p'.
# First example from <http://www.vectorsite.net/tsawk_3.html> awk 'NF != 0 { ++count } END { print count }' filename 3000000 real 3.67 user 3.55 sys 0.10
awk '/./ { ++count } END { print count }' filename 3000000 real 2.96 user 2.81 sys 0.12
time -p grep -c . filename 3000000 real 0.94 user 0.84 sys 0.08
# sed is just a slower grep here. sed -n -e '/./p' filename | wc -l real 6.00 user 5.64 sys 0.14 3000000
# If you REALLY love sed, you can replace wc -l, too! sed -n -e '/./p' filename | sed -n -e '$=' real 7.43 user 5.70 sys 0.19 3000000
tr -s '\012' < filename | wc -l real 1.21 user 0.84 sys 0.13 3000000
# -b and -s are non-POSIX extensions to cat found on GNU and # BSD systems. cat -b -s filename | tail -n 2 | cut -f 1 real 1.20 user 0.63 sys 0.16 2999999 3000000
sh -c 'count=0 while read ln ; do [ -n "$ln" ] && count=$(($count+1)) done echo $count' < filename 3000000 real 240.14 user 214.12 sys 24.96
perl -e 'while (<>) { chomp; if ($_) { ++$count } } ; print "$count\n"' < filename 3000000 real 7.22 user 7.00 sys 0.14
perl -e 'while (<>) { if (/./) { ++$count } } ; print "$count\n"' < filename 3000000 real 8.93 user 8.78 sys 0.11
# This one displays a separate count of blank, non-blank, and # total lines. awk 'NF != 0 {++nonblank} NF == 0 {++blank} END {print "Non-blank:",nonblank ; print "Blank:",blank ; print "Total:",NR}' filename Non-blank: 3000000 Blank: 1500000 Total: 4500000 real 5.42 user 5.28 sys 0.12
# Actually, we don't need a separate pattern and action to # count blank lines; we can subtract from the total instead. awk 'NF != 0 {++count} END {print "Non-blank:",count ; print "Blank:",NR-count ; print "Total:",NR}' filename Non-blank: 3000000 Blank: 1500000 Total: 4500000 real 3.75 user 3.64 sys 0.09
# This does the same, but has to read the file three separate # times. On your system, might be faster or slower than the # one above; depends on whether CPU or I/O is the bottleneck. sh -c 'printf "Non-blank: " ; grep -c . filename ; printf "Blank: " ; grep -v -c . filename ; printf "Total: " ; wc -l filename | cut -d " " -f 1' Non-blank: 3000000 Blank: 1500000 Total: 4500000 real 2.25 user 1.91 sys 0.31
EXERCISE TWO
Determine whether a given value is numeric (decimal).
Example numeric values:
123 45.6789 -3.4567 -0 000123 .01234 54321. 00000. -0.987 -.987 -0123. 012 0.0 .0 -.000
Example non-numeric values:
hello 3f 3F AB 0xAB 0.0. -0- 3.0E8 3.0e-08 .-0123 1.23.4 5.678- --98 -. a space a tab
As a bonus, make your command also consider a value numeric if it starts with a + instead of a -.
I haven't thought about this one as much, and only have one solution so far. Maybe you can come up with something using bc or some other non-obvious method?
Answer: I wasn't able to find a way of doing this using bc or shell arithmetic as I thought I might. So the fallback was to use grep and build an appropriate regular expression (regex).
We ran out of time before I could explain the full regex, so here is the command in all its glory:
egrep -q '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$'
or, if you want to strictly conform to POSIX,
grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null
We are using an extended regex, so we have to use egrep or the -E option to grep. Since we want to get a true/false value, we use the -q option or redirection to throw away the output. This way, we can just act based on grep's exit status.
Let's pick apart the regex to see what it does.
- ^
- The ^ at the beginning means that our regex will only match starting at the beginning of the line. The anticipated use for this command is something like 'echo "$value" | grep blah...', so we know there won't be any extraneous stuff at the beginning. If you are getting the value from a file or as input from the user, you may need to strip away whitespace from the beginning, or alter the regex to account for that.
- [-+]
- An expression inside brackets matches a single character, as long as that character is one of those listed inside the brackets (a range of characters can be specified, as can special pre-defined character classes, but in this case we're not using either of those). So this would match either a - sign or a + sign. Note that inside a bracket expression, + has no special meaning.
- [-+]?
- The question mark means "match zero or one of the preceding character." This makes it so that having a sign at the beginning of our number is optional, and also disallows multiple sign characters.
- ( ... | ... )
- The parentheses are there for grouping what's inside as a single unit. The pipe symbolizes alternation - that is, this part of the regex will match either the expression appearing before the pipe symbol, or the expression appearing after it. We need to use alternation because while it is optional to have numbers before the decimal point (e.g., .123) or after the decimal point (e.g., 123.), it is not valid for both sets of numbers to be missing (e.g., just a .).
- [0-9]
- Another bracket expression, this matches any single numeral (that is, any character in the range 0 through 9).
- [0-9]+
- The + sign outside a bracket expression does have special meaning. It means "match one or more of the preceding character." So this will match one or more numerals, but not an empty string.
- \.
- The dot has the special meaning "match any character." If we want to literally match a period, we have to escape it with a backslash to remove its special meaning. Note that [.] would do the same thing, as dot has no special meaning inside a bracket expression.
- \.?
- Again, question mark means to match zero or one of the preceding character. This makes the decimal point optional.
- [0-9]+\.?
- This is the entire first expression of our alternation. It will match an integer of any length, optionally followed by a decimal point. So 0, 123, 123., 00000., 000123, and 123123123123123123123 would all match, but just a decimal point would not.
- [0-9]*
- We've seen [0-9] before, but the * is new. It is similar to +, but means "match zero or more of the preceding character." We use * instead of + because it's valid to have nothing in front of the decimal point (e.g., .123).
- \.
- This matches a decimal point again, but note there is no question mark. In this expression, one single decimal point is mandatory.
- [0-9]+
- Again, this matches one or more numerals. Having numbers after the decimal point is not optional here.
- [0-9]*\.[0-9]+
- This is the full second expression of our alternation. It will match any floating point value, such as 0.123, 1.234, .123, 0098.6, 123.456, or 3.1415926535897932384626433832795028841971693993751.
- ([0-9]+\.?|[0-9]*\.[0-9]+)
- This is the full non-sign part of our regex. As discussed above regarding alternation, it means "either an integer, optionally followed by a decimal point, or an optional set of numerals followed by a (mandatory) decimal point and one or more trailing numerals."
- $
- This is the counterpart to the ^ character, forcing a match at the end of the line. Putting the expression inside ^$ forbids any extraneous characters before or after our match.
As we talked about, we could use this regex in awk like so:
awk '/^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$/ {action-list}'
where the action list would be executed for each numeric value in input. Note that sed does NOT support the extended regexes understood by egrep or 'grep -E'. It only handles the basic regexes of standard grep, so we cannot use this regex with sed.
If you are having difficulty understanding any of this, try playing around with different input values and/or altering the regex to see what the different parts do. Ask questions on the main wplug mailing list if you get really stuck. Have fun!
Example output:
cat numbers 123 45.6789 -3.4567 -0 000123 .01234 54321. 00000. -0.987 -.987 -0123. 012 0.0 .0 -.000 cat nonnumbers hello 3f 3F AB 0xAB 0.0. -0- 3.0E8 3.0e-08 .-0123 1.23.4 5.678- --98 .- cat bonusnumbers 123 45.6789 +3.4567 +0 000123 .01234 54321. 00000. +0.987 +.987 +0123. 012 0.0 .0 +.000 cat numbers | while read value ; do echo "$value" | \ grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null \ && echo "Number" || echo "Not a number"; done \ | paste numbers - 123 Number 45.6789 Number -3.4567 Number -0 Number 000123 Number .01234 Number 54321. Number 00000. Number -0.987 Number -.987 Number -0123. Number 012 Number 0.0 Number .0 Number -.000 Number cat nonnumbers | while read value ; do echo "$value" | \ grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null \ && echo "Number" || echo "Not a number"; done \ | paste nonnumbers - hello Not a number 3f Not a number 3F Not a number AB Not a number 0xAB Not a number 0.0. Not a number -0- Not a number 3.0E8 Not a number 3.0e-08 Not a number .-0123 Not a number 1.23.4 Not a number 5.678- Not a number --98 Not a number .- Not a number Not a number Not a number cat bonusnumbers | while read value ; do echo "$value" | \ grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null \ && echo "Number" || echo "Not a number"; done \ | paste bonusnumbers - 123 Number 45.6789 Number +3.4567 Number +0 Number 000123 Number .01234 Number 54321. Number 00000. Number +0.987 Number +.987 Number +0123. Number 012 Number 0.0 Number .0 Number +.000 Number
Meeting Minutes
DRAFT
The regular monthly meeting of the Western Pennsylvania Linux Users Group was held on Saturday, June 12, 2009, at 11:06 AM, at the Wilkins School Community Center, the regular presiding officer being in the chair. In the absence of the regular secretary, Vance Kochenderfer was elected to serve as secretary pro tem. The minutes of the October 31, 2009 meeting were approved as read.
The Treasurer reported that there is $760.80 in the checking account, $66 cash on hand in the refreshment fund, and $40 received in dues yet to be deposited.
The meeting adjourned at 11:09 AM.
Vance Kochenderfer
Secretary pro tem
DRAFT
Meeting Staff
If you would like to volunteer to assist with this meeting, please add your name to one or more of the categories below.
- Host: Your name here
- Co-Host: Your name here
- Donuts/Bagels: David Kraus
- Setup: David Kraus, Your name here
- Clean Up: David Kraus, Your name here
Carpooling
- Your name/location here