Meeting-20100612: Difference between revisions
(→Speaker/Presentation: change to replacement topic) |
|||
Line 26: | Line 26: | ||
I can think of at least six ways of doing this, how about you? |
I can think of at least six ways of doing this, how about you? |
||
Below are the various examples I came up with, and running times using an input file with 1.5 million blank lines and 3 million non-blank lines. You can generate these statistics by preceding the command with 'time -p'. |
|||
# First example from <http://www.vectorsite.net/tsawk_3.html> |
|||
awk 'NF != 0 { ++count } END { print count }' filename |
|||
3000000 |
|||
real 3.67 |
|||
user 3.55 |
|||
sys 0.10 |
|||
awk '/./ { ++count } END { print count }' filename |
|||
3000000 |
|||
real 2.96 |
|||
user 2.81 |
|||
sys 0.12 |
|||
time -p grep -c . filename |
|||
3000000 |
|||
real 0.94 |
|||
user 0.84 |
|||
sys 0.08 |
|||
# sed is just a slower grep here. |
|||
sed -n -e '/./p' filename | wc -l |
|||
real 6.00 |
|||
user 5.64 |
|||
sys 0.14 |
|||
3000000 |
|||
# If you REALLY love sed, you can replace wc -l, too! |
|||
sed -n -e '/./p' filename | sed -n -e '$=' |
|||
real 7.43 |
|||
user 5.70 |
|||
sys 0.19 |
|||
3000000 |
|||
tr -s '\012' < filename | wc -l |
|||
real 1.21 |
|||
user 0.84 |
|||
sys 0.13 |
|||
3000000 |
|||
# -b and -s are non-POSIX extensions to cat found on GNU and |
|||
# BSD systems. |
|||
cat -b -s filename | tail -n 2 | cut -f 1 |
|||
real 1.20 |
|||
user 0.63 |
|||
sys 0.16 |
|||
2999999 |
|||
3000000 |
|||
sh -c 'count=0 |
|||
while read ln ; do |
|||
[ -n "$ln" ] && count=$(($count+1)) |
|||
done |
|||
echo $count' < filename |
|||
3000000 |
|||
real 240.14 |
|||
user 214.12 |
|||
sys 24.96 |
|||
perl -e 'while (<>) { chomp; if ($_) { ++$count } } ; |
|||
print "$count\n"' < filename |
|||
3000000 |
|||
real 7.22 |
|||
user 7.00 |
|||
sys 0.14 |
|||
perl -e 'while (<>) { if (/./) { ++$count } } ; |
|||
print "$count\n"' < filename |
|||
3000000 |
|||
real 8.93 |
|||
user 8.78 |
|||
sys 0.11 |
|||
# This one displays a separate count of blank, non-blank, and |
|||
# total lines. |
|||
awk 'NF != 0 {++nonblank} NF == 0 {++blank} |
|||
END {print "Non-blank:",nonblank ; print "Blank:",blank ; |
|||
print "Total:",NR}' filename |
|||
Non-blank: 3000000 |
|||
Blank: 1500000 |
|||
Total: 4500000 |
|||
real 5.42 |
|||
user 5.28 |
|||
sys 0.12 |
|||
# Actually, we don't need a separate pattern and action to |
|||
# count blank lines; we can subtract from the total instead. |
|||
awk 'NF != 0 {++count} |
|||
END {print "Non-blank:",count ; print "Blank:",NR-count ; |
|||
print "Total:",NR}' filename |
|||
Non-blank: 3000000 |
|||
Blank: 1500000 |
|||
Total: 4500000 |
|||
real 3.75 |
|||
user 3.64 |
|||
sys 0.09 |
|||
# This does the same, but has to read the file three separate |
|||
# times. On your system, might be faster or slower than the |
|||
# one above; depends on whether CPU or I/O is the bottleneck. |
|||
sh -c 'printf "Non-blank: " ; grep -c . filename ; |
|||
printf "Blank: " ; grep -v -c . filename ; |
|||
printf "Total: " ; wc -l filename | cut -d " " -f 1' |
|||
Non-blank: 3000000 |
|||
Blank: 1500000 |
|||
Total: 4500000 |
|||
real 2.25 |
|||
user 1.91 |
|||
sys 0.31 |
|||
=== EXERCISE TWO === |
=== EXERCISE TWO === |
Revision as of 03:53, 13 June 2010
WPLUG will have a General User Meeting and presentation on Saturday, June 12th, 2010, starting at 11am at the Wilkins School Community Center.
Schedule for the Day
10:30am - Doors open, set up
11:00am - Business Meeting starts
11:30am - Featured Presentation
12:30pm - Meeting ends, everyone out. We are likely to go to D's 6pack or Square Cafe for lunch.
Speaker/Presentation
Vance Kochenderfer will be talking a bit about the UNIX text processing utilities such as grep, sed, awk, cat, wc, and the like.
However, you don't get to just sit on your butt and listen; this is an audience-participation event. What we're going to do is take a couple simple tasks, and then explore how you could accomplish them using various UNIX utilities.
The goal is not solely to find the standard, quickest, or simplest solution, but to try out as many different whacked-out options as we can. So don't stop thinking once you've got an answer, even if it's a good one - see what else you can come up with!
We'll talk over all the suggestions and how they work (or don't work), so hopefully we'll all learn something new.
Start thinking about these, and bring your ideas to the meeting:
EXERCISE ONE
You have a large text file. Some lines contain text; others are blank. Your goal is to figure out how many non-blank lines are in the file.
I can think of at least six ways of doing this, how about you?
Below are the various examples I came up with, and running times using an input file with 1.5 million blank lines and 3 million non-blank lines. You can generate these statistics by preceding the command with 'time -p'.
# First example from <http://www.vectorsite.net/tsawk_3.html> awk 'NF != 0 { ++count } END { print count }' filename 3000000 real 3.67 user 3.55 sys 0.10
awk '/./ { ++count } END { print count }' filename 3000000 real 2.96 user 2.81 sys 0.12
time -p grep -c . filename 3000000 real 0.94 user 0.84 sys 0.08
# sed is just a slower grep here. sed -n -e '/./p' filename | wc -l real 6.00 user 5.64 sys 0.14 3000000
# If you REALLY love sed, you can replace wc -l, too! sed -n -e '/./p' filename | sed -n -e '$=' real 7.43 user 5.70 sys 0.19 3000000
tr -s '\012' < filename | wc -l real 1.21 user 0.84 sys 0.13 3000000
# -b and -s are non-POSIX extensions to cat found on GNU and # BSD systems. cat -b -s filename | tail -n 2 | cut -f 1 real 1.20 user 0.63 sys 0.16 2999999 3000000
sh -c 'count=0 while read ln ; do [ -n "$ln" ] && count=$(($count+1)) done echo $count' < filename 3000000 real 240.14 user 214.12 sys 24.96
perl -e 'while (<>) { chomp; if ($_) { ++$count } } ; print "$count\n"' < filename 3000000 real 7.22 user 7.00 sys 0.14
perl -e 'while (<>) { if (/./) { ++$count } } ; print "$count\n"' < filename 3000000 real 8.93 user 8.78 sys 0.11
# This one displays a separate count of blank, non-blank, and # total lines. awk 'NF != 0 {++nonblank} NF == 0 {++blank} END {print "Non-blank:",nonblank ; print "Blank:",blank ; print "Total:",NR}' filename Non-blank: 3000000 Blank: 1500000 Total: 4500000 real 5.42 user 5.28 sys 0.12
# Actually, we don't need a separate pattern and action to # count blank lines; we can subtract from the total instead. awk 'NF != 0 {++count} END {print "Non-blank:",count ; print "Blank:",NR-count ; print "Total:",NR}' filename Non-blank: 3000000 Blank: 1500000 Total: 4500000 real 3.75 user 3.64 sys 0.09
# This does the same, but has to read the file three separate # times. On your system, might be faster or slower than the # one above; depends on whether CPU or I/O is the bottleneck. sh -c 'printf "Non-blank: " ; grep -c . filename ; printf "Blank: " ; grep -v -c . filename ; printf "Total: " ; wc -l filename | cut -d " " -f 1' Non-blank: 3000000 Blank: 1500000 Total: 4500000 real 2.25 user 1.91 sys 0.31
EXERCISE TWO
Determine whether a given value is numeric (decimal).
Example numeric values:
123 45.6789 -3.4567 -0 000123 .01234 54321. 00000. -0.987 -.987 -0123. 012 0.0 .0 -.000
Example non-numeric values:
hello 3f 3F AB 0xAB 0.0. -0- 3.0E8 3.0e-08 .-0123 1.23.4 5.678- --98 a space a tab
As a bonus, make your command also consider a value numeric if it starts with a + instead of a -.
I haven't thought about this one as much, and only have one solution so far. Maybe you can come up with something using bc or some other non-obvious method?
Meeting Minutes
(TBA)
Meeting Staff
If you would like to volunteer to assist with this meeting, please add your name to one or more of the categories below.
- Host: Your name here
- Co-Host: Your name here
- Donuts/Bagels: David Kraus
- Setup: David Kraus, Your name here
- Clean Up: David Kraus, Your name here
Carpooling
- Your name/location here