Learning from phoebos script (^Z issue 1 article 3)
Since I joined ^C, I also joined the #ctrl-c channel on tilde.chat and met loghead and heard about the ^Z zine. Then I wanted to read from the first issue on, the article 3 on the first issue is a kind of challange. In fact its a cool script wrote by phoebos and you can find it here on his page. You can easily determine what the script is doing by his name, but the interesting part is to understand how it works.
The first line on the script is the famous shebang used in *nix scripts to determinate the interpreter to be used when the script is exec'd
#!/bin/sh -e
So we know the script is supposed to be interpreted by bin/sh, but what the parameter -e means?, Usually sh will stand for the bourne shell, but its very rare for the original bourne shell to be used actually, so to just got what the parameter will mean in the context its being used, i ssh'd into ctrl-c to check which shell sh will point to:
giggles@ctrl-c:~$ ls -l /bin/sh
lrwxrwxrwx 1 root root 4 Mar 23  2022 /bin/sh -> dash
Ok, so now we know in our context /bin/sh is a symlink to dash, now we can just check the manual for dash to get the meaning of the parameter:
giggles@ctrl-c:~$ man dash
...
   Argument List Processing
     All of the single letter options that have a corresponding name can be used as an argument to the -o option.  The set -o name is provided next to the single letter option in the description below.  Specifying a dash “-” turns the option on, while using a plus “+” disables the option.  The following options can be set from the command line or with the set builtin
...
           -e errexit       If not interactive, exit immediately if any untested command fails.  The exit status of a command is considered to be explicitly tested if the command is used to control an if, elif, while, or until; or if the command is the left hand operand of an “&&” or “||” operator.
...
This parameter is being used to make the script exit if something fail(non-zero return), thats a cool thing, I used bash/dash for years and never used this option before, learned a new thing even "before" the script content itself. Now we can proceed to the next script line, I skipped the comment line, so in the following we have:
exec > "$HOME/public_html/activity/index.html"
Oh God, its shame for me to never care enough to learn to use exec properly or how it works, but at least I know > is to redirect the output, it can redirect other file descriptors but in this case without any parameters its redirecting the standard output(stdout) to a file using the $HOME variable to make it more dynamic, there is a little problem here, since by default the public_html directory at home dont have activity subdirectory by default, btw Its not a real problem we can just create it before using the script.
To understand what is happening, I choosed to look again on man page of dash, and I found out that exec is a builtin, and its documented under the Builtins section, I will reproduce the interesting part here:
   Builtins
     This section lists the builtin commands which are builtin because they need to perform some operation that can't be performed by a separate process.  In addition to these, there are several other commands that may be builtin for efficiency (e.g.  printf(1), echo(1), test(1), etc).
...
     exec [command arg ...]
            Unless command is omitted, the shell process is replaced with the specified program (which must be a real program, not a shell builtin or function).  Any redirections on the exec command are marked as permanent, so that they are not undone when the exec command finishes.
In the case of the script there is no program bein specified, so the exec is being used to set the redirections, that is, from this line on the standard output will be the file $HOME/public_html/activity/index.html, thats a cool way to let an script generate/regenerate the contents of a file (eg: a webpage).
If you carefully looked at the script, you noticed it will generate a web page redirecting the output of the following commands to the file specified before using the exec, you can think in the rest of the script as as in my following diagram:
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣠⡤⠶⠶⠚⠛⠛⠻⠿⢷⣶⡶⢾⣿⢿⣿⣷⣶⣶⣤⣤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣤⠶⠛⠉⠀⡀⠴⣏⡧⠐⠚⢛⡛⠓⠺⣿⣮⡿⠦⢤⡀⠀⠈⣭⠉⠛⠻⠷⣦⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⣠⣶⠟⠉⣤⡄⠀⣠⡛⠃⢠⣦⡄⠀⠀⡛⠛⠀⠰⠿⠟⢧⡀⠀⠻⠦⠀⠀⠀⠶⠀⢠⣦⡙⢿⣶⡄⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⣠⡾⠋⠑⠀⠀⠉⠀⠐⠟⠁⢀⣤⡉⠀⠀⠸⢿⠂⠀⣶⡦⠀⠘⠇⠀⠀⠰⠆⠀⣠⣤⡀⠀⠉⠀⠀⠙⣿⣆⠀⠀⠀⠀⠀
⠀⠀⠀⢀⣼⡟⠁⠀⠀⠰⠀⠀⠀⠀⠀⠀⠀⠉⢁⣤⡀⠀⣀⠀⠀⢀⣠⡀⠀⠀⢰⣶⠀⠀⠀⠈⠉⠀⠀⠀⠼⠏⠀⠈⠻⣦⠀⠀⠀⠀
⠀⠀⢀⣿⠏⡀⠀⠀⠀⠀⠀⠀⠀⠘⠛⠀⠀⣀⡈⠉⠀⠀⠋⠁⠀⠈⠉⠁⠀⠀⠀⠁⠀⠰⠟⠀⠀⠀⠰⠆⠀⠀⠀⠄⠀⠘⢧⠀⠀⠀|-> printf (3x)
⠀⠀⣾⡟⢠⠇⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠁⠀⠀⠘⠛⠃⠀⠀⠀⢰⡆⠀⠀⠀⠀⠀⠀⠠⣤⠄⠀⢀⠀⠀⠀⠀⠀⠀⠈⣧⠀⠀
⠀⠀⣿⣇⠈⣧⡟⣆⢘⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⢹⡄⠀
⠀⠀⠘⣿⣦⣌⣁⠈⠚⠷⢽⣮⣦⣄⠀⢀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣼⠃⠀
⠀⢀⣤⠾⢋⠈⠉⠛⠳⠦⣤⣀⡉⠉⠉⠒⠛⠷⢶⡤⠤⠤⠤⠤⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⡀⠤⠴⠶⢾⣿⣅⠀⠀
⣴⣿⠿⢿⣿⠃⡀⠀⠀⠀⣀⠈⠉⠛⠒⠲⣦⣤⡤⠤⠤⠤⠤⠤⠤⠤⠤⠴⠶⠖⠲⠶⣶⣒⠛⠛⠋⠉⠉⠀⠀⠀⠀⠀⢀⣀⣈⣻⣿⣦
⠀⠀⠀⠚⠛⣻⡇⠀⣴⣏⣙⣷⠦⠶⠶⣟⠉⠀⠀⠀⠀⢠⡤⠤⣤⣀⠀⠀⠀⣤⢶⣄⣠⡭⠿⢶⣄⠀⢀⣤⠶⠶⢦⡤⠼⣿⡏⠉⠙⠻|
⠀⠀⠀⢠⣴⣿⣷⣶⣿⣤⣉⡙⠛⠒⠒⠛⠛⣿⡄⢀⣤⠾⠤⠤⠖⠛⢷⣤⣼⣿⠶⣭⣄⣀⣀⢀⣸⣾⣋⣀⣀⣤⡤⣶⣿⣿⣅⠀⠀⠀|
⠀⠀⠀⢿⣿⣿⣿⣿⣻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣦⣀⠀⠀⠀⠀⠀⠀⢀⣠⠴⢚⣽⣿⣿⣿⣿⣿⣿⡿⠻⠙⠋⠉⠘⣿⡄⠀⠀|-> find...| awk... 
⠀⠀⠀⣸⣿⣿⣿⣿⣿⣟⣿⣿⣿⡟⢻⣏⣹⢻⣿⢻⣻⣿⣿⣦⠀⣀⣀⡴⠞⣋⣤⣶⡿⣿⠋⠳⣞⢁⡠⠁⠀⠀⠀⠀⠀⢀⣿⡇⠀⠀|
⠀⠀⠀⠘⣿⣿⣿⣿⣿⣿⣿⣿⣖⢻⣿⣿⣿⣿⣿⣻⣿⣿⣿⡿⣧⣭⣵⠖⠋⢩⣽⢁⣀⠻⠀⢤⣤⠿⠤⠖⣦⣷⣂⣠⣴⣾⣟⠃⠀⠀|
⠀⠀⠀⣾⡟⢿⣄⣈⣽⡿⠿⠟⠛⠿⣿⣿⣿⣿⣿⣿⣹⣯⣿⣤⣾⣚⣿⣲⣿⣚⣧⣤⣿⣾⣷⣾⣿⣾⣟⣋⡉⠉⠉⣿⠀⠙⣿⡄⠀⠀
⠀⠀⠀⣿⡆⣆⠀⠀⠀⠀⠀⠀⠀⠀⠉⠙⢷⡄⠀⠉⠉⣉⡽⠛⠋⠙⠛⠿⣤⣤⠴⠟⠋⠁⠀⠀⠀⠀⠀⠈⠉⠛⠒⠛⠀⠀⢹⡇⠀⠀
⠀⠀⠀⢿⣷⢻⣿⣆⢠⡀⠀⡀⠀⠀⠀⠀⠀⠙⠳⠶⠞⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⡇⠀⠀|-> printf
⠀⠀⠀⠘⢿⣦⣍⡛⠦⣷⣀⠳⣀⠈⠳⣄⠀⠀⣀⢀⣀⣀⣀⣀⣀⣀⣀⠀⠀⢀⣀⣀⣀⣀⡠⠤⠀⠀⠀⠀⠀⠀⠀⢀⣠⣴⠟⠀⠀⠀
⠀⠀⠀⠀⠀⠉⠙⠛⠻⠷⣶⣶⣾⣽⣷⣦⣤⣀⣈⣉⣉⣁⣀⣀⣀⣤⣤⣤⣤⣤⣤⣤⣤⣤⣤⣶⡶⠶⠶⠶⠒⠚⠋⠉⠁⠀⠀⠀⠀⠀
I dont have much to talk about the breads, its plain html going to the file, to setup up html formatting, so I will proceed direct to the filling starting with the 'find' command, I used 'find' lots of times before but its import to undestand the parameters being used here since its almost impossible to fully understand what awk will do with the find output without knowing what will be output format, so here its the command used in the script:
find /home/*/public_html -type f -name \*.html -ls 2>/dev/null
Its being used perform search inside the 'public_html' inside the home of all users, looking only for regular files(-type f), and which the filename ends with .html, it is also ignoring erros redirecting those to /dev/null, but what -ls does? I dont know, so I will look in the find man page.
giggles@ctrl-c:~$ man find
...
   ACTIONS
...
       -ls    True; list current file in ls -dils format on standard output.  The block counts are of 1 KB blocks, unless the environment variable POSIXLY_CORRECT  is  set,  in  which  case 512-byte blocks are used.  See the UNUSUAL FILENAMES section for information about how unusual characters in filenames are handled.
Cool, now we "know" which format will be used, I personally never used "ls -dils", we have two options here, just execute the command and figure out what the output means or look on documentation, we will look at the man page and also run it as an example for understand the order of output:
giggles@ctrl-c:~$ man ls
...
       -d, --directory
              list directories themselves, not their contents
       -i, --inode
              print the index number of each file
       -l     use a long listing format
       -s, --size
              print the allocated size of each file, in blocks
...
giggles@ctrl-c:~$ echo -n '123456' > testfile
giggles@ctrl-c:~$ ls -dils testfile
692883 4 -rw-rw-r-- 1 giggles giggles 6 Jun 24 16:28 testfile
I will not get into the details of all the columns, I marked the fields of interest for our major purpose which is to understand the script, as you can see the column 5 is the username of the file owner and the column 7 is the size of file, you can test it with other files to make sure if you want.
We finally reached the meat(oh yeah its awk), if you never saw, awk before the command being used on script can appear as black magic, I dont even understood what it was doing too, but I take some time to learn a bit more of awk because the purpose here is to learn from script. I see that the awk programs all have the structure:
BEGIN {awk commands}
/pattern/ {awk commands}
END {awk commands}
The section BEGIN, is executed before anything, its useful to initialize something if you want, the section after the pattern is executed in all input lines that match the pattern, and the END section is executed after EOF. All the sections are optional, as in the phoebos script the BEGIN section is not used, the pattern is also optional and if ommited the commands will be executed for all input. We the case of the script we dont have a pattern and the command being used is:
{sizes[$5] += $7; counts[$5]++}
Its somekind strange, but in awk variables dont need to be declared, here phoebos is using two variables 'sizes' and 'counts', both are arrays indexed by $5, the short history is that $NUMBER syntax is used to denote an specific record from the input line, remember that the column 5 from our input lines is the username of the owner of file, and the column 7 is the size of file. So those commands are keeping how many .html files are found indexed by username in 'counts' array, and also the sum of the file sizes in 'sizes' array which is also indexed by username.
After this come the END section, as we should know now, it will be run after awk finish to read input, the content itself from the webpage will be created here, you can guess by the strings on printf awk-commands, this post become more long than I was expecting therefore I will not describe the END section, but for the curious it will not be hard to understand the rest, So this is left as an exercise for the reader! haha like in math books. But it will be not fair unless I provide some resources so here is some references...
[Refs]
AWK Tutorial
Gawk Manual: Closing Input and Output Redirections