With Powershell, it's easy to program your Windows Commandline. The Powershell gives incredible power to those Windows powerusers that know how to manipulate it well. A few weeks ago, I wrote a minor scriplet that returns me the current ip-address in use. A rather hard thing to do, when you only have a mouse and the dialog box left.
Today Bash showed it's incredible power again, a friend o'mine needed a command to quickly count and sort the available file extensions within the current Directory Tree. He came up with the next, fairly straightforward command {syntaxhighlighter brush: bash;} find . -type f | sed 's/.*\.//gi' | sort | uniq -c | sort -rn {/syntaxhighlighter} the problem he soon found out, had to do with pathnames containing several dots (,)
In short, his command lists all the files within the current directory tree (find . -type f), strips away the full path, up to the first dot (,) sed finds, and he starts to sort and count the uniq instances with sort | uniq -c | sort -rn
After a few moments off thought I quickly came up with {syntaxhighlighter brush: bash;} find . -type f -exec basename "{}" \; | rev | cut -d. -f1 | rev | sort | uniq -c | sort -rn {/syntaxhighlighter} It does the same trick as my friends command, with a slight improvement. The find-part only prints the basename of each path it finds. This way, we circumvent the trouble of having directory names with multiple dots in them. So after tackling the directory name dots.
The sed-expression my pal wrote, also went awry when there're multiple dots within a filename. In order to tackle this, I simply threw away the sed-expression and replaced it with some minor string manipulation. Reverse the filename, find the first part the borders a dot and return it. This gives us an extension even when a filename contains multiple dots. In order to keep the file-extensions readable we should reverse them once more (rev | cut | rev) and start the sorting, counting again. (sort | uniq | sort)
Right now, it allready start to look readable, simple and slightly elegant but still I felt a bit worried about the -exec basename, part within the find-expression. When you think about it, my little algoritme would also work properly with the full path for a file. So we can easily strip away the basename part. This creates a statement such as {syntaxhighlighter brush: bash;} find . -type f | rev | cut -d. -f1 | rev | sort | uniq -c | sort -rn {/syntaxhighlighter}
After all the rewriting it still does the job, in a simple and elegant manner; I wouldn't know how to sort a windows filesystem as simple as this and come up with the count for each extension we find. (Even though the algoritme is portable)
My friend started to test with the new command and started to complain, the command fails when we have special characters within filenames,, characters such as ë. It seems rev, can't handle multibyte characters through a pipe. The old command, with -exec basename, does something magically for when we add the basename part it stays working, even when you've filenames which contain a ë So back to the drawing board again. The command {syntaxhighlighter brush: bash;} find . -type f -exec basename "{}" \; | rev | cut -d. -f1 | rev | sort | uniq -c | sort -rn {/syntaxhighlighter} feels awkward, I don't think there is a real need for the basename part; it only bloats and slows the find-expression.
During dinner some new idea came through, almost all the find-commands have the option to use -execdir instead of -exec. The option makes find, chdir to the directory where the file resided that you want to use in the -execdir expression. That would mean, the "{}" only expands to the base filename without the complete path residing in front. This would mean it's to write the same command in the next manner {syntaxhighlighter brush: bash;} find . -type f -execdir sh -c 'echo "{}"' \; | rev | cut -d. -f1 | rev | sort | uniq -c | sort -rn {/syntaxhighlighter}
Reading some bits of the bash-manual again, means we can rewrite the previous command in a way it becomes even more obscure. {syntaxhighlighter brush: bash;} find . -type f -execdir sh -c 'export file="{}"; echo ${file##*.}' \; | sort | uniq -ic | sort -rn {/syntaxhighlighter}
In this article alone we wrote at least 5 ways to count and summarize the file-extensions within the current directory tree. For the experienced bash user, the solutions without the -exec(dir) perform at least 10 times faster as the solutions with -exec(dir). While I think the find . -type f | rev | cut -d. -f1 | rev | sort | uniq -ic | sort -rn is the most elegant solution (hehehe came up with it myself) I still wonder how I could make the sed-expression do what it has to. For it's the shortest solution and performs a tad bit faster as the solution with reversing the strings.
Just before leaving my console I came up with a simple solution in combination with grep, it's the sixth in the list and seems to outperform the sed expression by a few nanoseconds :) {syntaxhighlighter brush: bash;} find . -type f | sed 's/.*\.//gi' | sort | uniq -ic | sort -rn find . -type f -exec basename "{}" \; | rev | cut -d. -f1 | rev | sort | uniq -ic | sort -rn find . -type f | rev | cut -d. -f1 | rev | sort | uniq -ic | sort -rn find . -type f -execdir sh -c 'echo "{}"' \; | rev | cut -d. -f1 | rev | sort | uniq -ic | sort -rn find . -type f -execdir sh -c 'export file="{}"; echo ${file##*.}' \; | sort | uniq -ic | sort -rn find . -type f | grep -o "[^./]*$" | sort | uniq -ic | sort -rn {/syntaxhighlighter}
While developing this simple and elegant solution to counting and summarizing file extensions for the current path we stumbled in to several problems: