Command line tricks: Globbing basics

When working with files in the command line, you don't always want to write full file names. Maybe you are working on a group of files that has file names with a given patterns, like log files (e.g, log_011023.txt)? Or maybe you want to filter out files that don't fit a given pattern? Or maybe your use case is just picking out shell scripts that starts with the letter P? In this article we will look at globbing, which can be thought of as wildcards we use in place of explicitly writing full filenames. Read on to see several examples on how they can be used to solve various command line tasks!

If it is not clear from the introductions: globs can be thought of as wildcards. That way we can make patterns to match files instead of explicitly writing full filenames. We will also use bash here, so there might be minor differences in other Unix style shells like Fish or Zsh. (I use zsh, so there is one note about that below!).

Glob patterns

There are a lot of different patterns we could cover, and many of them would be very esoteric. Instead, the basic and most useful patterns will be covered here. If you are curious on different ones from the ones covered, you can skip to the further reading section for a suggested place to read more.

To make the explanations a bit more fun, they will be very example heavy.

Globs - when are they "executed"

If you are new to globbing, you might wonder how the matches are executed? All programs can't posibly handle them themselves, so they are done by the shell, right? Yes! Before executing the programs in your scripts or command line, bash (or other shell variant) will match the glob patterns. To make it more clear, let us look at a quick example. To make things simple, we will use this blogs code in the example.

Let's say we have the command

cat *.html

(to print contents of all html files directly to the command line).
That is exactly the same as running

cat 404.html about.html index.html privacypolicy.html

directly. Before cat is being run, bash will replace the glob pattern with those exact file matches.

(don't worry if the pattern seem a bit weird! It will be explained and have more examples shortly.)

Matching anything with `*`

If you read the info-box above, you will probably have noticed the * matching. What does it mean? It simply matches anything! 0 or more characters. If you are used to regex (regular expressions), you might have seen it there. While globs and regular expressions have commonalities, there are also many differences. So have that in mind if you are familiar with regex.

Listing information on all html files in directory

Let us use the above glob pattern for something useful, and list file information on all html files in the root directory:

$ ls -l *.html
-rw-r--r--  1 marie  staff   123 Jul  1  2021 404.html
-rw-r--r--@ 1 marie  staff  2381 Feb 12 18:18 about.html
-rw-r--r--@ 1 marie  staff  1740 Feb 12 18:40 index.html
-rw-r--r--  1 marie  staff   657 Oct 31  2022 privacypolicy.html

(you might not see the @ characters, as those are Mac OS X style extended attributes. I always ignore them)

As you can see, with a simple trick, we filtered out only the file type we were interested in.

Listing all files that has an "l" in their extension, and contains a "g" in their name

What do we want here? We want a filename where g appears somewhere, with an extension where l appears somewhere. Take a deep breath and remember that * matches 0 or more characters. This means we can use *g* and *l* for our patterns:

$ ls *g*.*l*
_config.yml
org_publish.el

You will notice this about using globs, namely that it is mostly about knowing base rules, and knowing your data.

Match a single character with `?`

The previous pattern matched zero or more characters, but sometimes it can be useful to match only one character. This is possible with ?.

Mixing `?` and `*`: Getting all blog posts from the last 3 months of 2021

If you take a look at the html files on this blog, you will notice that they follow a pattern: year-month-day-title.html. We can use that to our advantage. The last 3 months of the year start with a 1 (i.e, 10, 11 and 12), so the pattern for those is simply 1?. After the month, we simply have a dash, our day, title and file extension. We don't really care what comes after the dash, so that is a good candidate for *. That leads us to the following pattern, and result:

$ ls _posts/2021-1?-*.html
_posts/2021-10-02-no_nonsense_command_line.html
_posts/2021-10-04-macosx_software.html
_posts/2021-10-08-emacs_packages_that_make_me_happy.html
_posts/2021-10-11-become_a_maven_ninja.html
_posts/2021-10-15-retro_programming_youtubers.html
_posts/2021-10-16-java_serviceloader.html
_posts/2021-10-20-springboot_easy_customizations.html
_posts/2021-10-26-javascript_the_good_parts.html
_posts/2021-11-03-kotlin_in_emacs.html
_posts/2021-11-10-jvm_scripting.html
_posts/2021-11-15-emacs_small_tips.html
_posts/2021-11-17-favorite_personal_finance_books.html
_posts/2021-11-20-emacs_package_highlight_try.html
_posts/2021-11-27-biographies_about_tech.html
_posts/2021-11-30-emacs_fun_useless.html
_posts/2021-12-01-coding_advent_calendars.html
_posts/2021-12-11-even_more_cli_tools_to_try.html
_posts/2021-12-16-reactive_whats_the_big_deal.html
_posts/2021-12-20-five_reasons_i_love_emacs.html

Creating e-book with pandoc

To give you some context: I usually prefer reading longer texts on my Kindle, but there are a lot of "books" you read directly in your browser out there. A lot of those come in markdown or html format, so they need to be converted if used on my Kindle. Fortunately, there is a tool that solves this issue called Pandoc! Some tweaking is sometimes needed, so we will use a fairly simple example here. Sheepolution has created a really awesome book for learning the Love2D Lua game framework, and has made the source files available on Github. We can use those Markdown files to create a really awesome epub file that can later be converted for use on our Kindle!

pandoc --resource-path=. -o sheepolution.epub --epub-cover-image=logo.png title.txt book/chapter?.md book/chapter1?.md book/chapter2?.md

The logo.png file and title.txt files are files I created; One is a simple cover, the other contain only yaml metadata. The metadata I use look like this:

---
title: Sheepolutions How to Love
author: Sheepolution
rights:  MIT
language: en-US
---

First things first, why couldn't we just use the pattern book/chapter*.md to cover everything? This is because the globbing matches alphabetically, so the order would be like this:

book/chapter0.md
book/chapter1.md
book/chapter10.md
book/chapter11.md
book/chapter12.md
book/chapter13.md
book/chapter14.md
book/chapter15.md
book/chapter16.md
book/chapter17.md
book/chapter18.md
book/chapter19.md
book/chapter2.md
book/chapter20.md
book/chapter21.md
book/chapter22.md
book/chapter23.md
book/chapter24.md
book/chapter3.md
book/chapter4.md
book/chapter5.md
book/chapter6.md
book/chapter7.md
book/chapter8.md
book/chapter9.md

As you see, the order is all wrong!

If you are unfamiliar with scripting, the next parts might be a little confusing. Don't worry! They are only meant as improvements to the above examples. Look at them slowly, and look up some references if you think they look scary. If you are very new to the command line, you might just want to skip right to the next heading.

If you run the pandoc command above, you might notice that it complains about the image files not existing? Some minor tweaks as needed in the markdown files, as the image paths start at root (e.g, /images/book/download_love.png). We need to remove the leading slashes. To do this task, we can utilize the find and sed commands to create a small script:

chapters=$(find book -name '*.md')
for chapter in $chapters; do
    cat $chapter | sed 's!/images/!images/!g' > tmp.md
    mv tmp.md $chapter
done

Here you notice that we use find to recursively search the book directory for files with the ending .md. The pattern given to find isn't globbed by bash, so it needs to be a string. find processes its patterns itself. Next is a simple for loop to go through each of the results from find. For each result, or chapter if you will, we use sed to replace /images/ with images/ (i.e, essentially removing the leading slash). For the curious, you will notice that we use ! as a delimiter for the replace pattern. As several matches may happen on the same line, we match globally with g to replace for all matches and not just the first. We write the results to a temporary file, and last we replace the original file with the temporary one. Many tools work line by line in the command line, so you might sometime get weird results if you write directly back to your input. It might have worked in this case, but I have built a habit of using temporary files in cases like these…

If you look closely at the example book, you will notice that it contains animated gifs. For our e-readers, it is a waste of space to keep them when only the first frame is shown. We can extract the first frame and only use that one to make the file size smaller. There is a neat tool for working with gif files, called gifsicle, which we can use here. The script is fairly similar to our last script:

files=$(find images -name '*.gif')
for file in $files; do
    echo "Fixing $file..."
    gifsicle $file '#0' > tmp.gif && mv tmp.gif $file
done

There are off course many more improvements we could have done. If you need an exercise, you can work on using ImageMagicks convert command to resize and/or compress the images :)

If you feel like the piping characters | and > above are confusing, I suggest you read up on pipes. It is covered in my earlier beginners guide to the command line article :)

Ranges and groups with `[]`

The group syntax, [group], where group is a group pattern, is a useful way to match ranges and limit possible matches. One such group matches one character at a time. Some notable patterns include:

[a-z] - matches all lower case letters from a to z. Likewise, [A-Z] will match the uppercase variants.
[0-9] - the numbers from 0 to 9.
[abc] - a, b and c.
[-0-9] - the - characters, and the numbers from 0 to 9. To include the - character, it either has to come at the beginning or end.
[!A-Z] - NOT upper case A to Z. This not-operation, !, can be used with other groups as well. The important part is that it is in the beginning of the square brackets.
[[:space:]] - space.

NOTE! The not pattern above can give some weird error messages in zshell/zsh, as zsh also uses ! for other events.

I often find groups to be of best use together with the previous patterns, so the examples will reflect that.

Files starting with a, b, c, d, e or f in blog source root

Let us for simplicity only match files that have an extension. That way we can use ls to list the files in a simple manner. (Remember that ls lists the content of the directory it gets as an argument, when the argument is a directory). This can be done fairly simply by combining groups and the match anything pattern. We can specify a, b, c, d, e and f with a direct group:

$ ls -1 [abcdef]*.*
about.html
ads.txt
create_tag_pages.sh
emacs_headless_publish.sh
favicon.ico

(the -1 argument will print each listing on one line)

This is quite verbose, as it is simply a range of alphabetic characters. Matches like above are useful when we can't express it with a range, for example if you want to skip every other character. In this case though, we can instead use the range [a-f]:

$ ls -1 [a-f]*.*
about.html
ads.txt
create_tag_pages.sh
emacs_headless_publish.sh
favicon.ico

Improving our "Blog posts from the last 3 months of 2021" example

In the previous version of this matcher, we matched every possible character after 1. Hopefully, you know that the only valid month numbers that start with 1 are 10, 11 and 12. That means that our previous version would have matched invalid patterns if we had made such files. Let us limit it a bit using groups. Instead of matching anything, we only want to match 0, 1 and 2. This can easily be done with a range [0-2]:

$ ls _posts/2021-1[0-2]-*.html
_posts/2021-10-02-no_nonsense_command_line.html
_posts/2021-10-04-macosx_software.html
_posts/2021-10-08-emacs_packages_that_make_me_happy.html
_posts/2021-10-11-become_a_maven_ninja.html
_posts/2021-10-15-retro_programming_youtubers.html
_posts/2021-10-16-java_serviceloader.html
_posts/2021-10-20-springboot_easy_customizations.html
_posts/2021-10-26-javascript_the_good_parts.html
_posts/2021-11-03-kotlin_in_emacs.html
_posts/2021-11-10-jvm_scripting.html
_posts/2021-11-15-emacs_small_tips.html
_posts/2021-11-17-favorite_personal_finance_books.html
_posts/2021-11-20-emacs_package_highlight_try.html
_posts/2021-11-27-biographies_about_tech.html
_posts/2021-11-30-emacs_fun_useless.html
_posts/2021-12-01-coding_advent_calendars.html
_posts/2021-12-11-even_more_cli_tools_to_try.html
_posts/2021-12-16-reactive_whats_the_big_deal.html
_posts/2021-12-20-five_reasons_i_love_emacs.html

Blog posts NOT from 2022, 2023 or 2024

If we assumed that all blog posts are from the 2020 decade, then this one is simply a match for the last digit not to be 2, 3 or 4. That leads us to 202[!234]. We want a more flexible solution for a longer living blog than mine. Maybe it is not even a blog, but a journal of sorts that matches far back (1900s?). For simplicity, we assume 4 digits with a maximum year of 9999. This means that future people using Unix on their space station could also use this pattern! (if every system is not Unix-inspired in the future, then all is lost). We need 2 different groups here: one for the numbers 0-9, and one for the numbers to avoid. One possible solution is

$ ls _posts/202[!234]-*.html
_posts/2020-08-08-first_post.html
_posts/2020-08-27-kotlin_dsl.html
_posts/2020-08-30-cool_linux_clis.html
_posts/2020-09-07-career_boosting_books.html
_posts/2020-10-04-java_rethrow_log_exceptions.html
_posts/2020-10-20-browser-extension-recommendation.html
_posts/2021-03-23-programminglanguages2021.html
_posts/2021-07-28-summer_books_2021.html
_posts/2021-08-04-more_cli_tools.html
_posts/2021-09-13-recommended_emacs_packages.html
_posts/2021-09-18-rip_clive_sinclair.html
_posts/2021-09-22-essential_ayn_rand.html
_posts/2021-09-26-scifi_books_to_unwind.html
_posts/2021-10-02-no_nonsense_command_line.html
_posts/2021-10-04-macosx_software.html
_posts/2021-10-08-emacs_packages_that_make_me_happy.html
_posts/2021-10-11-become_a_maven_ninja.html
_posts/2021-10-15-retro_programming_youtubers.html
_posts/2021-10-16-java_serviceloader.html
_posts/2021-10-20-springboot_easy_customizations.html
_posts/2021-10-26-javascript_the_good_parts.html
_posts/2021-11-03-kotlin_in_emacs.html
_posts/2021-11-10-jvm_scripting.html
_posts/2021-11-15-emacs_small_tips.html
_posts/2021-11-17-favorite_personal_finance_books.html
_posts/2021-11-20-emacs_package_highlight_try.html
_posts/2021-11-27-biographies_about_tech.html
_posts/2021-11-30-emacs_fun_useless.html
_posts/2021-12-01-coding_advent_calendars.html
_posts/2021-12-11-even_more_cli_tools_to_try.html
_posts/2021-12-16-reactive_whats_the_big_deal.html
_posts/2021-12-20-five_reasons_i_love_emacs.html

We could also easily have done a positive match:

$ ls _posts/202[0156789]-*.html
_posts/2020-08-08-first_post.html
_posts/2020-08-27-kotlin_dsl.html
_posts/2020-08-30-cool_linux_clis.html
_posts/2020-09-07-career_boosting_books.html
_posts/2020-10-04-java_rethrow_log_exceptions.html
_posts/2020-10-20-browser-extension-recommendation.html
_posts/2021-03-23-programminglanguages2021.html
_posts/2021-07-28-summer_books_2021.html
_posts/2021-08-04-more_cli_tools.html
_posts/2021-09-13-recommended_emacs_packages.html
_posts/2021-09-18-rip_clive_sinclair.html
_posts/2021-09-22-essential_ayn_rand.html
_posts/2021-09-26-scifi_books_to_unwind.html
_posts/2021-10-02-no_nonsense_command_line.html
_posts/2021-10-04-macosx_software.html
_posts/2021-10-08-emacs_packages_that_make_me_happy.html
_posts/2021-10-11-become_a_maven_ninja.html
_posts/2021-10-15-retro_programming_youtubers.html
_posts/2021-10-16-java_serviceloader.html
_posts/2021-10-20-springboot_easy_customizations.html
_posts/2021-10-26-javascript_the_good_parts.html
_posts/2021-11-03-kotlin_in_emacs.html
_posts/2021-11-10-jvm_scripting.html
_posts/2021-11-15-emacs_small_tips.html
_posts/2021-11-17-favorite_personal_finance_books.html
_posts/2021-11-20-emacs_package_highlight_try.html
_posts/2021-11-27-biographies_about_tech.html
_posts/2021-11-30-emacs_fun_useless.html
_posts/2021-12-01-coding_advent_calendars.html
_posts/2021-12-11-even_more_cli_tools_to_try.html
_posts/2021-12-16-reactive_whats_the_big_deal.html
_posts/2021-12-20-five_reasons_i_love_emacs.html

Multiple possible match types with `{}`

Sometimes we have multiple possible patterns we want to match. bash has us covered here, and we can cover our options with curly brackets delimited by commas. This is probably best explained with some examples.

Matching files ending with either html, sh or el

We want to match anything ending with either html, sh or el. Using the anything matcher in combination with the multiple matcher then looks like the following:

$ ls -1 *.{html,el,sh}
404.html
about.html
create_tag_pages.sh
emacs_headless_publish.sh
index.html
org_publish.el
privacypolicy.html

We notice that two of our patterns end with l. As the inside of the multiple matcher is simply patterns, we can use that to our advantage with an inner pattern:

$ ls -1 *.{{htm,e}l,sh}
404.html
about.html
create_tag_pages.sh
emacs_headless_publish.sh
index.html
org_publish.el
privacypolicy.html

You can mix and match these inner patterns using all the previous patterns! Enjoy!

Blog posts from March or July, in 2021 or 2023

We could easily mix and match using the group pattern here, but for the sake of example we will only use the multiple matcher. We either want to match month 03 or 07, and year 2021 or 2023.

$ ls -1 _posts/{2021,2023}-{03,07}-*.html
_posts/2021-03-23-programminglanguages2021.html
_posts/2021-07-28-summer_books_2021.html
_posts/2023-03-04-kotlin_collections_stdlib.html
_posts/2023-03-04-kotlin_stdlib_overlooked.html
_posts/2023-03-06-kotlin_strings_stdlib.html
_posts/2023-03-09-org_mode_uses.html
_posts/2023-03-26-rust_awesome_videos_explain_concepts.html
_posts/2023-07-29-three_things_love_hate_mac.html

What happens if I use the above characters in file names (and want to match them)?

If we ~~are stupid and~~ use the matching characters in file names, then how do we match them? Fortunately, bash let's us escape these characters when matching! Let us use the anything matcher here as an example. We want to match a file like myfile*.txt. For this purpose, we could off course use the anything matcher, but it would match other files as well.

$ ls myfile*.txt
myfile*.txt
myfile2.txt

We didn't want to match myfile2.txt! What can we do? We can escape the * character using \*:

$ ls myfile\*.txt
myfile*.txt

Like other patterns, we could use this escaped one in conjunction with the others. Maybe we have many files where the last part of the name is a *?

$ ls *\*.txt
myfile*.txt
somefile*.txt

No way to match a x number of characters?

Sadly no, at least not to my knowledge. If you for example want to match 3 numbers in succession, the easiest way is probably [0-9][0-9][0-9]. You don't have the number of times specifier like you have in regular expressions.