Using sed (the non-interactive command line text editor)

If you go to the sed webpage it tells you that it is a non-interactive command line text editor. Unlike interactive editors you do not spend time in sed, instead you invoke sed with a set of command line parameters and it directly outputs or writes its result.

I have often used it in Vagrant provisioning scripts to tweak configuration files, you can however also use it to extract information from text files (such as log files) and convert that into a different format (such as CSV). In this article I will demonstrate how you can use sed to do some of these things to give you some insight into it’s utility.

If you want to follow along with this article you will need to have sed installed on your system. In the case that you are running Linux, Mac OS or another Unix-like OS you likely already have sed installed, if it is not installed already you can likely do so through your system’s package manager (or by compiling it from source). There is no official Windows version of sed, Windows users will have to look to solutions such as WSL (the Windows Subsystem for Linux).

Modifying a configuration file

For this example I came up with an incredibly original configuration file, take the below content and save it into “settings.ini” to follow along with the commands in this section:

my-setting = my value
my-other-setting = my value

To replace values with sed you can use its s command, the below for example replaces the string “my value” with “your value”:

➜  sed -e 's/my value/your value/' settings.ini
my-setting = your value
my-other-setting = your value

Note that the command used here did not change the file itself, instead sed modified the contents of the file and output that to the stdout. If instead you would like to modify the file in place you would use -i instead of -e.

The command used in the example was the search and replace command, the parameters for this command are separated by the “/” character, if you need to use it in one of the parameters you will have to escape it using a “\” character. The first argument is the value to search (my value) and the second argument is the value to replace (your value).

Note that on both lines the value “my value” got substituted with “your value”, a more real like example that I would use would look something like this:

➜  sed -e 's/^\(my-setting = \).*/\1your value/' settings.ini    
my-setting = your value
my-other-setting = my value

In the above command I used a regular expression as the search value to match lines that start with “my-setting = ” (allowing more or less spaces around the = sign) followed any value. The start of the line is captured in a group, this group is then referenced with “\1” in the replace which is followed by “your value”.

Using the earlier mentioned -i switch you can non-interactively edit configuration files in your provisioning scripts. Pretty neat if you ask me!

Extracting log data and transforming it to CSV

Let’s say you want to extract some data from your Apache logs and get that into a spreadsheet, in particular you want to get some data out on who is reading your WordPress posts.

For this section I used one of my own log files (removing and anonymizing some of its contents), if you want to follow along you will have to use one of your own. If you don’t have any you can copy and paste from the next example in this section, if you then add a few lines of non-matching data in between you can better see the matching and filtering effects of the commands.

The first step I want to take is to filter the log file’s contents to only show the relevant log lines, while you do this with grep I am going to use sed to demonstrate its capability to perform this task:

➜  sed -n '/GET \/[0-9]\+[^ ]\+ HTTP[^ ]\+ 200/p' without_brains_access.log - - [20/Sep/2020:00:03:52 +0200] "GET /2020/08/23/using-sql-as-an-every-day-tool/ HTTP/1.0" 200 95663 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" - - [20/Sep/2020:01:04:59 +0200] "GET /2020/08/23/using-sql-as-an-every-day-tool/ HTTP/1.0" 200 95663 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" - - [20/Sep/2020:02:22:15 +0200] "GET /2020/09/13/consuming-a-rest-api-with-the-ruby-standard-library/ HTTP/1.1" 200 28349 "-" "Mozilla/5.0 (compatible; SomeBot; )" - - [20/Sep/2020:07:20:04 +0200] "GET /2020/08/23/using-sql-as-an-every-day-tool/ HTTP/1.0" 200 95663 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" - - [20/Sep/2020:07:38:56 +0200] "GET /2020/08/23/using-sql-as-an-every-day-tool/ HTTP/1.0" 200 30850 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36" - - [20/Sep/2020:11:10:54 +0200] "GET /2020/09/13/consuming-a-rest-api-with-the-ruby-standard-library/ HTTP/1.1" 200 28923 "-" "Mozilla/5.0 (compatible; SomeBot; )" - - [20/Sep/2020:11:37:06 +0200] "GET /2020/08/26/vim-tip-visual-block-editing/ HTTP/1.1" 200 25267 "" "Mozilla/5.0 (compatible; Linux x86_64; SomeBot; )" - - [20/Sep/2020:12:38:27 +0200] "GET /2020/08/26/vim-tip-visual-block-editing/ HTTP/1.1" 200 25304 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36" - - [20/Sep/2020:16:45:13 +0200] "GET /2020/09/13/consuming-a-rest-api-with-the-ruby-standard-library/ HTTP/1.1" 200 27453 "" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0" - - [20/Sep/2020:17:59:20 +0200] "GET /2020/08/23/using-sql-as-an-every-day-tool/ HTTP/1.0" 200 30850 "" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0 O/x.d3v0.12" - - [20/Sep/2020:21:07:45 +0200] "GET /2020/08/26/vim-tip-visual-block-editing/ HTTP/1.1" 200 25694 "-" "Mozilla/5.0 (compatible; SomeBot; )"

The above sed command uses a regular expression to select lines in the log file for GET requests on pages that have a path starting with multiple numbers (matching the format of a WordPress post URL) that returned a HTTP status code 200 (OK). The -n switch silences sed, it will not output anything unless you tell it to do so, which is what the p command after the regular expression does.

Instead of using the p command we can use the s command to do a search and replace like I demonstrated in the configuration file example:

➜  sed -n '/GET \/[0-9]\+[^ ]\+ HTTP[^ ]\+ 200/s/^\([^ ]\+\)[^\[]\+\[\([^]]\+\)\].*GET \(\/[0-9]\+[^ ]\+\).*/\1,\2,\3/p' without_brains_access.log,20/Sep/2020:00:03:52 +0200,/2020/08/23/using-sql-as-an-every-day-tool/,20/Sep/2020:01:04:59 +0200,/2020/08/23/using-sql-as-an-every-day-tool/,20/Sep/2020:02:22:15 +0200,/2020/09/13/consuming-a-rest-api-with-the-ruby-standard-library/,20/Sep/2020:07:20:04 +0200,/2020/08/23/using-sql-as-an-every-day-tool/,20/Sep/2020:07:38:56 +0200,/2020/08/23/using-sql-as-an-every-day-tool/,20/Sep/2020:11:10:54 +0200,/2020/09/13/consuming-a-rest-api-with-the-ruby-standard-library/,20/Sep/2020:11:37:06 +0200,/2020/08/26/vim-tip-visual-block-editing/,20/Sep/2020:12:38:27 +0200,/2020/08/26/vim-tip-visual-block-editing/,20/Sep/2020:16:45:13 +0200,/2020/09/13/consuming-a-rest-api-with-the-ruby-standard-library/,20/Sep/2020:17:59:20 +0200,/2020/08/23/using-sql-as-an-every-day-tool/,20/Sep/2020:21:07:45 +0200,/2020/08/26/vim-tip-visual-block-editing/

The above sed command uses the earlier written line selector to select the lines with the GET requests to WordPress posts, instead of using the p command it then uses the s command instead and captures the relevant segments of data which we want into regex capture groups, it then replaces the entire line with the first three capture groups separated by commas. Finally the p modifier is used to print the changed line, like the p command this overrides silent mode resulting in only printing the lines we selected and modified to stdout.

The search value looks pretty complicated, this is mainly due to the fact that many characters have to be escaped when using regular expressions with sed. Let’s break it down:

The search value starts with ^\([^ ]\+\), the ^ character indicates to begin at the start of the line, \( then starts the first capture group and \) ends it, what is matched with the regular expression between these two is what will be in the value of the first capture group. The regular expression used within the capture group is [^ ]\+, this means one or more characters that are not a space. In the context of the log file this would be the IP address that requested the page from Apache (in the examples here I have anonymized them, if you would run this on your own Apache log file you would get addresses of course).

The next part of the search value is [^[]\+\[\([^]]\+\)\], this first matches anything except a [ character using [^[]\+, then it matches the [ character using \[, this is following by starting the next capture group with \( like before and then matches anything for it that is not a ] character using [^]]\+. After closing the capture group with \) it matches the literal ] using \]. This catches the timestamp of the request in capture group two.

The final part of the search value is .*GET \(\/[0-9]\+[^ ]\+).*, this catches the path of the requested page and matches the rest of the line. If you do not match the entire line with your command you will find that the remainder of the line will be appended to your replace value (which is not what we are aiming for in this example). To break this part down like the previous parts: first we select anything matching characters with .* then we explicitly match GET and start the next capture group, again using \(, the capture group will contain the result of \/[0-9]\+[^ ]\+ which matches a /, a set of numbers and then anything that is not a space, this will match the path of the requested page. The expression then ends with .* which matches anything and will thus consume the rest of the line.

The replace value is \1,\2,\3 which puts the values of capture groups 1, 2 and 3 separated by commas. When using sed you can use a maximum of 9 capture groups. If you use \0 you will get the value that matched your entire regular expression (which would be the complete original line in the case of this example).

To be able to import the data into a spreadsheet program of your choosing you can redirect the output of sed to create a new CSV file:

➜  sed -n '/GET \/[0-9]\+[^ ]\+ HTTP[^ ]\+ 200/s/^\([^ ]\+\)[^[]\+\[\([^]]\+\)\].*GET \(\/[0-9]\+[^ ]\+\).*/\1,\2,\3/p' without_brains_access.log > example.csv

In conclusion

Sed is a pretty amazing tool to have, what I have demonstrated in this article is only the tip of the iceberg of what you can do with it. If you want to learn more about sed you should take a look at its manual using the “info” command if you have that at your disposal by running “info sed”, the latest version of the manual (which may not match the version of sed that you have on your system) can be found online here.

If you have feedback on this article or have questions based on its contents then feel free to reach out to me on Twitter or through e-mail.

One Reply to “Using sed (the non-interactive command line text editor)”

Comments are closed.