BFilter: Content Filters

Advanced Configuration: Content Filters

BFilter allows you to apply regular expressions to page content. This can be used for things like removing portions of a page, altering scripts or injecting your own scripts. There are a couple of things that make BFilter's implementation of this feature unique:

Applying a regex doesn't cause buffering of the whole page.
Replacement expressions can contain JavaScript code.

There is a GUI editor for content filters:

In this document I am going to describe the filter file format istead of the GUI. I do this because you can actually build BFilter without the GUI. Besides, you should have no trouble applying the instructions to the GUI.

Filter Files

Filter files are located in:
1. On Windows, they are typically located in C:\Program Files\BFilter\conf\filters.
2. On a Mac, they are located in /Library/Application Support/BFilter/filters.
3. On Unix / Linux they are typically located in /usr/local/etc/bfilter/filters. In case of the GUI version, user-specific filters are stored in $HOME/.bfilter/filters.

The files come in pairs:

`Filter Group Name`	Defines a group of filters.
`Filter Group Name.enabled`	Defines which filters are currently enabled.

Let's start with the second one. It contains names of the enabled filters, one by line. Alternatively, it can contain *, which means all filters are enabled. The absence of this file is the same as having an empty file.
The other one defines a group of filters. It must not have an extension (file name must not contain any dots). It's syntax is like this:

; comment1
[1st filter's name]
key=value
...
# comment2
[2nd filter's name]
key=value
...

This looks much like an .ini file. One difference is how we handle multiline values:

replace = <<END
multi
line
text
END

You can replace END with any other text. Note that comments are only allowed at the beginning of a line or after any number of tabs / spaces.

A minimal filter would be something like this:

[remove target=_blank]
search = /(<a\s[^>]*)target\s*=\s*['"]?_blank['"]?([^>]*>)/
replace = $1$2
replacement_type = expression

Note that filter names are local to a file they are in, so they won't clash with names from other files.

Now let's enumerate all the possible parameters:

order
match_count_limit
url
content_type
search
replace
replacement_type
if_flag
set_flag
clear_flag

order = number
Controls the order in which filters are applied. Lower values are applied first. Filters with negative order values are applied before the built-in html filter, others are applied after it. The default order value is 0. Note that the order in which filters appear in a file is also considered, so you may safely give the same order (or don't give any) to all of your filters.

match_count_limit = number
Allows you to limit the number of times the match/replace operation is performed. One example is if you want to insert your own script just before the first script on a page. But you don't want to insert it before every script. In this case you set match_count_limit = 1. The absense of this parameter allows unlimited number of match/replace operations.

url = glob
url = /regex/
Binds the filter to a specific URL. Glob is just a string with * and ? wildcards. Regular expressions are a complex topic, but you can find a good reference and tutorial here. Both the glob and the regex are anchored, which means they must cover the whole URL, not just a part of it. The default is not to bind the filter to a particular URL.

content_type = glob
content_type = /regex/
Apply a filter only to certain content types. Here is an example of a content type: text/html; charset=UTF-8. The default is to apply filters to html and xhtml content only. Note that you can't really trust the content type. Some sites serve javascript or even images as text/html. Use built-in flags to solve this problem.

search = glob
search = /regex/
Search for a specific pattern in byte stream. As you see, it works on the byte level, not on the text level. This means you won't be able to match arbitrary text in your native language. English text should be fine though. Anyway, the main purpose of content filters is to alter html tags and inject your own scripts, which works just fine on the byte level.

This time, the glob and the regex are not anchored, unless you explicitly anchor the regex with ^ and $. The patterns are case insensitive. This also applies to url and content_type patterns.

A word of warning: be careful with greedy quantifiers. A pattern like this: // will match everything starting from the first comment in the document and ending with the last one. Use // instead. But it doesn't mean you should not use greedy quantifiers at all. Basically only this case: .* should be avoided. Also note that greedy quantifiers have better performance than lazy ones.

replace = replacement_text
The replacement text. It can be a simple text, an expression with back-references, or JavaScript code. See the next parameter for more details.

replacement_type = text|expression|js

text
Default option. The matched text will be replaced with replacemnt_text.
expression
replacement_text will be interpreted as a regex format string. The format string syntax is described here.
js
replacement_text will be interpreted as a body of a JavaScript function. The function takes a variable number of arguments accessible through the arguments array, where arguments[0] contains the matched text, arguments[1] contains the first sub-match, and so on. The matched text will be replaced with the return value of the function. If the return value is null, the system will act as if no match was found in that position. The search would resume from the next byte.
More information about JS replacements is available.

if_flag = flag_name
set_flag = flag_name
clear_flag = flag_name
A flag serves as a condition allowing a filter to match. If a filter depends on a flag, it will only have a chance to match if that flag is set. You can set or clear a flag as a side effect of a successful match. This way you can make filters depend on other filters.

Flag names are local to the file containing them, and can't clash with flags from other files. Flag names are case sensitive.

There is one important detail you need to be aware of when using flags: by the time your filter matches something and sets a flag, all the data preceding the match has already been fed to the next filter. The next filter has seen the data but hasn't yet seen a flag you are setting, so if it was dependent on that flag, it would just let the data through without touching it.
All that may seem unnatural, but that's the consequence of not buffering the whole page before applying filters.

Here is an example that won't work:

# We want to insert a script at the beginning of a page,
# but only if the page has at least one script.
[locate a script]
search = /(?=<script[\s>])/
replace =
set_flag = script_found

[insert our script]
if_flag = script_found
search = /^/
replace = <<END
<script>
// ...
</script>
END

# It won't work because by the time the first filter sets
# the flag, the second one has already seen and let through
# the beginning of the page.

The problem can be solved by writing a greedy regex that accumulates all the data preceding the first script:

[locate a script]
search = /^(.*?<script[\s>])/
replace = $1
replacement_type = expression
set_flag = script_found

[insert our script]
search = /^/
replace = <<END
<script>
// ...
</script>
END
if_flag = script_found

Hint: JavaScript replacements within the same filter group (same file) can share variables between each other, which provides a more flexible alternative to flags. Read on for more info.

Built-in Flags

There are 3 built-in flags that are set by BFilter itself:

_HTML_
_XHTML_
_HTML_OR_XHTML_

They exist because you can't really trust the Content-Type header. It's not uncommon to encounter a dynamically generated image with Content-Type: text/html. If you try to apply your html filter to such an image, you have a good chance of breaking it. This problem can be solved like this:

if_flag = _HTML_OR_XHTML_

Note that these built-in flags are guaranteed to be set at the very beginning of a page, or not set at all.

JavaScript Execution Context

As I already mentioned, a JavaScript replacement expression is basically a body of a function that is supposed to return the replacement text. But where does that function live and what other functions and objects live in there?
That function is really a method of a special context object. In a browser environment, the window object serves as a context. The functions you define become its methods, and the global variables become its properties. In our case, the context object represents the filter group (filters that are defined in the same file). This makes it possible to share variables between different filters. Of course, a separate context is created for each page being filtered, so sharing variables this way is quite safe.
First let us recall the syntax of global and local variables in JavaScript:

function f()
{
	a = 1; // global variable (context variable)
	var b = 2; // local variable (local to a function)
	b = 3; // local variable, because already declared
}

Here is an example of sharing variables between filters:

[count links]
search = /<a\s[^>]*href[\s=][^>]*>/
replace = <<END
if (typeof link_count == 'undefined') {
	link_count = 1;
} else {
	++link_count;
}
END
replacement_type = js

[popup a message box]
search = /$/
replace = <<END
if (typeof link_count == 'undefined') {
	link_count = 0;
}
var msg = 'Number of links: '+link_count;
return '<script>alert("'+msg+'")</script>';
END
replacement_type = js

If you are familiar with JavaScript, you are used to have certain objects at your disposal. These include window, document, navigator and so on. These are provided by the browser, not by the language itself. The language provides just a few functions and classes. These are: escape(), unescape(), Function, RegExp, Date, Math and a few others. That's what you'll have to rely on when writing JavaScript replacements. One extra function provided by BFilter is log(). It can be used to debug your filters.
Example:

[test]
search = /<a\s[^>]*>/
replace = <<END
log("Match: "+arguments[0]);
return arguments[0];
END
replacement_type = js

It outputs messages to BFilter's log, but only if the Filter Configuration window is open and the selected filter is from the same group as the one sending the message.