Advanced Configuration: Content Filters
BFilter allows you to apply regular expressions to page content. This can be
used for things like removing portions of a page, altering scripts or injecting
your own scripts. There are a couple of things that make BFilter's
implementation of this feature unique:
- Applying a regex doesn't cause buffering of the whole page.
- Replacement expressions can contain JavaScript code.
Filter Files
Filter files are located in:1. On Windows, they are typically located in
C:\Program Files\BFilter\conf\filters
.2. On a Mac, they are located in
/Library/Application Support/BFilter/filters
.3. On Unix / Linux they are typically located in
/usr/local/etc/bfilter/filters
. In case of the GUI version,
user-specific filters are stored in $HOME/.bfilter/filters
.The files come in pairs:
Filter Group Name |
Defines a group of filters. |
Filter Group Name.enabled |
Defines which filters are currently enabled. |
The other one defines a group of filters. It must not have an extension (file name must not contain any dots). It's syntax is like this:
; comment1 [1st filter's name] key=value ... # comment2 [2nd filter's name] key=value ... |
replace = <<END multi line text END |
END
with any other text. Note that comments are
only allowed at the beginning of a line or after any number of tabs / spaces.
A minimal filter would be something like this:
[remove target=_blank] search = /(<a\s[^>]*)target\s*=\s*['"]?_blank['"]?([^>]*>)/ replace = $1$2 replacement_type = expression |
Now let's enumerate all the possible parameters:
- order
- match_count_limit
- url
- content_type
- search
- replace
- replacement_type
- if_flag
- set_flag
- clear_flag
order = number
Controls the order in which filters are applied. Lower values are applied first. Filters with negative order values are applied before the built-in html filter, others are applied after it. The default order value is 0. Note that the order in which filters appear in a file is also considered, so you may safely give the same order (or don't give any) to all of your filters.
match_count_limit = number
Allows you to limit the number of times the match/replace operation is performed. One example is if you want to insert your own script just before the first script on a page. But you don't want to insert it before every script. In this case you set
match_count_limit = 1
. The absense of this parameter allows
unlimited number of match/replace operations.
url = glob
url = /regex/
Binds the filter to a specific URL. Glob is just a string with * and ? wildcards. Regular expressions are a complex topic, but you can find a good reference and tutorial here. Both the glob and the regex are anchored, which means they must cover the whole URL, not just a part of it. The default is not to bind the filter to a particular URL.
content_type = glob
content_type = /regex/
Apply a filter only to certain content types. Here is an example of a content type:
text/html; charset=UTF-8
. The default is to apply filters
to html and xhtml content only. Note that you can't really trust the content
type. Some sites serve javascript or even images as text/html
.
Use built-in flags to solve this problem.
search = glob
search = /regex/
Search for a specific pattern in byte stream. As you see, it works on the byte level, not on the text level. This means you won't be able to match arbitrary text in your native language. English text should be fine though. Anyway, the main purpose of content filters is to alter html tags and inject your own scripts, which works just fine on the byte level.
This time, the glob and the regex are not anchored, unless you explicitly anchor the regex with ^ and $. The patterns are case insensitive. This also applies to url and content_type patterns.
A word of warning: be careful with greedy quantifiers. A pattern like this:
/<!--.*-->/
will match everything
starting from the first comment in the document and ending with the last one.
Use /<!--.*?-->/
instead. But it doesn't mean you should
not use greedy quantifiers at all. Basically only this case: .*
should be avoided. Also note that greedy quantifiers have better performance
than lazy ones.
replace = replacement_text
The replacement text. It can be a simple text, an expression with back-references, or JavaScript code. See the next parameter for more details.
replacement_type =
text|expression|js
text
Default option. The matched text will be replaced with replacemnt_text.expression
replacement_text will be interpreted as a regex format string. The format string syntax is described here.js
replacement_text will be interpreted as a body of a JavaScript function. The function takes a variable number of arguments accessible through the arguments array, where arguments[0] contains the matched text, arguments[1] contains the first sub-match, and so on. The matched text will be replaced with the return value of the function. If the return value is null, the system will act as if no match was found in that position. The search would resume from the next byte.
More information about JS replacements is available.
if_flag = flag_name
set_flag = flag_name
clear_flag = flag_name
A flag serves as a condition allowing a filter to match. If a filter depends on a flag, it will only have a chance to match if that flag is set. You can set or clear a flag as a side effect of a successful match. This way you can make filters depend on other filters.
Flag names are local to the file containing them, and can't clash with flags from other files. Flag names are case sensitive.
There is one important detail you need to be aware of when using flags: by the time your filter matches something and sets a flag, all the data preceding the match has already been fed to the next filter. The next filter has seen the data but hasn't yet seen a flag you are setting, so if it was dependent on that flag, it would just let the data through without touching it.
All that may seem unnatural, but that's the consequence of not buffering the whole page before applying filters.
Here is an example that won't work:
# We want to insert a script at the beginning of a page, # but only if the page has at least one script. [locate a script] search = /(?=<script[\s>])/ replace = set_flag = script_found [insert our script] if_flag = script_found search = /^/ replace = <<END <script> // ... </script> END # It won't work because by the time the first filter sets # the flag, the second one has already seen and let through # the beginning of the page. |
The problem can be solved by writing a greedy regex that accumulates all the data preceding the first script:
[locate a script] search = /^(.*?<script[\s>])/ replace = $1 replacement_type = expression set_flag = script_found [insert our script] search = /^/ replace = <<END <script> // ... </script> END if_flag = script_found |
Hint: JavaScript replacements within the same filter group (same file) can share variables between each other, which provides a more flexible alternative to flags. Read on for more info.
Built-in Flags
There are 3 built-in flags that are set by BFilter itself:- _HTML_
- _XHTML_
- _HTML_OR_XHTML_
if_flag = _HTML_OR_XHTML_ |
Note that these built-in flags are guaranteed to be set at the very beginning of a page, or not set at all.
JavaScript Execution Context
As I already mentioned, a JavaScript replacement expression is basically a body of a function that is supposed to return the replacement text. But where does that function live and what other functions and objects live in there?That function is really a method of a special context object. In a browser environment, the window object serves as a context. The functions you define become its methods, and the global variables become its properties. In our case, the context object represents the filter group (filters that are defined in the same file). This makes it possible to share variables between different filters. Of course, a separate context is created for each page being filtered, so sharing variables this way is quite safe.
First let us recall the syntax of global and local variables in JavaScript:
function f() { a = 1; // global variable (context variable) var b = 2; // local variable (local to a function) b = 3; // local variable, because already declared } |
Here is an example of sharing variables between filters:
[count links] search = /<a\s[^>]*href[\s=][^>]*>/ replace = <<END if (typeof link_count == 'undefined') { link_count = 1; } else { ++link_count; } END replacement_type = js [popup a message box] search = /$/ replace = <<END if (typeof link_count == 'undefined') { link_count = 0; } var msg = 'Number of links: '+link_count; return '<script>alert("'+msg+'")</script>'; END replacement_type = js |
If you are familiar with JavaScript, you are used to have certain objects at your disposal. These include window, document, navigator and so on. These are provided by the browser, not by the language itself. The language provides just a few functions and classes. These are: escape(), unescape(), Function, RegExp, Date, Math and a few others. That's what you'll have to rely on when writing JavaScript replacements. One extra function provided by BFilter is log(). It can be used to debug your filters.
Example:
[test] search = /<a\s[^>]*>/ replace = <<END log("Match: "+arguments[0]); return arguments[0]; END replacement_type = js |