Wordpress Versioning: Part 2
During a recent attempt at answering the Honeynet Log Mysteries Challenge, I wrote a series of reasoned analyses for the supplied Honeynet logging data. Unfortunately, teaching workloads stopped me from submitting any realistic challenge answer.
Inspired by the idea of applying the Scientific Method to Digital Forensics (see Casey2009 and Carrier2006) and using data visualisation (see Conti2007 and Marty2008), I set about attempting to apply the same principles to analysing the Log Mysteries data sets.
In Wordpress Versioning: Part 1 we had shown how, by downloading candidate Wordpress plugins, we could compare the downloaded plugin against a series of observed URLs. In doing this, we could then effectively test if a given candidate plugin was unlikely to be installed. This blog article shall focus on using probability measures to estimate:
- the version of Wordpress that is installed
- and the Wordpress plugins that are installed.
Wordpress and its plugins have their source code version controlled with Subversion (at least this is the case these days!). By checking out the entire source code tree for wordpress and all its plugins, we can build database tables relating all wordpress and plugin files (from the repository!) to their sizes, SHA1 hashes, etc.
Mapping Subversion Repositories to Rails Models
When using script based implementations to download large repositories, we are best first checking out the entire repository and then processing those results offline. Thus, we first issue the commands:
svn co http://core.svn.wordpress.org evidence/wordpresssvn co http://svn.wp-plugins.org evidence/wp-plugins
checkout:svn:wordpress and checkout:svn:wp-plugins in svn.rake).
Once we have successfully download the Wordpress repositories, we can use the rake tasks build:svn:wordpress and build:svn:wp-plugins to then process the downloaded code into an instance of the following class diagram:
Note: to check out all of the Wordpress (and plugin) source code repository takes approximately 3 days and consumes around 70GB of disk space.
From Wordpress Versioning: Part 1 we can identify that two URLs are used to access the Wordpress application:
GET /wp-includes/js/jquery/jquery.jswith a response size of 57276 bytes - we'll refer to this as event:$E_0 = \{ ($ /wp-includes/js/jquery/jquery.js"$, 57276) \}$- and
GET /wp-includes/js/jquery/jquery.form.jswith a response size of 8429 bytes - we'll refer to this as event:$E_1 = \{ ($ "/wp-includes/js/jquery/jquery.form.js"$, 8429) \}$.
- $File$ is a non-empty set of files (we assume that each file is a hash data structure with keys $size: File \rightarrow \mathbb{N}_0$ and $url: File \rightarrow \mathbb{P}(String)$);
- $x \in File$ is a random variable;
- $Version$ is a non-empty set of version tags;
- $W: Version \rightarrow \mathbb{P}(File)$ associates files with a specific tag release of Wordpress;
- $v \in Version$ is the tag release to be classified;
- $u \in String$ is an observed URL request;
- and $s \in \mathbb{N}_0$ is an observed response size (in bytes).
With our naive Bayesian network setup, we can now use it to classify the (tag release) versions of Wordpress as follows:
Whence we estimate the following (based on evidence $e_0$ and $e_1$, all other tag release values are not present [ie. probability is 0%], and so are excluded from this table):
| Wordpress Tag Release | Probability |
|---|---|
| 2.8 | 9.99% |
| 2.8.1 | 9.99% |
| 2.8.2 | 9.99% |
| 2.8.3 | 9.99% |
| 2.8.4 | 9.99% |
| 2.8.5 | 9.99% |
| 2.8.6 | 9.99% |
| 2.9 | 10.03% |
| 2.9.1 | 10.03% |
| 2.9.2 | 10.03% |
Working to one decimal place, we are able to estimate (with equal likely hood) that Wordpress has a tag release within the range of releases listed in the table above (ie. we're on a 2.8 or 2.9 tag release branch). The small probability variations here can be accounted for by differing tag release population sizes.
Wordpress Plugins: Tag Release Estimates
In a similar manner, we can also build naive Bayesian classifiers for determining which Wordpress plugins are installed, along with their respective tag release or trunk version numbers, as follows:
| Wordpress Plugin | Tag Release | Observations |
|---|---|---|
| Contact Form 7 | equal probability for each tag release in list 2.1, 2.1.1, 2.1.2, 2.2, 2.2.1 and 2.3 | estimate consistent with parameter ver=2.1.1 |
| Google Analyticator | equal probability for each tag release in list 6.0, 6.0.1, 6.0.2, 6.1 and 6.1.1 | estimate consistent with parameter ver=6.0.2 |
| Google Syntax Highlighter | 1.5.1 | this estimate only holds if we ignore the observation of shBrushBash.js with a size of 2810 bytes |
Note: searching for the file shBrushBash.js within the Wordpress plugin repository reveals no file with a size of 2810 bytes.
In the final blog article to this series, we shall look at how the work of Florian Buchholz (eg. see An Improved Clock Model for Translating Timestamps) can be used to measure logging event times relative to a suitable reference clock description.
References
Modeling and Reasoning with Bayesian Networks
by A.Darwiche
Cambridge University Press 2009
Data Analysis in Forensic Science: a Bayesian Decision Perspective
by F.Taroni, S.Bozza, A.Biedermann, P.Garbolino and C.Aitken
Wiley 2010
Tools Used
SamIam naive bayesian classifiers used in this article:
- Wordpress naive bayesian classifier
- Contact Form 7 naive bayesian classifier (will be made available at a latter date)
- Google Analyticator naive bayesian classifier
- Google Syntax Highlighter naive bayesian classifier.