FILTER
CHARACTERISTICS
The main activities of the POESIA project in the
technical field are:
· software specification
· software development
· software assessment
The specification includes a detailed requirements study, the construction
of test bases, a software architecture study, and the identification
of existing software to be extended.
Development covers the creation of a library of filtering components,
and the extension of existing Internet related open-source software
to use this library. Library components will provide a set of two-layered
(crude/elaborate) filtering functions covering multiple filtering modes
(e.g. images, natural language text, URLs, etc). Adaptive decision taking
mechanisms will combine the output of these components to deliver a
final filtering decision. POESIA uses caching (extending the open-source
Squid cache) both for Internet content and for filtering scores, enabling
mutualization of filtering costs and hence the use of more expensive
filtering techniques.
Communication mechanisms will be developed so that several POESIA systems
in the same area can communicate to share their cached contents and
scores.
Filtering will cover a range of modes, including image filtering, natural
language text filtering, URL, PICS and JavaScript filtering. URL pointers
can be static (simple links) or dynamic (JavaScript code generated),
so analysis of JavaScript code is required to identify the URLs that
code gives access to. For image-based filtering, a skin detector will
be implemented to identify pornographic images, exploiting a range of
learning and image processing methods. For filtering of natural language
text (for English, Italian, Spanish), POESIA will use natural language
processing techniques to provide more effective filtering than simple
word-based methods. This entails corpus collection and analysis, and
the adaptation of tools for shallow linguistic analysis.
«go
Top»
IMAGE FILTERING
A lot of new web sites and pages emerge every day and the same url´s
are also subject to change. It is then practically infeasible to prohibit
web sites and pages by introducing some static black/white lists and
maintaining them manually. As one of the content-based filtering, image
filtering is a much more intelligent and dynamic way to filter the internet.
There is rich information conveyed in images. Indeed, images are the
main media used to propagate harmful information like pornography and
violence.
Pornographic Image filtering
Using the proportion of skin pixels to filter images is a first step.
Most pornographic images can be filtered out. However, non pornographic
images as portrait or group portrait are often also filtered out. In
order to progress in analysing the content of images, other information
has to be extracted from the image. The shape information from the "skinness"
image from the original image is useful to detect human anatomy. Our
approach provides:
Image database corpus (skin images, non skin images, pornographic images)
· skin feature extraction and machine learning component,
· skin detection component,
· crude detection of pornographic images.
Symbol Image filtering
The problem is to compare web images with a user-defined set of image
symbols for instance, the Nazi gamed cross symbol). A Image Symbol Filtering
library will be developed. It is based on a similarity matching algorithm
between image symbol and website images.
Modules to be developed (as open-source free software) in POESIA are
:
1.
Image Invariant Descriptors Module
This
module (to be developed in POESIA) receive an image (e.g. an example
of image to reject) as input and produce a tuple of numbers, which
represent image descriptors.
2.
Matching Algorithm Module
This
module will receive two images as input and will produce a distance
between the two images.
«go
Top»
TEXT
FILTERING
The POESIA project will develop an intelligent and effective approach
to filtering Web pages on the basis of their textual content, which extends
beyond existing simple word-based approaches through the use of various
natural language processing (NLP) techniques. Single word-based approaches
to text filtering work well when the texts to be divided (accepted/rejected)
are very different in character. In a real world application such as the
Web, however, the documents to be analysed will fall into a continuum,
and whilst identifying those at either end of the spectrum should be straightforward
using relatively simple methods, correct classification of documents that
lie close to either side of the rejection border will inevitably be more
problematic. Effective and accurate classification of documents that lie
in this "grey zone" is crucial if the adoption of a filtering regime is
not to undermine the real benefits that proper use of the Web provides.
For example, although we aim to enable the filtering of inappropriate
sexual content, this does not mean that all material having sexual content
should be filtered. There is a clear division of acceptability between
the content of a pornographic Web site and a sex education Web site, even
though there may be some significant overlap in vocabulary.
We anticipate that the correct classification of "grey zone" documents
(i.e. which fall close to either side of the accept/reject border) can
be significantly enhanced by the use of NLP techniques, that can be exploited
to recognise linguistically significant multi-word expressions within
documents, and to recognise linguistically significant relations between
expressions. Such linguistically significant attributes, once identified,
will provide strong and reliable evidence to the classification process
in its determination of a document´s status.
Although NLP methods should enhance the effectiveness of the text filtering
component, their use typically involves a significant computational expense
above that required for simple filtering techniques. Consequently, it
is intended that the text filtering component should be realised via a
two-stage architecture, which will allow the expenditure of computational
resources to be concentrated as is needed to achieve accurate filtering.
In particular, we envisage two text filtering agents, of increasing complexity,
which are as follows:
-
A simple (´lite´) filtering agent which makes only
light use of NLP techniques, and can rapidly process large text
volumes. This component should provide a clear accept/reject decision
on a large proportion of documents, and mark the remainder as
requiring the attention of the second agent.
-
A
sophisticated (´heavy´) filtering agent which makes
heavier use of NLP resources and techniques to filter only those
documents that are left uncategorised by the first agent.
A prerequisite to the application/adaptation
of NLP methods to the text filtering task is the creation of appropriate
domain corpora for each of the target languages, i.e. collections of relevant
texts, for use in both manual and statistical analysis, providing a basis
for all further tasks. Given such corpora, there are a range of NLP methods
that could be exploited to enhance the effectiveness of filtering. although
the text filtering components for each target language may employ only
a subset of the full range of possible methods (given the inevitable time/resource
limits of the project). Relevant NLP methods include the following:
- automatic
extraction from the corpora of significant "terminology" (single
words, cue phrases, fixed multi-word expressions, frozen text patterns,
etc);
- construction
of domain relevant thesauri/semantic lexicons;
- shallow
linguistic analysis techniques, facilitating identification of variable
multi-word expressions and text patterns, including:
-
tokenisation,
-
morphological analysis and lemmatisation,
-
named entity recognition,
- "chunking",
i.e. segmenting a text into non recursive phrasal nuclei (e.g.
´base´ noun phrases),
-
identification of other (non-phrasal) collocations,
-
functional analysis, i.e. annotation of grammatical relations
(such as subject, object etc.).
The results of such shallow
linguistic analysis will provide evidence that is used in the text filtering
decision process, which will be based on machine learning methods. In
the machine learning paradigm, a general inductive process automatically
builds a classifier by "learning" the characteristics of the categories
of interest from a set of previously classified items. There is a wide
variety of techniques available for the purpose, ranging from decision
trees and neural networks to example-based classifiers and classifier
committees. In most previous work applying such methods to the task
of text classification/filtering, documents are treated as if they were
simply an unstructured "bag" of words. Expanding the evidence available
to the decision method to include the results of domain-adapted shallow
linguistic processing should facilitate effective categorisation by
providing strong and reliable evidence of document content and character.
A final important issue is that the text filtering components produced
should not be fixed and static, but rather should be able to adapt to
the changing nature of language used and to fight Web developers´
evolving tactics (which are increasingly oriented to pass filtering
software). Such adaptability is also relevant to the reapplication of
the approach to other domains. Consequently, the text filtering software
delivered by the project will be provided with appropriate learning
functionality.
«go
Top»
JAVASCRIPT
FILTERING
Javascript analysis is useful to guess the dynamic
links available in a page, i.e. the HTML links computed thru Javascript
(in the user´s browser)
Given that Poesia filters run in a box between the client browser (in
the classroom) and the origin server (far away on the Internet), they
should filter content before actual user interaction. This precludes
using a runtime Javascript interpretation in the Poesia Javascript filter,
since most of pages contain scripts just to add more dynamism to pages.Of
course, the Javascript analyzer has, in some particular situations,
to process code (when it does not depends upon external factors like
browser name or user clicks) to do the equivalent of a usual Javascript
interpreter.
Even abstract interpretation techniques are inadequate for Javascript
analysis in Poesia. Recall that abstract interpretation is a complex
technique which :
* abstracts the concrete values of the interpreted programs into an
abstract lattice (for a simple tutorial example, an abstraction of integer
values in a simple language with only integer scalar variables could
be the lattice of intervals or the lattice of finite unions of such
intervals).
* symbolically executes the interpreted program by computing in this
abstract lattice (this requires that each primitive of the language
is "abstracted" in the abstract interpreter by a computable
function which approximates the primitive; in the previous example,
arithmetic on intervals.)
* shrinks interpretation of loops (at the expense of precision) by using
sophisticated narowwing and widdening techniques.
* may give unsignificant results (by returning the [top] value, which
abstract any concrete value so do not contain any useful information).
* is quite complex and costly to implement (because of interprocedural
analysis, cost of elementary lattice operations, etc...)
In abstract interpretation terminology, the envisioned lattice for Poesia
(taking into account the goal of finding dynamic links) was a sophisticated
abstraction of strings, either as string prefixes or string regular
expressions. This lattice, already difficult to implement for all Javascript
operators but would need to be extended for objects (such as the document)
and numbers.Several (actually used!) features of the Javascript language
render such abstract interpretation based static analysis ineffective,
in particular :
The functional character of Javascript (in particular, the ability to
return a functional value -implemented as a closure- thru the function
keyword) is difficult to analyze thru abstract interpretation.
The reflexive character of Javascript, which includes the eval primitive
and its metaprogramming ability is not practically tractable by abstract
interpretation techniques.
The object-prototype character of Javascript is unusual for the abstract
interpretation community.
Each of the above items could need, to be effectively used, a significant
amount of basic research (without any guarantee of usable results!)
which is outside the scope of this IAP2117 project.So a more pragmatic
approach to Javascript analysis is required.
The few observations above suggest that Javascript analysis should be
driven by the actual examples found on the Web. The first strategy is
to focus on the http: substring, based upon frequent observations. Actually,
Javascript analysis should start as a sophisticated rule-based pattern
matcher. Continuous feedback from actual experimentation with current
Web content is much more important than what was initially supposed.
«go
Top»
|