Filter Characteristics - Image Filtering - Text Filtering - JavaScript Filtering


FILTER CHARACTERISTICS

The main activities of the POESIA project in the technical field are:

· software specification
· software development
· software assessment


The specification includes a detailed requirements study, the construction of test bases, a software architecture study, and the identification of existing software to be extended.

Development covers the creation of a library of filtering components, and the extension of existing Internet related open-source software to use this library. Library components will provide a set of two-layered (crude/elaborate) filtering functions covering multiple filtering modes (e.g. images, natural language text, URLs, etc). Adaptive decision taking mechanisms will combine the output of these components to deliver a final filtering decision. POESIA uses caching (extending the open-source Squid cache) both for Internet content and for filtering scores, enabling mutualization of filtering costs and hence the use of more expensive filtering techniques.

Communication mechanisms will be developed so that several POESIA systems in the same area can communicate to share their cached contents and scores.

Filtering will cover a range of modes, including image filtering, natural language text filtering, URL, PICS and JavaScript filtering. URL pointers can be static (simple links) or dynamic (JavaScript code generated), so analysis of JavaScript code is required to identify the URLs that code gives access to. For image-based filtering, a skin detector will be implemented to identify pornographic images, exploiting a range of learning and image processing methods. For filtering of natural language text (for English, Italian, Spanish), POESIA will use natural language processing techniques to provide more effective filtering than simple word-based methods. This entails corpus collection and analysis, and the adaptation of tools for shallow linguistic analysis.

  «go Top»


IMAGE FILTERING

A lot of new web sites and pages emerge every day and the same url´s are also subject to change. It is then practically infeasible to prohibit web sites and pages by introducing some static black/white lists and maintaining them manually. As one of the content-based filtering, image filtering is a much more intelligent and dynamic way to filter the internet. There is rich information conveyed in images. Indeed, images are the main media used to propagate harmful information like pornography and violence.

Pornographic Image filtering

Using the proportion of skin pixels to filter images is a first step. Most pornographic images can be filtered out. However, non pornographic images as portrait or group portrait are often also filtered out. In order to progress in analysing the content of images, other information has to be extracted from the image. The shape information from the "skinness" image from the original image is useful to detect human anatomy. Our approach provides:

Image database corpus (skin images, non skin images, pornographic images)
· skin feature extraction and machine learning component,
· skin detection component,
· crude detection of pornographic images.

Symbol Image filtering

The problem is to compare web images with a user-defined set of image symbols for instance, the Nazi gamed cross symbol). A Image Symbol Filtering library will be developed. It is based on a similarity matching algorithm between image symbol and website images.

Modules to be developed (as open-source free software) in POESIA are :

1. Image Invariant Descriptors Module

This module (to be developed in POESIA) receive an image (e.g. an example of image to reject) as input and produce a tuple of numbers, which represent image descriptors.

2. Matching Algorithm Module

This module will receive two images as input and will produce a distance between the two images.

  «go Top»

TEXT FILTERING

The POESIA project will develop an intelligent and effective approach to filtering Web pages on the basis of their textual content, which extends beyond existing simple word-based approaches through the use of various natural language processing (NLP) techniques. Single word-based approaches to text filtering work well when the texts to be divided (accepted/rejected) are very different in character. In a real world application such as the Web, however, the documents to be analysed will fall into a continuum, and whilst identifying those at either end of the spectrum should be straightforward using relatively simple methods, correct classification of documents that lie close to either side of the rejection border will inevitably be more problematic. Effective and accurate classification of documents that lie in this "grey zone" is crucial if the adoption of a filtering regime is not to undermine the real benefits that proper use of the Web provides. For example, although we aim to enable the filtering of inappropriate sexual content, this does not mean that all material having sexual content should be filtered. There is a clear division of acceptability between the content of a pornographic Web site and a sex education Web site, even though there may be some significant overlap in vocabulary.

We anticipate that the correct classification of "grey zone" documents (i.e. which fall close to either side of the accept/reject border) can be significantly enhanced by the use of NLP techniques, that can be exploited to recognise linguistically significant multi-word expressions within documents, and to recognise linguistically significant relations between expressions. Such linguistically significant attributes, once identified, will provide strong and reliable evidence to the classification process in its determination of a document´s status.

Although NLP methods should enhance the effectiveness of the text filtering component, their use typically involves a significant computational expense above that required for simple filtering techniques. Consequently, it is intended that the text filtering component should be realised via a two-stage architecture, which will allow the expenditure of computational resources to be concentrated as is needed to achieve accurate filtering. In particular, we envisage two text filtering agents, of increasing complexity, which are as follows:
    1. A simple (´lite´) filtering agent which makes only light use of NLP techniques, and can rapidly process large text volumes. This component should provide a clear accept/reject decision on a large proportion of documents, and mark the remainder as requiring the attention of the second agent.

    2. A sophisticated (´heavy´) filtering agent which makes heavier use of NLP resources and techniques to filter only those documents that are left uncategorised by the first agent.

A prerequisite to the application/adaptation of NLP methods to the text filtering task is the creation of appropriate domain corpora for each of the target languages, i.e. collections of relevant texts, for use in both manual and statistical analysis, providing a basis for all further tasks. Given such corpora, there are a range of NLP methods that could be exploited to enhance the effectiveness of filtering. although the text filtering components for each target language may employ only a subset of the full range of possible methods (given the inevitable time/resource limits of the project). Relevant NLP methods include the following:

    1. automatic extraction from the corpora of significant "terminology" (single words, cue phrases, fixed multi-word expressions, frozen text patterns, etc);

    2. construction of domain relevant thesauri/semantic lexicons;

    3. shallow linguistic analysis techniques, facilitating identification of variable multi-word expressions and text patterns, including:

      • tokenisation,
      • morphological analysis and lemmatisation,
      • named entity recognition,
      • "chunking", i.e. segmenting a text into non recursive phrasal nuclei (e.g. ´base´ noun phrases),
      • identification of other (non-phrasal) collocations,
      • functional analysis, i.e. annotation of grammatical relations (such as subject, object etc.).

The results of such shallow linguistic analysis will provide evidence that is used in the text filtering decision process, which will be based on machine learning methods. In the machine learning paradigm, a general inductive process automatically builds a classifier by "learning" the characteristics of the categories of interest from a set of previously classified items. There is a wide variety of techniques available for the purpose, ranging from decision trees and neural networks to example-based classifiers and classifier committees. In most previous work applying such methods to the task of text classification/filtering, documents are treated as if they were simply an unstructured "bag" of words. Expanding the evidence available to the decision method to include the results of domain-adapted shallow linguistic processing should facilitate effective categorisation by providing strong and reliable evidence of document content and character.

A final important issue is that the text filtering components produced should not be fixed and static, but rather should be able to adapt to the changing nature of language used and to fight Web developers´ evolving tactics (which are increasingly oriented to pass filtering software). Such adaptability is also relevant to the reapplication of the approach to other domains. Consequently, the text filtering software delivered by the project will be provided with appropriate learning functionality.  

 «go Top»


JAVASCRIPT FILTERING

Javascript analysis is useful to guess the dynamic links available in a page, i.e. the HTML links computed thru Javascript (in the user´s browser)

Given that Poesia filters run in a box between the client browser (in the classroom) and the origin server (far away on the Internet), they should filter content before actual user interaction. This precludes using a runtime Javascript interpretation in the Poesia Javascript filter, since most of pages contain scripts just to add more dynamism to pages.Of course, the Javascript analyzer has, in some particular situations, to process code (when it does not depends upon external factors like browser name or user clicks) to do the equivalent of a usual Javascript interpreter.

Even abstract interpretation techniques are inadequate for Javascript analysis in Poesia. Recall that abstract interpretation is a complex technique which :

* abstracts the concrete values of the interpreted programs into an abstract lattice (for a simple tutorial example, an abstraction of integer values in a simple language with only integer scalar variables could be the lattice of intervals or the lattice of finite unions of such intervals).

* symbolically executes the interpreted program by computing in this abstract lattice (this requires that each primitive of the language is "abstracted" in the abstract interpreter by a computable function which approximates the primitive; in the previous example, arithmetic on intervals.)

* shrinks interpretation of loops (at the expense of precision) by using sophisticated narowwing and widdening techniques.

* may give unsignificant results (by returning the [top] value, which abstract any concrete value so do not contain any useful information).

* is quite complex and costly to implement (because of interprocedural analysis, cost of elementary lattice operations, etc...)

In abstract interpretation terminology, the envisioned lattice for Poesia (taking into account the goal of finding dynamic links) was a sophisticated abstraction of strings, either as string prefixes or string regular expressions. This lattice, already difficult to implement for all Javascript operators but would need to be extended for objects (such as the document) and numbers.Several (actually used!) features of the Javascript language render such abstract interpretation based static analysis ineffective, in particular :

The functional character of Javascript (in particular, the ability to return a functional value -implemented as a closure- thru the function keyword) is difficult to analyze thru abstract interpretation.

The reflexive character of Javascript, which includes the eval primitive and its metaprogramming ability is not practically tractable by abstract interpretation techniques.

The object-prototype character of Javascript is unusual for the abstract interpretation community.

Each of the above items could need, to be effectively used, a significant amount of basic research (without any guarantee of usable results!) which is outside the scope of this IAP2117 project.So a more pragmatic approach to Javascript analysis is required.

The few observations above suggest that Javascript analysis should be driven by the actual examples found on the Web. The first strategy is to focus on the http: substring, based upon frequent observations. Actually, Javascript analysis should start as a sophisticated rule-based pattern matcher. Continuous feedback from actual experimentation with current Web content is much more important than what was initially supposed.

  «go Top»