Psiphon Final Report
Department of Computer Science
University Of Toronto
Patrick Smith
Jeffrey Jia
December 13
th
, 2004
2
Abstract
.
Psiphon is a user-friendly stand-alone proxy application designed to securely circumvent Internet
censorship. It is specifically intended for people living in countries where there is Internet
content filtering who have friends or relatives living in countries where there is not. Unlike other
circumvention technologies, Psiphon relies on multiple social networks of trust, rather than mass
publication of IPs or proxies, which can be easily intercepted and filtered by a determined state.
Unlike other proxy applications, Psiphon does not require installation of a web server or any other
programs.
3
Table of Contents
.
1. Introduction
............................................................................................................................. 4
1.1 Starting Point ...................................................................................................................... 4
1.2 Acknowledgements ............................................................................................................ 5
2. Design
....................................................................................................................................... 6
2.1 HTTPS Connection ............................................................................................................ 6
2.2 Handle HTTPS Requests from Browsers .......................................................................... 6
2.3 Getting Requested Resource ............................................................................................... 7
2.4 File Parsing ......................................................................................................................... 7
2.5 Graphical User Interface ..................................................................................................... 8
2.5.1 Design and Layout ...................................................................................................... 8
2.5.2 Internationalization and Localization ........................................................................ 10
2.5.3 Remote User Interface ............................................................................................... 10
2.6 User Authentication .......................................................................................................... 11
2.7 User Management ............................................................................................................. 11
3. Implementation
...................................................................................................................... 12
3.1 HTTPS Connection ........................................................................................................... 12
3.2 Handle HTTPS Requests from Browsers ......................................................................... 12
3.3 Getting Requested Files .................................................................................................... 15
3.4 File Parsing ....................................................................................................................... 15
3.5 Graphical User Interface .................................................................................................. 16
3.5.1 Host Interface Internationalization and Localization ................................................ 18
3.5.2 Remote User Interface ............................................................................................... 19
3.6 User Authentication ......................................................................................................... 20
3.7 User Management ............................................................................................................ 20
4. Discussion
............................................................................................................................... 22
4.1 Unicode ............................................................................................................................. 22
4.2 HTML Parsers ................................................................................................................... 23
4.3 Interface Tools .................................................................................................................. 24
4.4 Bugtrac .............................................................................................................................. 25
4.5 Secure Connections ........................................................................................................... 25
5. References
.............................................................................................................................. 26
4
1. Introduction
.
Censorship of the Internet is a serious problem in the world today. Countries such as Iran and
China, filter the content of the Internet. Content such as women's rights pages would not be
accessible from an Iranian Internet connection. Setting up a proxy server is a solution that is
often used to get around the censorship, because they allow users to view the Internet through
someone else's connection. Meaning that if the host of the proxy does not have content filtering,
then neither will the content the remote users is viewing. However, in order to reach the target
audience within such countries, the addresses of these servers are mass emailed to email
addresses which are known to be from that country. This process of mass emailing a proxy
address means that many people who are involved in blocking Internet content will also receive
the address, causing the server to be quickly blocked as well. Psiphon was designed to help fix
this problem.
Our primary goal was to create a user friendly proxy server tool. Most people do not have the
knowledge needed to create and run their own proxy server, so, we needed to make Psiphon as
easy to use as possible. Any user, with minimal computer experience, should be able to set up
and run their own proxy server.
Internet censorship mostly occurs in countries where English is not the primary language, so
having Psiphon translated into other languages would help users be more comfortable with using
the program. Though host users will be located in countries without Internet censorship, if they
are setting up the proxy for a friend in another country, they likely speak the same language as
their friend in that country, and may be more comfortable with software that is in that language.
In order to prevent a determined state from intercepting and observing the content being viewed
across the proxy, a secure connection is required, such as an HTTPS connection which encrypts
the data using SSL. Encryption ensures that the state cannot view what information is being sent
to the remote user. Having a secure connection between the host user (who has installed Psiphon
on their computer) and the remote user is desired so that the Psiphon server itself is not blocked.
1.1 Starting Point
.
At the beginning of the project, there was a small amount of code previously written by the
Citizen Lab. The code was very helpful in illustrating some of the tools that we would be using.
5
For example, the provided code used M2Crypto to provide an HTTPS connection using SSL.
The code also had a class to get web pages given an address and to parse the returned HTML.
The interface used a deprecated interface kit but provided a good indication of what modules we
should look at for the project.
1.2 Acknowledgements
.
We would like to thank our supervisor Dr. Gregory V. Wilson for his advice and guidance
throughout the project. Our client and system administrator Michelle Levesque, her help and
involvement made the project go as smoothly as possible. Nart Villeneuve of Citizen Lab, his
input helped us understand how existing proxies work, and also provided much needed interface
and translation help. Internationalizing the interface would not have been possible without the
help of Kevin Altis (of PythonCard) and Robin Dunn (of wxPython). We would like to thank the
following for their translation work: Ana Smith, Renée Aviles, Rares Pateanu, Behdad Esfahbod,
and Ekaterina Trapeznikova. We would like to thank all of the 49x students for their help and
feedback.
6
2. Design
.
Prior to starting the project, the client decided that they would prefer Psiphon be implemented
using Python. Python is a good choice for a few reasons. First, Python is cross-platform, Python
code can be run under Windows, Unix/Linux, and Mac operating systems. Windows executables
of Python programs can also be created by using the py2exe module. So a Windows user can
install Psiphon without even having the Python language installed. Second, Python provides rich
libraries for Internet programming, including its own standard but also third-party libraries.
2.1 HTTPS Connection
.
Providing an HTTPS connection is the key to securely circumvent Internet censorship. We
planned to use the HTTPS protocol which is based on SSL (Secure Socket Layer) to provide
secure connections. SSL works at the Internet transport layer, and can provide security to any
TCP based application such as HTTPS. Based on the existing code and online documentation we
read, we decided that M2Crypto, a Crypto and SSL toolkit for Python, would be the best tool to
implement an HTTPS connection.
Despite M2Crypto's rich library of modules and services that it offers, only the
M2Crypto.SSL.SSLServer
module, can provide HTTPS connections. To implement the
HTTPS connection, an SSL context object was also necessary. Fortunately, M2Crypto also
provides a class
M2Crypto.SSL.Context
that can be used to create such an object.
Therefore, HTTPS connection can be created easily with the third-party library M2Crypto.
2.2 Handle HTTPS Requests from Browsers
.
In order to provide a proxy service based on the SSL connection, we needed to design a way to
capture HTTPS requests sent from browsers, process them, and return a response. So, we needed
to implement an HTTP server.
Module
BaseHTTPServer
in Python's library provides classes to create a basic HTTP server.
This module defines two classes for implementing HTTP servers:
HTTPServer
and
BaseHTTPRequestHandler
. We decided to use
BaseHTTPRequestHandler
as the
superclass of the request handler class because M2Crypto already provides an HTTPS server, so
7
we were only in need of a way to handle requests. The request handler class will analyze the
HTTPS requests from browsers and send back response to browser.
When a request is accepted, Psiphon determines whether the requested resource is located in
Psiphon or in another web server. If the resource is a local one (for example, the home page of
Psiphon, or the page for authentication), Psiphon must send it back directly. Otherwise, Psiphon
must extract the suitable URL from the request, and use this URL to create a request which asks
for a resource located in the remote server. For example, if request from the browser is
https://PsiphonAddress,
then the home page of Psiphon is sent to the browser directly; however
if the request from the browser is
https://PsiphonAddress/req?href=http://pyre.third-
bit.com/psiphon,
then Psiphon must proxy the request
http://pyre.third-bit.com/psiphon.
2.3 Getting Requested Resource
.
Since Psiphon is a proxy, most of the resources requested by the browser won’t be provided by
Psiphon directly. Except in very few scenarios, (for example, the remote user requests to login to
Psiphon), Psiphon must request the resources required by the remote user from other web servers,
and forward these rescoures to the remote user. To get the requested resource from the web
server, we decided to use the
urllib2
module provided by Python's standard library.
When Psiphon proxies a request, there are two cases: If the response from the web server is an
HTML file, Psiphon will parse and modify all hyperlinks in the file (and remove the Javascript if
required), it will then forward the modified file to the browser. If, however, the response is not an
HTML file, it will simply forward the file unchanged.
2.4 File Parsing
.
Python's library provides a module
htmllib
which contains a
HTMLParser
class. This class
is a simple HTML parser, in the sense that it does not perform any action at each HTML tag. So,
by using the provided class, we will extend
htmllib
's
HTMLParser
by creating our own
custom parser for Psiphon.
At each anchor tag, we will have to modify the
href
attribute so that the new link reflects back
onto the Psiphon server, and indicate to Psiphon what the requested address is. For example,
8
given the link
<a href="http://pyre.third-bit.com/psiphon">
if the Psiphon
server's address is https://127.0.0.1:443, the resulting link would be:
<a href="https://127.0.0.1:443/href=http://pyre.third-bit.com/psiphon">
.
The same would have to be done to any tag that references a remote file, for example images and
stylesheets, and any other file which does not contain HTML tags. We decided to keep a list of
tags which reference files, as well as a way of knowing which attribute within those tags
reference the file.
To keep track of which tags and attributes to change, we would keep global variables for each
type of tag and attribute. Then, we would compare the tag being parsed to the variables, and
change the neccessary attribute. Any other tags we would just pass along unchanged.
Unfortunately,
HTMLParser
does not have any way to change the parsed HTML and return it.
However, it does have an overridable
onClose
function. We would use this function by storing
the resulting changed HTML in a string. At each of the overridden
HTMLParser
functions, we
would add to this string, either the modified HTML, or the HTML code itself if no changes are
needed. No changes are necessary if the provided tag does not require a remote file. We would
then override
onClose
by simply returning the string. The returned string would then contain
the entire HTML page, but with each reference to a file linked back to Psiphon. The string would
then be passed to the server to be passed along to the remote user.
2.5 Graphical User Interface
.
There are two kinds of user interfaces required for Psiphon. One is for the host user, the other is
for the remote users and must be put into the top of each web page returned by Psiphon
2.5.1 Design and Layout
.
Based on a meeting with the clients, the design in figure 2.5-1 was decided upon.
9
Figure 2.5-1 The proposed Psiphon interface design.
The design included the menu items 'File', 'Options' and 'Help', which would have items that
would be used to close the program, manage users and language preferences, and the like. To
control whether the server is accepting incoming connections or not, there would be two buttons,
one to start Psiphon, (ie. allow incoming requests), and another to stop it. The server information,
as well as a user name and password were designed to be displayed as text fields, and a "Send to a
friend" button would allow for the user to easily copy and paste all these items at once. Finally a
list area showing which users are connected and how much brandwidth they have requested, as
well as a 'Total bandwidth' status bar.
Python's standard library provides tools for graphical interfaces, however, there are also third-
party kits wxPython and PythonCard. wxPython is an extension of the wxWidgets (C++) tool kit,
and allows for wxWidgets programming in Python. PythonCard further extends wxPython in an
attempt at making GUI design and creation easier.
We decided to use a mixture of wxPython and PythonCard to create the interface. With
wxPython, the code needed to do what we wanted it to do would be complex. Yet, with
10
PythonCard, we would be unable to have the layout that we wanted. Using a mixture, we hoped
to have the best of both modules. We would build a PythonCard resource file with all the
components that we need, and then use wxPython Panes to organize the layout.
Using the PythonCard resource file, would allow us to easily change component attributes by
simply addressing the component. For example,
self.components.userField.text
would allow us to access the text displayed in the "User Name" text field. This would be helpful
for both updating the interface information, as well as for internationalizing and localizing the
interface.
2.5.2 Internationalization and Localization
.
We decided that we should have the interface available in English, plus one additional language,
specifically, a language which uses a non-ASCII character set. By having the interface available
in a Unicode language, hoped to make sure that the interface would work with any language.
Since one of the team members speaks Chinese as a first language, we planned to have the
interface in both English and Simplified Chinese.
The language files would be Python files themselves. Each language file would have to have a
set of variables, one for each text that would be displayed in the interface. Also a single Python
file would also be needed to keep track of the language names and the file associated with that
language. Keeping track of the languages this way, would avoid scanning the directory for
language files. The name/language file pair would be displayed to the user and the selected
language file would be imported. Since PythonCard would be used, we could simply assign each
component text to the related value in the imported language file.
2.5.3 Remote User Interface
.
Like other proxy tools, remote users connected to the server need a simple way to interact with it.
During the client meeting with Citizen Lab, we were shown examples of proxies which inserted a
new search bar at the top of every HTML page it returned. Based on the what the client wanted,
we decided to have the same approach. At the top of each HTML page, we planned to add our
own address bar. However, we also planned to have extra options as check boxes, such as,
'Remove Scripts', and 'Remove Java'. Depending on which options were checked, Psiphon would
return a page that reflects what the user picked.
11
2.6 User Management
.
Since Psiphon relies on multiple social networks of trust, every remote user must be assigned a
unique user name and an associated password. The host user has the ability to create and delete a
user account. All of the user information must be stored on the host user's end, and as a remote
user connects, their account information must be verified against the stored data. We decided to
store user information within a Python file. Since the connection between the host and remote
user is secure, the user information is safe while it is being verified, however, the accounts are
stored within a Python file which can be viewed and modified by any text editor on the host user's
computer.
2.7 User Authentication
.
When a remote user logs into Psiphon, the user name and password are sent to Psiphon through
and HTTPS request. When Psiphon receives this request, it fetches the user name and password
from the request, and compares the information with the data assigned by the host user. If user
name and password match, the remote user can access Psiphon, otherwise the connection is
refused.
12
3. Implementation
.
Psiphon's implementation relied on Python's own standard library, third-party Python modules, as
well as our own modules to accomplish the desired goals.
3.1 HTTPS Connection
.
A multi-threaded approach was used for Psiphon to improve performance. A new thread is
created for each incoming request to the server. M2Crypto's provided
M2Crypto.SSL.SSLServer.ThreadingSSLServer
class helped us to achieve this.
Figure 3.3-1 illustrates the hierarchy of Psiphon as a proxy server.
Figure 3.1-1 hierarchy of class ProxyServer
Inheriting from
ThreadingSSLServer
, class
ProxyServer
is very compact. Besides
members inherited from super class, there are only five functions in this class, four of them were
helpful in creating the connection. Function
serve_forever
calls
handle_request
inherited from
ThreadingSSLServer
, and function
start
create a thread to run
serve_forever
.
3.2 Handle HTTPS Requests from Browsers
.
We implemented two types of HTTP commands: GET and POST. If the request is a POST
command, the data in the command is extracted. To handle either a GET or a POST command,
method
handle_request
had to be called. This method returns a response including all the
elements of an http(s) response, and is therefore the engine of Psiphon.
13
Since some requests do not need to be forwarded to other web servers, and most requests have to
be proxied, requests from browser are classified as one of the following:
Case 1: The request is for a connection to Psiphon. If the request has the form:
https://Psiphon-Address
then Psiphon just returns its homepage which is a login screen. The login page is shown in figure
3.2-1
Figure 3.2-1 Psiphon home page
Case 2: The request asks for an authentication (i.e. to login Psiphon). If the user clicks the login
button in figure 3.2-1, then a request of the following form:
https://server:port/login?ID=user&pwd=123&mozilla=nothing&submit=Login
will be sent to Psiphon. Method
matchIdPwd
is called to extract the user name and password
from the request, and do authentication.
If the user name and password match, then the remote user interface will be sent to browser, and
the remote user's ID will be saved in the cookie of this page.
If the user's ID and password don't match, Psiphon will refuse this log in attempt and send back a
web page shown in figure 3.2-2.
14
Figure 3.2-2 Authentication failure screen
If a user attempts to login while that user name is already in use, the attempt is rejected, and the
page displayed in figure 3.2-3 is displayed.
Figure 3.2-3 User name is already in use.
Case 3: The user requests to logout. If the user clicks the logout button in the remote user
interface, then a request of the form:
https://Psiphon-Address /logout?href=logout
will be sent to Psiphon. Psiphon will send back the logout webpage to the browser which uses
javascript to close the browser window after a few seconds (actually Mozilla can't close the
window, IE gives a close prompt window).
Case 4: The request is for a resourse on a remote server. In this scenario, Psiphon needs to
forward the request to the other web server. There are two sub-cases in this situation.
Sub-case (1): The remote user enters a URL (for example: www.google.com) and clicks the
'connect' button in the remote user interface, then a request of the form:
https://Psiphon-Address /req?href=www.google.com
will be sent to Psiphon. Psiphon will extract the input URL (www.google.com), and forward this
URL to the corresponding web server, and accepts the response resource.
Sub-case (2): Remote user clicks a hyper link in a web page which has been parsed and
modified by Psiphon. In this situation, a request of the form:
https://Psiphon-Address /href=http://www.google.com/
15
will be sent to Psiphon. Psiphon extracts the hyper link, (http://www.google.com/), and forwards
this to the corresponding web server, and accepts the response resource.
For both sub-cases, if the response resource from other web server is an HTML file, it will
contain a header with a Content-Type field, (i.e. 'Content-Type:text/html'), then this HTML file
will be parsed to modify all hyper links and references to remote files.
3.3 Getting Requested Files
.
As a proxy server, Psiphon must grab resources from other web servers when the remote user
requests a resource not in Psiphon (i.e. not the login page). To get resources from other web
servers, Python's
urllib2
module is used in class
URLFetcher
.
URLFetcher
's
grab
uses
a URL to wrap an object of class
urllib2. Request
, then uses this object (and parameter
data if it is a POST command) to grab the requested resource.
In class
ProxyHandler
, there is an attribute fetcher which is an object of class
URLFetcher
.
As mentioned in section 3.2, a proxied URL is extracted from the remote user's request. This
URL is sent to fetcher, and the reponse which is a dictionary with a structure of:
{'responsecode':200,'body':urlObj.read(),'headers':urlObj.info()}
is returned from the fetcher. The elements of the dictionary correspond to the response code
returned from the server, the body of the requested resource, and the headers of the page,
respectively.
3.4 File Parsing
.
The resulting parser class created for Psiphon needed to be done using regular expressions, rather
then using any of Python's own parsers (See section 4.2). The regular expression looks for any
tag which requires being changed. But rather than looking for the tag types (anchor, image,
etc...), the regular expression matches the attributes of these tags. Tags with attributes: href,
background, src, action, or url need the tag to have the Psiphon server, port, and 'href=' added to
the front of the attribute's value. This way any tag which references a file will have the file go
through Psiphon again, and not require the remote user to download the file themselves from the
target server.
Regular expressions to parse the HTML files was required because none of Python's provided
parsers were stable enough to handle malformed HTML properly. However, as a result of
16
looking for attributes to change, there are tags which contain attributes such as 'background'
which we do not want changed, and yet, with the regular expressions they will be changed. Also,
because of how the regular expressions were written, they do not take into account white space
between the attribute name, and the equals sign. This will cause many tags to be ignored.
3.5 Graphical User Interface
.
From the initial design (see Figure 2.5-1), we made an interface by mixing both wxPython and
PythonCard. Psiphon's interface is made from a PythonCard resource file. The resource file is
used to create the components and menu of Psiphon, and wxPython sizers are used to place the
components into the correct places once the program has started.
Figure 3.5-1 Psiphon's Host user interface.
The interface contains all of the elements that were requested in the meeting with the client. The
'Address' text field contains the IP address of the current server. The address is obtained by
Psiphon's server by using our
HostInfo
module, which uses Python's
socket
's
getHostByName
. The text fields which contain the server information are not editable. This
way the user cannot accidentally erase or change the data and lose the information until Psiphon
is restarted. The user can either copy and paste the information from the fields individually to
their friend, or they can click the 'Send to a friend' button to get the information displayed in a
single (uneditable) text area (figure 3.5-2).
17
Figure 3.5-2 Send to a friend dialog.
The displayed user name and password is the information from one of the user accounts stored in
the user management system (section 3.7). However, when Psiphon is started, if there are no
existing users created, the user is prompted to create a user. Since the server requires a user name
and password to be used, the host user must create at least one user account before starting the
server.
The menu consists of four items: File, Options, Users, and Help. Though some menus only
contain one item, it was designed this way to allow additional menu items to be added as more
functionality is introduced to Psiphon. 'File' contains an 'Exit' item (figure 3.5-3), which closes
the connection, and closes the program itself (opposed to the 'Stop' button, which simply stops the
server).
Figure 3.5-3 File menu.
The 'Options' menu contains a 'Select Language' item which allows users to set the default
language of Psiphon (figure 3.5-4).
Figure 3.5-4 Options menu.
The 'Users' menu is used to add and delete users (figure 3.5-5).
18
Figure 3.5-5 Users menu.
Finally, the 'Help' menu (figure 3.5-6) consists of an 'About Psiphon' dialog, which contains a
summary of what Psiphon is, as well as provide links to associated links, such as Psiphon's
homepage.
Figure 3.5-6 Help menu.
3.5.1 Host Interface Internationalization and Localization
.
The primary language of Psiphon's (remote user) target audience is non-English. Since Psiphon
relies on the remote users having friends in a country where there is no Internet censorship, we
decided to translate Psiphon’s interface into one language additional Unicode language, and make
it easy to add additional languages.
Psiphon is currently available in the following nine languages: Norwegian, Persian (Iran),
Spanish, Italian, French, English, Russian, Romanian, and Chinese (Simplified). Each language
has its own languge file. The language files are simply Python source files containing a variable
and the appropriate translation for each value. Below is an example of the English language file
variables for the labels of the server information.
address="Address:"
user="User Name:"
passw="Password:"
...
Adding a new language requires someone to make a copy of an existing language file, but change
each variable value to the new equivalent, then add to the dictionary in language.py. The keys of
the dictionary the names of the languages (in that language), and the value for each key is the
associated language file name.
When Psiphon is run for the first time, the language selection dialog is displayed (Figure 3.5-7)
19
Figure 3.5-7 Language Selection dialog.
The dialog is created by a PythonCard resource file which is created as the program starts, so any
changes in the dictionary from language.py will be reflected once the application is opened again.
Once an initial default language is selected, a Python source file defaultLanguage.py is created
containing a single variable whose value is the default language file name. After the initial
request to select a default language, the user is not prompted to select a language again, unless
they choose to change their default language from the 'Options' menu.
All dialog boxes are loaded from PythonCard resource files that are created as they are requested.
So if a user has the language set to English, and accesses the 'About Psiphon' dialog box, then
changes the language to Romanian and selects the 'About Psiphon' dialog again, all the
information will be in Romanian.
3.5.2 Remote User Interface
.
Before each altered html page is returned to the user from the server, our remote user interface is
added to the top of the page (Figure 3.5-8).
Figure 3.5-8 The remote user interface.
20
Like other proxies, the user must put the URL they want to go to inside the text field. If they
write the address in the browser's url field, then they will be taken outside of the proxy. By
clicking the 'connect' button after entering the desired URL, the address is sent to Psiphon, and
the requested page is returned.
3.6 User Authentication
.
When a user sends a login request to Psiphon,
matchIdPwd
in class
ProxyHandler
is called
to extract the user name and password from the request, and do authentication (refer to section 3.2
Handle HTTPS Requests from Browsers).
When a user tries to login, the user's name and password are matched with the stored data on the
host's computer. If the user name and password match the stored record, then a userID is
returned. If the user name and password match but someone with that user name is already
logged in, then
matchIdPwd
returns a message that the name is already being used. Finally, if
the user name and password don't match, a message is returned saying that the login attempt
failed.
3.7 User Management
.
User management is done, in a way similar to the way the default language is stored. When
Psiphon is first opened, it tries to load the
userdata.py
Python source file. If it does not exist,
it prompts the user to create a new user. The user information that is entered is then displayed in
Psiphon's text fields. The file which contains the users is a dictionary of dictionaries. Each key is
the user name, and the value for each key is a dictionary containing that user's specific
information as follows:
userID:{password:
pwd
,bandwidth:
xx
,loginTime:
dd-hh-m
,IP:x
.y.z.w
,sentBytes:
0
,online:
n
}
Each key/value pair of each user's dictionary stores the information that is specific to that user.
Many of the elements are not yet implemented, but will be used to provide feedback for the host
user of who is connected to their server.
To manage checking user passwords and adding/deleting users, a separate Python file was
created,
usermanagement.py
, containing functions used to manipulate the dictionary of
21
users. To add a user, the host user must open the 'Add a user' dialog and select a (unique) user
name and password.
Figure 3.7.1 Delete a user dialog.
To remove a user, the host user must open the "Delete a user" dialog, from the 'Users' menu. All
of the users in the userdata.py file are displayed in a list. The user then selects the user they want
to remove, and click 'OK'. If the user information that Psiphon is displaying has the associated
user removed, then a different user will have their information displayed. However, if all users
are deleted, then the user is prompted to create a new user so that Psiphon can accept incoming
requests. All changes to the users dictionary are immediately reflected in userdata.py.
22
4. Discussion
.
Throughout the project, we were forced to change not only our design approach, but also our
understanding of the tools and problems in order to compensate for difficulties that arose.
4.1 Unicode
.
One of the more challenging parts of getting Psiphon up and running was the internationalization
and localization of the interface. The problem was Unicode. As we discovered
1
, Unicode (or
rather, pretending it does not exist) is a big problem in software. In order to ensure that Psiphon
would work with all languages, we needed to get a language such as Chinese or Persian working.
Both of them use the UTF-8 encoding, and they do not use the ASCII character set we are used to
dealing with. Chinese and Persian were two ideal languages to try because Chinese requires more
Bites to be displayed, and Persian is also a 'Right-to-left' language.
The first big problem was that none of the development tools we were using was able to handle
Unicode characters properly. During initial tests on the Persian translation file we were unable to
tell if the problem was with our data file, or with wxPython and PythonCard. After four to five
hours of working along side Michelle Levesque, comparing how our different tools handled the
Unicode, we determined that it was our development tools that were causing the problem.
Of Komodo, Cygwin, the Windows Command Prompt, and Idle, it appeared (at first) that Idle
was the best at handling Unicode. Komodo could probably display the data file when it was
opened, but its Python shell interpreter would print gibberish when it attempted to handle the
data. We determined it was a codec problem with Komodo's Python shell. We had moderate
success with Idle, however, after further tests with more Unicode, it seemed that Idle also had
problems dealing with alot of Unicode.
We soon realized that the Unicode was not being displayed properly because a non-Unicode build
of wxPython was installed. After fixing that problem, it allowed the wxPython demo (which
contained a Unicode example) to run smoothly, but Psiphon's Persian translation was still causing
the program to crash.
1
Spolsky 2003
23
The error was traced to a PythonCard bug, which used
StringType
, rather then
StringTypes
. We reported the bug to the PythonCard author Kevin Altis, and he posted on
the PythonCard mailing list that the bug will be fixed on the next release.
Dispite the advances made in getting Unicode working, we were still getting errors when Psiphon
tried to display Unicode. Since we were sure that the Unicode wxPython wasn't the problem
(because the Unicode Demo was working), we determined that the PythonCard resource file was
the problem. Migrating the interface away from PythonCard was strongly considered.
Robin Dunn of wxPython gave much appreciated help in examining the resource file generating
code (where we determined the problem to be), and with his help we were able to get Psiphon
working with Unicode languages. However, we were unable to get the 'Right-to-Left' working
with Persian. Since we needed Psiphon to be multi-platform we could not simply use the GTK
wxPython build. GTK itself is able to handle 'Right-to-Left' easily, but wxPython does not have
its own related options.
The Unicode problem forced us to understand Unicode, and why we should never write a
program which does not handle Unicode properly. We also gained experience in sorting out
problems in open source software. Tracking down a new bug in an existing tool gave more
insight into how the tool itself works. It also taught us how to better debug code, and to realize
that sometimes the bug may not be in the code you are writing, but rather in the code from a tool
that you are using. Overall, the time invested in fixing the Unicode problem was worth it for the
knowledge that was gained.
4.2 HTML Parsers
.
A key aspect of any proxy tool is that all references to remote files within an HTML page are
changed to reflect back to the proxy's address. We first decided to use Python's
htmllib
to
build a custom parser, however, we soon found that many tags were not probably being parsed.
When we tested the parser on a page like www.google.com, tags such as tables, and the parser
was not even recognizing scripts. The Python documentation showed us that
htmllib
's
HTMLParser
was designed for parsing both HTML and XHTML. Despite being designed to
handle HTML, the parser still ignored many tags.
24
The design was then migrated to
HTMLParser
's
HTMLParser
class.
HTMLParser
is very
similar to
htmllib
's version, but does not require formatters, and is also a newer version of a
parser. However, we saw that there were tags that this parser was also ignoring. It appeared as
though
HTMLParser
also had problems parsing malformed HTML.
Since we could not find a suitable parser that would handle all HTML, even poorly written
HTML, we had to write our own parser class, which used regular expressions. Rather then parse
each tag, then find the required attribute, we instead used a regular expression to find the
attributes.
Because the regular expressions matched attributes and not tags, there is no danger in dealing
with poorly written HTML; however, matching attributes also leads to possible errors. For
example, we match the
background
attribute, but not all tags with a background attribute need
to be changed, so some pages will have modified tags, when they should have been left alone.
Also, the current regular expressions do not take into account white space between the attribute
name and the equals sign, which will lead to problems when the attributes are matched.
Dealing with Python's two major file parsing modules gave us a better understanding of HTML
standards and how difficult it can be to handle both poorly and properly written HTML.
4.3 Interface Tools
.
Psiphon was developed using a mixture of wxPython and PythonCard. This mixture gave us an
appreciation for how to choose interface kits in the future. As outlined in the Unicode section
(4.1), we experienced problems with PythonCard and Unicode. PythonCard does make life easier
in many aspects of GUI creation, but at a cost.
PythonCard is not yet mature enough to properly implement an internationalized interface. Using
PythonCard required far too much time in sorting out bugs, and we still required wxPython code
to put together the layout. Though wxPython is more complicated to learn and code, had we
invested the time in the first place, it would have saved the time spent debugging the Unicode
errors later in the design.
The experience taught us to consider the maturity of a tool more closely before using it to
implement a project. On future problems, we would use wxPython, unless PythonCard's bugs are
fixed.
25
4.4 Bugtrac
.
Psiphon was the first project that either of us had used a bug tracking system. We found it to be a
great way to communicate what had been done, and what still needed to be finished. We used the
system to store our schedule, and that allowed us to keep a record of the progress that we had
made. It also allowed us to make tickets for things that we were unable to finish during the
project, so that if another team were to work on Psiphon, they would have a starting point of what
to fix.
Having worked on this project using a bug tracking system, we would continue to use a similar
system on future projects, as we found it to be an asset.
4.5 Secure Connections
.
HTTPS was used to keep the communication between Psiphon and remote users secure. In the
procedure of developing Psiphon, we did some research about HTTP, SSL, digital signatures,
certificate authorities, and HTTPS.
Since M2Crypto has implemented SSL already, we didn't study the algorithm used in SSL in
detail. We focused on how digital signatures and certificate authorities are used to implement
SSL. Through developing Psiphon, we understood the difference between HTTP and HTTPS, and
how a secure HTTP connection (i.e. HTTPS connections) are created. We also learned that SSL
works at the Internet transport layer, and can provide security to any TCP based application
including HTTPS. Unfortunately, we were unable to use OpenSSL to create a CA and certificate
for Psiphon, and had to use the out of date certificate provided with M2Crypto.
26
5. References
.
Garfinkel, S. & Spafford, G. (2002)
Web Security, Privacy & Commerce 2
nd
Edition
. CA,
O'Reilly.
Kasplat (2004).
PythonCard Home Page.
Retrieved December 10, 2004, from
http://pythoncard.sourceforge.net/
Python Software Foundation. (2004).
Python Programming Language
. Retrieved December 10,
2004, from http://www.python.org/
Spolsky, J. (2003, October 8). Joel on Software.
The absolute minimum every software
developer absolutely, positively must know about Unicode and character sets (no excuses!)
.
Retrieved December 10, 2004, from http://www.joelonsoftware.com/articles/Unicode.html/
wxPython (2004).
wxPython.
Retrieved December 10, 2004, from http://www.wxpython.org/