The followings plots have been generated with Scapy based on a method described here. It consists in sending 1000 tcp syn to Google/Live/Yahoo on port 80, waiting for answers and analyzing the values held by various interesting fields (ip id, tcp seq, tcp timestamp). The plots are drawn from those observations.
Please refer to the previous post for a more complete description (for Linux) of the fields used. You can also check out the code used here.
Let's investigate with what values and where are initialized the IP ID field and the TCP Seq and Timestamp fields into the Linux kernel (2.6.24). Those informations are useful for example for deducing if two TCP stacks are the sames (with the IP ID field) or in some cases for deducing the uptime of the machine (through the TCP TIMESTAMP).
When connect() is called from user space this call leads to the function tcp_v4_connect() located in the file tcp_ipv4.c. This function initiates the first part of the three way handshake and therefore has to choose a new and random sequence value TCP Seq. This value is randomly picked from random data hashed with MD4 message disgest algorithm by the function secure_tcp_sequence_number into random.c, see this link about current RNG implementation into the Kernel.
if (!tp->write_seq) tp->write_seq = secure_tcp_sequence_number(inet->saddr, inet->daddr, inet->sport, usin->sin_port);
This sequence number will then be increased by following the standart RFC 793.
The IP ID identifier is also initialized into tcp_v4_connect() through:
inet->id = tp->write_seq ^ jiffies;
(Update) It is important to mention that kernels 2.6.x (I haven't tested for 2.4.x) use a null IP ID during the TCP handshake and it is only after those exchanges that the random value described above is used.
Where the random data obtained previously for the sequence number is reused and xored with the variable jiffies. jiffies is a global counter variable increased at system's timer interrupt frequency.
The TCP Timestamp value is an option from the TCP protocol and will be appended to the options field only if the sysctl_tcp_timestamps (always with our example based on a TCP connect) is set to 1 in the function tcp_transmit_skb from tcp_output.c. This variable is accesible from user space through sysctl -w net.ipv4.tcp_timestamps={0,1}
if (sysctl_tcp_timestamps) {
tcp_header_size += TCPOLEN_TSTAMP_ALIGNED;
sysctl_flags |= SYSCTL_FLAG_TSTAMPS;
}
This option is a 10 bytes field whose a chunk of 4 bytes is the current time of the system (taken just before sending the packet) and another 4 bytes chunk is the timestamp reply value. The space of possible values is therefore modulo 2^32. This field is in our example assigned in the function tcp_connect from tcp_output.c.
TCP_SKB_CB(buff)->when = tcp_time_stamp;
Where tcp_time_stamp is based on the global variable jiffies and is defined into tcp.h. Obviously this value permits to deduce the uptime of the owner of this TCP stack (mod 2^32 of course).
#define tcp_time_stamp ((__u32)(jiffies))
Eventually, the options field is built in the function tcp_transmit_skb located into tcp_output.c and the packet can be sent.
} else {
tcp_build_and_update_options((__be32 *)(th + 1),
tp, tcb->when,
NULL);
TCP_ECN_send(sk, skb, tcp_header_size);
}
strace is a powerful tool tracing the systems calls and signals.
It can take a program as argument or be attached to a running process by its pid, for example:
sundae:~$ psall | grep firefox
ookoi 13196 1 13196 7 9 12:16 ? 00:43:17 /usr/lib/firefox/firefox-bin
[...]
sundae:~$ strace -p 13196
Process 13196 attached - interrupt to quit
gettimeofday({1182457421, 943651}, NULL) = 0
ioctl(3, FIONREAD, [0]) = 0
[...]
Second example: say you want to know where is written the configuration of xtensoftphone (a VoIP softphone) all you have to do is tracing the calls to the open() syscall and dig the output, i.e. the set of all the opened files by the process.
sundae:~/Desktop/tmp/xten-xlite$ strace -e trace=open ./xtensoftphone
open("/etc/ld.so.cache", O_RDONLY) = 3
[...]
open("/home/ookoi/.Xscrc", O_RDONLY) = 6
[...]
sundae:~/Desktop/tmp/xten-xlite$ head -n 5 /home/ookoi/.Xscrc
<root>
<branch id="SOFTWARE">
<branch id="XtenNetworksInc">
<branch id="X-Lite">
The right file is /home/ookoi/.Xscrc, the name is not very explicit per se, but the path is relative to the user's home directory, which is a common place for his settings.
The -c option can be used for displaying various statistics, the rows can be ordered with the option -S:
sundae:~$ strace -c -S time ls -la [...] % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000863 9 99 99 getxattr 0.00 0.000000 0 21 read 0.00 0.000000 0 101 write 0.00 0.000000 0 64 24 open 0.00 0.000000 0 45 close 0.00 0.000000 0 1 execve 0.00 0.000000 0 15 15 access 0.00 0.000000 0 3 brk 0.00 0.000000 0 2 ioctl 0.00 0.000000 0 1 readlink 0.00 0.000000 0 13 munmap 0.00 0.000000 0 1 uname 0.00 0.000000 0 1 mprotect 0.00 0.000000 0 8 _llseek 0.00 0.000000 0 2 rt_sigaction 0.00 0.000000 0 1 rt_sigprocmask 0.00 0.000000 0 1 getrlimit 0.00 0.000000 0 59 mmap2 0.00 0.000000 0 100 lstat64 0.00 0.000000 0 42 fstat64 0.00 0.000000 0 2 getdents64 0.00 0.000000 0 17 fcntl64 0.00 0.000000 0 1 futex 0.00 0.000000 0 1 set_thread_area 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 1 clock_gettime 0.00 0.000000 0 4 socket 0.00 0.000000 0 4 4 connect 0.00 0.000000 0 1 sendto ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000863 612 142 total
ltrace is also an interesting tool intercepting the calls to the dynamics librairies (usually the files in .so). For example the calls to malloc, memcpy, ...
Update: just to mention that both strace and ltrace rely on ptrace().
A segfault is not always the end of the world, but at least this is a bug, and obviously as every program the CPython interpreter may contains errors.
This dummy script (dummy.py) takes a python source file, generate its bytecode, fuzz only one byte (a random byte is replaced by a random value) then write it back. You can now try to run it, and may experience a segmentation fault (if you're lucky).
Random example with listcomp.py and CPython 2.5:
sundae:~/Desktop/python-compiler$ ./dummy.py listcomp.py
[]
python fuzzed-listcomp-0.pyc
[]
python fuzzed-listcomp-1.pyc
RuntimeError: Bad code object in .pyc file
python fuzzed-listcomp-2.pyc
XXX lineno: 1, opcode: 234
Traceback (most recent call last):
File "/home/ookoi/Desktop/python-compiler/listcomp.py", line 1, in module
b = [a for a in range(3) if a > 2]
SystemError: unknown opcode
python fuzzed-listcomp-3.pyc
[]
python fuzzed-listcomp-4.pyc
RuntimeError: Bad code object in .pyc file
python fuzzed-listcomp-5.pyc
Segmentation fault (core dumped)
python fuzzed-listcomp-6.pyc
[]
python fuzzed-listcomp-7.pyc
[]
python fuzzed-listcomp-8.pyc
[]
python fuzzed-listcomp-9.pyc
RuntimeError: Bad code object in .pyc file
sundae:~/Desktop/python-compiler$
Among 10 executions we can observe a segmentation fault (fuzzed-listcomp-5.pyc).
Update: with exactly the same procedure, the Java bytecode of this program resulted on a total of zero segmentation fault and that over 1000 tests. This is certainly due to the Java bytecode verifier which performs a static analysis to ensure a certain amount of security on the loaded bytecode.
Update: it appears that this behavior isn't always reproducible, it is present under Feisty with Python 2.5.1 but not under Dapper with Python 2.4
With the inclusion of ctypes in Python 2.5, i think time has come to make pyinotify a pure python module (no more compilation needed, nor C files).
That means, removing the C code handling the inotify's syscalls and use instead ctypes making it load the libc and use its interface to inotify. The only constraint is to keep the maximum of portability. There's two aspects, first the module ctypes is not part of previous Python releases, so in this case there will be a dependance and the user will be required to install this module himself. Second, I don't know at wich point something like:
import ctypes
libc = ctypes.CDLL("libc.so.6")
is really portable, if it is then perfect!
I just tried a quick hack, and it seems to work well, but, i will make things proper (as soon as possible...).
By the way, another priority will be to merge the others files into a standalone file pyinotify.py, and so for simplififying the namespace, comprehension, installation, etc...
Note: lately i've been using scapy (see my previous post), and i read important parts of its code, and its design is really awesome, this is a great source of inspiration. Only few modules use metaclasses even less appropriately, but scapy does. There are plenty of details (use of design patterns, use of modules like logging, warning, ...) which make it a model of coding.
Update: I set up a dedicated page containing the latest developments about pyinotify.
With my group of project we are working these days on fuzzing.
It is an interesting way of finding vulnerabilities especially in network protocols or filesystems. There are different use cases, if you have the source code, you can obviously use traditionals and advanced mechanisms of software testing: unit testing, static analysis (abstract interpretation, constraint programming,...). In the case of binaries you have to disassemble it or apply test cases (fuzzing) by following the specifications (the protocol) if known. Even worst on remote systems on which you can not access systems logs nor binaries, the unique choice is to fuzz it.
To familiarize ourselves with fuzzing we choose a simple protocol DHCP and we try to find bugs in differents implementations. Unfortunaly at this time we haven't found such a case, our current targets are dhcpd on Linux, a Netgear router, and another propriatary router. Our script is written in Python and rely on Scapy to build layers, assemble, send packets and collect answers.
Be sure, i will be happy to update this post if we find something...
When speaking about cryptography, the first referenced book is the Bruce Schneier's book and in second the one of Doug Stinson, but i think there is another excellent book worth to be mentioned: Handbook of Applied Cryptography.
I appreciate this book because there are many examples, it introduces many algorithms and protocols. Moreover it is freely available by following the previous link. The only two negative points i found are: there isn't any full chapter on the elliptic curves (although one of its authors has written a dedicated book on EC), and the second is that it hasn't been edited since 2001, so it doesn't cover the last algorithms (for example: AKS on primality).
Few days ago, the Google Reader team has added a new functionally to their feed reader, it displays stats about your feeds.I love charts so i find it interesting. But i must say, that in fact it's not particularly useful, for example it doesn't provide you any recommandation on which new item you probably want to read. One such system would be based on the items you has read or not in the past. Anyway, here is the screenshot of my last month of stats.
Rusty russel has recently built an hypervisor named lhype on top of the Linux paravirt_ops interface (this is the common interface for all the current solutions like Xen or VMWare).I still haven't tried lhype but it seems easily usable as described by James Morris in this entry. Moreover it seems to be small and understandable. In January 2007, Rusty Russel will present his work at linux.conf.au, i hope some slides or an article will be available on the web.
update: lhype lguest has its dedicated website now, with a useful description.
What is the gap between reading and understanding something and making it yourself ?Sometimes not too much... but sometimes it's hugeee, i just experienced this
disagreement, i wanted to write a simple function (strlen) in assembly (x86), since i'am moderately able to understand the logic behind few isolated lines of assembly i thought it would be trivial... Obviously, i was too confident, ultimately i spent at least 2 hours to implement it ! i'am ashamed.
Yet another demonstration of Python's widespread use, a 125 lines script removing DRM protections of bought iTunes tracks (on Win32). As often with this kind ofwork, it mainly consists in reading, writing and copying memory. So what's better than a debugger for this purposes ? nothing.
I played with FUSE recently. And like everybody already said, it's a wonderfull promising app., once you understand the api you can imagine the huge power of fuse, only limited by your imagination. The only drawback when you start with it is the lack of documentation and worst the lack of advanced examples, if you want to understand how to make a real usable fs you must study another project based on fuse. There are some interesting projects like GmailFS, or a small project BeagleFS.
One thing you must understand is that FUSE is a virtual fs, i.e. for example you can't directly write to it like you would have written to a traditional fs, you must see it like an interface, for example you can have an ext3 partition with encrypted files and use a fuse fs for creating an interface to automatically read and manage your encrypted data like it were not encrypted. Or your operations could be made on a database or on a distant system like Gmail, or by using a particular protocol like RPC... Indeed, by constraining you to provide and adhere to the commons fs operations (open, write, read,...), you will automatically be able to use all the standart commands like ls, mkdir, your favorite text editor,...
The python binding works quite well, unfortunaly it also lacks of documentation, and you'll probably have to read the source code to understand some parts of the project. It builds an Object Oriented interface on top of FUSE, you can manipulate your files or directories through dedicated classes.
In order to learn the basis of FUSE i made a small fs, it is written in python, is available and introduced here.
I made the mistake 2 years ago to check the option "maintain my library organized" or something like that in the iTunes preferences and the result were to store on my computer the music files this way: /artist-x/album-y/file-z.mp3 which is quite desappointing since the common way to keep its music ordered is by album: /album-y/file-z.mp3, while the others informations are meta-data and don't requires to add a useless physical hierarchical level...
Then now, i'm not using anymore iTunes (since Amarok is a wonderfull app.) and i wanted to revert the mess introduced by iTunes on my files so i made this dummy script which reorder my files by albums.
I wanted to call a small script behind https and thought i had enough to slightly modify this recipe. But I was wrong, it is not that easy since pyOpenSSL doesn't implements the makefile() method, instead use a fileobject, which doesn't supports the dup2() call (in CGIHTTPRequestHandler). I must say, at this time i don't have investigated too much, i don't know the exact issue, it would be interesting to look at what really does OpenSSL (now in 2006, whereas the last release of pyOpenSSL was in 2004) and how the others OpenSSL bindings deal with this problem.
Meanwhile, in order to run and test my script (under Linux) i made one small (ugly) modification to CGIHTTPServer.py, in order to force the script to be executed like it were under another os than Unix or Windows and don't call dup2().
These days i'm trying to evaluate, the choice of technologies, i look around and seek what others developpers have choose as solutions, what projects worth to be used, now and in the future, which standarts are interesting... This is a hard work, given the number of different possibilities, and the global incertitudes, who can predict, that a given project would going forward in the right direction? In order to narrow this first set it is necessary to define his constraints, mine are:
- Use Python whenever it is possible, firstly because of the huge community and the huge amount of projects written in Python. RoR and Ruby are very popular but i'm feeling more comfortable with the Python community, which is less focused on web developement.
- Use a web framework, but not an heavy framework, moreover the previous constraint eleminates RoR. Currently, i want to look at web.py whose its strength is its simplicity, and to Django whose the popularity is increasing so much that maybe it is a killer framework (actually i can't say, i've just given a sneak peek).
- Ability to make a RESTful architecture.
- I don't want to write a lot of Javascript, but these days it is hard to ignore it, there is plenty of awesome developments. There are few interesting librairies which can be very interesting like dojo (a big project), MochiKit or prototype, the big constraint is that they must be able to be integrated in the framework.
- I'm also considering using XML (maybe under the Atom 1.0 dialect) serverside and processing these XML with XSLT transformations inside the client. Which can reduce the painful task to process XML in Javascript, but it would probably force me to use a javascript library like Sarissa, Freja or AJAXSLT, because many browsers still doesn't natively supports XSLT yet.
In a next post i want to develop my thoughts and give the projects i'm considering to use, what are my choices and why.
There are two disctints python front-ends, first what we call front-end is the step occurring on your source code from parsing until to final bytecode generation.
The first front-end, is probably the one you have in mind, it is the one called when you type python myfile.py, it is mainly written in C and has recently been refactored and modified, it is well described in this pep. Do notice that this is the one used when you call the builtin function compile().
The second front-end is the one accessible through the module compiler, it is a pure-python implementation, it is expected to have the same behavior than the first front-end, i.e. compiler.compile() is expected to produce strictly the same bytecode than compile() given the same source code. Read the documentation for the motivations behind having two front-end.
What i think is an interesting excerpt on this subject from Peter Van Roy and al. in this paper:
Adding both concurrency and state to the core results in the most expressive computation model. There are two basic approaches to program in it: message passing with active objects or atomic actions on shared state. Active objects are used in Erlang. Atomic actions are used in Java and other concurrent object-oriented languages. These two approaches have the same expressive power, but are appropriate for different classes of applications (multi-agent versus data-centered).
I could not have said it better :)
One has always to take the decision between developping its web pages with a complete framework or instead using useful independant modules. In the latter case, there's several interesting and useful python modules:
- Cheetah: powerfull templating engine.
- mod_python.
- Elementtree: XML processor (integrated in CPython 2.5).
- SQLObject: Object Relational Manager for providing an object interface to your database.
- fastcgi.
- formencode: form parameters validator.
As i already mentioned Code Coverage testing is very useful especially for dynamic languages, i made a simple recipe based on existing recipes which generate an html representation of the source code and highlights the unreached parts (statements). You can read this recipe here.
Behind this complicated term is hiding the next generation of C++.
In a short article Bjarn Stroustrup introduces what could be the C++ in 2008.
As expected, he doesn't plan a revolution, but a few longly awaited (however not by everybody) features should be added like optional Garbage Collection, few languages constructions mainly for simplifying generic programming. An effort will be made to add (standardize) useful librairies (like threading, regexp) which i think is the weakest point, the STL effort is a master piece, like the BOOST librairies, but not every librairies can become a standart without an unified community and an ostensible leadership. Maybe big corporations doesn't need this kind of support/infrastructure but small business/academics people care a lot this kind of thing.
This talk (slides) from Herb Sutter on concurrency is very enlightening, he starts from the common observation under which that "the free lunch" for traditional sequential applications (mostly client apps) is "over" since the micro-processor industry is now unable to produce faster processor and is now designing multi-core processors.
Thus it seems obvious that under this path only concurrent apps will be able to maintain their performance growth. But it is even worse than that, only N-concurrent apps would be able to do that. "N" concurrent apps by opposition to "K" concurrent apps are applications which scale well to N-core with latent concurrency, instead K-concurrency target exclusively K-core (doesn't scale above k cpus, can be penalized below k cpus).
Therefore we must be able to write N-concurrent-applications. Herb Sutter compares the more widespread mechanism which is actually Threads + Locks to the C language, and argues like Object Oriented Languages have emerged in the 90', concurrency needs its own revolution. Moreover, whereas some kind of application could freely ignore OOP, almost no one won't be able to withdraw concurrent programming.
His solution would be to pick selected abstractions idioms, then plug them into mainstream languages (like C++). It introduces what the C++ would be with the support for active objects (each active object can run on its own thread), with non-blocking statements, through messaging pieces of code runs asynchronuously, sync statements (wait() call), and what would be the implications for librairies, STL in particular, pros (performance, design), cons (small? code parts to rewrite)...
This recipe permits to retrieve the common tags (if any) associated with a given url, and that by consulting del.icio.us.
As far i know, del.icio.us doesn't provide an api for accessing this kind of service. The only related service is an api for accessing his (as user) common tags from his bookmarks.
Say we want for example get the tags associated to http://www.python.org/ (note that the trailing '/' is significant), we must first md5ized this url, which give the md5 string code 'c5b453422c02f37495b83d97018ff0c9', therefore we can now obtain the url of the 'python' page on del.icio.us at http://del.icio.us/url/c5b453422c02f37495b83d97018ff0c9.
Then it only remains to parse the content of this page to retrieve the interesting part: the common tags and theirs frequency classes. Putting all together give the following code and the following tag cloud for the python's url:
These days i'm looking at Erlang, what was first appealing to me is its native design for concurrency and real-time uses.
But i didn't know one important and interesting characteristic: its declarative syntax à la Prolog. It seems not to be the hazard, because its first implementation was written in Sicstus Prolog. Like Prolog does it supports the atom/Variable semantic coupled with pattern matching, functions are called procedures and may be compounded of clauses, the arity introduces ad-hoc polymorphism (overloading), it adds a guard mechanism and some explicit syntax for pattern matching. Unlike Prolog, it implements imperative features like explicit conditional statements, while (loop). Moreover, Erlang doesn't implements the backtracking mechanism, nor the cut operator. Lists and tuples are among its native types, lists are manipulated exactly as in Prolog. In summary, it seems to be a kind of Prolog (mainly its syntax) without the roots of Prolog in logic programming. A Prolog's fellow may wonder why to use a sub-set of Prolog ? Answer: because the purpose of Erlang is not to solve logic problems but concurrency issues and soft real-time constraints. I read the first part of the book: Concurrent Programming in Erlang which is a good introduction.
Obviously Erlang is well suited for server/distributed systems and should easily and safely communicate with others software components eventually written in other languages via RPC calls or dedicated interfaces like for example py_interface which seems to implements in Python an Erlang node (however i still not played with it).
Due to duck typing, the Observer Pattern can easily be implemented in Python. The following code follows the syntax used in the GoF's book:
class Subject(object):
'''
Minimal Subject class, with few shortcomings:
- Deleting this subject produce dangling references into
observers
- Triggers the update (notify call) after every state
changes.
- tick method blocks
'''
def __init__(self):
self._observers = [] # many observers can observe
self._time = None # current state
def attach(self, obs):
self._observers.append(obs) # attach new observer
def detach(self, obs):
self._observers.remove(obs) # detach an observer
def get_time(self):
return self._time
def tick(self):
import datetime, time
while True:
self._time = '%s' % datetime.datetime.now()
self.notify()
time.sleep(1)
def notify(self):
for obs in self._observers:
obs.update(self) # notify every observer
class Observer(object):
'''
Minimal observer class:
- Cannot observes more than one subject.
- Encapsulating minimal update semantics.
'''
def __init__(self, subject):
'''
keep a ref on Subject,
'''
self._subject = subject # keep a reference
self._subject.attach(self) # attach himself
def update(self, subject):
print 'Observer %d: %s' % (id(self),
subject.get_time())
if __name__ == '__main__':
s = Subject()
o1 = Observer(s)
o2 = Observer(s)
s.tick() # blocking call
You can donwload this example here.
I don't know if i dare to tell one thing... ok, i take the risk, i really don't like unittest. I care to extensively test my code, but probably because i don't know very well this module, i'm unable to find enough flexibility to match my needs.
And as often there is a poor documentation, this time i don't want to spend my time to guess the internals. This kind of machinery is perfect for important amount of code and heavy frameworks, but for relatively small modules/classes it seems to be disproportionated and simple testing strategy can be enough, for example, the reports in that case are useless and the initialization is undoubtedly more comfortable without all the constraints of unittest. By the way, i don't know if the test* methods are called sequentially or are dispatched on multiple threads, this is the black box effect.
Anyway, i wanted to add something else on the testing subject, i think an important kind of test as said here is to look at code coverage, to figure out, unused or at least unexecuted code which could be a symptom for dead code. Especially, dynamically typed languages whose static checking are limited defers errors detection until program execution. Thus this form of laziness added to rarely executed blocks can lead to easily keep hidden lexical/types errors. Code coverage may help in these errors patterns.
Let's investigate a simple locking policy to avoid dead locks and data races.
First it is worth to repeat one more time a trivial but often broken rule:
- Several threads can silmultaneously read a shared ressource (memory, file) without requiring any locking. This is the easy part.
- But as soon as at least a thread needs to write to this ressource, all the access even those which only reads must be protected with appropriate locks. This is the frustrating part, because even with only 1 write for 10 reads, it requires the same strategy. (note that some more advanced locking mechanisms exists like the shared lock, but still aren't implemented in the Python standart library).
Now that the 'big' rule is respected, you should follow a simple policy to avoid race conditions or dead locks:
- Because sometimes unexpected events happens like exceptions, modifying control flow of your program, it may lead to not release the lock you wanted to release. Thus, the following block is needed:
lock.acquire() # lock is an instance of threading.Lock try: [...] finally: lock.release()
Whatever it happens in the body, the finally clause will be executed and will release the lock. - Avoid to call functions or methods when your program holds a lock. Imagine the callee try to acquire the same lock than the caller: dead lock. I know you're smart enough to know that it is exactly the craft of RLock to deal with these situations, but i argue that it is still too dangerous, because like everything, functions or methods can be modified at a later time, a time where you don't remember the details, so you can broke the initial strategy and involontary introduce some bugs. Just don't call any functions or methods in that state.
This is the first post.
Since i bought dbzteam.com 2.99$ at Yahoo! i felt that a domain without a Blog could be seen like an heresy in 2006. So that is how this page is born. I plan to write about Programming Languages, Python, Computer Science in general. I set up PyBlosxom as weblog system, but i am actually too lazy to configure the comments plugin.








