Thumbnail API

Open source Website Thumbnail generator API



Introduction to the open source thumbnail API

As of September 2012, we dropped the Firefox support. We're using the WebKit right now; so these instructions are outdated (as is the Firefox 4.01 as well). The WebKit instructions will be placed here in the future. Webkit patches for the gnome-web-photo are (which also enable the Adobe Flash) here and here!

Now it's all out there. There should be all parts available! This page will be updated with more details.

Preconditions

This will instruct how to duplicate the SW running on immediatenet.com to any linux server out there! Particularly, Ubuntu 10.04 is where we run.

HW specs

immediatenet.com runs on the cheapest dedicated servers out there. Indeed, only 2 gigabytes of memory with 2 cores is what we have. If we had 16 gigs / 8 cores, things would be a lot nicer!

Engine

We use the gnome-web-photo. Unfortunately, there's no support for Firefox based gnome-web-photo anymore. The gnome-web-photo uses the WebKit at the moment. There's nothing wrong with the WebKit, but getting it working on Ubuntu 10.04 is a challenge (it requires GTK3.0+ so you'll have the early bugs there as well). On 11.xx it should be less of a pain to get the gnome-web-photo working.

Problems

There cannot be too many clients drawing thumbnails at the same time. The number depends heavily on the memory available on the system. As a rule of thumb, never let the server start swapping.

Xorg server "dies" (not sure with the latest update) occasionally - once / 2 months or so. All of the new requests will hang. For example, "xhost" on cmd line will also hang.

Bots

The system contains serveral bots that will take care of cleaning old thumbnails from the system. These will be talked about in the future.

File system

The filesystem should have 255 chars/max filenames. ext3/ext4 should be just fine.


Installation instructions for the open source screenshot API

1. Create account for an arbitrary username - any will do, but one that will not be guessed by brute force ssh password guessing SW, here it'd be "wWw1Lser2RNd" but DO NOT use this one, as it's no longer random. Simply:

adduser wWw1Lser2RNd

2. log in with the wWw1Lser2RNd

3. We need some files on /usr/share/gnome-web-photo/
At least these files should be present: prefs.js style.css
- on Ubuntu 10.04, you'd want to:

sudo apt-get install gnome-web-photo

4. Get the Firefox 4.0.1. Note that 5.0 does not have GtkMozEmbed unless you backport it onto it; it just means you cannot use Firefoxes > 4.0.1. Download firefox

5. Build / install the firefox (only xulrunner needed):

./configure --enable-application=xulrunner
make
sudo make install

It should install libraries on /usr/local/* - it will not wipe out your existing one (unless the same version at same dir was there before, not likely). It should be xulrunner-devel-2.0.1.

6. Get the gnome-web-photo at the commit that still was based on Firefox:

git clone git://git.gnome.org/gnome-web-photo
cd gnome-web-photo/
git reset --hard d0391cdd8a65d366e208e8bc9a4e41440804844d

7. Apply the ugly patch for the gnome-web-photo

git am 0001-firefox-4.0.1-update.patch

Find it gzipped: here

8. Build it

./autogen.sh
make

9. Check it built on top of Firefox 4.0.1 (version must be 2.0.1):

grep xulrunner *

config.log:libxul_cv_include_root=/usr/local/include/xulrunner-2.0.1
config.log:libxul_cv_libdir=/usr/local/lib/xulrunner-devel-2.0.1/bin
config.log:libxul_cv_sdkdir=/usr/local/lib/xulrunner-devel-2.0.1
config.log:LIBXUL_INCLUDE_ROOT='/usr/local/include/xulrunner-2.0.1'
...

10. Grant the user wWw1Lser2RNd xserver access:

sudo xhost +local:wWw1Lser2RNd

That needs to be done every time the server is booted, unless put in the start-up scripts.

11. Make Apache run on wWw1Lser2RNd (/etc/apache2/envvars):
envvars:export APACHE_RUN_USER=wWw1Lser2RNd
envvars:export APACHE_RUN_GROUP=wWw1Lser2RNd


Restart Apache:
/etc/init.d/apache2 restart

12. Create the thumbnail/screenshot 'database' (very simple):

cd
mkdir thumb_api_fast

13. Find the error images here

cd
mkdir thumb_api_errors
tar xzf error_images.tar.gz


Website screenshot scripts - linux shell scripts for the thumbnail API

We'll add more scripts here. Here's all you need to get started. Remember to check your /var/log/apache2/error.log periodically - all errors are put there. Errors due to scripts (for example if your hard disk is in other than /dev/sda2, you'd want to correct that) must be fixed. If there's complains in Apache's error log, please fix all of them.

Glue code - integrating the www and the engine

You need the shell scripts, this is an example of the immediatenet.com/t/m/ script. Download it here, gzipped: here
I declare it as GPL2 licenced code, check the licence file here ; the licence is not in the script itself as it would slow down the processing. Please notice that running shell scripts isn't the most efficient way of handling the issue - but it's still far better than what majority has for similar screenshot/thumbnail services! A complete binary written in C and compiled could be much, much more faster. Anyway, lets have a look at it:

#!/bin/sh

# Immediate Net

cd /home/wWw1Lser2RNd/thumb_api_fast 2>/dev/null > /dev/null
if [ $? -ne 0 ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/unexpected.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
        /bin/cat /home/wWw1Lser2RNd/thumb_api_errors/unexpected.jpg
        exit
fi

First we'd like to go into directory /home/wWw1Lser2RNd/thumb_api_fast. If that cannot be done, return an error image.

SIZE=`/bin/echo "$QUERY_STRING" | /bin/sed "s/%60//g" | /usr/bin/tr --delete ';\`"<>,!@#$%^(){}[]' |
/usr/bin/cut -d"&" -f 1 | /bin/sed "s/Size=//"`

SIZE_W=`echo "$SIZE" | /usr/bin/cut -d'x' -f 1`
SIZE_H=`echo "$SIZE" | /usr/bin/cut -d'x' -f 2`

SIZEOK="0"
if [ "$SIZE_W" = "1280" ]; then
	SIZEOK="1"
fi
if [ "$SIZE_W" = "1024" ]; then
        SIZEOK="1"
fi
if [ "$SIZE_W" = "800" ]; then
        SIZEOK="1"
fi

if [ "$SIZEOK" = "0" ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_size.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
        /bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_size.jpg
        exit
fi

SIZEOK="0"
if [ "$SIZE_H" = "1024" ]; then
        SIZEOK="1"
fi
if [ "$SIZE_H" = "768" ]; then
        SIZEOK="1"
fi
if [ "$SIZE_H" = "600" ]; then
        SIZEOK="1"
fi

if [ "$SIZEOK" = "0" ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_size.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
        /bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_size.jpg
        exit
fi

Next we check the size (above). This could be done a lot simpler. Let it be the excercise for you.

MYNAME=`/bin/echo "$QUERY_STRING" "$SIZE_W" "$SIZE_H" "15" | /usr/bin/cut -d'&' -f2- | /usr/bin/cut -b 5- |
/bin/sed "s/^http:\/\///I" | /bin/sed "s/^www.//I" | /usr/bin/xxd -ps -c 100 | /usr/bin/head -1`
DIRA=`/bin/echo "$MYNAME" | /usr/bin/cut -b 1,2`
DIRB=`/bin/echo "$MYNAME" | /usr/bin/cut -b 3,4`

if [ "$DIRA" = "" ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/unexpected.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
        /bin/cat /home/wWw1Lser2RNd/thumb_api_errors/unexpected.jpg
        exit
fi
if [ "$DIRB" = "" ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/unexpected.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
        /bin/cat /home/wWw1Lser2RNd/thumb_api_errors/unexpected.jpg
        exit
fi

if [ -f $DIRA/$DIRB/$MYNAME.jpg ]; then
	DT=`stat -t -c %y $DIRA/$DIRB/$MYNAME.jpg`
	MODI=`date -R -u -d "$DT" | cut -d'+' -f1`
	LEN=`/usr/bin/stat -c %s $DIRA/$DIRB/$MYNAME.jpg`
	/bin/echo "Last-Modified: \"$MODI\"GMT"
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
	/bin/echo ""
        /bin/cat $DIRA/$DIRB/$MYNAME.jpg
        exit
fi

Above we check whether the entry exists in the cache as a regular file. Simple, isn't it? The URL is truncated into 100 characters, turned into hex (now 200 chars), and size params as well as the scaling factor (15%) are stored into the filename. It's always less than the 255 chars allowed by the ext3/ext4 filesystems.

/bin/echo "$HTTP_USER_AGENT" | /bin/grep "Wget" 2>/dev/null >/dev/null
if [ $? -eq 0 ]; then
        /bin/echo "wget" "$REMOTE_ADDR" "$HTTP_REFERER" >> /var/www/monitor_system.html
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/unexpected.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
        /bin/cat /home/wWw1Lser2RNd/thumb_api_errors/unexpected.jpg
        exit
fi

If the useragent is wget, make a record into a system monitor.html file. You need to make sure the user wWw1Lser2RNd has write access in the monitor_system.html file.

INFO=`/bin/echo "$QUERY_STRING" | /usr/bin/cut -d'&' -f2- | /usr/bin/cut -b 5- | /bin/sed "s/%60//g"`
if [ "$INFO" = "" ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_url.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
	/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_url.jpg
        exit
fi

If the URL is empty, give an error.

cnt=`echo "$INFO" | /usr/bin/wc -c`

if [ $cnt -le 4 ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_url.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
	/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_url.jpg
	exit
fi

/bin/echo "$INFO" | /usr/bin/tr '[:upper:]' '[:lower:]' | /bin/grep "file://" > /dev/null
if [ $? -eq 0 ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_url.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
	/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_url.jpg
	exit
fi

If the URL length is less than 4, return an error. Also if it contains file://, give an error.

/bin/echo "$INFO" | /usr/bin/tr '[:upper:]' '[:lower:]' | /bin/grep "http://" > /dev/null
if [ $? -eq 0 ]; then
	STEP=`/bin/echo "$INFO" | /usr/bin/tr '[:upper:]' '[:lower:]' | /bin/grep "http://" | /bin/sed "s/http:\/\///g" |
/usr/bin/cut -d"/" -f -1 | /usr/bin/cut -d"&" -f -1`
else
	/bin/echo "$INFO" | /usr/bin/tr '[:upper:]' '[:lower:]' | /bin/grep ":/" > /dev/null
	if [ $? -eq 0 ]; then
	        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_http.jpg`
	        /bin/echo "Content-Length: $LEN"
        	/bin/echo "Content-type: image/jpeg"
	        /bin/echo ""
	        /bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_http.jpg
		exit
	fi
	STEP=`/bin/echo "$INFO" | /usr/bin/tr '[:upper:]' '[:lower:]' | /usr/bin/cut -d"/" -f -1 | /usr/bin/cut -d"&" -f -1`
fi

Check that only http://, not smb://'s etc will go. This is a pretty cruel rule, probably unnecessary.

if [ "$STEP" = "" ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_unknown_host.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
	/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_unknown_host.jpg
	exit
fi

cnts=`echo "$STEP" | /usr/bin/wc -c`

if [ $cnts -le 4 ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_url.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
	/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_url.jpg
	exit
fi

Check the STEP is ok, and contains at least some data.

STEPRESULTS=`/usr/bin/nslookup "$STEP"`
/bin/echo "$STEPRESULTS" | /bin/grep "can't" > /dev/null
if [ $? -eq 0 ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_unknown_host.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
	/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_unknown_host.jpg
	exit
fi

Let's check whether the URL domain exists. This is a very important step. nslookup return values were found pretty useless, so grepping the "can't". There are other error outputs that should be grepped as well. Look for the apache error log.

if [ "$REMOTE_ADDR" = "88.208.244.8" ]; then
	/bin/echo "Recursion!" "$REMOTE_ADDR" "$HTTP_REFERER" >> /var/www/monitor_system.html
	LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_recursion.jpg`
	/bin/echo "Content-Length: $LEN"
	/bin/echo "Content-type: image/jpeg"
	/bin/echo ""
	/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_recursion.jpg
	exit
fi

Put your server address in there! This is very important, unless you want your server down in a single request! If there's a page that contains thumbnails from your server, and someone wants to thumbnailize that page, and those thumbnails in the page are not found from the cache, it gets complicated, especially if the thumbnails refer to the page itself! The server will get thousands of requests due to recursion and the rest is history!

export USERNAME=wWw1Lser2RNd
export HOME=/home/wWw1Lser2RNd
export GDM_LANG=en_US.utf8
export DISPLAY=:0.0

# We don't want to run out of disk space!
hdamount=`/bin/df | /bin/grep "dev/sda2" | /usr/bin/awk '{print $5}' | /usr/bin/tr --delete '%'`
if [ $hdamount -ge 60 ]; then
        LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_disk_full.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
	/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/disk_full.jpg
        exit
fi

This will check whether the disk space is less than 60% full. If it gets more or equal than 60% full, stop creating new thumbnails. Use dev/sdaX where X is your hard disk id. exports are for the gnome-web-photo.

GNMS=`/bin/ps -A | /bin/grep "gnome-web-photo" | /usr/bin/wc -l`
if [ $GNMS -gt 10 ]; then
	LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_server_busy.jpg`
	/bin/echo "Serverbusy!" "$REMOTE_ADDR" "$HTTP_REFERER" >> /var/www/monitor_system.html
	/bin/echo "Content-Length: $LEN"
	/bin/echo "Content-type: image/jpeg"
	/bin/echo ""
	/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_server_busy.jpg
	exit
fi

Count the existing thumbnail generator processes. If more than 10, reject the request. We want the server not to swap. The amount depends heavily on the available memory.

FILEN=`/bin/mktemp -u XXXXXXXX`
/home/wWw1Lser2RNd/gnome-web-photo/src/gnome-web-photo --width="$SIZE_W" --size="$SIZE_H" --format=png --quality=0 "$INFO" $FILEN.png 2>/dev/null > /dev/null

if [ -f $FILEN.png ]; then
        /usr/bin/convert $FILEN.png -resize 15% $FILEN.jpg > /dev/null 2>/dev/null
	/bin/rm $FILEN.png 2>/dev/null >/dev/null
	if [ -d $DIRA ]; then
		if [ -d $DIRA/$DIRB ]; then
			cp $FILEN.jpg $DIRA/$DIRB/$MYNAME.jpg 2>/dev/null >/dev/null
		else
			/bin/mkdir $DIRA/$DIRB 2>/dev/null >/dev/null
			cp $FILEN.jpg $DIRA/$DIRB/$MYNAME.jpg 2>/dev/null >/dev/null
		fi
	else
		/bin/mkdir $DIRA 2>/dev/null >/dev/null
		/bin/mkdir $DIRA/$DIRB 2>/dev/null >/dev/null
		cp $FILEN.jpg $DIRA/$DIRB/$MYNAME.jpg 2>/dev/null >/dev/null
	fi
        LEN=`/usr/bin/stat -c %s $FILEN.jpg`
        /bin/echo "Content-Length: $LEN"
        /bin/echo "Content-type: image/jpeg"
        /bin/echo ""
	/bin/cat $FILEN.jpg
	/bin/rm $FILEN.jpg 2>/dev/null >/dev/null
        /bin/echo "$INFO" "$HTTP_REFERER" >> /var/www/monitor_system.html
	exit
fi

Launch the modified gnome-web-photo. If the image is captured successfully, scale & store it into the "database", return the image to user, and make a record into the system. Make sure you have imagemagick - sudo apt-get install imagemagick. It converts the images faster than the gnome-web-photo it seems.

LEN=`/usr/bin/stat -c %s /home/wWw1Lser2RNd/thumb_api_errors/error_timeout.jpg`
/bin/echo "Content-Length: $LEN"
/bin/echo "Content-type: image/jpeg"
/bin/echo ""
/bin/cat /home/wWw1Lser2RNd/thumb_api_errors/error_timeout.jpg
exit

Oops, the gnome web photo didn't get us the image, so it must have timeout.

Questions? Comments? Send them to eero.nurkkala[at]offcode.fi


Click on the images below for more immediatenet.com services:

Immediatenet.com home Thumbnail API Thumbnails HTML to image directory 3D animations