Friday, 15 August 2008

crazy idea: user supported snapshot.debian.net

I just had this crazy idea about solving snapshot.debian.net's space issue which just might work if done right.

We all know that snapshot.debian.net has ran out of space in early May this year and we all lost a valuable service. Since there are probably many users, developers and maintainers which find this service useful, it might make sense to think that the users themselves could solve the problem.

How? Think about implementing a really distributed storage/file system (not necesarly a real file system) that can rely on cluster nodes that can enter or exit at any time and the nodes are contributed by users, very much alike the GoogleFS file system (or maybe clients in a torrent swarm).


First problems that I can already see:
  • there must be some indexing service that should run 24x7 to manage the information about the blocks which are desired.
  • some data should always be kept on a trusted machine - the gpg signatures, the Packages files indexing information and meta data
  • there might be a security risk involved since data would be stored on untrusted machines, but a sha1/md5 check in colaboration with the gpg signature should be enough (which isn't always doable - see large packages like openoffice.org or game data packages); if there are reasons why this might not be secure, maybe we need to rethink a little about the dpkg-sig package and see if we can get all the packages in the archive to be individually signed when entering the archive, or something of that sort
  • the meta-data of the FS cluster might be really big, but my gut feeling tells me that it won't be as big as the current snapshot.d.n storage
  • when a file is requested, the file needs to be cached on a central server so it can be assembled and its checksum be verified before being delivered to the client requesting it
  • people might donate really little space compared to their usage
  • some nodes of the network are special and this could lead to the failure of the entire infrastructure in case the special nodes fail; some of these nodes might turn out to need really big muscles
  • some data might be unavailable at times depending on the clients connected to the network at different points in time
  • in an attempt to create more copies of a block in more nodes in the network, a DoS might be triggered if many clients request the same info from a (set of) node(s) containing the rare info - still torrent's way of workig seems to cope quite well with that model
  • none of this is implemented and probably I won't do it myself, although it could be a really nice project, even it is doomed to fail from day 1

Other ideas:
  • maybe it would make more sense for the FS to actually be an ever-growing torrentt which people can connect to via some special client which would allow pushing to the clients in order to store information
  • the number of copies of a file (or chunk of a file) in the distributed storage should be higher for more recent packages and lower for older ones (still one copy might be necessary to be stored on safe nodes - aka controled and trusted by debian so the information doesn't get lost - back to our current storage problem)
  • probably the entire thing could be built somehow by piggy-backing the current apt-transport-debtorrent implementation or debtorrent, or apt's cache - the idea is that although not all files might be available, some people might still have relatiely recent packages files in their apt cache
  • such a system, if done right, might prove to be usable as a live backup system between different nodes in the cluser, so it could be used in other situations by individuals, without the need of a central server - think trackerless trorrents
What do other people think?

5 comments:

Anonymous said...

“Ganneff announced that snapshot.d.o will become an official service (ETA: some months).”

<20080814211643.GA12330@xanadu.blop.info>

Anonymous said...

You may wanna take a look at Tahoe/Allmydata, a distributed storage designed for unreliable nodes.

Anonymous said...

Hi Eddyp, how's Debconf? :)

I just asked Fumitoshi, who is snapshot maintainer, an update.
He said snapshot archiving was still living, but indexing was dead.
Although he estimated one day to fix the problem by his hand, it depends when he can make a free time from his busy schedule.

Anyway building alternative service is good to avoid a single point failure.

Thanks,

Javi said...

why not with debdiffs ? (it would reduces space a lot)

They call me Brett said...

My first thought was to put all the files in git. I would guess most file chunks from different versions of the same package would match and so not take up extra space.