ArchiveBox or similar for shared archiving of research project
from Stopwatch1986@lemmy.ml to selfhosted@lemmy.world on 22 Jun 18:24
https://lemmy.ml/post/49100701

I am one of a network of academic researchers from around the world working on collecting media market data. One problem is that referenced sources often disappear which makes validation later difficult or impossible. So, I thought I would recommend self-hosting something like archive.org that would allow affiliated researchers to submit their web references and have their sources efficiently archived in a central project repository. That would allow validation and continuity for when web-hosted text and files disappear or researchers leave.

I have been looking at ArchiveBox. If you have experience of this or a similar solution, would that fit the bill? The important thing is efficiency for researchers submitting/retrieving pages and files, and openness in structure and formats so that the archive would remain useful if ArchiveBox or similar disappears. FOSS of course means you can’t be locked out anyway.

#selfhosted

threaded - newest

irmadlad@lemmy.world on 22 Jun 18:41 next collapse

I use ArchiveBox occasionally to archive websites into a browsable, offline copy, regardless of the data disappearing online, and independently of whether or not ArchiveBox is in operation after the archiving finishes, if of course you persist the data locally. I’ve archived several self-hosted sites because they contained data I would like to conserve for personal use at a later date. It does it quite thoroughly, tho obviously large sites would take a little time to ingest. It might be worth spinning up a Docker instance and run it through it’s paces to see if it would fit your criteria.

Stopwatch1986@lemmy.ml on 23 Jun 12:06 collapse

I wonder if an authorised remote user (ie an affiliated researcher) can easily instruct ArchiveBox to store a URL and later retrieve it. Also, ideally a random user should be able to retrieve the archived web page or file (eg a PDF, CSV etc). The idea is that authorised researchers can get URLs archived, and then any user reading our reports can click on a citation and get our archived source if the original is not available any more. I’ll need to run it and see, but it looks promising.

Keeping the archive alive for years later, possibly after funding dries up, is another challenge but there are public repositories that may be suitable for that.

irmadlad@lemmy.world on 23 Jun 14:59 collapse

I wonder if an authorised remote user (ie an affiliated researcher) can easily instruct ArchiveBox to store a URL and later retrieve it

Once you download the data and persist it on local storage, it’s available to whomever has access to that drive or server.

Also, ideally a random user should be able to retrieve the archived web page or file (eg a PDF, CSV etc).

For rando access, you could put the data on a public ftp server, or even get fancier with html styled pages. If I understand you correctly, you want a random user to be reading your report that has citations, so that when a rando user clicks the citation, they are presented with whatever you downloaded with ArchiveBox. Kind of Wikipedia style. Speaking of which, a wiki framework might be just the ticket you are looking for.

Download the data, integrate it in to a selfhosted wiki, and it would be available to rando users. Of course your wiki server will have to have all the accoutrements of security so you don’t get hacked by a bazillion bots.

Stopwatch1986@lemmy.ml on 24 Jun 12:41 collapse

A wiki is a good idea. Putting a Singlefile or similar all-in-one file in a repository and provide index numbers organised as a look-up table would also work for easy retrieval by a random research user. Both require some admin and more effort from the researchers.

I wish there was a hostable version of archive.is for near-zero maintenance. You just submit a URL over the internet and the web page is cached once along with a screenshot. Then, anyone can access the archived version. This can be done already with archive.is but we have no control over its future, which is critical for long-term dependable archiving.

irmadlad@lemmy.world on 24 Jun 15:52 collapse

This can be done already with archive.is but we have no control

Did a little digging this morning. I honestly can’t find a selfhosted, archive.is alternative. All the solutions I came up with are either paid for and online use only, or free, but still online use only.

Stopwatch1986@lemmy.ml on 24 Jun 19:45 collapse

Thanks for doingthe digging. An archivist may know something more. Or the archive.is people.

irmadlad@lemmy.world on 24 Jun 19:56 collapse

It might be worthwhile to run your scenario by the folks at lemmy.world/c/datahoarder

hexagonwin@lemmy.today on 23 Jun 09:49 next collapse

webrecorder browsertrix should work for this. they even have a hosted/paid service which could be better than selfhosting depending on the circumstance.

saving as html with singlefile and sharing manually could be easier/simpler, the concept is easy to understand for non computer people imo.

other than that i recently found out hoardy-web, doesn’t really fit your usecase as this is basically saving everything you see on your browser for personal archiving though. very well made but somehow it isn’t as widely known as other stuff in this area…

Stopwatch1986@lemmy.ml on 23 Jun 12:29 collapse

One advantage and disadvantage of having webrecorder host our archived pages is that the archive may survive longer than, or not as long as our project.

I have been using singlefile for years. It’s great but not for seamlessly making cached web pages available to the general public reading our reports and finding that cited links are now dead. And it doesn’t support URLs point to PDF, CSV files. A public-facing repository of singlefile files with an index for ToC might do it though. Simplicity is good for future-proofing an archive.

Something like archive.org and archive.is would be ideal, but we have no control over its future and practices.

moonpiedumplings@programming.dev on 24 Jun 16:28 collapse

Check out Zotero: www.zotero.org

Zotero is an open source bibliography manager. It’s my main go to tool for generating works cited pages, like during essays.

But, it also has a browser extension, which can download, and archive sites or academic articles you are adding to the sources. I would then use the fulltext search that zotero provides for easy searching of sources.

Unfortunately, it’s not hosted, which would make it difficult to share.

EDIT: It does look like the server component is open source, AGPLv3: github.com/zotero/dataserver/

But, I cannot find any deployment instructions. But, it looks like their hosted version lets users create groups of shared items, including sharing archived snapshots of the various items.

Stopwatch1986@lemmy.ml on 24 Jun 19:39 collapse

I have been using Zotero every day for more than two decades and somehow it hasn’t cross my mind. You may be on to something.

Zotero supports public and private shared bibliographies that you can subscribe to through the client or their web interface. Each entry contains the bibliographical details, notes attachments, file attachments and links to local files. It also captures webpages and metadata through the browser addon. The local database can be backed up and, if self-hosted, you have control. The best part is that academic researchers will be familiar with the software and process. One downside is that the cached file is not independently archived so it could be tampered with. Thanks for the idea.

moonpiedumplings@programming.dev on 24 Jun 20:12 collapse

One downside is that the cached file is not independently archived so it could be tampered with. Thanks for the idea.

You could have multiple researchers archive it and store copies independently. Then tampering would show up accross copies.

Unfortunately, central hosting doesn’t guarantee that it is tamper free. The host could be hacked, or could be malicious. Archive.is was caught tampering with their archived pages:

en.wikipedia.org/…/Wikipedia:Archive.today_guidan…