Data Access and Transfer
Where is my CryoEM Data?
All data is kept at the SLAC Shared Scientific Data Facility (SDF) which provides the petaflops of compute and petabytes of storage needed for cryoEM data set management and analysis. The SDF is comprised of many high-powered computer servers interconnected with high-speed networking fabrics. This in turn is connected to high performance storage which is available from all centrally managed hosts.
Introduction
There are numerous ways to access your data. Each has their pro's and con's which will be discussed below. Key points common to all access methods:
- You will need a SLAC Unix computer account to access your data.
- All experimental data is stored under this directory path structure:/sdf/group/cryoem/exp/YYYYMM/YYYYMMDD-AA##_BBBB
- YYYY is the year, MM is the month, and DD is the day
- AA## is the 4- or 5-character alphanumeric project number assigned to your project.
- BBBB is the instrument used in your experiment:
- TEM1
- TEM2
- TEM3
- TEM4
- TEMALPHA
- TEMBETA
- TEMGAMMA
- TEMDELTA
- For example, if your experiment name is 20220423-CA107, on instrument TEMBETA, then your data will be found at /sdf/group/cryoem/exp/202204/20220423-CA107_TEMBETA
- Permissions to access data is controlled via the CryoEM eLogBook.
- Whoever is added as a collaborator to the experiment will have access to the data associated with this experiment.
- The SDF has multiple data access computer nodes located at
dtn01.slac.stanford.edu,dtn02.slac.stanford.edu,and dtn03.slac.stanford.edu.
- To access or check the data in the SDF, open a terminal emulator and use the command: ssh <username>@dtn01.slac.stanford.edu followed by entering your SLAC Unix account password.
Data Organization
Within your experiment's directory (i.e., YYYYMMDD-AA##_BBBB), two directory folders are created for each sample recorded in the cryoEM eLogBook. The first directory folder is given a unique 24-character alphanumeric name and contains your data for that sample. The second directory folder is a symbolic link with the name of the sample in the cryoEM eLogBook that directs to the corresponding alphanumeric directory for that sample. Inside each sample directory folder is the output from our bespoke data processing pipeline that performs frame alignment (i.e., motion correction) and CTF estimation before and after alignment and posts results to an experiment-specific private Slack channel for near real-time feedback during data collection. The output directories from our data pipeline are as follows:
- raw
- Raw contains the original data directory collected via EPU, Serial EPU, Serial EM, etc... that is transferred to the SDF from the microscope.
- aligned
- Aligned contains the output from MotionCor2.
- summed
- Summed contains the output from CTF estimation via CTFFind4.
- particles
- Particles contains the output from particle picking via Relion.
- previews
- Previews contains the image files that are posted to the experiment's Slack channel.
- logs
- Logs contain the logs of all the preprocessing pipeline tasks of frame alignment, CTF estimation, particle picking, and preview generation.
Data Collected using EPU
The format of the EPU file naming scheme is:
FoilHole_[Hole ID]_Data_[Acquisition Area ID]_[date]_[time].mrc
- For example: in FoilHole_31545690_Data_31547881_31547882_20190601_081945.mrc
- [Hole ID] is 31545690
- [Acquisition Area ID] is 31547881_31547882
- [date] is 20190601 in yyyymmdd format
- [time] is 081945 in 24-hour hhmmss format
EPU organizes these files in the following directory scheme by grid square with the top directory being the EPU session name, located in your raw directory:
- EPU Session
- Metadata
- Images-Disc1
- GridSquare_########
- FoilHoles
- Data
- ...
- GridSquare_########
- FoilHoles
- Data
- GridSquare_########
Each "FoilHoles" directory contains images of the foil hole that data was collected from in both .jpg and .mrc format. Each "Data" directory under an EPU session contains the files stored for each image acquisition.
Each image acquisition in EPU results in six files:
- A high-quality MRC or TIFF image file. This is an unaligned summed image from the image stack of dose fraction images, and a .jpg copy of this file.
- A high-quality MRC or TIFF image stack file. This is your raw data which is an unaligned image stack of dose fraction images and has "Fractions" appended to the file naming scheme. A checksum file ending in .dm5 can also be found here from transferring this file to the data server.
- Two XML files with metadata, one for the raw image stack and one for the unaligned summed image.
Globus | SAMBA | SSHFS | RSYNC/SCP | BBCP | |
---|---|---|---|---|---|
Software Install | Local Clients available. | Implemented in Mac and Windows | Requires FUSE | Command line tools | Command line |
Graphical Interface | Yes (web based) | Yes (OS based) | Yes (OS based) | Potentially | No |
Command Line Interface | Yes | Yes (standard OS) | Yes (standard | Yes (standard OS) | Yes (standard |
Performance | Fast | Fast | Slow | Slow | Fast |
Access | Anywhere | At SLAC only | Anywhere | Anywhere | Anywhere |
Credentials | Globus ID & SLAC Unix | SLAC Windows | SLAC Unix | SLAC Unix | SLAC Unix |
Ease of Use | Easy | Easy | Medium | Difficult | Difficult |
Globus
Globus is a group at the University of Chicago that develops and operates a non-profit service for use by the research community. Globus was originally developed in 1997 to enable grid computing with the approach that, by connecting computing resources, the data can be freed from its initial source and made portable-even if it's huge. The Globus project has grown much since its initial beginnings based on the research of Ian Foster, Carl Kesselman, and Steve Tuecke, and in staying true to their roots in scientific research, today services from Globus are used by tens of thousands of researchers at many hundreds of universities, laboratories, and computing facilities around the world. The mission of Globus is to help researchers focus on their research rather than on IT issues, by providing users (as well as administrators of computing facilities and labs) powerful tools built for solving the problems of data-intensive research. Globus enables virtual organizations to collaborate across organizational boundaries by staying at the forefront of conversations about research data management, campus bridging, data planning, and models for creating sustainable software. Globus products and services are developed and operated by the University of Chicago and Argonne National Laboratory, supported by funding from the Department of Energy, the National Science Foundation, and the National Institutes of Health-as well as the generosity of subscribers.
- First, create a free Globus account at https://globus.org/.
- From there, you can download a local client (Globus Connect Personal) so that files may be copied to say your workstation or staged to another Globus Endpoint that your institution may run.
- The Endpoint (i.e., Collection) for SLAC CryoEM's data is slac#cryoem.
- You should use your SLAC Unix account as credentials to log into the slac#cryoem endpoint and your experiment's directory path once you have a Globus account.
- First-time access via Globus requires your SLAC Unix account to be added to the CryoEM Globus AFS group at SLAC.
- Contact our Information Systems Administrator, Patrick J. Pascual, or our Information Systems Specialist, Yee-Ting Li, via email or Slack direct message, to be added.
- A "How to Log in and Transfer Files with Globus" guide by Globus developers.
SAMBA
SAMBA is the standard Windows interoperability suite of programs for Linux and Unix. Since 1992, SAMBA has provided secure, stable, and fast file and print services for all clients using the Server Message Block/Common Internet File System (SMB/CIFS) protocol, such as all versions of DOS and Windows, OS/2, Linux and many others.
If you are onsite at SLAC, you can access the data via smb/cifs protocol using SAMBA. Connect to zslaccfs to browse the global directory. From there, you can access the cryoEM data storage disks under cryoem.
You should login with your SLAC Windows account to use this.
- On your Linux machine, open a terminal window.
- Install the necessary software with the command sudo apt-get install -y samba samba-common python-glade2 system-config-samba.
- Type your sudo password and hit Enter.
- Allow the installation to complete.
- Open a new file browser window.
- At the bottom of the left navigation pane, click "Other Locations"
- At the bottom of the window, in the "Connect to Server" field, Type smb://zslaccfs/
- Open the cryoem directory.
- Log in with your SLAC Windows account username and password, leave the "Workgroup" field as the default value or if using Ubuntu the domain should be "slac".
- You can now browse the CryoEM disks in your Linux file browser.
Alternatively, you can mount it via command line.
- Install cifs-utils. sudo apt-get install cifs-utils
- Create a directory where you want to mount it. sudo mkdir /mnt/slac
- sudo mount -t cifs -o
username=SLAC_USERNAME,vers=1.0,domain=slac,uid='id -u'
//zslaccfs.slac.stanford.edu/cryoem/ /mnt/slac
- Note that the uid is important so that you as a user can have the correct permissions on your local desktop.
SSHFS
You can also use the FUSE based SSHFS to mount the filesystem via ssh. It is recommended to ssh into dtn02.slac.stanford.edu in order to use this method.
Using SSHFS, the filesystem will be mounted locally so that you may browse the directory as you would with SAMBA etc.
RSYNC/SCP
Rsync can be used over ssh to bulk transfer/synchronise data across different locations. Please use dtn02.slac.stanford.edu for this purpose.
BBCP
BBCP provides multi-stream parallel transfers of data that allows transfers speeds to be significantly higher than that of scp. Contact our Information Systems Administrator, Patrick J. Pascual, or our Information Systems Specialist, Yee-Ting Li, via email or Slack direct message, if you wish to utilize this method of data transfer.