JWST Science Data Overview

A brief introduction to the JWST data products, their formats, and their naming conventions are provided in this article. Note that this is intended as a general-purpose summary; links are provided throughout to the more extensive ReadTheDocs documentation for further details.

On this page

The four stages of JWST science data products

Figure 1 shows the flow of data through the JWST Science Calibration Pipeline. This pipeline and the corresponding science data products that it produces can be divided into 4 main stages, depending on the degree of processing:

  • Stage 0: Uncalibrated raw data products from single exposures in units of total DN (e.g., "uncal.fits")
  • Stage 1: Data products that have been corrected for certain detector effects and converted to units of DN/s (e.g., "rate.fits", produced by the stage 1 pipeline)
  • Stage 2: Calibrated data products from single or multiple exposures with world coordinates and photometric information (e.g., "cal.fits", produced by the stage 2 pipeline)
  • Stage 3: Calibrated data products resulting from the combination of multiple exposures into a single integrated product (e.g., "i2d.fits", "s3d.fits", "x1d.fits", produced by the stage 3 pipeline)

All of these data stages are archived in MAST. For detailed information on all JWST data product types, see ReadTheDocs: Data Product Types.

Stage 0 uncalibrated data

Uncalibrated data (i.e., "*uncal.fits" files) are produced by the the Science Data Processing (SDP) system from raw telemetry data downloaded from the spacecraft. This is the stage at which key header information is populated from a variety of sources including onboard telemetry, APT proposal planning, JPL spacecraft ephemerides, etc. General users will not interact with the SDP code, and instead start their processing with the "*uncal.fits" files that it produces.

These "*uncal.fits" files are input to stage 1 of the calibration pipeline and usually have 4 dimensions since all JWST detectors use up-the-ramp readout (sometimes referred to as MULTIACCUM) in which pixel values increase between groups within a given integration. The first two dimensions are the column and row axes of the detector, while the third dimension is determined by the number of groups per integration, and the fourth dimension by the number of integrations in the exposure. Note that all 4 dimensions will be used, even if Ngroup = 1 or Nint= 1.

These data come from single exposures and are usually contained within a single FITS file. However, when the raw data volume for an individual exposure is large enough, like for time-series observations, the uncalibrated data can be broken into multiple segments less than 2GB each, so as to keep total file sizes to a reasonable level. Such broken-up exposures usually include "segNNN" in the file names, where NNN is 1-indexed and always includes any leading zeros.

Figure 1. Overview of JWST pipeline stages

Click on the figure for a larger view. 

Stage 1 products

Words in bold are GUI menus/
panels or data software packages; 
bold italics are buttons in GUI
tools or package parameters.

Stage 1 pipeline products are produced by the calwebb_detector1 pipeline from single exposures and can be science or non-science products. Science products typically include two-dimensional final count rate images in units of DN/s (i.e., "*rate.fits" files) in addition to products generated by intermediate steps of the calwebb_detector1 pipeline (e.g., per-integration count rate images "*rateints.fits"). Some of this intermediate data are generated by default and can be retrieved from MAST, while others are optional products that can be generated by reprocessing the data offline with custom parameters. Non-science products can include dark exposures that use some of the same pipeline steps as regular science observations, or auxiliary products that provide relevant information about the data, such as charge trap state data.

Stage 2 products

Stage 2 pipeline products depend on the observing mode and are produced by the calwebb_image2 or calwebb_spec2 pipeline. These can be generated from a single exposure or from the combination of more than one exposure, but are usually two-dimensional calibrated images in units of MJy/sr (i.e., "*cal.fits"). The number and type of products generated will vary for imaging, spectroscopy, and TSO data. For spectroscopic data it will also depend on how the observation was planned. When observations include background exposures for instance, the background-subtracted data may also be produced by default for some JWST observing modes. Information about what data should be combined is captured in stage 2 association files. Note that spectroscopic data products have wavelengths given in the barycentric vacuum rest frame.

Stage 3 products

Stage 3 pipeline products result from the combination of stage 2 exposures into a single integrated product in stage 3 of the JWST science calibration pipeline. The exact form of processing depends on the JWST observing mode in use (e.g., imaging will be processed by the calwebb_image3 pipeline, spectroscopy by the calwebb_spec3 pipeline, etc), but typically involves the combination of dithered observations onto a single output pixel grid. The type and number of data products of this stage vary with the type of observation, governed by the relevant association file describing the relation between the individual input exposures.  The stage 3 data are typically science-ready, fully calibrated in units of MJy/sr (for images and IFU data cubes) or Jy (for one-dimensional spectra), and provided on a regular grid (e.g., "*i2d.fits files", "*s3d.fits files", and "*x1d.fits files"). These data products can also include catalog-level information produced from the final calibrated data.



Included data

JWST science data products typically include a "SCI" extension providing the measured science values, an "ERR" extension giving an estimated uncertainty on those values, and a "DQ" extension giving the corresponding data quality information.

Science arrays (SCI)

Science arrays are the primary results provided by the JWST pipeline, and give values corresponding to the signal measured from a given astronomical scene. The units of the science arrays change throughout the pipeline, starting as raw measured detector counts in stage 0 and becoming calibrated radiometric units in stages 2 and 3.

Error arrays (ERR)

Error arrays are initialized in stage 1 processing during the initial ramp fitting step, and have contributions from the detector read noise, the Poisson noise of the measured signal, and the Poisson noise of the subtracted reference dark. These terms are tracked independently, with variances due to Poisson noise stored in the "VAR_POISSON" extensions, and variances due to read noise stored in the "VAR_RNOISE" extension.  At each step, the total "ERR" array is recomputed as the square root of the quadratic sum of the individual variance terms. These variance and the corresponding total error are propagated through each step of the pipeline using a noise model, and updated as necessary. Stage 2 calibration, for instance, introduces a new variance term due to the uncertainty in the applied flatfield ("VAR_FLAT") that is combined into the total estimated error budget.

Note that different uncertainty sources behave in different ways. Some noise sources (e.g., photon noise) are independent between integrations and others (e.g., flat field noise) are not. By propagating each term through the calibration pipeline, the use of each term can be customized for the processing. For example, the use of the flat field noise term is different between non-dithered and dithered observations. For the former, the noise does not reduce with the addition of more integrations while for the latter it does. Note that while covariance can be significant (especially in resampled data) it is not explicitly tracked by the JWST pipeline at the present time (although see discussion of covariance in MIRI MRS data cubes).

For a detailed list of the variances created and/or updated by each pipeline step, see ReadTheDocs Error Propagation.

Data quality arrays (DQ)

The data quality (DQ) initialization step in the calibration pipeline (see ReadTheDocs) populates the data quality mask for a dataset to flag any pixels that may be unreliable or unusable for a number of reasons, such as dead pixels, hot pixels, etc. This data quality array is a bitmask, with integer values reflecting which combinations of multiple different possible quality flags have been set for each pixel. These flags are carried through the steps of the pipeline and may inform how the calculations within a calibration step are performed for a pixel. Different instruments monitor different characteristics and hence may have differing pixel flags. The most important flag, however, is the "DO_NOT_USE" flag (bit 2^0 = 1) which is typically used to exclude certain pixels from later calculations. For a full list of data quality flags used by the pipeline, see the ReadTheDocs data quality flag summary. Likewise, for tips on working with DQ arrays see Tips and Tricks for Working with the JWST Pipeline.

Throughout stage 1 processing, data quality information is stored in the "PIXELDQ" and "GROUPDQ" extensions; these represent flags affect entire pixels, and flags affecting only certain readout groups of a given pixel respectively. At the end of stage 1 processing and throughout stage 2, the individual "PIXELDQ" and "GROUPDQ" extensions for a ramp are replaced by a single "DQ" extension, which is a data array containing DQ flags for each pixel, for each integration (or for averaged integrations, depending on the data product type). In stage 3 processing, the data is resampled based on the WCS and distortion information and then combined into a single undistorted product.

Resampled imaging and slit spectroscopy data products (see ReadTheDocs) contain "WHT" and "CON"" extensions in place of the "DQ". These extensions provide observers with the 2-D weight image giving the relative weight of the output pixels (WHT) and the 2-D context image, which encodes information about which input images contribute to a specific output pixel (CON). Resampled IFU spectroscopic data products (see ReadTheDocs) contain "DQ" and "WMAP" extensions, with the former conveying data quality information and the latter the relative weights of the output voxels.

Associations

Relationships between multiple exposures are captured in an association, which is a means of identifying a set of exposures that belong together and may be dependent upon one another. The association concept permits exposures to be calibrated, archived, retrieved, and reprocessed as a set rather than as individual objects. Association files follow a similar naming convention to stage 3 data products.

In order to capture a list of exposures that could potentially form an association product and provide relevant information about those exposures, the JWST Science Calibration Pipeline first generates an association pool. These are Astropy tables that contain the metadata for all the data in a given proposal. These pools are then used by the association generator to create stage 2 associations or stage 3 associations in JSON format for a particular type of processing. Based on this JSON file, the JWST calibration pipeline can create the relevant data products.

An association file can contain multiple intermediate science data products, related files that support the science data (e.g., jitter data or target acquisition images), and contemporaneous calibration observations. Stage 2 associations might typically include calibration data, such as background exposures and NIRSpec MSA imprint exposures, in order to allow the relevant steps to subtract these exposures from the science data. Stage 3 associations typically list the multiple dithers or mosaic tile positions within a given observation (see an example stage 3 association file), so that the stage 3 pipeline can combine these data into a single composite science product. Any dedicated background or PSF reference observation can be included in more than one association product.

Association files are provided with the standard data products available through MAST; users wishing to reprocess their data offline can either edit these association files or use the tools provided by the JWST pipeline notebooks to regenerate custom associations using the relevant association commands.

At present, two kinds of associations are provided in MAST for stage 3 data products; "observation" and "candidate" associations. As indicated by Table 1, these associations differ slightly in their data content. In brief, "observation" associations (indicated by the "oNNN" name identifier) combine data from within a given observation, while "candidate" associations (indicated by the "c1NNN" name identifier) combine data across observations by subtracting dedicated background observations linked via an APT special requirement.


Table 1.  Association types and type of data included

Association
candidate

Name
identifier

What it means

Type of data included

observation

oNNN

Combine data within an observation 

  • All exposures within the same observation of a pointing for all points within a dither, using the same optical elements and configuration like filter, detector, readout, etc.

  • All exposures within the same observation that can be associated. Examples are WFSS data where observations from different wavelength ranges taken with different detectors can be associated (or combined).

candidate

c1NNN

Combine data across different observations 

  • Data that should be combined at stage 3; e.g. WFSS simultaneous images get associated with the grism observations for the purpose of using the catalog for the stage 3 processing

  • Data that has been grouped together via special requirements like group-non-ints or background observations

  • All data for a given configuration that has been taken in a mosaic


For more detailed information on the construction and usage of associations, see the ReadTheDocs Association documentation.



File naming conventions

Stage 0–2 file names

JWST science data files are exposure-based up to stage 2 of processing, meaning they contain only the values from a single exposure for a single detector. As a result, exposure-based file names are constructed using information from the exposure itself such as the program ID, visit number, and detector ID. All JWST filenames are preceded by the characters "jw".

As an example, a typical filename might be jw01523003001_03102_00002_mirifulong_rate.fits 

In this case, this is a MIRI MRS exposure. "01523" indicates that the data comes from Program ID 1523, "003001" indicates that it corresponds to the first visit of observation 3, "03102" indicates the parallel sequence and activity number, "00002" indicates that it is exposure number 2, "mirifulong" indicates that the data come from the long-wavelength MIRI IFU detector, and "rate" indicates that this is a stage 1 data product.

For further details, see ReadTheDocs: Exposure File Names.

Stage 3 file names

Stage 3 data products can either be target-based or source-based as they combined data from multiple different exposures. As a result, these filenames are constructed using information such as the target and/or source ID, instrument, and optical element in use.

As an example, a typical filename might be jw01523-o010_t010_miri_f1800w-sub64_i2d.fits

In this case, this is a MIRI imaging mosaic. "01523" indicates that the data comes from Program ID 1523, "o010" indicates that the data come from observation 10, "t010" indicates that this data corresponds to target ID number 10 from the APT program, "miri" indicates that the MIRI instrument was in use, "f1800w" that the observations used the F1800W filter, "sub64" that they used the SUB64 detector readout subarray, and "i2d" that this is a rectified and calibrated image mosaic.

For further details, see ReadTheDocs: Stage 3 File Names.



Data storage formats

Data produced by the JWST Science Calibration Pipeline is primarily stored in Flexible Image Transport System (FITS) files, but can also include Advanced Scientific Data Format (ASDF) files, JavaScript Object Notation (JSON) files, and Enhanced Character Separated Values (ECSV) files. Similar formats are also used for the calibration reference files. Note that the simplest way to interact with JWST data products is often via the JWST Data Models; see Tips and Tricks for Working with the JWST Pipeline for details.

Multi-extension FITS format

The Flexible Image Transport System (FITS) is a standard format for exchanging astronomical data, independent of the hardware platform and software environment. FITS format files consist of a series of header data units (HDUs), each containing 2 components: an ASCII text header and binary data. The header contains a series of keywords that describe the data in a particular HDU; the data component may immediately follow the header.

For JWST FITS data, the first HDU, or primary header, only contains header information in the form of keyword records with an empty data array, which is indicated by the occurrence of NAXIS = 0 in the primary header. Keywords in the primary header usually pertain to the entire file.  The primary header may be followed by one or more HDUs called extensions, which may take the form of images, binary tables, ASCII text tables, or ASDF-format information. The data type for each extension is recorded in the XTENSION header keyword. These extensions also contain header keywords that record metadata pertinent to that extension.

Typically, JWST FITS extensions include a "SCI" extension containing the science pixel values, an "ERR" extension containing the estimated uncertainties of each pixel value, and a "DQ" extension which records any binary data quality flags which have been set for each pixel. See Error and Data Quality Arrays for further information.

Note that interferometric data (i.e., NIRISS AMI) uses the OIFITS2 (Optical Interferometry FITS) format instead.

FITS header keywords

The FITS headers of all JWST data contain keywords required by the FITS standard and keywords relevant to the observation. Additionally, most JWST data products also record keywords summarizing successfully executed pipeline processing steps, and the versions of both pipeline software and the calibration reference files that were used. By examining the file header using tools such as Astropy, observers can find detailed information about the data, including:

  • Coordinates of the target (TARG_RA, TARG_DEC), program number (PROGRAM), observation (OBSERVTN), and other observation identifiers

  • Date and time of the observation including start, mid-exposure, and end times (MJD-BEG, MJD-MID, MJD-END respectively)

  • Instrument (INSTRUME) and configuration information (DETECTOR, FILTER, SUBARRAY)
  • Readout definition parameters (READPATT, NINTS, NGROUPS, and GROUPGAP)

  • Spacecraft attitude information (RA_V1, DEC_V1, PA_V3, etc., contained within the "SCI" extension header)

  • Calibration pipeline version (CAL_VER) and CRDS reference file context (CRDS_CTX)
  • Information about individual pipeline step completion (e.g., S_STRAY = 'COMPLETE'           / Straylight Correction)
  • Information about specific reference files used by the pipeline (e.g., R_DISTOR= 'crds://jwst_miri_distortion_0133.asdf' / Distortion reference file information)

Note that FITS header keywords are 1-indexed, while Astropy and the JWST Science Calibration Pipeline typically use a 0-indexed system.

Header keywords related to a particular topic are kept together logically, such as the program information or target information. This sample data header shows some of the keywords and groupings. The full sample of schematic headers for all the JWST modes can also be found in MAST. The JWST Keyword Dictionary in the MAST documentation contains the complete list of standard JWST header keywords, the FITS header extension where they can be found, where the information comes from, and their valid values. 

Figure 2 shows a schematic representation of how data from many sources are used by the Science Data Processing (SDP) subsystem to populate the uncalibrated (stage 0) FITS files of the science data. These will have keywords required by the FITS standard and keywords relevant to the observation that are extracted from the telemetry packet headers and science image header.

Following FITS conventions, each keyword is no longer than 8 characters, and their values can be an integer, real (floating-point) number, or a character string. Several keywords are common to all JWST data, and others are instrument-specific. 

Figure 2. Source and usage of information by the SDP system to populate the headers of science data

 

Click on the image for a larger view.


Timing keywords

See JWST Time Definitions for an overview of timing-related header keywords.

ASDF format

The Advanced Scientific Data Format (ASDF) is a next generation, human-readable, hierarchical metadata structure made up of basic dynamic data types such as strings, numbers, lists, basic transformation functions, and mappings. Data can be saved as plain text or as binary arrays. ASDF is primarily intended as an interchange format for delivering information about the science instruments or how products were created to scientists or, for example, between stages of the calibration pipeline. ASDF-formatted data is often included as an extension in the FITS files produced by the pipeline. As an example, the complex spatial and spectral distortion models needed to transform detector positions to a world coordinate frame (R.A., Decl., wavelength) are attached to data by the assign_wcs pipeline step and stored as an ASDF extension. Likewise, many calibration reference files are stored in ASDF format.

Basic functions for interacting with ASDF data are provided by the Astropy ASDF library.

ECSV format

The Enhanced Character Separated Values (ECSV) format, which is standard for the interchange of tabular data in a text-only format, is used to store a catalog of derived data for sources identified in some final stages of the calibration pipeline. This file includes a header section with the definition, data type, and description for the columns, and a data section with as many rows as sources identified in an image. Besides a simple comma-separated delimited text file reader, these files can also be read, modified, or created using the Astropy ECSV library.

JSON format

The JavaScript Object Notation (JSON) is a language-independent data format that many modern programming languages understand, and is used to transfer populated data structures. The calibration pipeline uses this type of file to provide information about relationships between multiple exposures or associations. This is also used by some calibration reference files.



Notable updates
Originally published