File Formats for Data Files

In [1]: import os

In [2]: startdir = os.path.abspath('.')

In [3]: home = startdir.rsplit('docs' , 1)[0]

In [4]: os.chdir(home);

In [5]: os.chdir('docs/source')

In [6]: if not os.path.exists('generated'):
   ...:   os.makedirs('generated')
   ...: 

In [7]: os.chdir('generated')

The data statistics, covariance matrix, and grid file are all read from plaintext files by the pyRSD.rsdfit module, and each requires a special format. We will describe those file file formats in this section.

The Power Statistics

The plaintext file holding the data statistics must be specified in the parameter file using the data.data_file parameter. The package can handle reading either \(P(k,\mu)\) or \(P_\ell(k)\) data, and we will give examples for both below.

\(P(k,\mu)\) Data

For \(P(k,\mu)\) data, we have power measured in wide bins of \(\mu\). If there are \(N_k\) \(k\) bins and \(N_\mu\) \(\mu\) bins, the pyRSD code expects the following data:

  1. k :

    an array of shape (\(N_k\), \(N_\mu\)) providing the mean wavenumber value in each bin (units: \(h/\mathrm{Mpc}\))

  2. mu :

    an array of shape (\(N_k\), \(N_\mu\)) providing the mean \(\mu\) value in each bin

  3. power :

    an array of shape (\(N_k\), \(N_\mu\)) providing the mean power value in each bin (units: \(h^{-3} \mathrm{Mpc}^3\))

The data can be saved easily to a plaintext file from a numpy structured array using the pyRSD.rsdfit.data.PowerMeasurements class. For example, here we will generate some fake \(P(k,\mu)\) and write this data to a plaintext file with the correct format.

In [8]: from pyRSD.rsdfit.data import PowerMeasurements

In [9]: import numpy as np

In [10]: Nk = 20

In [11]: Nmu = 5

# generate 5 mu bins from 0 to 1
In [12]: mu_edges = np.linspace(0, 1, Nmu+1)

In [13]: mu_cen = 0.5 * (mu_edges[1:] + mu_edges[:-1])

# generate 20 k bins from 0.01 to 0.4
In [14]: k_edges = np.linspace(0.01, 0.4, Nk+1)

In [15]: k_cen = 0.5 * (k_edges[1:] + k_edges[:-1])

# make 2D k, mu arrays with shape (20, 5)
In [16]: k, mu = np.meshgrid(k_cen, mu_cen, indexing='ij')

# some random power data
In [17]: pkmu = np.random.random(size=(Nk,Nmu))

# make a structured array
In [18]: data = np.empty((Nk,Nmu), dtype=[('k', 'f8'), ('mu', 'f8'), ('power', 'f8')])

In [19]: data['k'] = k[:]

In [20]: data['mu'] = mu[:]

In [21]: data['power'] = pkmu[:]

# identifying names for each statistic
In [22]: names = ['pkmu_0.1', 'pkmu_0.3', 'pkmu_0.5', 'pkmu_0.7', 'pkmu_0.9']

# initialize the PowerMeasurements object
In [23]: measurements = PowerMeasurements.from_array(names, data)

In [24]: measurements
Out[24]: 
[<PowerMeasurement P(k, mu=0.1), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, mu=0.3), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, mu=0.5), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, mu=0.7), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, mu=0.9), (0.0198 - 0.39) h/Mpc, 20 data points>]

# save to file
In [25]: measurements.to_plaintext("pkmu_data.dat")

Our fake data has been saved to a plaintext text file with the desired format. The first few lines of this plaintext file look like:

20 5
k mu power
1.97500e-02 1.00000e-01 2.73693e-01
3.92500e-02 1.00000e-01 1.12058e-01
5.87500e-02 1.00000e-01 7.47377e-01
7.82500e-02 1.00000e-01 7.84050e-01
9.77500e-02 1.00000e-01 2.49533e-01
1.17250e-01 1.00000e-01 2.79883e-02
1.36750e-01 1.00000e-01 8.17067e-01
1.56250e-01 1.00000e-01 6.27760e-01

We can easily re-initialize a PowerMeasurements object from this plaintext file using

In [26]: names = ['pkmu_0.1', 'pkmu_0.3', 'pkmu_0.5', 'pkmu_0.7', 'pkmu_0.9']

In [27]: measurements_2 = PowerMeasurements.from_plaintext(names, 'pkmu_data.dat')

In [28]: measurements_2
Out[28]: 
[<PowerMeasurement P(k, mu=0.1), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, mu=0.3), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, mu=0.5), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, mu=0.7), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, mu=0.9), (0.0198 - 0.39) h/Mpc, 20 data points>]

Note

The names specified in this example serve as identifying strings for each power bin. They should be specified via the data.statistics keyword in the parameter file. For \(P(k,\mu)\) data, the names must begin with pkmu_ and are typically followed by the mean \(\mu\) value in each power bin, i.e., pkmu_0.1 identifies the statistic measuring \(P(k,\mu)\) in a bin centered at \(\mu=0.1\).

\(P_\ell(k)\) Data

For \(P_\ell(k)\) data, we have measured the multipoles moments of \(P(k,\mu)\) for several multipole numbers \(\ell\). If there are \(N_k\) \(k\) bins and \(N_\ell\) multipoles, the pyRSD code expects the following data:

  1. k :

    an array of shape (\(N_k\), \(N_\ell\)) providing the mean wavenumber for each multipole (units: \(h/\mathrm{Mpc}\))

  1. power :

    an array of shape (\(N_k\), \(N_\ell\)) providing the mean power for each multipole (units: \(h^{-3} \mathrm{Mpc}^3\))

Again, here we will generate some fake \(P_\ell(k)\) and write this data to a plaintext file with the correct format as an example.

In [29]: Nk = 20

In [30]: Nell = 3

# fit the monopole, quadrupole, and hexadecapole
In [31]: ells = [0, 2, 4]

# generate 20 k bins from 0.01 to 0.4
In [32]: k_edges = np.linspace(0.01, 0.4, Nk+1)

In [33]: k_cen = 0.5 * (k_edges[1:] + k_edges[:-1])

# make a structured array
In [34]: data = np.empty((Nk,Nell), dtype=[('k', 'f8'), ('power', 'f8')])

In [35]: for i, ell in enumerate(ells):
   ....:     data['k'][:,i] = k_cen[:]
   ....:     data['power'][:,i] = np.random.random(size=Nk)
   ....: 

# identifying names for each statistic
In [36]: names = ['pole_0', 'pole_2', 'pole_4']

# initialize the PowerMeasurements object
In [37]: measurements = PowerMeasurements.from_array(names, data)

In [38]: measurements
Out[38]: 
[<PowerMeasurement P(k, ell=0), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, ell=2), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, ell=4), (0.0198 - 0.39) h/Mpc, 20 data points>]

# save to file
In [39]: measurements.to_plaintext("pole_data.dat")

Our fake data has been saved to a plaintext text file with the desired format. The first few lines of this plaintext file look like:

20 3
k power
1.97500e-02 1.03185e-01
3.92500e-02 3.20316e-02
5.87500e-02 9.39877e-01
7.82500e-02 3.90996e-01
9.77500e-02 2.75281e-01
1.17250e-01 1.82381e-01
1.36750e-01 8.72181e-01
1.56250e-01 8.33602e-01

We can easily re-initialize a PowerMeasurements object from this plaintext file using

In [40]: names = ['pole_0', 'pole_2', 'pole_4']

In [41]: measurements_2 = PowerMeasurements.from_plaintext(names, 'pole_data.dat')

In [42]: measurements_2
Out[42]: 
[<PowerMeasurement P(k, ell=0), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, ell=2), (0.0198 - 0.39) h/Mpc, 20 data points>,
 <PowerMeasurement P(k, ell=4), (0.0198 - 0.39) h/Mpc, 20 data points>]

Note

The names specified in this example serve as identifying strings for each multipole. They should be specified via the data.statistics keyword in the parameter file. For \(P_\ell(k)\) data, the names must begin with pole_ and are typically followed by the multipole number, i.e., pole_0 identifies the monopole (\(\ell=0\)).

The Covariance Matrix

The parameter estimation procedure relies on the likelihood function, which tells use the probability of the measured data given our theoretical model. In order to evaluate the likelihood, we need the covariance matrix of the data statistics. pyRSD provides two classes for dealing with covariance matrices, pyRSD.rsdfit.data.PkmuCovarianceMatrix for \(P(k,\mu)\) data and pyRSD.rsdfit.data.PoleCovarianceMatrix for \(P_\ell(k)\) data.

Again, when running the rsdfit command, the covariance matrix will be loaded from a plaintext file with a specific format. The easiest way to save your desired covariance matrix is take advantage of the functionality provided by the PkmuCovarianceMatrix and pyRSD.rsdfit.data.PoleCovarianceMatrix classes.

The most important thing to realize when dealing with the covariance matrix is that it is the covariance matrix of the full, concatenated data vector. For example, if we are fitting the \(\ell=0,2,4\) multipoles, then the full data vector is

\[\mathcal{D} = [P_0, P_2, P_4].\]

Then, if we have \(N_k\) data points measured for each multipole, then the covariance matrix has a size of \((3 N_k, 3 N_k)\). The upper left sub-matrix of size \((N_k, N_k)\) is the covariance of \(P_0\) with itself, the next sub-matrix of size \((N_k, N_k)\) to the right is the covariance between \(P_0\) and \(P_2\) and similarly, the last sub-matrix on the top row is the covariance between \(P_0\) and \(P_4\).

As an example, we will generate some fake multipole data and compute the covariance below,

In [43]: from pyRSD.rsdfit.data import PoleCovarianceMatrix

# generate 100 fake monopoles
In [44]: P0 = np.random.random(size=(100, Nk)) # Nk = 20, see above

# 100 fake quadrupoles
In [45]: P2 = np.random.random(size=(100, Nk))

# 100 fake hexadecapoles
In [46]: P4 = np.random.random(size=(100, Nk))

# make the full data vector
In [47]: D = np.concatenate([P0, P2, P4], axis=-1) # shape is (100, 60)

# compute the covariance matrix
In [48]: cov = np.cov(D, rowvar=False) # shape is (20,20)

# initialize the PoleCovarianceMatrix
In [49]: ells = [0,2,4]

In [50]: C = PoleCovarianceMatrix(cov, k_cen, ells)

In [51]: C
Out[51]: <PoleCovarianceMatrix: size 60x60>

# write to plaintext file
In [52]: C.to_plaintext('pole_cov.dat')

The covariance matrix of our fake data has been saved to a plaintext text file with the desired format. The first few lines of this plaintext file look like:

60
0.08749548040812441
-0.016328517094392307
0.03065759204306653
-0.0027165435929309336
0.002304111735306902
-0.003093621922043565
-0.01174711039079392
0.004067073725322639
0.017856388426813617

We can easily re-initialize a PoleCovarianceMatrix object from this plaintext file using

In [53]: names = ['pole_0', 'pole_2', 'pole_4']

In [54]: C_2 = PoleCovarianceMatrix.from_plaintext('pole_cov.dat')

In [55]: C_2
Out[55]: <PoleCovarianceMatrix: size 60x60>

The case of \(P(k,\mu)\) data is very similar to multipoles, except now the second dimension specifies the \(\mu\) bins, rather than the multipole numbers. The class PkmuCovarianceMatrix should be used for \(P(k,\mu)\) data.

Warning

Be sure to ensure that the number of data points, specifically the \(k\) range used, of the data statistics in the file specified by the data.data_file attribute agrees with the number of elements in the covariance matrix.

The Window Function

If the user wishes to fit window-convolved power spectrum multipole, a file holding the correlation function multipoles of the window function should be specified using the data.window_file parameter. Then, the theoretical model will be convolved with this window function to accurately compare model and data.

The window function file should hold several columns of data, with the first column specifying the separation array \(s\), and the other columns giving the even-numbered correlation function multipoles of the window. See Window-convolved Power Spectra for a full example of the window multipoles.

The \(P(k,\mu)\) Grid

As discussed in Discrete Binning Effects, the user can optionally take into account the effects of dicretely binned data with a finely-binned \(P(k,\mu)\) grid. The name of the file holding this grid should be specified in the parameter file using the data.grid_file attribute.

Given a 2D data array specifies this grid, the easiest way to save the grid to a plaintext file in the right format is to use the PkmuGrid.to_plaintext() function. See Discrete Binning Effects for a full example of how to do this.

Note

If the user is fitting window-convolved multipoles, the grid file does not need to be specified.