File Formats for Data Files¶
In [1]: import os
In [2]: startdir = os.path.abspath('.')
In [3]: home = startdir.rsplit('docs' , 1)[0]
In [4]: os.chdir(home);
In [5]: os.chdir('docs/source')
In [6]: if not os.path.exists('generated'):
...: os.makedirs('generated')
...:
In [7]: os.chdir('generated')
The data statistics, covariance matrix, and grid file are all read
from plaintext files by the pyRSD.rsdfit
module, and
each requires a special format. We will describe those file file formats
in this section.
The Power Statistics¶
The plaintext file holding the data statistics must be specified in the
parameter file using the data.data_file
parameter. The package
can handle reading either \(P(k,\mu)\) or \(P_\ell(k)\) data, and
we will give examples for both below.
\(P(k,\mu)\) Data¶
For \(P(k,\mu)\) data, we have power measured in wide bins of \(\mu\). If there are \(N_k\) \(k\) bins and \(N_\mu\) \(\mu\) bins, the pyRSD code expects the following data:
k :
an array of shape (\(N_k\), \(N_\mu\)) providing the mean wavenumber value in each bin (units: \(h/\mathrm{Mpc}\))
mu :
an array of shape (\(N_k\), \(N_\mu\)) providing the mean \(\mu\) value in each bin
power :
an array of shape (\(N_k\), \(N_\mu\)) providing the mean power value in each bin (units: \(h^{-3} \mathrm{Mpc}^3\))
The data can be saved easily to a plaintext file from a numpy structured array
using the pyRSD.rsdfit.data.PowerMeasurements
class. For example,
here we will generate some fake \(P(k,\mu)\) and write this data to a
plaintext file with the correct format.
In [8]: from pyRSD.rsdfit.data import PowerMeasurements
In [9]: import numpy as np
In [10]: Nk = 20
In [11]: Nmu = 5
# generate 5 mu bins from 0 to 1
In [12]: mu_edges = np.linspace(0, 1, Nmu+1)
In [13]: mu_cen = 0.5 * (mu_edges[1:] + mu_edges[:-1])
# generate 20 k bins from 0.01 to 0.4
In [14]: k_edges = np.linspace(0.01, 0.4, Nk+1)
In [15]: k_cen = 0.5 * (k_edges[1:] + k_edges[:-1])
# make 2D k, mu arrays with shape (20, 5)
In [16]: k, mu = np.meshgrid(k_cen, mu_cen, indexing='ij')
# some random power data
In [17]: pkmu = np.random.random(size=(Nk,Nmu))
# make a structured array
In [18]: data = np.empty((Nk,Nmu), dtype=[('k', 'f8'), ('mu', 'f8'), ('power', 'f8')])
In [19]: data['k'] = k[:]
In [20]: data['mu'] = mu[:]
In [21]: data['power'] = pkmu[:]
# identifying names for each statistic
In [22]: names = ['pkmu_0.1', 'pkmu_0.3', 'pkmu_0.5', 'pkmu_0.7', 'pkmu_0.9']
# initialize the PowerMeasurements object
In [23]: measurements = PowerMeasurements.from_array(names, data)
In [24]: measurements
Out[24]:
[<PowerMeasurement P(k, mu=0.1), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, mu=0.3), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, mu=0.5), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, mu=0.7), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, mu=0.9), (0.0198 - 0.39) h/Mpc, 20 data points>]
# save to file
In [25]: measurements.to_plaintext("pkmu_data.dat")
Our fake data has been saved to a plaintext text file with the desired format. The first few lines of this plaintext file look like:
20 5
k mu power
1.97500e-02 1.00000e-01 2.73693e-01
3.92500e-02 1.00000e-01 1.12058e-01
5.87500e-02 1.00000e-01 7.47377e-01
7.82500e-02 1.00000e-01 7.84050e-01
9.77500e-02 1.00000e-01 2.49533e-01
1.17250e-01 1.00000e-01 2.79883e-02
1.36750e-01 1.00000e-01 8.17067e-01
1.56250e-01 1.00000e-01 6.27760e-01
We can easily re-initialize a PowerMeasurements
object from this plaintext
file using
In [26]: names = ['pkmu_0.1', 'pkmu_0.3', 'pkmu_0.5', 'pkmu_0.7', 'pkmu_0.9']
In [27]: measurements_2 = PowerMeasurements.from_plaintext(names, 'pkmu_data.dat')
In [28]: measurements_2
Out[28]:
[<PowerMeasurement P(k, mu=0.1), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, mu=0.3), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, mu=0.5), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, mu=0.7), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, mu=0.9), (0.0198 - 0.39) h/Mpc, 20 data points>]
Note
The names
specified in this example serve as identifying strings for
each power bin. They should be specified via the data.statistics
keyword in the parameter file. For \(P(k,\mu)\) data, the names
must begin with pkmu_
and are typically followed by the mean \(\mu\)
value in each power bin, i.e., pkmu_0.1
identifies the statistic
measuring \(P(k,\mu)\) in a bin centered at \(\mu=0.1\).
\(P_\ell(k)\) Data¶
For \(P_\ell(k)\) data, we have measured the multipoles moments of \(P(k,\mu)\) for several multipole numbers \(\ell\). If there are \(N_k\) \(k\) bins and \(N_\ell\) multipoles, the pyRSD code expects the following data:
k :
an array of shape (\(N_k\), \(N_\ell\)) providing the mean wavenumber for each multipole (units: \(h/\mathrm{Mpc}\))
power :
an array of shape (\(N_k\), \(N_\ell\)) providing the mean power for each multipole (units: \(h^{-3} \mathrm{Mpc}^3\))
Again, here we will generate some fake \(P_\ell(k)\) and write this data to a plaintext file with the correct format as an example.
In [29]: Nk = 20
In [30]: Nell = 3
# fit the monopole, quadrupole, and hexadecapole
In [31]: ells = [0, 2, 4]
# generate 20 k bins from 0.01 to 0.4
In [32]: k_edges = np.linspace(0.01, 0.4, Nk+1)
In [33]: k_cen = 0.5 * (k_edges[1:] + k_edges[:-1])
# make a structured array
In [34]: data = np.empty((Nk,Nell), dtype=[('k', 'f8'), ('power', 'f8')])
In [35]: for i, ell in enumerate(ells):
....: data['k'][:,i] = k_cen[:]
....: data['power'][:,i] = np.random.random(size=Nk)
....:
# identifying names for each statistic
In [36]: names = ['pole_0', 'pole_2', 'pole_4']
# initialize the PowerMeasurements object
In [37]: measurements = PowerMeasurements.from_array(names, data)
In [38]: measurements
Out[38]:
[<PowerMeasurement P(k, ell=0), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, ell=2), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, ell=4), (0.0198 - 0.39) h/Mpc, 20 data points>]
# save to file
In [39]: measurements.to_plaintext("pole_data.dat")
Our fake data has been saved to a plaintext text file with the desired format. The first few lines of this plaintext file look like:
20 3
k power
1.97500e-02 1.03185e-01
3.92500e-02 3.20316e-02
5.87500e-02 9.39877e-01
7.82500e-02 3.90996e-01
9.77500e-02 2.75281e-01
1.17250e-01 1.82381e-01
1.36750e-01 8.72181e-01
1.56250e-01 8.33602e-01
We can easily re-initialize a PowerMeasurements
object from this plaintext
file using
In [40]: names = ['pole_0', 'pole_2', 'pole_4']
In [41]: measurements_2 = PowerMeasurements.from_plaintext(names, 'pole_data.dat')
In [42]: measurements_2
Out[42]:
[<PowerMeasurement P(k, ell=0), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, ell=2), (0.0198 - 0.39) h/Mpc, 20 data points>,
<PowerMeasurement P(k, ell=4), (0.0198 - 0.39) h/Mpc, 20 data points>]
Note
The names
specified in this example serve as identifying strings for
each multipole. They should be specified via the data.statistics
keyword in the parameter file. For \(P_\ell(k)\) data, the names
must begin with pole_
and are typically followed by the multipole
number, i.e., pole_0
identifies the monopole (\(\ell=0\)).
The Covariance Matrix¶
The parameter estimation procedure relies on the likelihood function, which tells
use the probability of the measured data given our theoretical model. In order
to evaluate the likelihood, we need the covariance matrix of the data statistics.
pyRSD provides two classes for dealing with covariance matrices,
pyRSD.rsdfit.data.PkmuCovarianceMatrix
for \(P(k,\mu)\) data and
pyRSD.rsdfit.data.PoleCovarianceMatrix
for \(P_\ell(k)\) data.
Again, when running the rsdfit
command, the covariance matrix will be
loaded from a plaintext file with a specific format. The easiest way to
save your desired covariance matrix is take advantage of the functionality
provided by the PkmuCovarianceMatrix
and
pyRSD.rsdfit.data.PoleCovarianceMatrix
classes.
The most important thing to realize when dealing with the covariance matrix is that it is the covariance matrix of the full, concatenated data vector. For example, if we are fitting the \(\ell=0,2,4\) multipoles, then the full data vector is
Then, if we have \(N_k\) data points measured for each multipole, then the covariance matrix has a size of \((3 N_k, 3 N_k)\). The upper left sub-matrix of size \((N_k, N_k)\) is the covariance of \(P_0\) with itself, the next sub-matrix of size \((N_k, N_k)\) to the right is the covariance between \(P_0\) and \(P_2\) and similarly, the last sub-matrix on the top row is the covariance between \(P_0\) and \(P_4\).
As an example, we will generate some fake multipole data and compute the covariance below,
In [43]: from pyRSD.rsdfit.data import PoleCovarianceMatrix
# generate 100 fake monopoles
In [44]: P0 = np.random.random(size=(100, Nk)) # Nk = 20, see above
# 100 fake quadrupoles
In [45]: P2 = np.random.random(size=(100, Nk))
# 100 fake hexadecapoles
In [46]: P4 = np.random.random(size=(100, Nk))
# make the full data vector
In [47]: D = np.concatenate([P0, P2, P4], axis=-1) # shape is (100, 60)
# compute the covariance matrix
In [48]: cov = np.cov(D, rowvar=False) # shape is (20,20)
# initialize the PoleCovarianceMatrix
In [49]: ells = [0,2,4]
In [50]: C = PoleCovarianceMatrix(cov, k_cen, ells)
In [51]: C
Out[51]: <PoleCovarianceMatrix: size 60x60>
# write to plaintext file
In [52]: C.to_plaintext('pole_cov.dat')
The covariance matrix of our fake data has been saved to a plaintext text file with the desired format. The first few lines of this plaintext file look like:
60 0.08749548040812441 -0.016328517094392307 0.03065759204306653 -0.0027165435929309336 0.002304111735306902 -0.003093621922043565 -0.01174711039079392 0.004067073725322639 0.017856388426813617
We can easily re-initialize a PoleCovarianceMatrix
object from this plaintext
file using
In [53]: names = ['pole_0', 'pole_2', 'pole_4']
In [54]: C_2 = PoleCovarianceMatrix.from_plaintext('pole_cov.dat')
In [55]: C_2
Out[55]: <PoleCovarianceMatrix: size 60x60>
The case of \(P(k,\mu)\) data is very similar to multipoles, except now
the second dimension specifies the \(\mu\) bins, rather than the
multipole numbers. The class PkmuCovarianceMatrix
should be used
for \(P(k,\mu)\) data.
Warning
Be sure to ensure that the number of data points, specifically the
\(k\) range used, of the data statistics in the file specified by the
data.data_file
attribute agrees with the number of elements in
the covariance matrix.
The Window Function¶
If the user wishes to fit window-convolved power spectrum multipole, a
file holding the correlation function multipoles of the window function should
be specified using the data.window_file
parameter. Then, the theoretical
model will be convolved with this window function to accurately compare
model and data.
The window function file should hold several columns of data, with the first column specifying the separation array \(s\), and the other columns giving the even-numbered correlation function multipoles of the window. See Window-convolved Power Spectra for a full example of the window multipoles.
The \(P(k,\mu)\) Grid¶
As discussed in Discrete Binning Effects, the user can optionally take into account
the effects of dicretely binned data with a finely-binned \(P(k,\mu)\) grid.
The name of the file holding this grid should be specified in the parameter
file using the data.grid_file
attribute.
Given a 2D data array specifies this grid, the easiest way to save the grid
to a plaintext file in the right format is to use the PkmuGrid.to_plaintext()
function. See Discrete Binning Effects for a full example of how to do this.
Note
If the user is fitting window-convolved multipoles, the grid file does not need to be specified.