ProbabilisticEchoInversion.jl

Introduction

Welcome to the documentation for ProbabilisticEchoInversion!

This package is designed to solve the "inverse problem" for acoustic backscatter at multiple frequencies in a Bayesian statistical framework. In other words, given observed echoes from one or more types of scatterers, this package will help you infer

  1. What they were,
  2. How many of them were present, and
  3. How sure you can be about (1) and (2).

We call this approach Automatic Probabilistic Echo Solving, or APES.

This documentation provides a short introduction to the problem and general approach, as well as a simple tutorial on how to use the package. It assumes basic familiarity with the principles of fisheries acoustics and Bayesian statistical modeling (e.g. experience with Turing.jl, Stan, or JAGS). While the package is written in Julia, we don't assume prior Julia knowledge, and you should hopefully be able to follow along if you are familiar with scientific programming in a similar scripting language like R, Python, or Matlab. Extensive resources for learning Julia are available at the official website.

Background

The typical approach in fisheries acoustics is to use relative differences in backscatter strength at multiple frequencies to classify parts of the water column as one thing or another (fish vs. zooplankton, large vs. small fish, etc.). These frequency responses are usually calculated by substracting backscatter at two different frequencies in the decibel domain ("dB differencing"). Once the echogram has been classified using multifrequency information, scatterer density is estimated by echo-integrating at a single frequency, using the relationship

\[s_v = \langle \sigma_{bs} \rangle n, \;\;\;\;\;\;(1)\]

where $s_v$ is the volume backscattering coefficient, $\langle \sigma_{bs} \rangle$ is the average backscattering cross section of a single scatterer, and $n$ is their numerical density.

The Inverse Approach

The inverse approach relies on the same theory and assumptions, but instead of classifying first using multiple frequencies and then integrating using only one, it does both at the same time, using all frequencies available. Mathematically, if we have a vector of backscatter at multiple frequencies $\mathbf{s}_v$, and a matrix $\Sigma$ of backscattering cross-sections, where the $i,j^{th}$ entry holds $\langle \sigma_{bs} \rangle$ for species $j$ at frequency $i$, then solving the inverse problem means finding the vector of scatterer densities $\mathbf{n} \ge 0$ that solves the equation

\[\mathbf{s}_v = \Sigma \mathbf{n}. \;\;\;\;\;\;(2)\]

The inverse approach has several advantages over the classify-then-integrate approach:

  • It uses all available frequency information to integrate, in theory giving a more robust estimate of animal density than single-frequency integration.
  • It extends naturally from a small number of narrowband frequencies to broadband spectra, or a mixture of the two.
  • It can handle mixtures of different scatterers, a situation where dB differencing struggles.

However, even though the inverse approach has been around for some time, it has never acheived widespread use in practice, because Equation 2 is often (if not usually) underdetermined, and it involves inherent uncertainties that are nonlinear and potentially difficult to quantify.

Bayesian Inverse Modeling

One way to address these challenges is to implement the inverse problem as a Bayesian statistical model. Like all inversion methods, a Bayesian approach handles species mixtures, uses all available frequencies, and extends naturally to broadband signals. However, it has a few distinct advantages. The priors required for a Bayesian model provide a rigorous way to incorporate assumptions, ecological knowledge, and/or data from direct sampling. Good priors provide the inverse model with additional information, improving the quality of the solution and allowing even underdetermined problems to be solved. Bayesian models can incorporate multiple sources of uncertainty and propagate them through to the solution, increasing the robustness ofthe results: a well-specified model should not produce solutions that are simultaneously wrong and confident. Finally, these models are based on physical scattering processes, so their output is fully interpretable, unlike some machine learning methods. Taken together, these advantages make the inverse approach robust and reliable enough to be used on real-world acoustic data.

Tutorial

Installation and setup

ProbabilisticEchoInversion.jl is a Julia package, so if you have not installed the Julia programming language, that's the first thing to do. You can download the latest version for free from the official website. Even better, use the Juliaup installer/version manager, which makes it much easier to upgrade when new Julia versions are released.

You can work with Julia files in any text editor, but for a nice integrated experience, we can recommend Visual Studio Code with the Julia extension, Jupyter, or Pluto.

Once you have Julia installed, open the Julia command line (a.k.a. the read-evaluate-print-loop, or REPL). While not required, it is easy and highly recommended to set up a local environment for each of your projects. To do that for this tutorial, run the following commands:

# create a new folder--could call it whatever you want
julia> mkdir("APESTutorial")

# change directory to the folder we just created
julia> cd("APESTutorial")

# type `]` to enter package manager mode
julia> ]

# activate the current directory as project environment
(@v1.9) pkg> activate .

(APESTutorial) pkg>

Install the package to this new project environment by running the following command:

(APESTutorial) pkg> add ProbabilisticEchoInversion

To recreate the graphics, install Plots.jl and ColorSchemes.jl the same way:

(APESTutorial) pkg> add Plots ColorSchemes

Note : Using local environments

You don't strictly need to create a local environment, and can install ProbabilisticEchoInversion into the top-level Julia environment (i.e., (@v1.9) instead of APESTutorial). This will make it vailable automatically for all projects. However, the more packages you install in the top-level environment, the more likely you are to end up with conflicting versions and dependencies. In our experience, working with local environments is much easier in the long run–and as a pleasant side effect, it makes it much easier to reproduce your analyses, since all the precise package versions you used are recorded automatically in the Project.toml and Manifest.toml files.

Once everything has downloaded and precompiled, you can exit the package manager by hitting backspace. To run the rest of this tutorial yourself, you'll need the data files located here. Download them to the project directory you just created. You can also download the example.jl script, which contains all the following code in one place.

Loading and arranging your data

ProbabilisticEchoInversion expects your acoustic data to be arranged in a multidimensional DimArray from DimensionalData.jl. DimArrays behave like normal Julia arrays, but also let you index with named dimensions and have a bunch of nice functionality for subsetting, slicing, and plotting. For a typical downard-looking echosounder on a moving ship, your data have three dimensions - depth, distance, and frequency - meaning this array will be three-dimensional.

If your data are already stored this way, for instance in a NetCDF or .mat file, it will be easy to convert them to a DimArray (refer to the DimensionalData.jl documentation for details). If your data are stored as a table in "long" format, as is typical for .CSV exports from Echoview, you will need to do some reshaping first. This package provides a function unstack_echogram to take care of that reshaping.

First, load the required packages.

using ProbabilisticEchoInversion
using CSV, DataFrames
using DimensionalData, DimensionalData.Dimensions

This tutorial contains five comma-delimited data files, one for each frequency. We read them in, add a frequency column to each one, and use vcat to stack them all into a single DataFrame.

freqs = [18, 38, 70, 120, 200]
echo_df = map(freqs) do f
    filename = joinpath(@__DIR__, "DY1702_haul24_$(f)kHz.csv")
    df = CSV.read(filename, DataFrame)
    df[!, :frequency] .= f
    return df
end
echo_df = vcat(echo_df...)

Next, we'll transform this data frame into a 3-d DimArray. The unstack_echogram function takes long-format a DataFrame as its first argument. That DataFrame needs to have at least four columns:

  1. An x-coordinate, such as along-track distance or time
  2. A y-coordinate, such as depth or range from the transducer
  3. Acoustic frequency, and
  4. Acoustic mean volume backscatter (can be in linear or decibel units )

The names of these columns are passed to unstack_echogram as the second through fifth arguments, respectively. In this example, the data files are standard .csv exports from Echoview, and the relevant column names are:

echo = unstack_echogram(echo_df, :Dist_M, :Layer_depth_min, :frequency, :Sv_mean)

By default, the axes of echo will be named X, Y, and F. If you'd like to define your own dimensions, this is easy to do with the @dim macro from DimensionalData.Dimensions. You can then supply them in the optional final three arguments to unstack_echogram and they will be applied to the DimArray it returns.

@dim Z YDim "Depth (m)"
@dim D XDim "Distance (km)"
echo = unstack_echogram(echo_df, :Dist_M, :Layer_depth_min, :frequency, :Sv_mean, D, Z)

It is now easy to manipulate the multifrequency echogram, for instance by selecting a slice by frequency and plotting it. Refer to the DimensionalData.jl docs to learn more about how to slice and dice DimArrays.

heatmap(echo[F(At(120))], yflip=true)

Echogram of example data at 120 kHz

Defining the model

Once the data are loaded in, we need to define the inverse model we want to solve. This is done using the probabilistic programming language Turing.jl. If you are familiar with BUGS, JAGS, or Stan, model definitions in Turing are conceptually very similar. If you have not worked with it before, it is worth studying the Turing documentation before going any further.

A very simple inverse model is defined below.

@model function examplemodel(data, params)
    nfreq, nspp = size(params.TS)
    Σ = exp10.(params.TS ./ 10)

    # define priors
    logn ~ arraydist(Normal.(zeros(nspp), fill(3, nspp))) # scatterer log-densities
    ϵ ~ Exponential(1.0) # observation error variance

    # Predict Sv based on scatterer density and TS
    n = exp10.(logn)
    μ = 10log10.(Σ * n)

    # Compare observed to predicted backscatter
    data.backscatter .~ Normal.(μ, fill(ϵ, nfreq))
end

To work with APES, your model function must take two arguments. The first, data, is a NamedTuple containing acoustic data and metadata from a single cell. These data tuples will be generated automatically when you run the analysis using the apes function (described below), and will contain three fields, all accessible using dot-notation:

  • data.backscatter : Vector of backscatter values (in whatever units you defined your echo array)
  • data.freqs : Vector of frequencies at which backscatter was recorded
  • data.coords : Spatial/temporal coordinates of the cell within the echo array

You can use any of these fields inside the model, but data.backscatter is the only one you must use, since it contains the actual observations.

The second argument, params is for any constants or auxiliary information you want to pass to the model. It will typically be a NamedTuple, but can be any data type you want. If your model doesn't need any other info, just supply an empty tuple (). Here, we'll use params to hold a single item, a matrix of target strengths (TS).

This model assumes a fixed number of scattering classes are present, each with a known TS spectrum. It puts a vague prior on their log-densities, and assumes a single error variance for all frequencies.

Note : Writing models in the log-domain

Note that this model is defined in the logarithmic domain - that is, the scatterer densities are written as log-densities, and the observed data are assumed to be decibel-valued mean volume backscattering strengths ($S_v$) instead of linear mean volume backscattering coefficients ($s_v$). While not strictly required, defining your models this way is a really good idea. The small absolute values and wide ranges of both scatterer densities and observed backscatter means that linear-domain models often have problems with floating-point precision that can manifest in inefficient and/or incorrect inference.

The last step in setting up our model is to choose our candidate scatterers and construct the TS matrix. A research trawl performed at this location found a mixture of Alaska pollock (Gadus chalcogrammus), unidentified lanternfish, and Pacific glass shrimp (Pasiphaea pacifica). We will assume these were the main scatterers present and define three plausible TS spectra at our five frequencies. We then concatenate them into a matrix and wrap it in a named tuple.

TS_pollock = [-34.6, -35.0, -35.6, -36.6, -38.5]
TS_myctophid = [-73.0, -58.0, -65, -67.2, -70.0]
TS_shrimp = [-100, -90, -82, -76.2, -73.7]
TS = [TS_pollock TS_myctophid TS_shrimp]
params = (; TS)

plot(freqs, TS, marker=:o, label=["Pollock" "Myctophid" "Shrimp"],
    xlabel="Frequency (kHz)", ylabel="TS (dB re m⁻²)")

TS spectra of pollock, myctophids, and shrimp

Running the model

Once the data, parameters, and model are all set up, running it is just one line of code.

solution_mcmc = apes(echo, examplemodel, MCMCSolver(), params=params)

This will draw 1,000 samples from the joint posterior of the model for each acoustic cell, using the No-U-Turn Sampler (NUTS) from Turing. Any cells where all backscatter values are missing (e.g., below-bottom cells) will be skipped. Altogether, the inference will take on the order of 5-20 minutes to finish, depending on your machine.

If you don't have time to wait for (asymptotically) exact inference, you can opt for a much faster option: maximum-a-posteriori optimization, with errors estimated via the delta method. This is done by changing the solver argument to MAPSolver(). Inference in this case takes just a few seconds.

solution_map = apes(echo, examplemodel, MAPSolver(), params=params)

In either case (MCMC chains or optimization fits) the results are returned in a DimArray that shares the first two dimensions as echo, so they are also easy to manipulate and plot. For instance, arrays of posterior means and standard deviations can be obtained this way,

post_mean = passmissing(chn -> mean(Array(chn), dims=2)).(solution_mcmc);
post_cv = passmissing(chn -> cv(Array(chn), dims=2)).(solution_mcmc);

where we use passmissing to deal with the fact that some of the result cells contain MCMC chains and some are missing.

More Advanced Examples

Because the inverse model is defined in Turing.jl's modeling language, APES is incredibly flexible in terms of the data and situations to which it can be applied. For a more in- depth look at some of its capabilities, please check out the fully-worked example problems in the APESExamples repository. These reproduce the analyses and figures in Urmy et al. 20**, and include:

  • A simulated fish/zooplankton mixture, demonstrating how to solve simultaneously for scatterer size and density,
  • A simulated mesopelagic scattering layer, demonstrating how to use ground-truth data to constrain the solution of an underdetermined inverse problem,
  • An application of APES to a variety of mixed scattering types at the Aleutian shelfbreak

in the Gulf of Alaska,

  • And an application of APES to broadband backscatter from zooplankton and fish of mixed

sizes in Barnabas Trough, south of Kodiak Island.

Using, Citing, and Contributing

This software was developed by Sam Urmy at NOAA's Alaska Fisheries Science Center. As a product of the U.S. Government, it is free for anyone to use under a public-domain Creative Commons CC0 license.

ProbabilisticEchoInversion.jl has been tested and peer-reviewed, but should still be considered research-grade beta software rather than fully production-ready. If you try it on your own data, please do submit bug reports, comments, and feature requests via the project's GitHub repository. Pull requests are welcome, both for code and documentation!

Finally, if you do use APES in your own work, please cite the following publication:

Urmy, De Robertis, and Bassett (2023). A Bayesian inverse approach to identify and quantify organisms from fisheries acoustic data. ICES Journal of Marine Science, https://doi.org/10.1093/icesjms/fsad102

API

ProbabilisticEchoInversion.MAPMCMCSolverType
MAPMCMCSolver([;optimizer, options])

Construct an MAPMCMCSolver, specifying how to invert a probabilistic backscattering model using a combination of maximum a posteriori optimization and Markov-chain Monte Carlo. This simply means that an optimization routine finds the MAP point estimate of the parameters, which is then used as the starting point for the MCMC run.

Arguments correspond exactly to the ones for MAPSolver and MCMCSolver; refer to their documentation for details.

source
ProbabilisticEchoInversion.MAPSolverType
MAPSolver([;optimizer, options])

Construct a MAPSolver, specifying how to invert a probabilistic backsattering model using maximum a-posteriori optimization. Default optimizer is LBFGS. See Turing.jl and Optim.jl documentation for more information on available solvers and options.

source
ProbabilisticEchoInversion.MCMCSolverType
MCMCSolver([;sampler, parallel, nsamples, nchains; kwargs, verbose])

Construct an MCMCSolver, specifying how to invert a probabilistic backscattering model using Markov-chain Monte Carlo. By default uses the no-U-turn sampler with acceptance rate 0.8 and collects 1000 samples. See Turing.jl documentation for more information on options for MCMC sampling.

source
ProbabilisticEchoInversion.apesMethod
apes(echogram, model, solver[; params, result_handler, safe_precision, distributed])

Run the automatic probabilistic echo solver defined by the inverse model and solution method solver on the acoustic backstter data in echogram.

Arguments

  • echogram::DimArray: Acoustic backscatter data in the linear domain (i.e., volume backsattering coefficient $s_v$, area backsattering coefficient $s_a$, or nautical area scattering coefficient, $NASC$). The last dimension of the DimArray should be named :F and index the acoustic frequencies; all other dimensions should reference spatial/temporal coordinates.
  • model::Function: Probabilistic inverse model defined with Turing.jl or DynamicPPL.jl. This model should have the signature model(data, params), where data and params contain the acoustic data and any additional parameters. See below for more details.
  • solver::AbstractSolver: The method used to solve the inverse problem specified in model. See MCMCSolver and MAPSolver for more detail.
  • params: Optional additional params to pass to model.
  • result_handler: Optional function to transform the output of the solver before (for instance, by calculating the means of a Markov chain).
  • distributed::Bool=false: Whether to use all available processors when fitting model to echogram cells.

Details

This function applies a probabilistic inverse backscattering model defined using Turing.jl to each spectrum in a mulifrequency echogram. The model's constructor must accept two arguments:

  • data: A NamedTuple or other structure accessible by dot-notation, with fields coords, freqs, and backscatter. These contain the observed acoustic data. The model doesn't have to use all of them.
  • params: Optional NamedTuple or other object, containing any constants or auxiliary information used by the model.
source
ProbabilisticEchoInversion.iterspectraFunction
iterspectra(echogram[, freqdim])

Given an mulifrequency or broadband echogram in the form of a DimArray, with the acoustic frequencies in one dimension (by default named :F), iterate over each spectrum. The iterator yields NamedTuples with three fields:

  • coords: Coordinates of the spectrum on the non-:F dimensions of the Array (i.e. its location space/time)
  • freqs: Array of acoustic frequencies
  • backscatter: Array of backscatter values
source
ProbabilisticEchoInversion.mapspectraMethod
mapspectra(f, echogram[; freqdim, distributed])

Map the function f over each spectrum in the DimArray echogram. By default assumes the acoustic frequencies are recorded in dimension :F, if this is not the case, specify the name of the dimension using the freqdim argument.

source
ProbabilisticEchoInversion.solveFunction
solve(data, model, solver[, params])

Run the probabilistic inverse model defined by model on the acoustic backscatter spectrum in data, using solver as the inference engine.

source
ProbabilisticEchoInversion.unstack_echogramFunction
unstack_echogram(echo_df, xcol, ycol, fcol, svcol[, X, Y, F])

Convert a long-format DataFrame of multifrequency acoustic data into a three-dimensional DimArray representing an echogram. This DataFrame must contain the following columns:

  1. An x-coordinate, such as along-track distance or time
  2. A y-coordinate, such as depth or range from the transducer
  3. Acoustic frequency, and
  4. Acoustic mean volume backscatter (can be in linear or decibel units )

The names of these columns are passed to unstack_echogram as the second through fifth arguments, respectively.

  • echo_df::DataFrame : Long-format DataFrame with columns for an x and y coordinate,

acoustic frequency, and backscatter.

  • xcol, ycol, fcol, svcol : Names of the columns in echo_df (as Symbols)
  • X, Y, F : Optional Dimensions to apply to the x, y, and frequency axes of the

resulting DimArray.

source