This commit is contained in:
Patrick Hangl
2026-05-05 09:53:04 +02:00
parent 5e5331f099
commit 0a5244ec3f
9 changed files with 312 additions and 244 deletions
+137 -126
View File
@@ -60,26 +60,26 @@ As the \ac{ARM} operation is not the main focus of this thesis and its behavior
The implementation of the \ac{ANR} algorithm on the \ac{DSP} core is structured into several key sections, each responsible for specific aspects of the algorithm's functionality. The following paragraphs outline the main components:
\paragraph{Memory initialization and mapping}
The memory initialization section starts with the definition of the interrupt register (0xC00004) and the corresponding bit masks used to control the interrupt behavior of the \ac{DSP} core. Afterwards, a section in the shared memory is defined for the storage of input and output audio samples after/before the transport to/from the \ac{PCM} interface. The output section is initialized with an offset of 16 bytes from the input section (0x800000), resulting in a storage capability of 4 32-bit double-words for each of the two memory sections - this is more than needed, but prevents future memory relocation, if the necessity for more space would arise. After this initialization, the interrupt register and the memory sections are declared as volatile variables, telling the compiler, that these variables can be changed outside the normal program flow (e.g., by hardware interrupts), preventing certain optimizations. The final input/output buffers are then declared in form of two 16-bit arrays, consisting of 4 elements each. Finally, a variable is declared to signal the \ac{DSP} core, an interrupt has occurred, which changes the state of the interrupt register and signals a processing request.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
#define CSS_CMD 0xC00004
#define CSS_CMD_0 (1<<0)
#define CSS_CMD_1 (1<<1)
\begin{listing}[H]
\centering
\begin{lstlisting}[style=cstyle]
#define CSS_CMD 0xC00004
#define CSS_CMD_0 (1 << 0)
#define CSS_CMD_1 (1 << 1)
#define INPUT_PORT0_ADD 0x800000
#define OUTPUT_PORT_ADD (INPUT_PORT0_ADD + 16)
#define INPUT_PORT0_ADD 0x800000
#define OUTPUT_PORT_ADD (INPUT_PORT0_ADD + 16)
volatile static unsigned char set_storage(DMIO:CSS_CMD) css_cmd_flag;
volatile static unsigned char set_storage(DMIO:CSS_CMD) css_cmd_flag;
static volatile int16_t set_storage(DMB:INPUT_PORT0_ADD) input_port[4];
static volatile int16_t set_storage(DMB:OUTPUT_PORT_ADD) output_port[4];
static volatile int16_t set_storage(DMB:INPUT_PORT0_ADD) input_port[4];
static volatile int16_t set_storage(DMB:OUTPUT_PORT_ADD) output_port[4];
static volatile int action_required;
\end{lstlisting}
\label{fig:fig_dps_code_memory}
\caption{Low-level implementation: Memory initialization and mapping}
\end{figure}
static volatile int action_required;
\end{lstlisting}
\caption{Low-level implementation: Memory initialization and mapping}
\label{lst:lst_dsp_code_memory}
\end{listing}
\noindent Figure \ref{fig:fig_compiler.jpg} shows an exemplary memory map of the input buffer array, taken from the compiler debugger. As the array is initialized as a 16-bit integer array, each element occupies 2 bytes of memory.
\begin{figure}[H]
\centering
@@ -90,30 +90,41 @@ The memory initialization section starts with the definition of the interrupt re
\paragraph{Main loop and interrupt handling}
The main loop of the \ac{DSP} core is quite compact, as it mainly focuses on handling interrupts and delegating the sample processing to the \ac{ANR} function. The loop starts by enabling interrupts with a compiler-specific function and setting up pointers for the output buffer and the sample variable. After setting the action flag to zero, the main function enters an infinite loop, signaling the \ac{ARM} core it's halted state by setting the interrupt register to 1 and halting the core.\\ \\
If the \ac{ARM} core requests a sample to be processed, it activates the \ac{DSP} core and triggers an interrupt, which sets the action flag to 1. The main loop then checks the action flag, and sets the interrupt register back to 0, indicating the \ac{ARM} core it is now processing the sample. After resetting the action flag, the output pointer is updated to point to the next position in the output buffer using a cyclic addition function. Before triggering the calculate\_output()-function, the calculated sample from the previous cycle is moved from its temporary memory location to the current position in the output buffer. Afterwards, the calculate\_output()-function is triggered for the current cycle and the loop restarts. The flow diagram in Figure \ref{fig:fig_dsp_logic.jpg} visualizes the described behavior of the main loop and interrupt handling on the \ac{DSP} core.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
int main(void) {
enable_interrupts();
output_pointer = &output_port[1];
sample_pointer = &sample;
\begin{listing}[H]
\centering
\begin{lstlisting}[style=cstyle]
int main(void) {
enable_interrupts();
output_pointer = &output_port[1];
sample_pointer = &sample;
action_required = 0;
while (1) {
css_cmd_flag = CSS_CMD_1;
core_halt();
if (action_required == 1) {
css_cmd_flag = CSS_CMD_0;
action_required = 0;
while (1){
css_cmd_flag = CSS_CMD_1;
core_halt();
if (action_required == 1) {
css_cmd_flag = CSS_CMD_0;
action_required = 0;
out_pointer = cyclic_add(output_pointer, 2, output_port, 4);
*output_pointer = *sample_pointer;
calculate_output(&corrupted_signal, &reference_noise_signal, mode, &input_port[1], &input_port[0], sample_pointer);
}
}
out_pointer = cyclic_add(output_pointer, 2, output_port, 4);
*output_pointer = *sample_pointer;
calculate_output(
&corrupted_signal,
&reference_noise_signal,
mode,
&input_port[1],
&input_port[0],
sample_pointer
);
}
\end{lstlisting}
\caption{Low-level implementation: Main loop and interrupt handling}
\label{fig:fig_dps_code_mainloop}
\end{figure}
}
}
\end{lstlisting}
\caption{Low-level implementation: Main loop and interrupt handling}
\label{lst:lst_dsp_code_mainloop}
\end{listing}
\begin{figure}[H]
\centering
\includegraphics[width=1.0\linewidth]{Bilder/fig_dsp_logic.jpg}
@@ -129,37 +140,37 @@ The ability to process audio samples in real-time on the \ac{DSP} core is strong
In the following, some examples of optimization possibilities shall be outlined, before the entire \ac{ANR} implementation on the \ac{DSP} is analyzed in regard of its performance.
\paragraph{Logic operations}
Logic operstions, such as finding the maximum or minimum of two values, are quite common in signal processing algorithms. However, their implementation in C usually involves conditional statements (if-else), which can be inefficient on certain architectures due to pipeline stalls.\\ \\
The simple function shown in Figure \ref{fig:fig_dsp_code_find_max} returns the maximum of two given integer values. Processing this manual implementation on the \ac{DSP} takes 12 cycles to execute, while the intrinsic function of the \ac{DSP} compiler allows a 4-cycle execution.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
int find_max(int a, int b){
return (a > b) ? a : b;
}
\end{lstlisting}
\caption{Manual implementation of a max-function, returning the maximum of two integer values, taking 12 cycles to execute. The intrinsic functions of the DSP compiler allows a 4-cycle implementation of such an operation.}
\label{fig:fig_dsp_code_find_max}
\end{figure}
The simple function shown in Listing \ref{lst:lst_dsp_code_find_max} returns the maximum of two given integer values. Processing this manual implementation on the \ac{DSP} takes 12 cycles to execute, while the intrinsic function of the \ac{DSP} compiler allows a 4-cycle execution.
\begin{listing}[H]
\centering
\begin{lstlisting}[style=cstyle]
int find_max(int a, int b) {
return (a > b) ? a : b;
}
\end{lstlisting}
\caption{Manual implementation of a max-function, returning the maximum of two integer values, taking 12 cycles to execute. The intrinsic functions of the DSP compiler allow a 4-cycle implementation of such an operation.}
\label{lst:lst_dsp_code_find_max}
\end{listing}
\paragraph{Cyclic array iteration}
Basically every part of the \ac{ANR} algorithm relies on iterating through memory sections in a cyclic manner. In C, this is usually implemented by defining an array, containing the data, and a pointer, which is incremented after each access. When the pointer reaches the end of the array, it is reset to the beginning again. This approach requires several different operations, such as pointer incrementation, if-clauses and for-loops.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
int* cyclic array iteration(int *pointer, int increment, int *pointer_start, int buffer_length){
int *new_pointer=pointer;
for (int i=0; i < abs(increment); i+=1){
new_pointer ++;
if (new_pointer >= pointer_start + buffer_length){
new_pointer=pointer_start;
}
\begin{listing}[H]
\centering
\begin{lstlisting}[style=cstyle]
int* cyclic array iteration(int *pointer, int increment, int *pointer_start, int buffer_length){
int *new_pointer=pointer;
for (int i=0; i < abs(increment); i+=1){
new_pointer ++;
if (new_pointer >= pointer_start + buffer_length){
new_pointer=pointer_start;
}
return new_pointer;
}
\end{lstlisting}
\caption{Manual implementation of a cyclic array iteration function in C, taking the core 20 cycles to execute a pointer inremen of 1. The intrinsic functions of the DSP compiler allows a single-cycle implementation of such cyclic additions.}
\label{fig:fig_dsp_code_cyclic_add}
\end{figure}
\noindent Figure \ref{fig:fig_dsp_code_cyclic_add} shows a manual implementation of such a cyclic array iteration function in C, which updates the pointer to a new address. This implementation takes the \ac{DSP} 20 cycles to execute, while the already implemented compiler-optimized version only takes one cycle, making use of the specific architecture of the \ac{DSP} allowing such a single-cycle operation.
}
return new_pointer;
}
\end{lstlisting}
\caption{Manual implementation of a cyclic array iteration function in C, taking the core 20 cycles to execute a pointer inremen of 1. The intrinsic functions of the DSP compiler allows a single-cycle implementation of such cyclic additions.}
\label{lst:lst_dsp_code_cyclic_add}
\end{listing}
\noindent Listing \ref{lst:lst_dsp_code_cyclic_add} shows a manual implementation of such a cyclic array iteration function in C, which updates the pointer to a new address. This implementation takes the \ac{DSP} 20 cycles to execute, while the already implemented compiler-optimized version only takes one cycle, making use of the specific architecture of the \ac{DSP} allowing such a single-cycle operation.
\paragraph{Fractional fixed-point arithmetic}
As already mentioned during the beginning of the current chapter, the used \ac{DSP} is a fixed point processor, meaning, that it does not support floating-point arithmetic natively. Instead, it relies on fixed-point arithmetic, which represents numbers as integers scaled by a fixed factor. This is a key requirement, as it allows the use of the implemented dual \ac{MAC} \ac{ALU}s. This approach is also faster and more energy efficient, and therefore more suitable for embedded systems. However, it also introduces challenges in terms of precision and range, which need to be taken into account when conducting certain calculations.\\ \\
To tackle this issues, the \ac{DSP} compiler provides intrinsic functions for fractional fixed-point arithmetic, such as a fractional multiplication function, which takes two 32-bit integers as input and return an already bit-shifted 64-bit output, representing the fractional multiplication result. This approach prevents the need for manual bit-shifting operations after each multiplication.\\ \\
@@ -169,103 +180,103 @@ To support such operations, a 72-bit accumulator is provided, allowing to store
The $calculate\_output()$-function, forms the center of the \ac{ANR} algorithm on the \ac{DSP} core and is responsbile for the actual processing of the audio samples. The general functionality of the function in C is the same as in the high-level implementation (refer to Figure \ref{fig:fig_anr_logic}), and will therefore not be described in detail again. The main focus lies now on the computational efficiency of the different parts of the function, with the goal of generating a formula by quantizizing the computational effort of the different sub-parts in relation to changeable parameters like the filter length.\\ \\
The $calculate\_output()$ functions consists out of the following five main parts:
\begin{itemize}
\item $write\_buffer$: Pointer handling and buffer management
\item $apply\_fir\_filter$: Application of the \ac{FIR} filter on the reference noise signal
\item $update\_output$: Calculation of the output sample (=error signal)
\item $update\_filter\_coefficients$: Update of the \ac{FIR} filter coefficients based on the error signal
\item $write\_output$: Writing the output sample back to the output port in the shared memory section
\item $write\_buffer()$: Pointer handling and buffer management
\item $apply\_fir\_filter()$: Application of the \ac{FIR} filter on the reference noise signal
\item $update\_output()$: Calculation of the output sample (=error signal)
\item $update\_filter\_coefficients()$: Update of the \ac{FIR} filter coefficients based on the error signal
\item $write\_output()$: Writing the output sample back to the output port in the shared memory section
\end{itemize}
These sub-functions feature \ac{DSP}-spefic optimizations and are partly depenend on the setable parameters like the filter length in regard of their computational cost. The following paragraphs will analyze the computational efficiency of these sub-functions in detail.
\paragraph{write\_buffer}The $write\_buffer$-function is responsible for managing the input line, where the samples of the reference noise signal are stored for further processing. The buffer management mainly consits out of a cyclic pointer increase operation and a pointer dereference operation to write the new sample into the buffer. The cyclic pointer increase operation is implemented using the already mentioned intrinsic function of the \ac{DSP} compiler, while the pointer dereference operation takes 15 cycles to execute. This results in a total duration of 16 cycles for the $write\_buffer$-function to process, indipentent of the filter length or other parameters.
\paragraph{apply\_fir\_filter} The $apply\_fir\_filter$-function is responsible for applying the coefficients of the \ac{FIR} filter on the reference noise signal samples stored in the input line. The needed cycles for this function are mainly depenendent on the lenght of the filter, as the number of multiplications and additions increase with the filter length. To increase the performance, the dual \ac{MAC} architecture of the \ac{DSP} is utilized, allowing two multiplications and two additions to be performed in a single cycle. Another \ac{DSP}-specific optimization is the use of the already introduced 72-bit accumulators and the fractional multiplication function, which allows performing multiplications on two 32-bit integers without losing precision or the need for manual bit-shifting operations.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
for (int i=0; i < n_coeff; i+=2) chess_loop_range(1,){
x0 = *p_x0;
w0 = *p_w;
p_w++;
p_x0 = cyclic_add(p_x0, -1, p_xstart, sample_line_len);
x1 = *p_x0;
w1 = *p_w; y
p_w++;
p_x0 = cyclic_add(p_x0, -1, p_xstart, sample_line_len);
\paragraph{write\_buffer()}The $write\_buffer()$-function is responsible for managing the input line, where the samples of the reference noise signal are stored for further processing. The buffer management mainly consits out of a cyclic pointer increase operation and a pointer dereference operation to write the new sample into the buffer. The cyclic pointer increase operation is implemented using the already mentioned intrinsic function of the \ac{DSP} compiler, while the pointer dereference operation takes 15 cycles to execute. This results in a total duration of 16 cycles for the $write\_buffer()$-function to process, indipentent of the filter length or other parameters.
\paragraph{apply\_fir\_filter()} The $apply\_fir\_filter()$-function is responsible for applying the coefficients of the \ac{FIR} filter on the reference noise signal samples stored in the input line. The needed cycles for this function are mainly depenendent on the lenght of the filter, as the number of multiplications and additions increase with the filter length. To increase the performance, the dual \ac{MAC} architecture of the \ac{DSP} is utilized, allowing two multiplications and two additions to be performed in a single cycle. Another \ac{DSP}-specific optimization is the use of the already introduced 72-bit accumulators and the fractional multiplication function, which allows performing multiplications on two 32-bit integers without losing precision or the need for manual bit-shifting operations.
\begin{listing}[H]
\centering
\begin{lstlisting}[style=cstyle]
for (int i=0; i < n_coeff; i+=2) chess_loop_range(1,){
x0 = *p_x0;
w0 = *p_w;
p_w++;
p_x0 = cyclic_add(p_x0, -1, p_xstart, sample_line_len);
x1 = *p_x0;
w1 = *p_w; y
p_w++;
p_x0 = cyclic_add(p_x0, -1, p_xstart, sample_line_len);
acc_fir_1+=fract_mult(x0, w0);
acc_fir_2+=fract_mult(x1, w1);
}
\end{lstlisting}
\caption{Code snippet of the $apply\_fir\_filter$-function, showing the use of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. The loop iterates through the filter coefficients and reference noise signal samples, performing two multiplications and two additions in each cycle.}
\label{fig:fig_dsp_code_apply_fir_filter}
\end{figure}
acc_fir_1+=fract_mult(x0, w0);
acc_fir_2+=fract_mult(x1, w1);
}
\end{lstlisting}
\caption{Code snippet of the $apply\_fir\_filter()$-function, showing the use of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. The loop iterates through the filter coefficients and reference noise signal samples, performing two multiplications and two additions in each cycle.}
\label{lst:lst_dsp_code_apply_fir_filter}
\end{listing}
\begin{figure}[H]
\centering
\includegraphics[width=1.0\linewidth]{Bilder/fig_dsp_fir_cycle.jpg}
\caption{Visualization of the FIR filter calculation in the $apply\_fir\_filter$-function during the 2nd cyclce of a calculation loop. The reference noise signal samples are stored in the sample line, while the filter coefficients are stored in a separate memory section (filter line).}
\caption{Visualization of the FIR filter calculation in the $apply\_fir\_filter()$-function during the 2nd cyclce of a calculation loop. The reference noise signal samples are stored in the sample line, while the filter coefficients are stored in a separate memory section (filter line).}
\label{fig:fig_dsp_fir_cycle.jpg}
\end{figure}
\noindent The final result is represented in a computing effort of 1 cycle per item in the sample line buffer (which equals the filter length) plus 12 cycles for general function overhead, resulting in a total of $N+12$ cycles for the $apply\_fir\_filter$-function, with $N$ being the filter length.
\paragraph{update\_output} The $update\_output$-function is responsible for calculating the output sample based on the error signal and the accumulated filter output. The calculation is a simple subtraction and only takes 1 cycle to execute, independent of the filter length or other parameters.
\paragraph{update\_filter\_coefficient} The $update\_filter\_coefficient$-function represents the second computationally expensive part of the $calculate\_output()$-function. The calculated output from the previous function is now multiplied with the step size and the corresponding sample from the reference noise signal, which is stored in the sample line buffer. The result is then added to the current filter coefficient to update it for the next cycle. Again, \ac{DSP}-specific optimizations, like the dual \ac{MAC} architecture, are used, resulting in a computing effort of 6 cycles per filter coeffcient. Per function call, an overhead of 8 cycles is added, resulting in a total of $6*N+8$ cycles for the $update\_filter\_coefficient$-function, with $N$ again being the filter length.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
for (int i=0; i< n_coeff; i+=2) chess_loop_range(1,){
lldecompose(*((long long *)p_w0), w0, w1);
acc_w0 = to_accum(w0);
acc_w1 = to_accum(w1);
\noindent The final result is represented in a computing effort of 1 cycle per item in the sample line buffer (which equals the filter length) plus 12 cycles for general function overhead, resulting in a total of $\text{N+12}$ cycles for the $apply\_fir\_filter()$-function, with $N$ being the filter length.
\paragraph{update\_output()} The $update\_output()$-function is responsible for calculating the output sample based on the error signal and the accumulated filter output. The calculation is a simple subtraction and only takes 1 cycle to execute, independent of the filter length or other parameters.
\paragraph{update\_filter\_coefficient()} The $update\_filter\_coefficient()$-function represents the second computationally expensive part of the $calculate\_output()$-function. The calculated output from the previous function is now multiplied with the step size and the corresponding sample from the reference noise signal, which is stored in the sample line buffer. The result is then added to the current filter coefficient to update it for the next cycle. Again, \ac{DSP}-specific optimizations, like the dual \ac{MAC} architecture, are used, resulting in a computing effort of 6 cycles per filter coeffcient. Per function call, an overhead of 8 cycles is added, resulting in a total of $\text{6*N+8}$ cycles for the $update\_filter\_coefficient()$-function, with $\text{N}$ again being the filter length.
\begin{listing}[H]
\centering
\begin{lstlisting}[style=cstyle]
for (int i=0; i< n_coeff; i+=2) chess_loop_range(1,){
lldecompose(*((long long *)p_w0), w0, w1);
acc_w0 = to_accum(w0);
acc_w1 = to_accum(w1);
acc_w0 += fract_mult(correction, *p_x0);
acc_w1 += fract_mult(correction, *p_x1);
acc_w0 += fract_mult(correction, *p_x0);
acc_w1 += fract_mult(correction, *p_x1);
p_x0 = cyclic_add(p_x0, -2, p_xstart, sample_line_len);
p_x1 = cyclic_add(p_x1, -2, p_xstart, sample_line_len);
p_x0 = cyclic_add(p_x0, -2, p_xstart, sample_line_len);
p_x1 = cyclic_add(p_x1, -2, p_xstart, sample_line_len);
*((long long *)p_w0) = llcompose(rnd_saturate(acc_w0), rnd_saturate(acc_w1));
p_w0+=2;
*((long long *)p_w0) = llcompose(rnd_saturate(acc_w0), rnd_saturate(acc_w1));
p_w0+=2;
}
\end{lstlisting}
\caption{Code snippet of the $update\_filter\_coefficient$-function, again making use of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. Additionaly, 32-bit values are loaded and stored as 64-bit values, using two also intrinisc functions, allowing to update two filter coefficients in a single cycle.}
\label{fig:fig_dsp_code_update_filter_coefficients}
\end{figure}
\caption{Code snippet of the $update\_filter\_coefficient()$-function, again making use of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. Additionaly, 32-bit values are loaded and stored as 64-bit values, using two also intrinisc functions, allowing to update two filter coefficients in a single cycle.}
\label{lst:lst_dsp_code_update_filter_coefficients}
\end{listing}
\begin{figure}[H]
\centering
\includegraphics[width=1.0\linewidth]{Bilder/fig_dsp_coefficient_cycle.jpg}
\caption{Visualization of the coefficient calculation in the $update\_filter\_coefficient$-function during the 2nd cyclce of a calculation loop. The output is multiplied with the step size and the corresponding sample from the sample line, before being added to the current filter coefficient.}
\caption{Visualization of the coefficient calculation in the $update\_filter\_coefficient()$-function during the 2nd cyclce of a calculation loop. The output is multiplied with the step size and the corresponding sample from the sample line, before being added to the current filter coefficient.}
\label{fig:fig_dsp_coefficient_cycle.jpg}
\end{figure}
\paragraph{write\_output} The $update\_output$-function is responsible for writing the calculated output sample back into the shared memory section. The operation takes 5 cycles to execute, independent of the filter length or other parameters.
\noindent The total computing effort of the $calculate\_output()$-function in dependency of the filter length $N$ can now be calculated by summing up the computing efforts of the different sub-functions:
\paragraph{update\_output()} The $update\_output()$-function is responsible for writing the calculated output sample back into the shared memory section. The operation takes 5 cycles to execute, independent of the filter length or other parameters.
\noindent The total computing effort of the $calculate\_output()$-function in dependency of the filter length $\text{N}$ can now be calculated by summing up the computing efforts of the different sub-functions:
\begin{equation}
\label{equation_computing}
\begin{aligned}
C_{total} = C_{write\_buffer} + C_{apply\_fir\_filter} + C_{update\_output} + \\
C_{update\_filter\_coefficient} + C_{write\_output}
\text{C}_{\text{total}} = \text{C}_{\text{write\_buffer}} + \text{C}_{\text{apply\_fir\_filter}} + \text{C}_{\text{update\_output}} + \\
\text{C}_{\text{update\_filter\_coefficient}} + \text{C}_{\text{write\_output}}
\end{aligned}
\end{equation}
The sub-functions can seperatly be expressed in dependency of the filter length $N$ and also in dependency of the update rate of the filter coefficients, which is represented by the parameter $1/U$ (e.g., if the coefficients are updated every 2 cycles, $1/U$ would result in a vaule of 0.5):
The sub-functions can seperatly be expressed in dependency of the filter length $\text{N}$ and also in dependency of the update rate of the filter coefficients, which is represented by the parameter $\text{1/U}$ (e.g., if the coefficients are updated every 2 cycles, $\text{1/U}$ would result in a vaule of 0.5):
\begin{gather}
\label{equation_c_1}
C_{write\_buffer} = 16 \\
\text{C}_{\text{write\_buffer()}} = 16 \\
\label{equation_c_2}
C_{apply\_fir\_filter} = N + 12 \\
\text{C}_{\text{apply\_fir\_filter()}} = \text{N} + 12 \\
\label{equation_c_3}
C_{update\_output} = 1 \\
\text{C}_{\text{update\_output()}} = 1 \\
\label{equation_c_4}
C_{update\_filter\_coefficient} = \frac{1}{U}(6*N + 8)\\
\text{C}_{\text{update\_filter\_coefficient()}} = \frac{1}{\text{U}}(6*\text{N} + 8)\\
\label{equation_c_5}
C_{write\_output} = 5 \\
\text{C}_{\text{write\_output()}} = 5 \\
\end{gather}
\noindent By inserting the sub-function costs into the total computing effort formula, Equation \ref{equation_computing} can now be expressed as:
\begin{equation}
\label{equation_computing_final}
C_{total} = N + \frac{6*N+8}{U} + 34
\text{C}_{\text{total}} = \text{N} + \frac{6*\text{N}+8}{\text{U}} + 34
\end{equation}
Equation \ref{equation_computing_final} now provides an estimation of the necessary computing effort for one output sample in relation to the filter length $N$ and the update rate of the filter coefficients $1/U$. This formula can now be used to estimate the needed computing power (and therefore the power consumption) of the \ac{DSP} core for different parameter settings, alowing to find an optimal parameter configuration in regard of the quality of the noise reduction and the power consumption of the system.
Equation \ref{equation_computing_final} now provides an estimation of the necessary computing effort for one output sample in relation to the filter length $\text{N}$ and the update rate of the filter coefficients $\text{1/U}$. This formula can now be used to estimate the needed computing power (and therefore the power consumption) of the \ac{DSP} core for different parameter settings, alowing to find an optimal parameter configuration in regard of the quality of the noise reduction and the power consumption of the system.
\begin{figure}[H]
\centering
\includegraphics[width=1.0\linewidth]{Bilder/fig_c_total.png}
\caption{Dependence of the total computing effort on the filter length $N$ and update rate $1/U$.}
\caption{Dependence of the total computing effort on the filter length $\text{N}$ and update rate $\text{1/U}$.}
\label{fig:fig_c_total.png}
\end{figure}