Korr 4
This commit is contained in:
+39
-13
@@ -3,8 +3,8 @@ Now, with a functioning high-level implementation in place, the focus shifts to
|
||||
\subsection{Low-power system architecture and integration}
|
||||
This thesis considers a low-power \ac{SOC} architecture that integrates a general-purpose \ac{ARM} core with a dedicated \ac{DSP} core. The system combines the flexibility of an \ac{ARM}-based control processor with the computational efficiency of a specialized \ac{DSP}, splitting general computing tasks from real-time signal processing workloads.
|
||||
\subsubsection{ARM and DSP hardware architecture overview}
|
||||
A 32-bit \ac{ARM} core serves as the primary control unit of the system. It is responsible for high-level application logic, system configuration, peripheral management as also scheduling and serves as a general-purpose processing unit. Due to its universal instruction set and extensive input/output interface, the \ac{ARM} core is well suited for handling general tasks and the interaction with the \ac{CI} system. Time-critical numerical processing is intentionally offloaded to the \ac{DSP} core in order to reduce computational load and power consumption on the control processor.\\ \\
|
||||
The \ac{DSP} used for the implementation features a 32-bit dual Harvard, dual \ac{MAC} architecture primarily designed for audio signal-processing applications in low-power embedded systems. It doesn't feature a designated boot ROM, as it is initialized and managed by the \ac{ARM} core. The firmware executing the \ac{ANR} algorithm is developed and programmed in the C programming language. The used proprietary compiler offers highly efficient functions and generates optimized assembler code, which is then translated in machine code to execute the \ac{ANR} algorithm on incoming samples.\\ \\
|
||||
A 32-bit \ac{ARM} core serves as the primary control unit of the system. It is responsible for high-level application logic, system configuration, peripheral management, scheduling and serves as a general-purpose processing unit. Due to its extensive input/output interface, the \ac{ARM} core is well suited for handling general tasks and the interaction with the \ac{CI} system. Time-critical numerical processing is offloaded to the \ac{DSP} core in order to reduce computational load and power consumption on the main processor.\\ \\
|
||||
The \ac{DSP} used for the implementation features a 32-bit dual Harvard, dual \ac{MAC} architecture primarily designed for audio signal processing applications in low-power embedded systems. It doesn't feature a designated boot ROM, as it is initialized and managed by the \ac{ARM} core. The firmware executing the \ac{ANR} algorithm is developed and programmed in the C programming language. The used proprietary compiler offers highly efficient functions and generates optimized assembler code, which is then translated in machine code to execute the \ac{ANR} algorithm on incoming samples.\\ \\
|
||||
All memory instances and registers of the \ac{SOC} are directly addressable by the \ac{ARM} through the standard buses, also enabling a simplified control of the \ac{DSP} through a shared memory section. The memory consists mainly out of the two following parts:
|
||||
\begin{itemize}
|
||||
\item \textbf{Program Memory:} This memory section stores the executable code for both the \ac{ARM} core and the \ac{DSP} core. It contains the compiled instructions that define the behavior of the system, including the \ac{ANR} algorithm implemented on the \ac{DSP}.
|
||||
@@ -13,7 +13,7 @@ All memory instances and registers of the \ac{SOC} are directly addressable by t
|
||||
The data memory is supported by an integrated \ac{DMA} controller, which allows efficient data transfers between peripherals and memory without burdening the processing cores. This is particularly needed for transferring audio samples from the \ac{PCM} interface to the shared memory section for further processing by the \ac{DSP}, as well as transferring processed samples back to the \ac{PCM} interface for playback. The shared memory section is crucial for enabling efficient communication and data exchange between the two processing units, further described in the following subchapter.\\ \\
|
||||
When the \ac{DSP} is not required to process audio data, it can be put to sleep by halting the clock provided to the \ac{DSP} core. When halted, the \ac{DSP} core enters a low-power state, still allowing the \ac{ARM} core to access its shared memory and wake up the \ac{DSP} core when needed. This mechanism helps to reduce overall power consumption, which is crucial for battery-operated devices like cochlear implants.\\ \\
|
||||
The processing unit of the \ac{DSP} is equipped with load/store architecture, meaning that, initially all operands need to be moved from the memory to the registers, before any operation can be performed. After this task is performed, the execution units (\ac{ALU} and multiplier) can perform their operations on the data and write back the results into the registers. Finally, the results need to be explicitly moved back to the memory.\\ \\
|
||||
Processor-wise, the \ac{DSP} includes a three stage pipeline consisting of fetch, decode, and execute stages, allowing for overlapping instruction execution and improved throughput. The architecture is optimized for high cycle efficiency when executing computationally intensive signal-processing workloads. The featured dual Harvard, dual \ac{MAC} architecture (two separate \ac{ALU}s) enables the execution of two \ac{MAC} operations, two memory operations (load/store) and two pointer updates in a single processor cycle.
|
||||
Inside the \ac{DSP}, a three stage pipeline consisting of fetch, decode, and execute stages, allows overlapping instruction execution and improved throughput. The architecture is optimized for high cycle efficiency when executing computationally intensive signal processing workloads. The featured dual Harvard, dual \ac{MAC} architecture (two separate \ac{ALU}s) enables the execution of two \ac{MAC} operations, two memory operations (load/store) and two pointer updates in a single processor cycle.
|
||||
\subsubsection{Intercore communication mechanisms}
|
||||
In order to ensure a smooth, but power-efficient, operation together with the \ac{CI} system, an interrupt-driven communication between the \ac{ARM} core and the \ac{DSP} core is crucial. Center of communication between the cores is the already mentioned shared memory region accessible by both processing units. This shared memory enables the exchange of data without the need for separate communication protocols or input/output interfaces (refer to Figure \ref{fig:fig_dsp_setup.jpg}). Synchronization between the cores is achieved using interrupt-based signaling: the \ac{ARM} core initiates processing requests by waking up the \ac{DSP} and triggering an interrupt which sets an action flag, while the \ac{DSP} notifies the \ac{ARM} core upon completion of a task also by changing an interrupt register (for simplicity reasons, this behavior will be just called ``interrupts'' in the remaining thesis). This approach ensures efficient coordination while minimizing active waiting (polling) and therefore unnecessary power consumption.
|
||||
\begin{figure}[H]
|
||||
@@ -138,7 +138,7 @@ The ability to process audio samples in real-time on the \ac{DSP} core is strong
|
||||
\subsubsection{DSP-specific optimizations for real-time processing}
|
||||
In the following, some examples of optimization possibilities shall be outlined, before the entire \ac{ANR} implementation on the \ac{DSP} is analyzed in regard of its performance.
|
||||
\paragraph{Logic operations}
|
||||
Logic operations, such as finding the maximum or minimum of two values, are quite common in signal processing algorithms. However, their implementation in C usually involves conditional statements (if-else), which can be inefficient on certain architectures due to pipeline stalls.\\ \\
|
||||
Logic operations, such as finding the maximum or minimum of two values, are quite common in signal processing algorithms. However, their implementation in C usually involves conditional statements, which can be inefficient on certain architectures.\\ \\
|
||||
The simple function shown in Listing \ref{lst:lst_dsp_code_find_max} returns the maximum of two given integer values. Processing this manual implementation on the \ac{DSP} takes 12 cycles to execute, while the intrinsic function of the \ac{DSP} compiler allows a 4-cycle execution.
|
||||
\begin{listing}[H]
|
||||
\centering
|
||||
@@ -186,8 +186,8 @@ The $calculate\_output()$ functions consists out of the following five main part
|
||||
\item $write\_output()$: Writing the output sample back to the output port in the shared memory section
|
||||
\end{itemize}
|
||||
These sub-functions feature \ac{DSP}-specific optimizations and are partly depenend on the setable parameters like the filter length in regard of their computational cost. The following paragraphs will analyze the computational efficiency of these sub-functions in detail.
|
||||
\paragraph{write\_buffer()}The $write\_buffer()$-function is responsible for managing the sample line, where the samples of the reference noise signal are stored for further processing. The buffer management mainly consists out of a cyclic pointer increase operation and a pointer dereference operation to write the new sample into the buffer. The cyclic pointer increase operation is implemented using the already mentioned intrinsic function of the \ac{DSP} compiler, while the pointer dereference operation takes 15 cycles to execute. This results in a total duration of 16 cycles for the $write\_buffer()$-function to process, indipendent of the filter length or other parameters.
|
||||
\paragraph{apply\_fir\_filter()} The $apply\_fir\_filter()$-function is responsible for applying the coefficients of the \ac{FIR} filter on the reference noise signal samples stored in the sample line. The needed cycles for this function are mainly depenendent on the length of the filter, as the number of multiplications and additions increase with the filter length. To increase the performance, the dual \ac{MAC} architecture of the \ac{DSP} is utilized, allowing two multiplications and two additions to be performed in a single cycle. Another \ac{DSP}-specific optimization is the use of the already introduced 72-bit accumulators and the fractional multiplication function, which allow performing multiplications on two 32-bit integers without losing precision or the need for manual bit-shifting operations.
|
||||
\paragraph{write\_buffer()}The $write\_buffer()$-function is responsible for managing the Sample Line, where the samples of the reference noise signal are stored for further processing. The buffer management mainly consists out of a cyclic pointer increase operation and a pointer dereference operation to write the new sample into the buffer. The cyclic pointer increase operation is implemented using the already mentioned intrinsic function of the \ac{DSP} compiler, while the pointer dereference operation takes 15 cycles to execute. This results in a total duration of 16 cycles for the $write\_buffer()$-function to process, indipendent of the filter length or other parameters.
|
||||
\paragraph{apply\_fir\_filter()} The $apply\_fir\_filter()$-function is responsible for applying the coefficients of the \ac{FIR} filter on the reference noise signal samples stored in the Sample Line. The needed cycles for this function are mainly depenendent on the length of the filter, as the number of multiplications and additions increase with the filter length. To increase the performance, the dual \ac{MAC} architecture of the \ac{DSP} is utilized, allowing two multiplications and two additions to be performed in a single cycle. Another \ac{DSP}-specific optimization is the use of the already introduced 72-bit accumulators and the fractional multiplication function, which allow performing multiplications on two 32-bit integers without losing precision or the need for manual bit-shifting operations.
|
||||
\begin{listing}[H]
|
||||
\centering
|
||||
\begin{lstlisting}[style=cstyle]
|
||||
@@ -211,12 +211,12 @@ for (int i=0; i < n_coeff; i+=2) chess_loop_range(1,){
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=1.0\linewidth]{Bilder/fig_dsp_fir_cycle.jpg}
|
||||
\caption{Visualization of the FIR filter calculation in the $apply\_fir\_filter()$-function during the 2nd cyclce of a calculation loop. The reference noise signal samples are stored in the sample line, while the filter coefficients are stored in a separate memory section (filter line).}
|
||||
\caption{Visualization of the FIR filter calculation in the $apply\_fir\_filter()$-function during the 2nd cyclce of a calculation loop. The reference noise signal samples are stored in the Sample Line, while the filter coefficients are stored in a separate memory section, called Filter line.}
|
||||
\label{fig:fig_dsp_fir_cycle.jpg}
|
||||
\end{figure}
|
||||
\noindent The final result is represented in a computing effort of 1 cycle per item in the sample line buffer (which equals the filter length) plus 12 cycles for general function overhead, resulting in a total of $\text{N+12}$ cycles for the $apply\_fir\_filter()$-function, with $N$ being the filter length.
|
||||
\noindent The final result is represented in a computing effort of 1 cycle per item in the Sample Line buffer (which equals the filter length) plus 12 cycles for general function overhead, resulting in a total of $\text{N+12}$ cycles for the $apply\_fir\_filter()$-function, with $N$ being the filter length.
|
||||
\paragraph{update\_output()} The $update\_output()$-function is responsible for calculating the output sample based on the error signal and the accumulated filter output. The calculation is a simple subtraction and only takes 1 cycle to execute, independent of the filter length or other parameters.
|
||||
\paragraph{update\_filter\_coefficient()} The $update\_filter\_coefficient()$-function represents the second computationally expensive part of the $calculate\_output()$-function. The calculated output from the previous function is now multiplied with the step size and the corresponding sample from the reference noise signal, which is stored in the sample line buffer. The result is then added to the current filter coefficient to update it for the next cycle. Again, \ac{DSP}-specific optimizations, like the dual \ac{MAC} architecture, are used, resulting in a computing effort of 6 cycles per filter coeffcient. Per function call, an overhead of 8 cycles is added, resulting in a total of $\text{6*N+8}$ cycles for the $update\_filter\_coefficient()$-function, with $\text{N}$ again being the filter length.
|
||||
\paragraph{update\_filter\_coefficient()} The $update\_filter\_coefficient()$-function represents the second computationally expensive part of the $calculate\_output()$-function. The calculated output from the previous function is now multiplied with the step size and the corresponding sample from the reference noise signal, which is stored in the Sample Line buffer. The result is then added to the current filter coefficient to update it for the next cycle. Again, \ac{DSP}-specific optimizations, like the dual \ac{MAC} architecture, are used, resulting in a computing effort of 6 cycles per filter coeffcient. Per function call, an overhead of 8 cycles is added, resulting in a total of $\text{6*N+8}$ cycles for the $update\_filter\_coefficient()$-function, with $\text{N}$ again being the filter length.
|
||||
\begin{listing}[H]
|
||||
\centering
|
||||
\begin{lstlisting}[style=cstyle]
|
||||
@@ -235,13 +235,13 @@ for (int i=0; i< n_coeff; i+=2) chess_loop_range(1,){
|
||||
p_w0+=2;
|
||||
}
|
||||
\end{lstlisting}
|
||||
\caption{Code snippet of the $update\_filter\_coefficient()$-function, again making use of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. Additionaly, 32-bit values are loaded and stored as 64-bit values, using two also intrinisc functions, allowing to update two filter coefficients in a single cycle.}
|
||||
\caption{Code snippet of the $update\_filter\_coefficient()$-function, again making use of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. Additionaly, 32-bit values are loaded and stored as 64-bit values, using two also intrinsic functions, allowing to update two filter coefficients in a single cycle.}
|
||||
\label{lst:lst_dsp_code_update_filter_coefficients}
|
||||
\end{listing}
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=1.0\linewidth]{Bilder/fig_dsp_coefficient_cycle.jpg}
|
||||
\caption{Visualization of the coefficient calculation in the $update\_filter\_coefficient()$-function during the 2nd cyclce of a calculation loop. The output is multiplied with the step size and the corresponding sample from the sample line, before being added to the current filter coefficient.}
|
||||
\caption{Visualization of the coefficient calculation in the $update\_filter\_coefficient()$-function during the 2nd cyclce of a calculation loop. The output is multiplied with the step size and the corresponding sample from the Sample Line, before being added to the current filter coefficient.}
|
||||
\label{fig:fig_dsp_coefficient_cycle.jpg}
|
||||
\end{figure}
|
||||
\paragraph{update\_output()} The $update\_output()$-function is responsible for writing the calculated output sample back into the shared memory section. The operation takes 5 cycles to execute, independent of the filter length or other parameters.
|
||||
@@ -253,7 +253,7 @@ for (int i=0; i< n_coeff; i+=2) chess_loop_range(1,){
|
||||
\text{C}_{\text{update\_filter\_coefficient}} + \text{C}_{\text{write\_output}}
|
||||
\end{aligned}
|
||||
\end{equation}
|
||||
The sub-functions can seperatly be expressed in dependency of the filter length $\text{N}$ and also in dependency of the update rate of the filter coefficients, which is represented by the parameter $\text{1/U}$ (e.g., if the coefficients are updated every 2 cycles, $\text{1/U}$ would result in a vaule of 0.5):
|
||||
The sub-functions can seperatly be expressed in dependency of the filter length $\text{N}$ and also in dependency of the update rate of the filter coefficients, which is represented by the parameter $\text{1/U}$ (e.g. if the coefficients are updated every two cycles, $\text{1/U}$ would result in a vaule of 0.5):
|
||||
\begin{gather}
|
||||
\label{equation_c_1}
|
||||
\text{C}_{\text{write\_buffer()}} = 16 \\
|
||||
@@ -278,5 +278,31 @@ Equation \ref{equation_computing_final} now provides an estimation of the necess
|
||||
\caption{Dependence of the total computing effort on the filter length $\text{N}$ and update rate $\text{1/U}$.}
|
||||
\label{fig:fig_c_total.png}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Verification of the DSP implementation}
|
||||
To verify the general performance of the \ac{DSP}-implemented \ac{ANR} algorithm, the complex usecase of the high-level implemenation is utilized, which includes, again, a 56-tap \ac{FIR} filter and an update of the filter coefficients every cycle. In contary to the high-level implementation, the coeffcient convergence is now not included in the evaluation anymore, but the metric for the \ac{ANR} performance stays the same as for the \ac{SNR} improvement.
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=1.0\linewidth]{Bilder/fig_plot_1_dsp_complex.png}
|
||||
\caption{Desired signal, corrupted signal, reference noise signal and filter output of the complex \ac{ANR} use case, simulated on the \ac{DSP}}
|
||||
\label{fig:fig_plot_1_dsp_complex.png}
|
||||
\end{figure}
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=1.0\linewidth]{Bilder/fig_plot_2_dsp_complex.png}
|
||||
\caption{Error signal of the complex \ac{ANR} use case, simulated on the \ac{DSP}}
|
||||
\label{fig:fig_plot_2_dsp_complex.png}
|
||||
\end{figure}
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=1.0\linewidth]{Bilder/fig_high_low_comparison.png}
|
||||
\caption{Comparison of the high- and low-level simulation output.}
|
||||
\label{fig:fig_high_low_comparison.png}
|
||||
\end{figure}
|
||||
\begin{figure}[H]
|
||||
\centering
|
||||
\includegraphics[width=1.0\linewidth]{Bilder/fig_high_low_comparison_hist.png}
|
||||
\caption{Histogram of the error amplitude between the high- and low-level simulation output.}
|
||||
\label{fig:fig_high_low_comparison_hist.png}
|
||||
\end{figure}
|
||||
\noindent Figure \ref{fig:fig_plot_1_dsp_complex.png} and \ref{fig:fig_plot_2_dsp_complex.png} show the results of the complex \ac{ANR} use case, simulated on the \ac{DSP} - with a \ac{SNR}-Gain of 10.26 dB it performs equivalent sucessful as the one of the high-level implementation. Figure \ref{fig:fig_high_low_comparison.png} shows both outputs seperately and then together in one subfigure, together with the plotted error amplitude. Lastly, Figure \ref{fig:fig_high_low_comparison_hist.png} features a histogram of the error amplitude between the high- and low-level implemenation, indicating the correct functionality of the \ac{DSP} implementation. The small deviations can be explained by the fact that the \ac{DSP} implementation is based on fixed-point arithmetic, which leads to a slightly different convergence behavior. Nevertheless, the results show that the \ac{DSP} implementation of the \ac{ANR} algorithm is able to achieve the same performance as the high-level implementation. The next step is of evaluate the performance of the \ac{DSP} implementation in terms of computational efficiency under different scenarios and non-synchrone signals.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user