Korr 1
This commit is contained in:
+10
-11
@@ -1,18 +1,17 @@
|
||||
\section{Hardware implementation and performance quantization of the ANR Algorithm on a low-power system}
|
||||
This section can be considered as the heart of this thesis. The first subchapter describes the hardware, on which the \ac{ANR} algorithm is implemented, including its environment, which serves as a link to the \ac{CI} system itself. The following subchapter continues with the basic implementation of the \ac{ANR} algorithm on the hardware itself and shall provide the reader with a basic understanding of its challenges, possibilities and limitations. This basic implementation is then low-level simulated with some of the previous use cases to get some idea of the general performance.\\
|
||||
The last subchapter picks the final optimizations of the \ac{ANR} algorithm itself as a central theme, especially with respect to the capabilities of a hybrid \ac{ANR} approach.
|
||||
Now, with a functioning high-level implementation in place, the focus shifts to the hardware implementation of the \ac{ANR} algorithm on a low-power system. The first subchapter describes the hardware, on which the \ac{ANR} algorithm is implemented, including its environment, which serves as a link to the \ac{CI} system itself. The following subchapter continues with the basic implementation of the \ac{ANR} algorithm on the hardware itself and shall provide the reader with a basic understanding of its challenges, possibilities and limitations. This implementation is then tested on a simulator to be compared to the high-level implementation.\\
|
||||
\subsection{Low-power system architecture and integration}
|
||||
This thesis considers a low-power \ac{SOC} architecture that integrates a general-purpose \ac{ARM} core with a dedicated \ac{DSP} core. The system combines the flexibility of an \ac{ARM}-based control processor with the computational efficiency of a specialized \ac{DSP}, splitting general computing tasks from real-time signal processing workloads.
|
||||
\subsubsection{ARM and DSP hardware architecture overview}
|
||||
A 32-bit \ac{ARM} core serves as the primary control unit of the system. It is responsible for high-level application logic, system configuration, peripheral management as also scheduling and serves as a general-purpose processing unit. Due to its universal instruction set and extensive input/output interface, the \ac{ARM} core is well suited for handling general tasks and the interaction with the \ac{CI} system. Time-critical numerical processing is intentionally offloaded to the \ac{DSP} core in order to reduce computational load and power consumption on the control processor.\\ \\
|
||||
The \ac{DSP} used for the implementation features a 32-bit dual Harvard, dual \ac{MAC} architecture primarily designed for audio signal-processing applications in low-power embedded systems. It doesn't feature a designated boot ROM, as it is initialized and managed by the \ac{ARM} core. The firmware executing the \ac{ANR} algorithm is developed and programmed in the C programming language. The used proprietary compiler is highly efficient and generates optimized assembler code, which is then translated in machine code to execute the \ac{ANR} algorithm on incoming samples.\\ \\
|
||||
The \ac{DSP} used for the implementation features a 32-bit dual Harvard, dual \ac{MAC} architecture primarily designed for audio signal-processing applications in low-power embedded systems. It doesn't feature a designated boot ROM, as it is initialized and managed by the \ac{ARM} core. The firmware executing the \ac{ANR} algorithm is developed and programmed in the C programming language. The used proprietary compiler offers highly efficient functions and generates optimized assembler code, which is then translated in machine code to execute the \ac{ANR} algorithm on incoming samples.\\ \\
|
||||
All memory instances and registers of the \ac{SOC} are directly addressable by the \ac{ARM} through the standard buses, also enabling a simplified control of the \ac{DSP} through a shared memory section. The memory consists mainly out of the two following parts:
|
||||
\begin{itemize}
|
||||
\item \textbf{Program Memory:} This memory section stores the executable code for both the \ac{ARM} core and the \ac{DSP} core. It contains the compiled instructions that define the behavior of the system, including the \ac{ANR} algorithm implemented on the \ac{DSP}.
|
||||
\item \textbf{Data Memory:} This memory section is used for storing runtime data and variables, required during the execution of the program. This also includes the memory section for input/output audio samples and intermediate processing results. The shared memory section between the \ac{ARM} core and the \ac{DSP} core is also part of the data memory, featuring a total size of 64 KB.
|
||||
\end{itemize}
|
||||
The data memory is supported by an integrated \ac{DMA} controller, which allows efficient data transfers between peripherals and memory without burdening the processing cores. This is particularly needed for transferring audio samples from the \ac{PCM} interface to the shared memory section for further processing by the \ac{DSP}, as well as transferring processed samples back to the \ac{PCM} interface for playback. The shared memory section is crucial for enabling efficient communication and data exchange between the two processing units, further described in the following subchapter.\\ \\
|
||||
When the \ac{DSP} is not required to process audio data, it can be paused by pausing the clock provided to the \ac{DSP} core. When paused, the \ac{DSP} core enters a low-power state, still allowing the \ac{ARM} core to access its shared memory and wake up the \ac{DSP} core when needed. This mechanism helps to reduce overall power consumption, which is crucial for battery-operated devices like cochlear implants.\\ \\
|
||||
When the \ac{DSP} is not required to process audio data, it can be put to sleep by halting the clock provided to the \ac{DSP} core. When halted, the \ac{DSP} core enters a low-power state, still allowing the \ac{ARM} core to access its shared memory and wake up the \ac{DSP} core when needed. This mechanism helps to reduce overall power consumption, which is crucial for battery-operated devices like cochlear implants.\\ \\
|
||||
The processing unit of the \ac{DSP} is equipped with load/store architecture, meaning that, initially all operands need to be moved from the memory to the registers, before any operation can be performed. After this task is performed, the execution units (\ac{ALU} and multiplier) can perform their operations on the data and write back the results into the registers. Finally, the results need to be explicitly moved back to the memory.\\ \\
|
||||
Processor-wise, the \ac{DSP} includes a three stage pipeline consisting of fetch, decode, and execute stages, allowing for overlapping instruction execution and improved throughput. The architecture is optimized for high cycle efficiency when executing computationally intensive signal-processing workloads. The featured dual Harvard, dual \ac{MAC} architecture (two separate \ac{ALU}s) enables the execution of two \ac{MAC} operations, two memory operations (load/store) and two pointer updates in a single processor cycle.
|
||||
\subsubsection{Intercore communication mechanisms}
|
||||
@@ -33,7 +32,7 @@ In order to ensure a smooth, but power-efficient, operation together with the \a
|
||||
|
||||
\subsection{Software architecture and execution flow}
|
||||
\subsubsection{ARM–DSP communication and data exchange details}
|
||||
In contrary, to the high-level simulation environment written in Python from the previous chapter, the implementation of the \ac{ANR} algorithm on the \ac{DSP} requires a low-level programming approach, as which takes into account the specific architecture and capabilities of the processor and its environment. This includes considerations such as memory management, data types, and optimization techniques specific to the \ac{DSP} architecture. The implementation is required to be done in the C programming language, which is a standard for embedded systems, as it allows low-level hardware implementation.\\ \\
|
||||
In contrary, to the high-level simulation environment written in Python from the previous chapter, the implementation of the \ac{ANR} algorithm on the \ac{DSP} requires a low-level programming approach, as which takes into account the specific architecture and capabilities of the processor and its environment. This includes considerations such as memory management, data types, and optimization techniques specific to the \ac{DSP} architecture. The implementation is required to be done in the C programming language, which is a standard for embedded systems.\\ \\
|
||||
The implementation of the \ac{ANR} algorithm on the \ac{DSP} follows the same overall structure as the high-level variant, but now the focus lies on memory management, interrupt-handling and communication between the two cores. The \ac{ARM} operates in a continuous loop, structured into several states:
|
||||
\begin{itemize}
|
||||
\item \textbf{Idle:} The \ac{ARM} core waits for an interrupt from the \ac{DMA} controller, indicating that new audio samples are available in the input buffer.
|
||||
@@ -89,7 +88,7 @@ static volatile int action_required;
|
||||
\end{figure}
|
||||
\paragraph{Main loop and interrupt handling}
|
||||
The main loop of the \ac{DSP} core is quite compact, as it mainly focuses on handling interrupts and delegating the sample processing to the \ac{ANR} function. The loop starts by enabling interrupts with a compiler-specific function and setting up pointers for the output buffer and the sample variable. After setting the action flag to zero, the main function enters an infinite loop, signaling the \ac{ARM} core it's halted state by setting the interrupt register to 1 and halting the core.\\ \\
|
||||
If the \ac{ARM} core requests a sample to be processed, it activates the \ac{DSP} core and triggers an interrupt, which sets the action flag to 1. The main loop then checks the action flag, and sets the interrupt register back to 0, indicating the \ac{ARM} core it is now processing the sample. After resetting the action flag, the output pointer is updated to point to the next position in the output buffer using a cyclic addition function. Before triggering the calculate\_output()-function, the calculated sample from the previous cycle is moved from its temporary memory location to the current position in the output buffer. Afterwards, the calculate\_output()-function is triggered for the current cycle and the loop restarts. The flow diagram in Figure \ref{fig:fig_dsp_logic.jpg} visualizes the described behavior of the main loop and interrupt handling on the \ac{DSP} core.
|
||||
If the \ac{ARM} core requests a sample to be processed, it activates the \ac{DSP} core and triggers an interrupt, which sets the action flag to 1. The main loop then checks the action flag, and sets the interrupt register back to 0, indicating the \ac{ARM} core it is now processing the sample. After resetting the action flag, the output pointer is updated to point to the next position in the output buffer using a cyclic addition function. Before triggering the $calculate\_output()$-function, the calculated sample from the previous cycle is moved from its temporary memory location to the current position in the output buffer. Afterwards, the $calculate\_output()$-function is triggered for the current cycle and the loop restarts. The flow diagram in Figure \ref{fig:fig_dsp_logic.jpg} visualizes the described behavior of the main loop and interrupt handling on the \ac{DSP} core.
|
||||
\begin{listing}[H]
|
||||
\centering
|
||||
\begin{lstlisting}[style=cstyle]
|
||||
@@ -132,14 +131,14 @@ int main(void) {
|
||||
\label{fig:fig_dsp_logic.jpg}
|
||||
\end{figure}
|
||||
\paragraph{calculate\_output()-function}
|
||||
The calculate\_output()-function at the very end of the main process loop represents the heart of the \ac{DSP} code, as it is responsible for applying the \ac{ANR} algorithm on the two input samples. As it follows the same structure as the high-level implementation described in the previous chapter, the general functionality will not be described in detail again. Yet, the technical implementation on the \ac{DSP} however will be outlined in detail in the following subchapter, as the hardware-specific optimizations are key elements for the estimation of the expectable power consumption of the system.\\ \\
|
||||
The $calculate\_output()$-function at the very end of the main process loop represents the heart of the \ac{DSP} code, as it is responsible for applying the \ac{ANR} algorithm on the two input samples. The technical implementation on the \ac{DSP} will be outlined in detail in the following subchapter, as the hardware-specific optimizations are key elements for the estimation of the expectable power consumption of the system.\\ \\
|
||||
|
||||
\subsection{DSP-level implementation of the ANR algorithm}
|
||||
The ability to process audio samples in real-time on the \ac{DSP} core is strongly dependent on compiler-specific optimizations and hardware-specific implementation techniques, which allow a far more efficient execution of the algorithm compared to a native C implementation.
|
||||
\subsubsection{DSP-specific optimizations for real-time processing}
|
||||
In the following, some examples of optimization possibilities shall be outlined, before the entire \ac{ANR} implementation on the \ac{DSP} is analyzed in regard of its performance.
|
||||
\paragraph{Logic operations}
|
||||
Logic operstions, such as finding the maximum or minimum of two values, are quite common in signal processing algorithms. However, their implementation in C usually involves conditional statements (if-else), which can be inefficient on certain architectures due to pipeline stalls.\\ \\
|
||||
Logic operations, such as finding the maximum or minimum of two values, are quite common in signal processing algorithms. However, their implementation in C usually involves conditional statements (if-else), which can be inefficient on certain architectures due to pipeline stalls.\\ \\
|
||||
The simple function shown in Listing \ref{lst:lst_dsp_code_find_max} returns the maximum of two given integer values. Processing this manual implementation on the \ac{DSP} takes 12 cycles to execute, while the intrinsic function of the \ac{DSP} compiler allows a 4-cycle execution.
|
||||
\begin{listing}[H]
|
||||
\centering
|
||||
@@ -186,9 +185,9 @@ The $calculate\_output()$ functions consists out of the following five main part
|
||||
\item $update\_filter\_coefficients()$: Update of the \ac{FIR} filter coefficients based on the error signal
|
||||
\item $write\_output()$: Writing the output sample back to the output port in the shared memory section
|
||||
\end{itemize}
|
||||
These sub-functions feature \ac{DSP}-spefic optimizations and are partly depenend on the setable parameters like the filter length in regard of their computational cost. The following paragraphs will analyze the computational efficiency of these sub-functions in detail.
|
||||
\paragraph{write\_buffer()}The $write\_buffer()$-function is responsible for managing the input line, where the samples of the reference noise signal are stored for further processing. The buffer management mainly consits out of a cyclic pointer increase operation and a pointer dereference operation to write the new sample into the buffer. The cyclic pointer increase operation is implemented using the already mentioned intrinsic function of the \ac{DSP} compiler, while the pointer dereference operation takes 15 cycles to execute. This results in a total duration of 16 cycles for the $write\_buffer()$-function to process, indipentent of the filter length or other parameters.
|
||||
\paragraph{apply\_fir\_filter()} The $apply\_fir\_filter()$-function is responsible for applying the coefficients of the \ac{FIR} filter on the reference noise signal samples stored in the input line. The needed cycles for this function are mainly depenendent on the lenght of the filter, as the number of multiplications and additions increase with the filter length. To increase the performance, the dual \ac{MAC} architecture of the \ac{DSP} is utilized, allowing two multiplications and two additions to be performed in a single cycle. Another \ac{DSP}-specific optimization is the use of the already introduced 72-bit accumulators and the fractional multiplication function, which allows performing multiplications on two 32-bit integers without losing precision or the need for manual bit-shifting operations.
|
||||
These sub-functions feature \ac{DSP}-specific optimizations and are partly depenend on the setable parameters like the filter length in regard of their computational cost. The following paragraphs will analyze the computational efficiency of these sub-functions in detail.
|
||||
\paragraph{write\_buffer()}The $write\_buffer()$-function is responsible for managing the sample line, where the samples of the reference noise signal are stored for further processing. The buffer management mainly consists out of a cyclic pointer increase operation and a pointer dereference operation to write the new sample into the buffer. The cyclic pointer increase operation is implemented using the already mentioned intrinsic function of the \ac{DSP} compiler, while the pointer dereference operation takes 15 cycles to execute. This results in a total duration of 16 cycles for the $write\_buffer()$-function to process, indipendent of the filter length or other parameters.
|
||||
\paragraph{apply\_fir\_filter()} The $apply\_fir\_filter()$-function is responsible for applying the coefficients of the \ac{FIR} filter on the reference noise signal samples stored in the sample line. The needed cycles for this function are mainly depenendent on the lenght of the filter, as the number of multiplications and additions increase with the filter length. To increase the performance, the dual \ac{MAC} architecture of the \ac{DSP} is utilized, allowing two multiplications and two additions to be performed in a single cycle. Another \ac{DSP}-specific optimization is the use of the already introduced 72-bit accumulators and the fractional multiplication function, which allows performing multiplications on two 32-bit integers without losing precision or the need for manual bit-shifting operations.
|
||||
\begin{listing}[H]
|
||||
\centering
|
||||
\begin{lstlisting}[style=cstyle]
|
||||
|
||||
Reference in New Issue
Block a user