\@writefile{lof}{\contentsline{figure}{\numberline{33}{\ignorespaces Simplified flowchart of the sample processing between the \ac{ARM} core and the \ac{DSP} core via interrupts and shared memory.}}{43}{}\protected@file@percent }
\@writefile{toc}{\contentsline{subsection}{\numberline{4.2}Software architecture and execution flow}{43}{}\protected@file@percent }
\@writefile{toc}{\contentsline{subsubsection}{\numberline{4.2.1}ARM–DSP communication and data exchange details}{43}{}\protected@file@percent }
\acronymused{ANR}
@@ -117,11 +117,11 @@
\acronymused{PCM}
\acronymused{ARM}
\acronymused{DSP}
\@writefile{lof}{\contentsline{figure}{\numberline{34}{\ignorespaces Detailed visualization of the \ac{DMA} operations between the PCM interface to the shared memory section. When the memory buffer occupied, an interrupt is triggerd, either to the \ac{DSP} core or to the \ac{ARM} core, depending if triggered during a Read- or Write-operation.}}{45}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{34}{\ignorespaces Detailed visualization of the \ac{DMA} operations between the PCM interface to the shared memory section. When the memory buffer occupied, an interrupt is triggered, either to the \ac{DSP} core or to the \ac{ARM} core, depending on, if triggered during a Read- or Write-operation.}}{45}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{35}{\ignorespaces Low-level implementation: Memory initialization and mapping}}{46}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{36}{\ignorespaces Exemplary memory map of the 4-element input buffer array. As it is initialized as a 16bit integer array, each element occupies 2 bytes of memory, resulting in a total size of 8 bytes for the entire array. As the DSP architecture works in 32-bit double-words, the bytewise adressing is a result of the compiler abstraction.}}{46}{}\protected@file@percent }
\newlabel{fig:fig_compiler.jpg}{{36}{46}{}{}{}}
\@writefile{lof}{\contentsline{figure}{\numberline{36}{\ignorespaces Exemplary memory map of the 4-element input buffer array. As it is initialized as a 16-bit integer array, each element occupies 2 bytes of memory, resulting in a total size of 8 bytes for the entire array. As the DSP architecture works in 32-bit double-words, the bytewise addressing is a result of the compiler abstraction.}}{46}{}\protected@file@percent }
\@writefile{toc}{\contentsline{paragraph}{Main loop and interrupt handling}{46}{}\protected@file@percent }
\acronymused{DSP}
\acronymused{ANR}
@@ -151,10 +151,10 @@
\acronymused{ARM}
\acronymused{DSP}
\@writefile{lof}{\contentsline{figure}{\numberline{37}{\ignorespaces Low-level implementation: Main loop and interrupt handling}}{47}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{38}{\ignorespaces Flow diagram of the code implementation of the main loop and interrupt handling on the \ac{DSP} core.}}{48}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{39}{\ignorespaces Manual implementation of a max-function, returning the maximum of two integer values, taking 12 cycles to execute. The intrinsic functions of the DSP compiler allows a 4-cycle implementation of such an operations.}}{49}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{39}{\ignorespaces Manual implementation of a max-function, returning the maximum of two integer values, taking 12 cycles to execute. The intrinsic functions of the DSP compiler allows a 4-cycle implementation of such an operation.}}{49}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{40}{\ignorespaces Manual implementation of a cyclic array iteration function in C, taking the core 20 cycles to execute an pointer inremen of 1. The intrinsic functions of the DSP compiler allows a single-cycle implementation of such cyclic additions.}}{50}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{40}{\ignorespaces Manual implementation of a cyclic array iteration function in C, taking the core 20 cycles to execute a pointer inremen of 1. The intrinsic functions of the DSP compiler allows a single-cycle implementation of such cyclic additions.}}{50}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{41}{\ignorespaces Code snippet of the $apply\_fir\_filter$-function, showing the use of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. The loop iterates through the filter coefficients and reference noise signal samples, performing two multiplications and two additions in each cycle.}}{52}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{42}{\ignorespaces Visualization of the FIR filter calculation in the $apply\_fir\_filter$-function during the 2nd cyclce of a calculation loop. The reference noise signal samples are stored in the sample line, while the filter coefficients are stored in a separate memory section (filter line).}}{52}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{43}{\ignorespaces Code snippet of the $update\_filter\_coefficient$-function, again making use of of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. Additionaly, 32-bit values are loaded and stored as 64-bit values, using two also intrinisc functions, allowing to update two filter coefficients in a single cycle.}}{53}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{43}{\ignorespaces Code snippet of the $update\_filter\_coefficient$-function, again making use of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. Additionaly, 32-bit values are loaded and stored as 64-bit values, using two also intrinisc functions, allowing to update two filter coefficients in a single cycle.}}{53}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{44}{\ignorespaces Visualization of the coefficient calculation in the $update\_filter\_coefficient$-function during the 2nd cyclce of a calculation loop. The output is multiplied with the step size and the corresponding sample from the sample line, before being added to the current filter coefficient.}}{54}{}\protected@file@percent }
\section{Hardware implementation and performance quantization of the ANR Algorithm on a low-power system}
This section can be considered as the heart of this thesis. The first subchapter describes the hardware, on which the \ac{ANR} algorithm is implemented, including its environment, which serves as a link to the \ac{CI} system itself. The following subchapter continues with the basic implementation of the \ac{ANR} algorithm on the hardware itself and shall provide the reader with a basic understanding of its challenges, possibilities and limitations. This basic implementation is then low-level simulated with some of the previuous use cases to get some idea of the general performance.\\
This section can be considered as the heart of this thesis. The first subchapter describes the hardware, on which the \ac{ANR} algorithm is implemented, including its environment, which serves as a link to the \ac{CI} system itself. The following subchapter continues with the basic implementation of the \ac{ANR} algorithm on the hardware itself and shall provide the reader with a basic understanding of its challenges, possibilities and limitations. This basic implementation is then low-level simulated with some of the previous use cases to get some idea of the general performance.\\
The last subchapter picks the final optimizations of the \ac{ANR} algorithm itself as a central theme, especially with respect to the capabilities of a hybrid \ac{ANR} approach.
\subsection{Low-power system architecture and integration}
This thesis considers a low-power \ac{SOC} architecture that integrates a general-purpose \ac{ARM} core with a dedicated \ac{DSP} core. The system combines the flexibility of an \ac{ARM}-based control processor with the computational efficiency of a specialized \ac{DSP}, splitting general computing tasks from real-time signal processing workloads.
\subsubsection{ARM and DSP hardware architecture overview}
A 32-bit \ac{ARM} core serves as the primary control unit of the system. It is responsible for high-level application logic, system configuration, peripheral management as also scheduling and serves as a general-purpose processing unit. Due to its universal instruction set and extensive input/output interface, the \ac{ARM} core is well suited for handling general tasks and the interaction with the \ac{CI} system. Time-critical numerical processing is intentionally offloaded to the \ac{DSP} core in order to reduce computational load and power consumption on the control processor.\\\\
The \ac{DSP} used for the implementation features a 32-bit dual Harvard, dual \ac{MAC} architecture primarily designed for audio signal-processing applications in low-power embedded systems. It doesn´t feature a designated boot ROM, as it is initialized and managed by the \ac{ARM} core. The firmware executing the \ac{ANR} algorithm is developed and programmed in the C programming language. The used propretiery compiler is highly efficient and generates optimized assembler code, which is then translated in machine code to execute the \ac{ANR} algorithm on incoming samples.\\\\
All memory instances and registers of the \ac{SOC} are directly addressable by the \ac{ARM} through the standard busses, also enabling a simplified control of the \ac{DSP} through a shared memory section. The memory consists mainly out of the two followng parts:
The \ac{DSP} used for the implementation features a 32-bit dual Harvard, dual \ac{MAC} architecture primarily designed for audio signal-processing applications in low-power embedded systems. It doesn't feature a designated boot ROM, as it is initialized and managed by the \ac{ARM} core. The firmware executing the \ac{ANR} algorithm is developed and programmed in the C programming language. The used proprietary compiler is highly efficient and generates optimized assembler code, which is then translated in machine code to execute the \ac{ANR} algorithm on incoming samples.\\\\
All memory instances and registers of the \ac{SOC} are directly addressable by the \ac{ARM} through the standard buses, also enabling a simplified control of the \ac{DSP} through a shared memory section. The memory consists mainly out of the two following parts:
\begin{itemize}
\item\textbf{Program Memory:} This memory section stores the executable code for both the \ac{ARM} core and the \ac{DSP} core. It contains the compiled instructions that define the behavior of the system, including the \ac{ANR} algorithm implemented on the \ac{DSP}.
\item\textbf{Data Memory:} This memory section is used for storing runtime data and variables, required during the execution of the program. This also includes the memory section for input/output audio samples and intermediate processing results. The shared memory section between the \ac{ARM} core and the \ac{DSP} core is also part of the data memory, featuring a total size of 64 KB.
@@ -15,15 +15,15 @@ The data memory is supported by an integrated \ac{DMA} controller, which allows
When the \ac{DSP} is not required to process audio data, it can be paused by pausing the clock provided to the \ac{DSP} core. When paused, the \ac{DSP} core enters a low-power state, still allowing the \ac{ARM} core to access its shared memory and wake up the \ac{DSP} core when needed. This mechanism helps to reduce overall power consumption, which is crucial for battery-operated devices like cochlear implants.\\\\
The processing unit of the \ac{DSP} is equipped with load/store architecture, meaning that, initially all operands need to be moved from the memory to the registers, before any operation can be performed. After this task is performed, the execution units (\ac{ALU} and multiplier) can perform their operations on the data and write back the results into the registers. Finally, the results need to be explicitly moved back to the memory.\\\\
Processor-wise, the \ac{DSP} includes a three stage pipeline consisting of fetch, decode, and execute stages, allowing for overlapping instruction execution and improved throughput. The architecture is optimized for high cycle efficiency when executing computationally intensive signal-processing workloads. The featured dual Harvard, dual \ac{MAC} architecture (two separate \ac{ALU}s) enables the execution of two \ac{MAC} operations, two memory operations (load/store) and two pointer updates in a single processor cycle.
\subsubsection{Inter-core communication mechanisms}
In order to ensure a smooth, but power-efficient, operation together with the \ac{CI} system, a interrupt-driven communication between the \ac{ARM} core and the \ac{DSP} core is crucial. Center of communication between the the cores is the already mentioned shared memory region accessible by both processing units. This shared memory enables the exchange of data without the need for separate communication protocols or input/output interfaces (refer to Figure \ref{fig:fig_dsp_setup.jpg}). Synchronization between the cores is achieved using interrupt-based signaling: the \ac{ARM} core initiates processing requests by waking up the \ac{DSP} and triggering an interrupt which sets an action flag, while the \ac{DSP} notifies the \ac{ARM} core upon completion of a task also by changing an interrupt register (for simplicity reasons, this behaviour will be just called ``interrupts'' in the remaining thesis). This approach ensures efficient coordination while minimizing active waiting (polling) and therefore unnecessary power consumption.
\subsubsection{Intercore communication mechanisms}
In order to ensure a smooth, but power-efficient, operation together with the \ac{CI} system, an interrupt-driven communication between the \ac{ARM} core and the \ac{DSP} core is crucial. Center of communication between the cores is the already mentioned shared memory region accessible by both processing units. This shared memory enables the exchange of data without the need for separate communication protocols or input/output interfaces (refer to Figure \ref{fig:fig_dsp_setup.jpg}). Synchronization between the cores is achieved using interrupt-based signaling: the \ac{ARM} core initiates processing requests by waking up the \ac{DSP} and triggering an interrupt which sets an action flag, while the \ac{DSP} notifies the \ac{ARM} core upon completion of a task also by changing an interrupt register (for simplicity reasons, this behavior will be just called ``interrupts'' in the remaining thesis). This approach ensures efficient coordination while minimizing active waiting (polling) and therefore unnecessary power consumption.
\caption{Simplified visualization of the interaction between the \ac{CI}-System, the \ac{ARM} core and the \ac{DSP} core, making use of the \ac{PCM} interface and shared memory for audio data exchange.}
\label{fig:fig_dsp_setup.jpg}
\end{figure}
\noindent The \ac{ARM} Core receives the 16-bit audio data (the corrupted signal and the reference noise signal via two channels) from the CI system via a \ac{PCM} interface, which offers one 32-bit input and one 32-bit output register. An interrupt triggers the integrated \ac{DMA} controller when the input register is occupied, which transfers the audio data from the \ac{PCM} interface to the input buffer in a predefined memory location (now in two 16-bit samples again). Once completed, the \ac{DSP} core is requested to start processing the audio data. The \ac{DSP} core then reads the audio samples from the shared memory, processes them using the implemented \ac{ANR} algorithm, and writes the a 16-bit processed sample back to an output buffer, also located in the shared memory. Finally, the \ac{ARM} core is notified via an interrupt from the \ac{DSP} core, that the processing is complete - the \ac{DMA} controller then transfers the processed audio samples from the output buffer back to the \ac{PCM} interface for playback (refer to Figure \ref{fig:fig_dsp_comm.jpg}).\\\\
\noindent The \ac{ARM} Core receives the 16-bit audio data (the corrupted signal and the reference noise signal via two channels) from the CI system via a \ac{PCM} interface, which offers one 32-bit input and one 32-bit output register. An interrupt triggers the integrated \ac{DMA} controller when the input register is occupied, which transfers the audio data from the \ac{PCM} interface to the input buffer in a predefined memory location (now in two 16-bit samples again). Once completed, the \ac{DSP} core is requested to start processing the audio data. The \ac{DSP} core then reads the audio samples from the shared memory, processes them using the implemented \ac{ANR} algorithm, and writes the 16-bit processed sample back to an output buffer, also located in the shared memory. Finally, the \ac{ARM} core is notified via an interrupt from the \ac{DSP} core, that the processing is complete - the \ac{DMA} controller then transfers the processed audio samples from the output buffer back to the \ac{PCM} interface for playback (refer to Figure \ref{fig:fig_dsp_comm.jpg}).\\\\
@@ -34,11 +34,11 @@ In order to ensure a smooth, but power-efficient, operation together with the \a
\subsection{Software architecture and execution flow}
\subsubsection{ARM–DSP communication and data exchange details}
In contrary, to the high-level simulation environment written in Python from the previous chapter, the implementation of the \ac{ANR} algorithm on the \ac{DSP} requires a low-level programming approach, as which takes into account the specific architecture and capabilities of the processor and its environment. This includes considerations such as memory management, data types, and optimization techniques specific to the \ac{DSP} architecture. The implementation is required to be done in the C programming language, which is a standard for embedded systems, as it allows low-level hardware implementation.\\\\
The implementation of the \ac{ANR} algorithm on the \ac{DSP} follows the same overall structure as the high-level variant, but now the focus lies on memory management, interrupt-handling and communication between the two cores. The \ac{ARM} operates in a continious loop, structured into several states:
The implementation of the \ac{ANR} algorithm on the \ac{DSP} follows the same overall structure as the high-level variant, but now the focus lies on memory management, interrupt-handling and communication between the two cores. The \ac{ARM} operates in a continuous loop, structured into several states:
\begin{itemize}
\item\textbf{Idle:} The \ac{ARM} core waits for an interrupt from the \ac{DMA} controller, indicating that new audio samples are available in the input buffer.
\item\textbf{Work:} After receiving the interrupt, the \ac{ARM} core triggers an interrupt on the \ac{DSP} core to start processing the audio samples.
\item\textbf{Wait:} After recieving the first interrupt, which signals that the \ac{DSP} startedprocessing the sample, the \ac{ARM} core waits for the second interrupt, indicating that the processing is complete.
\item\textbf{Wait:} After receiving the first interrupt, which signals that the \ac{DSP} startedprocessing the sample, the \ac{ARM} core waits for the second interrupt, indicating that the processing is complete.
\item\textbf{Done/Idle:} Once the processing is complete, the \ac{ARM} core triggers the \ac{DMA} controller to transfer the processed audio samples from the output buffer back to the \ac{PCM} interface for playback. The \ac{ARM} core then returns to the idle state, waiting for the next batch of audio samples.
\end{itemize}
On the contrary, the \ac{DSP} core operates in an interrupt-driven manner:
@@ -51,15 +51,15 @@ On the contrary, the \ac{DSP} core operates in an interrupt-driven manner:
\caption{Detailed visualization of the \ac{DMA} operations between the PCM interface to the shared memory section. When the memory buffer occupied, an interrupt is triggerd, either to the \ac{DSP} core or to the \ac{ARM} core, depending if triggered during a Read- or Write-operation.}
\caption{Detailed visualization of the \ac{DMA} operations between the PCM interface to the shared memory section. When the memory buffer occupied, an interrupt is triggered, either to the \ac{DSP} core or to the \ac{ARM} core, depending on, if triggered during a Read- or Write-operation.}
\label{fig:fig_dsp_dma.jpg}
\end{figure}
\noindent Figure \ref{fig:fig_dsp_dma.jpg} visualizes the concrete operation of the \ac{DMA} controller during the audio sample processing. The \ac{DMA} controller is configured to samplewise transfer the audio samples from the \ac{PCM} interface to the input buffer of the shared memory. When the input buffer is filled with one sample of both channels, an interrupt is triggered to the \ac{DSP} core, notifying it to start processing the available samples. After processing, the results are written into the output buffer in the shared memory. Once the output buffer is occupied, another interrupt is triggered to the \ac{DMA} controller, indicating that the processed samples are ready to be transferred back to the \ac{PCM} interface for playback. \\\\
As the \ac{ARM} operation is not the main focus of this thesis and its behavior is already sufficiently described, further implementaion details will be omitted in the following while the focus will be put on implementation of the \ac{ANR} algorithm on the \ac{DSP} core itself.
As the \ac{ARM} operation is not the main focus of this thesis and its behavior is already sufficiently described, further implementation details will be omitted in the following while the focus will be put on implementation of the \ac{ANR} algorithm on the \ac{DSP} core itself.
\subsubsection{System control flow and main processing loop}
The implementation of the \ac{ANR} algorithm on the \ac{DSP} core is structured into several key sections, each responsible for specific aspects of the algorithm's functionality. The following paragraphs outline the main components:
\paragraph{Memory initialization and mapping}
The memory initialization section starts with the definition of the interrupt register (0xC00004) and the corresponding bit masks used to control the interrupt behavior of the \ac{DSP} core. Afterwards, a section in the shared memory is defined for the storage of input and output audio samples after/before the transport to/from the \ac{PCM} interface. The output section is initialized with an offset of 16 bytes from the input section (0x800000), resulting in a storage capability of 4 32-bit double-words for each of the two memory sections - this is more than needed, but prevents future memory relocation, if the necessety for more space would arise. After this initialization, the interrupt register and the memory sections are declared as volatile variables, telling the compiler, that these variables can be changed outside the normal program flow (e.g., by hardware interrupts), preventing certain optimizations. The final input/output buffers are then declared in form of two 16-bit arrays, consisting of 4 elements each. Finally, a variable is declared to signal the \ac{DSP} core, an interrupt has occured, which changes the state of the interrupt register and signals a processing request.
The memory initialization section starts with the definition of the interrupt register (0xC00004) and the corresponding bit masks used to control the interrupt behavior of the \ac{DSP} core. Afterwards, a section in the shared memory is defined for the storage of input and output audio samples after/before the transport to/from the \ac{PCM} interface. The output section is initialized with an offset of 16 bytes from the input section (0x800000), resulting in a storage capability of 4 32-bit double-words for each of the two memory sections - this is more than needed, but prevents future memory relocation, if the necessity for more space would arise. After this initialization, the interrupt register and the memory sections are declared as volatile variables, telling the compiler, that these variables can be changed outside the normal program flow (e.g., by hardware interrupts), preventing certain optimizations. The final input/output buffers are then declared in form of two 16-bit arrays, consisting of 4 elements each. Finally, a variable is declared to signal the \ac{DSP} core, an interrupt has occurred, which changes the state of the interrupt register and signals a processing request.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
@@ -80,16 +80,16 @@ The memory initialization section starts with the definition of the interrupt re
\label{fig:fig_dps_code_memory}
\caption{Low-level implementation: Memory initialization and mapping}
\end{figure}
\noindent Figure \ref{fig:fig_compiler.jpg} shows an exemplary memory map of the input buffer array, taken from the compiler debugger. As the array is initialized as a 16bit integer array, each element occupies 2 bytes of memory.
\noindent Figure \ref{fig:fig_compiler.jpg} shows an exemplary memory map of the input buffer array, taken from the compiler debugger. As the array is initialized as a 16-bit integer array, each element occupies 2 bytes of memory.
\caption{Exemplary memory map of the 4-element input buffer array. As it is initialized as a 16bit integer array, each element occupies 2 bytes of memory, resulting in a total size of 8 bytes for the entire array. As the DSP architecture works in 32-bit double-words, the bytewise adressing is a result of the compiler abstraction.}
\caption{Exemplary memory map of the 4-element input buffer array. As it is initialized as a 16-bit integer array, each element occupies 2 bytes of memory, resulting in a total size of 8 bytes for the entire array. As the DSP architecture works in 32-bit double-words, the bytewise addressing is a result of the compiler abstraction.}
\label{fig:fig_compiler.jpg}
\end{figure}
\paragraph{Main loop and interrupt handling}
The main loop of the \ac{DSP} core is quite compact, as it mainly focuses on handling interrupts and delegating the sample processing to the \ac{ANR} function. The loop starts by enabling interrupts with a compiler-specific function and setting up pointers for the output buffer and the sample variable. After setting the action flag to zero, the main function enters an infinite loop, signaling the \ac{ARM} core it´s halted state by setting the interrupt register to 1 and halting the core.\\\\
If the \ac{ARM} core requests a sample to be processed, it activates the \ac{DSP} core and triggers an interrupt, which sets the action flag to 1. The main loop then checks the action flag, and sets the interrupt register back to 0, indicating the \ac{ARM} core it is now processing the sample. After resetting the action flag, the output pointer is updated to point to the next position in the output buffer using a cyclic addition function. Before triggering the calc()-function, the calculated sample from the previous cycle is moved from its temporary memory location to the current position in the output buffer. Afterwards, the calc()-function is triggered for the current cycle and the loop restarts. The flow diagram in Figure \ref{fig:fig_dsp_logic.jpg} visualizes the described behavior of the main loop and interrupt handling on the \ac{DSP} core.
The main loop of the \ac{DSP} core is quite compact, as it mainly focuses on handling interrupts and delegating the sample processing to the \ac{ANR} function. The loop starts by enabling interrupts with a compiler-specific function and setting up pointers for the output buffer and the sample variable. After setting the action flag to zero, the main function enters an infinite loop, signaling the \ac{ARM} core it's halted state by setting the interrupt register to 1 and halting the core.\\\\
If the \ac{ARM} core requests a sample to be processed, it activates the \ac{DSP} core and triggers an interrupt, which sets the action flag to 1. The main loop then checks the action flag, and sets the interrupt register back to 0, indicating the \ac{ARM} core it is now processing the sample. After resetting the action flag, the output pointer is updated to point to the next position in the output buffer using a cyclic addition function. Before triggering the calculate\_output()-function, the calculated sample from the previous cycle is moved from its temporary memory location to the current position in the output buffer. Afterwards, the calculate\_output()-function is triggered for the current cycle and the loop restarts. The flow diagram in Figure \ref{fig:fig_dsp_logic.jpg} visualizes the described behavior of the main loop and interrupt handling on the \ac{DSP} core.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
@@ -126,7 +126,7 @@ The calculate\_output()-function at the very end of the main process loop repres
\subsection{DSP-level implementation of the ANR algorithm}
The ability to process audio samples in real-time on the \ac{DSP} core is strongly dependent on compiler-specific optimizations and hardware-specific implementation techniques, which allow a far more efficient execution of the algorithm compared to a native C implementation.
\subsubsection{DSP-specific optimizations for real-time processing}
In the following, some examples of optimization possibilities shall be outlined, before the entire \ac{ANR} implementation on the \ac{DSP} is analyzed in regard of its performance.
In the following, some examples of optimization possibilities shall be outlined, before the entire \ac{ANR} implementation on the \ac{DSP} is analyzed in regard of its performance.
\paragraph{Logic operations}
Logic operstions, such as finding the maximum or minimum of two values, are quite common in signal processing algorithms. However, their implementation in C usually involves conditional statements (if-else), which can be inefficient on certain architectures due to pipeline stalls.\\\\
The simple function shown in Figure \ref{fig:fig_dsp_code_find_max} returns the maximum of two given integer values. Processing this manual implementation on the \ac{DSP} takes 12 cycles to execute, while the intrinsic function of the \ac{DSP} compiler allows a 4-cycle execution.
@@ -137,7 +137,7 @@ The simple function shown in Figure \ref{fig:fig_dsp_code_find_max} returns the
return (a > b) ? a : b;
}
\end{lstlisting}
\caption{Manual implementation of a max-function, returning the maximum of two integer values, taking 12 cycles to execute. The intrinsic functions of the DSP compiler allows a 4-cycle implementation of such an operations.}
\caption{Manual implementation of a max-function, returning the maximum of two integer values, taking 12 cycles to execute. The intrinsic functions of the DSP compiler allows a 4-cycle implementation of such an operation.}
\label{fig:fig_dsp_code_find_max}
\end{figure}
\paragraph{Cyclic array iteration}
@@ -156,12 +156,12 @@ Basically every part of the \ac{ANR} algorithm relies on iterating through memor
return new_pointer;
}
\end{lstlisting}
\caption{Manual implementation of a cyclic array iteration function in C, taking the core 20 cycles to execute an pointer inremen of 1. The intrinsic functions of the DSP compiler allows a single-cycle implementation of such cyclic additions.}
\caption{Manual implementation of a cyclic array iteration function in C, taking the core 20 cycles to execute a pointer inremen of 1. The intrinsic functions of the DSP compiler allows a single-cycle implementation of such cyclic additions.}
\label{fig:fig_dsp_code_cyclic_add}
\end{figure}
\noindent Figure \ref{fig:fig_dsp_code_cyclic_add} shows a manual implementation of such a cyclic array iteration function in C, which updates the pointer to a new address. This implementation takes the \ac{DSP} 20 cycles to execute, while the already implemented compiler-optimized version only takes one cycle, making use of the specific architecture of the \ac{DSP} allowing such a single-cycle operation.
\paragraph{Fractional fixed-point arithmetic}
As already mentioned during the beginning of the current chapter, the used \ac{DSP} is a fixed point processor, meaning, that it does not support floating-point arithmetic natively. Instead, it relies on fixed-point arithmetic, which represents numbers as integers scaled by a fixed factor. This is a key requirement, as it allows the use of the implemented dual \ac{MAC}\ac{ALU}s. This approach is also faster and more energy efficient, and therefore more suitable for embedded systems. However, it also introduces challenges in terms of precision and range, which need to taken into account when conducting certain calculations.\\\\
As already mentioned during the beginning of the current chapter, the used \ac{DSP} is a fixed point processor, meaning, that it does not support floating-point arithmetic natively. Instead, it relies on fixed-point arithmetic, which represents numbers as integers scaled by a fixed factor. This is a key requirement, as it allows the use of the implemented dual \ac{MAC}\ac{ALU}s. This approach is also faster and more energy efficient, and therefore more suitable for embedded systems. However, it also introduces challenges in terms of precision and range, which need to be taken into account when conducting certain calculations.\\\\
To tackle this issues, the \ac{DSP} compiler provides intrinsic functions for fractional fixed-point arithmetic, such as a fractional multiplication function, which takes two 32-bit integers as input and return an already bit-shifted 64-bit output, representing the fractional multiplication result. This approach prevents the need for manual bit-shifting operations after each multiplication.\\\\
To support such operations, a 72-bit accumulator is provided, allowing to store intermediate 64-bit results of 32-bit multiplications without losing precision - the remaining 8 bit serve as an overflow space. If needed, a saturation function is also provided, to round the 64-bit result back to a 32-bit value.
@@ -177,7 +177,7 @@ The $calculate\_output()$ functions consists out of the following five main part
\end{itemize}
These sub-functions feature \ac{DSP}-spefic optimizations and are partly depenend on the setable parameters like the filter length in regard of their computational cost. The following paragraphs will analyze the computational efficiency of these sub-functions in detail.
\paragraph{write\_buffer}The $write\_buffer$-function is responsible for managing the input line, where the samples of the reference noise signal are stored for further processing. The buffer management mainly consits out of a cyclic pointer increase operation and a pointer dereference operation to write the new sample into the buffer. The cyclic pointer increase operation is implemented using the already mentioned intrinsic function of the \ac{DSP} compiler, while the pointer dereference operation takes 15 cycles to execute. This results in a total duration of 16 cycles for the $write\_buffer$-function to process, indipentent of the filter length or other parameters.
\paragraph{apply\_fir\_filter} The $apply\_fir\_filter$-function is responsible for applying the coefficients of the \ac{FIR} filter on the reference noise signal samples stored in the input line. The needed cycles for this function are mainly depenendent on the lenght of the filter, as the number of multiplications and additions increase with the filter length. To increase the performance, the dual \ac{MAC} architecture of the \ac{DSP} is utilized, allowing two multiplications and two additions to be performed in a single cycle. Another \ac{DSP}-specific optimization is the use of the already introduced 72-bit accumulators and the fractional multiplication function, which allows to perform multiplications on two 32-bit integers without losing precision or the need for manual bit-shifting operations.
\paragraph{apply\_fir\_filter} The $apply\_fir\_filter$-function is responsible for applying the coefficients of the \ac{FIR} filter on the reference noise signal samples stored in the input line. The needed cycles for this function are mainly depenendent on the lenght of the filter, as the number of multiplications and additions increase with the filter length. To increase the performance, the dual \ac{MAC} architecture of the \ac{DSP} is utilized, allowing two multiplications and two additions to be performed in a single cycle. Another \ac{DSP}-specific optimization is the use of the already introduced 72-bit accumulators and the fractional multiplication function, which allows performing multiplications on two 32-bit integers without losing precision or the need for manual bit-shifting operations.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
@@ -206,7 +206,7 @@ These sub-functions feature \ac{DSP}-spefic optimizations and are partly depenen
\end{figure}
\noindent The final result is represented in a computing effort of 1 cycle per item in the sample line buffer (which equals the filter length) plus 12 cycles for general function overhead, resulting in a total of $N+12$ cycles for the $apply\_fir\_filter$-function, with $N$ being the filter length.
\paragraph{update\_output} The $update\_output$-function is responsible for calculating the output sample based on the error signal and the accumulated filter output. The calculation is a simple subtraction and only takes 1 cycle to execute, independent of the filter length or other parameters.
\paragraph{update\_filter\_coefficient} The $update\_filter\_coefficient$-function represents the second computationally expensive part of the $calculate\_output()$-function. The calculated output from the previous function is now multiplied with the step size and the corresponding sample from the reference noise signal, which is stored in the sample line buffer. The result is then added to the current filter coefficient to update it for the next cycle. Again, \ac{DSP}-specific optimizations, like the dual \ac{MAC} architecture, are used, resulting in a computing effort of 6 cycles per filter coeffcient. Per function call, a overhead of 8 cycles is added, resulting in a total of $6*N+8$ cycles for the $update\_filter\_coefficient$-function, with $N$ again being the filter length.
\paragraph{update\_filter\_coefficient} The $update\_filter\_coefficient$-function represents the second computationally expensive part of the $calculate\_output()$-function. The calculated output from the previous function is now multiplied with the step size and the corresponding sample from the reference noise signal, which is stored in the sample line buffer. The result is then added to the current filter coefficient to update it for the next cycle. Again, \ac{DSP}-specific optimizations, like the dual \ac{MAC} architecture, are used, resulting in a computing effort of 6 cycles per filter coeffcient. Per function call, an overhead of 8 cycles is added, resulting in a total of $6*N+8$ cycles for the $update\_filter\_coefficient$-function, with $N$ again being the filter length.
\begin{figure}[H]
\centering
\begin{lstlisting}[language=C]
@@ -225,7 +225,7 @@ These sub-functions feature \ac{DSP}-spefic optimizations and are partly depenen
p_w0+=2;
}
\end{lstlisting}
\caption{Code snippet of the $update\_filter\_coefficient$-function, again making use of of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. Additionaly, 32-bit values are loaded and stored as 64-bit values, using two also intrinisc functions, allowing to update two filter coefficients in a single cycle.}
\caption{Code snippet of the $update\_filter\_coefficient$-function, again making use of the dual \ac{MAC} architecture of the \ac{DSP} and the fractional multiplication function. Additionaly, 32-bit values are loaded and stored as 64-bit values, using two also intrinisc functions, allowing to update two filter coefficients in a single cycle.}
\@writefile{lof}{\contentsline{figure}{\numberline{45}{\ignorespaces Desired signal, corrupted signal, reference noise signal and filter output of the complex \ac{ANR} use case, simulated on the \ac{DSP}}}{56}{}\protected@file@percent }
\@writefile{lof}{\contentsline{figure}{\numberline{46}{\ignorespaces Error signal and filter coefficient evolution of the complex \ac{ANR} use case, simulated on the \ac{DSP}}}{57}{}\protected@file@percent }
\section{Performance evaluation of different implementation variants}
To verify the general performance of the \ac{DSP} implemented \ac{ANR} algorithm, the complex usecase of the high-level implemenation is used, which includes a 16-tap \ac{FIR} filter and an update of the filter coefficients every cycle. In contary to the high-level implementation, the coeffcient convergence is now not included in the evaluation anymore, but the metric for the \ac{ANR} performance stays the same as the \ac{SNR} improvement.
To verify the general performance of the \ac{DSP} implemented \ac{ANR} algorithm, the complex usecase of the high-level implemenation is used, which includes, again, a 16-tap \ac{FIR} filter and an update of the filter coefficients every cycle. In contary to the high-level implementation, the coeffcient convergence is now not included in the evaluation anymore, but the metric for the \ac{ANR} performance stays the same as the \ac{SNR} improvement.
@@ -12,4 +12,20 @@ To verify the general performance of the \ac{DSP} implemented \ac{ANR} algorithm
\caption{Error signal and filter coefficient evolution of the complex \ac{ANR} use case, simulated on the \ac{DSP}}
\label{fig:fig_plot_2_dsp_complex.png}
\end{figure}
Figure \ref{fig:fig_plot_1_dsp_complex.png} and \ref{fig:fig_plot_2_dsp_complex.png} show the results of the complex \ac{ANR} use case, simulated on the \ac{DSP}. The \ac{SNR} improvement of 5.92 dB is nearly the same as the one of the high-level implementation, which is 6.15 dB. The small difference can be explained by the fact that the \ac{DSP} implementation is based on fixed-point arithmetic, which leads to a slightly different convergence behavior. Nevertheless, the results show that the \ac{DSP} implementation of the \ac{ANR} algorithm is able to achieve a similar performance as the high-level implementation. The next step is of evaluate the performance of the \ac{DSP} implementation in terms of computational efficiency under different scenarios.
Figure \ref{fig:fig_plot_1_dsp_complex.png} and \ref{fig:fig_plot_2_dsp_complex.png} show the results of the complex \ac{ANR} use case, simulated on the \ac{DSP}. The \ac{SNR} improvement of XXXX dB is nearly the same as the one of the high-level implementation, which is XXXX dB. The small difference can be explained by the fact that the \ac{DSP} implementation is based on fixed-point arithmetic, which leads to a slightly different convergence behavior. Nevertheless, the results show that the \ac{DSP} implementation of the \ac{ANR} algorithm is able to achieve a similar performance as the high-level implementation, again indicating the fact, that 16 filter coefficients are insufficent to filter out a complex, phase-shifted noise signal. The next step is of evaluate the performance of the \ac{DSP} implementation in terms of computational efficiency under different scenarios.
\subsection{Computational efficiency evaluation}
\noindent For the evaluation of the computational efficiency, different combinations of desired signals and noise signals are considered. This approach rules out, that a certain combination of signals is not representative for the overall performance of the \ac{ANR} algorithm.
The desired signals are chosen as follows:
\begin{itemize}
\item A male speaker on TV
\item A short music jingle
\end{itemize}
PLOT
These two desired signals are corrupted with 3 different noise signals:
\begin{itemize}
\item The already used breathing sound
\item A chewing sound
\item A scratching sound
\end{itemize}
PLOT
The combination of stated sets delivers 6 different scenarious, everyone different in regard of it's challenges for the \ac{ANR} algorithm. For every scenario, the \ac{SNR}-Gain is calculated with an increasing set of filter coeffcients, ranging from 16 to 64.
@@ -47,7 +47,7 @@ in cochlear implant systems}}\\[-0.5ex]
\textbf{\Large{Master of Science}}\\
\bigskip\par
by \par
\large{Patrick Hangl, BSc}\\[-1ex]
\large{Patrick Hangl, B.Sc.}\\[-1ex]
\large{Matriculation Nr.: q4179749}\\ [-1ex]
\vspace{0.6cm}
\end{center}
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.