Italiano - English

Analisi di memoria pkQueueTS come buffer di cv::Mat. Parte 2, caso reale.

In Analisi di Memoria-Parte 1 si conclude che una coda std::queue con OpenCV Mats utilizza la memoria in modo efficiente. In questo articolo testiamo la nostra coda pkQueueTS con un grabber thread e un processor thread e analizziamo l'utilizzo della memoria per confermare le conclusioni preliminari su riciclo e efficienza.

Architecture of the test

We want to investigate memory usage by std::queue and cv::Mat. A single producer single consumer multithread environment is adequate to evaluate the memory behaviour of the queue.

To this purpose we will create an OpenCV application with 2 concurrent threads. A grabber thread captures images from a webcam and pushes them on a queue. A processor thread retrieves images from the queue and executes some processing. This is single producer single consumer multithread environment.

Unfortunately std::queue isn't thread safe and cv::Mat can't be used directly with it. For these reasons we will use our pkQueueTS which is a thread safe std::queue with tools to enqueue cv::Mat. In addiction, pkQueueTS offers the OnPush feature that can be used to investigate memory behaviour of the queue.

Grabber follows the webcam at 20FPS (50ms, processing is quite fast, let's say 10ms. In order to generate a variable queue size we will introduce some delay in the processor thread. Because we like to generate repeatable/comparable test, number of grabbed frames and delay are constant.

Each time we push an element on the queue we will take some memory measurement, at the end we will produce a report to better understand the memory behaviour of the queue and its elements.

Memory analysis with pkQueueTS

The idea to log some interesting "memory indicator" while the queue is running. In details we would like to investigate:

  • Time point
  • Number of push calls
  • memory working set (MByte)
  • Queue max size
  • Queue current size
  • Number of unique memory addresses for queue elements
  • Number of unique memory addresses for cv::Mat data generated

Following pkQueueTS doc, the OnPush feature can be used for this scope. The MemoryMeter class has been created and used as OnPush event handler.

Here is our memory meter class (the template class type parameter is useful to know the queue data type):

 
/** \brief Custom class to collect statistics about memory usage using the `OnPush` handler
*/
template <class TElement>
class MemoryMeter : public pkQueueOnPushBase
{
public:
    /** \brief Default constructor */
    MemoryMeter(const string &fname);
 
    /** \brief Default DEconstructor */
    ~MemoryMeter();
 
    /** \brief Measures current memory info and writes them the log file
    * \param queueSize current queue size;
    * \param elementPtr address of the pushed element (`&m_queue.back()`)
    * \pre This function is called by pkQueueTS::Push() while it holds the lock on the queue
    */
    void OnPush(size_t queueSize, const void *elementPtr)
    {
        m_queueSize = queueSize;
        m_queueSizeMax = max(m_queueSizeMax, m_queueSize);
        m_numOfElements++;
        m_counterQueueSize[m_queueSize]++;
        // Count unique element allocations
        m_counterMemoryElements[elementPtr]++;
        // Count unique mat data allocations
        const void *imgDataPtr = ((TElement*)elementPtr)->mat.data;
        m_counterMemoryImgData[imgDataPtr]++;
        Log();
    }
 
    /** \brief Prepare a report and printout to the console or save it to a file.
    * \param fname file name where to save the report. If empty the console will be used
    */
    void Report(const string &fname = "");
 
private:
#ifdef _WIN32
    DWORD  m_processID = 0;
 
    /** \brief Returns WorkingSetSize Memory Usage for given process
    *
    * \return WorkingSetSize in byte
    */
    SIZE_T MemoryWorkingSetSize();
#else
#error "Please write code to get working memory for your platform"
#endif
 
    /** \brief Write current memory info to the log file */
    void Log(bool init = false);
 
    /** \brief Returns current counters as string */
    string AsString(bool title = true, bool value = true);
 
    chrono::steady_clock::time_point m_startTime;   //<! Clock at construction time
    ofstream m_logFile;                             //<! The log stream
    string m_fname;                                 //<! The file name for the log
    size_t m_numOfElements;                         //<! Total number of elements
    size_t m_queueSize;                             //<! Current queue size
    size_t m_queueSizeMax;                          //<! Max queue size
 
    /** Unique memory addresses of queued elements */
    map<const void*, uint64> m_counterMemoryElements;
    /** Unique memory addressesof cv::Mat matrix data */
    map<const void*, uint64> m_counterMemoryImgData;
    /** Keep History of queue size. Elements are sorted by size ascending. */
    std::map<size_t, uint64> m_counterQueueSize;
};

The MemoryMeter::OnPush() method is called by the queue on each push, collects all needed information and writes a log record.

Grabber -> pkQueueTS::Push() -> MemoryMeter::OnPush() -> MemoryMeter::Log()
  • std::map has been used as memory addresses frequency counter. Its size is the number of unique addresses;
  • To get the working memory size, because we are on Windows, the WorkingSetSize (PSAPI) has been used in MemoryMeter::Log()
  • All information are stored in a CSV log file for easy analysis;
  • Locks have been avoided because we won't to read the MemoryMeter members;

Use of std::map to collect addresses

A map is a Ncols x 2rows container. First row contains <keys> second row contains <values>. Writing map[key]++ we get a counter for the key. Because we are using memory addresses as key, map[address]++ creates a counter of addresses.

Definitely ours std::map<const void *, uint64> m_counterMemory contains:

  • iterator->first memory addresses;
  • iterator->second number of times i-th address has been used;
  • m_counterMemory.size() the number of different addresses;

Example: 4 different memory addresses have been allocated. The address 0x12345 has been used 25times, and so on...

        +----------------------------------------+
 first: | 0x12345  | 0x133A5 | 0x143B2 | 0x153E5 | <= element pointers
second: | 25       | 48      | 22      | 6       | <= counts
        +----------------------------------------+

The test application

Ok, now we have the queue with a nice meter, let's go to create a real test case using the initial architecture.

 
/** Instance of our memory meter */
string fname = "./memoryMeter";
MemoryMeter<MatCapsule> memoryMeter(fname + ".csv");
/** The queue using custom our memory meter as OnPush handler */
pkQueueTS<MatCapsule> theQueue(&memoryMeter);
/** Lock free var to control the threads */
atomic<bool> processorOn, grabberOn;
 
/** \brief The grabber thread.
 *  Grabs from a webcam and writes frames on the queue.
 *  It also shows the grabbed frames and current queue size.
 */
void TheGrabberThread(int device)
{
...
    cv::VideoCapture cap(device);
    MatCapsule data;
    grabberOn.store(true);
    while (grabberOn.load())
    {
        cap >> data.mat;
        if (data.mat.empty()) continue;
...
        // Push on the queue
        size_t N = theQueue.Push(data);
...
        imshow(winName, data.mat);
        waitKey(1);
    }
}
 
/** \brief The processing function.
 *  Applies morphology gradient on the image.
 *  Is called by the processing thread.
 */
void TheImageProcess(cv::Mat &img, unsigned frameNum, const string &winName);
 
/** \brief The processor thread.
 *  Starts the Grabber Thread, reads frames from the queue and processes it.
 *  Each 100 frames it sleeps for 500ms time
 * \param maxFrame Use 0 to run forever
 */
void TheProcessorThread(unsigned maxFrameToGrab)
{
...
    thread grab(TheGrabberThread, 0);      // start the grabbing task
    MatCapsule data;
    unsigned frameCount = 0;
    processorOn.store(true);
    while (processorOn.load())
    {
        pkQueueResults res = theQueue.Pop(data, 2000);
        if (res != PK_QTS_OK)
        {
            if (res == PK_QTS_TIMEOUT)
                cout << "WARNING: time out reading from the queue!" << endl;
            if (res == PK_QTS_EMPTY)
                cout << "INFO: the queue is empty!" << endl;
            // pass the control to other threads
            this_thread::yield();
            continue;
        }
        frameCount++;
        /// Delay each 100 frames
        bool pause = (frameCount % 100 == 0);
...
        TheImageProcess(data.mat, frameCount, winName);
        if (pause) // Fixed delay
            this_thread::sleep_for(chrono::milliseconds(500));
 
        // check if it's time to terminate
        if (maxFrame && (frameCount == maxFrame))
            processorOn.store(false);
    }
 
    // stop the grab loop
    grabberOn.store(false);
    grab.join();
    // flush the buffer
    while (PK_QTS_EMPTY != theQueue.Pop(data, 0))
        TheImageProcess(data.mat, ++frameCount, winName);
}
 
 
/** \brief Main application.
It runs the Processor thread
*/
int main()
{
...
    unsigned frameToGrab = 450;
    thread thProc(TheProcessorThread, frameToGrab);   // start the the processing thread
 
    // your own GUI
    cout << endl << "Grabbing " << frameToGrab << " frames..." << endl;
 
    // terminate
    processorOn.store(false);       // stop the processing thread
    thProc.join();                  // wait for thread termination
 
    memoryMeter.Report(fname + ".txt");
...
}
 

Results

  • Target machine: Win7 x64 i3 /8GB Ram
  • Compiler: Visual Studio 2013 and TDM-GCC version 5.1.0 (tdm64-1). Both x64 platform
  • OpenCV 3.1.0 (debug build)
  • Grabbing @ 20FPS

Summary report produced by  memoryMeter.Report() . Below is MSVC version, GCC version is quite similar, check the plots below for details.

Summary:
--------
24.9619	[ s ] Time	
451	[ # ] Push Cout	
45	[MByte] Mem working set	
10	[ # ] Queue max size	
1	[ # ] Queue current size	
16	[ # ] Memory addrs for elements	
25	[ # ] Memory addrs for cv::Mat data	

Queue size frequencies:
-------------------
Sizes:	1	2	3	4	5	6	7	8	9	10
Count:	382	9	6	8	8	8	7	7	6	10

Plots of data produced by MemoryMeter::Log()


Plot 1 - Memory vs time

Analysis:

Reader should remember the Processor thread is delayed each 100 frames for 500ms. This explains regular peaks for Queue current size.

  • Text report says that 451 frames has been grabbed (and pushed) but we have only few memory addresses, this means that many memory block has been recycled. In details:
    16	[ # ] Memory addrs for elements
    25	[ # ] Memory addrs for cv::Mat data
  • Looking at plots, when Queue current size size doesn't change, the Memory addresses are always recycled.
  • When the queue size increases, the memory is partially recycled. Here, some new memory blocks are required but not all of them are allocated at same address as before. So, the number of unique memory addresses also increase (in special case of cv::Mat data). But addresses increment is less than queue increase, hence memory is partially recycled.
  • Memory working set depends on queue current size but not on Push count or memory recycling effectiveness.
  • Memory working set goes back with queue size even if memory recycling occurs (see timepoint 17s). This means that the process releases the memory used by previous push to the OS but it was possible to get it back again on next push.
  • Described behaviour is repeatable/comparable between multiple runs of our test. Effectiveness of memory recycling is always good even if it's not always the same, for sure machine overall load has some influence on recycling.

Conclusion

We can conclude that:

  • A std::queue of Mats like our pkQueueTS is memory consuming effective.
  • A lot of memory recycling is performed by the memory manager.
  • Required memory depends on the size (length) of the queue despite of how many "push" we will perform.

We also should remember that memory recycling can't save us from

  • to call memory copy (memcpy, ....): each push creates a copy of the element, in case of images memory copy can has some relevance
  • to call memory allocation (free, malloc, new, delete): each pop calls the element destructor than the memory is freed. Even if memory is recycled the memory manager have to allocate the block again when it will be needed. Memory allocation on the heap blocks all threads that shares same memory. Buffers with dynamic memory should used with care when high performance in multithreading is a must, static structures like circular arrays can make the difference.

Enjoy with testing and please, send us your comments !

Vota questa pagina:

0 Commenti:

Lascia il tuo commento:

Note:
  • La tua email non è obligatoria e non sarà visibile in alcun modo
  • Si prega di inviare solo commenti relativi a questa pagina
  • Commenti inappropriati o offensivi saranno modificati o eliminati
  • Codici HTML non sono consentiti. Prego usare i BB code:
    [b]bold[/b], [u]underline[/u], [i]italic[/i], [code]code[/code]
Il codice, le illustrazioni e gli esempi riportati in questa pagina sono solo a scopo illustrativo. L'autore non prende alcuna responsabilità per il loro utilizzo da parte dell'utente finale.
Questo materiale è di proprietà di Pk Lab ed è utilizzabile liberamente a condizione di citarne la fonte.