stb_image_resize2.h - minesweeper - A minewseeper implementation to play around with Hare and Raylib

stb_image_resize2.h (451105B)
      1 /* stb_image_resize2 - v2.12 - public domain image resizing
      2 
      3    by Jeff Roberts (v2) and Jorge L Rodriguez
      4    http://github.com/nothings/stb
      5 
      6    Can be threaded with the extended API. SSE2, AVX, Neon and WASM SIMD support. Only
      7    scaling and translation is supported, no rotations or shears.
      8 
      9    COMPILING & LINKING
     10       In one C/C++ file that #includes this file, do this:
     11          #define STB_IMAGE_RESIZE_IMPLEMENTATION
     12       before the #include. That will create the implementation in that file.
     13 
     14    EASY API CALLS:
     15      Easy API downsamples w/Mitchell filter, upsamples w/cubic interpolation, clamps to edge.
     16 
     17      stbir_resize_uint8_srgb( input_pixels,  input_w,  input_h,  input_stride_in_bytes,
     18                               output_pixels, output_w, output_h, output_stride_in_bytes,
     19                               pixel_layout_enum )
     20 
     21      stbir_resize_uint8_linear( input_pixels,  input_w,  input_h,  input_stride_in_bytes,
     22                                 output_pixels, output_w, output_h, output_stride_in_bytes,
     23                                 pixel_layout_enum )
     24 
     25      stbir_resize_float_linear( input_pixels,  input_w,  input_h,  input_stride_in_bytes,
     26                                 output_pixels, output_w, output_h, output_stride_in_bytes,
     27                                 pixel_layout_enum )
     28 
     29      If you pass NULL or zero for the output_pixels, we will allocate the output buffer
     30      for you and return it from the function (free with free() or STBIR_FREE).
     31      As a special case, XX_stride_in_bytes of 0 means packed continuously in memory.
     32 
     33    API LEVELS
     34       There are three levels of API - easy-to-use, medium-complexity and extended-complexity.
     35 
     36       See the "header file" section of the source for API documentation.
     37 
     38    ADDITIONAL DOCUMENTATION
     39 
     40       MEMORY ALLOCATION
     41          By default, we use malloc and free for memory allocation.  To override the
     42          memory allocation, before the implementation #include, add a:
     43 
     44             #define STBIR_MALLOC(size,user_data) ...
     45             #define STBIR_FREE(ptr,user_data)   ...
     46 
     47          Each resize makes exactly one call to malloc/free (unless you use the
     48          extended API where you can do one allocation for many resizes). Under
     49          address sanitizer, we do separate allocations to find overread/writes.
     50 
     51       PERFORMANCE
     52          This library was written with an emphasis on performance. When testing
     53          stb_image_resize with RGBA, the fastest mode is STBIR_4CHANNEL with
     54          STBIR_TYPE_UINT8 pixels and CLAMPed edges (which is what many other resize
     55          libs do by default). Also, make sure SIMD is turned on of course (default
     56          for 64-bit targets). Avoid WRAP edge mode if you want the fastest speed.
     57 
     58          This library also comes with profiling built-in. If you define STBIR_PROFILE,
     59          you can use the advanced API and get low-level profiling information by
     60          calling stbir_resize_extended_profile_info() or stbir_resize_split_profile_info()
     61          after a resize.
     62 
     63       SIMD
     64          Most of the routines have optimized SSE2, AVX, NEON and WASM versions.
     65 
     66          On Microsoft compilers, we automatically turn on SIMD for 64-bit x64 and
     67          ARM; for 32-bit x86 and ARM, you select SIMD mode by defining STBIR_SSE2 or
     68          STBIR_NEON. For AVX and AVX2, we auto-select it by detecting the /arch:AVX
     69          or /arch:AVX2 switches. You can also always manually turn SSE2, AVX or AVX2
     70          support on by defining STBIR_SSE2, STBIR_AVX or STBIR_AVX2.
     71 
     72          On Linux, SSE2 and Neon is on by default for 64-bit x64 or ARM64. For 32-bit,
     73          we select x86 SIMD mode by whether you have -msse2, -mavx or -mavx2 enabled
     74          on the command line. For 32-bit ARM, you must pass -mfpu=neon-vfpv4 for both
     75          clang and GCC, but GCC also requires an additional -mfp16-format=ieee to
     76          automatically enable NEON.
     77 
     78          On x86 platforms, you can also define STBIR_FP16C to turn on FP16C instructions
     79          for converting back and forth to half-floats. This is autoselected when we
     80          are using AVX2. Clang and GCC also require the -mf16c switch. ARM always uses
     81          the built-in half float hardware NEON instructions.
     82 
     83          You can also tell us to use multiply-add instructions with STBIR_USE_FMA.
     84          Because x86 doesn't always have fma, we turn it off by default to maintain
     85          determinism across all platforms. If you don't care about non-FMA determinism
     86          and are willing to restrict yourself to more recent x86 CPUs (around the AVX
     87          timeframe), then fma will give you around a 15% speedup.
     88 
     89          You can force off SIMD in all cases by defining STBIR_NO_SIMD. You can turn
     90          off AVX or AVX2 specifically with STBIR_NO_AVX or STBIR_NO_AVX2. AVX is 10%
     91          to 40% faster, and AVX2 is generally another 12%.
     92 
     93       ALPHA CHANNEL
     94          Most of the resizing functions provide the ability to control how the alpha
     95          channel of an image is processed.
     96 
     97          When alpha represents transparency, it is important that when combining
     98          colors with filtering, the pixels should not be treated equally; they
     99          should use a weighted average based on their alpha values. For example,
    100          if a pixel is 1% opaque bright green and another pixel is 99% opaque
    101          black and you average them, the average will be 50% opaque, but the
    102          unweighted average and will be a middling green color, while the weighted
    103          average will be nearly black. This means the unweighted version introduced
    104          green energy that didn't exist in the source image.
    105 
    106          (If you want to know why this makes sense, you can work out the math for
    107          the following: consider what happens if you alpha composite a source image
    108          over a fixed color and then average the output, vs. if you average the
    109          source image pixels and then composite that over the same fixed color.
    110          Only the weighted average produces the same result as the ground truth
    111          composite-then-average result.)
    112 
    113          Therefore, it is in general best to "alpha weight" the pixels when applying
    114          filters to them. This essentially means multiplying the colors by the alpha
    115          values before combining them, and then dividing by the alpha value at the
    116          end.
    117 
    118          The computer graphics industry introduced a technique called "premultiplied
    119          alpha" or "associated alpha" in which image colors are stored in image files
    120          already multiplied by their alpha. This saves some math when compositing,
    121          and also avoids the need to divide by the alpha at the end (which is quite
    122          inefficient). However, while premultiplied alpha is common in the movie CGI
    123          industry, it is not commonplace in other industries like videogames, and most
    124          consumer file formats are generally expected to contain not-premultiplied
    125          colors. For example, Photoshop saves PNG files "unpremultiplied", and web
    126          browsers like Chrome and Firefox expect PNG images to be unpremultiplied.
    127 
    128          Note that there are three possibilities that might describe your image
    129          and resize expectation:
    130 
    131              1. images are not premultiplied, alpha weighting is desired
    132              2. images are not premultiplied, alpha weighting is not desired
    133              3. images are premultiplied
    134 
    135          Both case #2 and case #3 require the exact same math: no alpha weighting
    136          should be applied or removed. Only case 1 requires extra math operations;
    137          the other two cases can be handled identically.
    138 
    139          stb_image_resize expects case #1 by default, applying alpha weighting to
    140          images, expecting the input images to be unpremultiplied. This is what the
    141          COLOR+ALPHA buffer types tell the resizer to do.
    142 
    143          When you use the pixel layouts STBIR_RGBA, STBIR_BGRA, STBIR_ARGB,
    144          STBIR_ABGR, STBIR_RX, or STBIR_XR you are telling us that the pixels are
    145          non-premultiplied. In these cases, the resizer will alpha weight the colors
    146          (effectively creating the premultiplied image), do the filtering, and then
    147          convert back to non-premult on exit.
    148 
    149          When you use the pixel layouts STBIR_RGBA_PM, STBIR_RGBA_PM, STBIR_RGBA_PM,
    150          STBIR_RGBA_PM, STBIR_RX_PM or STBIR_XR_PM, you are telling that the pixels
    151          ARE premultiplied. In this case, the resizer doesn't have to do the
    152          premultipling - it can filter directly on the input. This about twice as
    153          fast as the non-premultiplied case, so it's the right option if your data is
    154          already setup correctly.
    155 
    156          When you use the pixel layout STBIR_4CHANNEL or STBIR_2CHANNEL, you are
    157          telling us that there is no channel that represents transparency; it may be
    158          RGB and some unrelated fourth channel that has been stored in the alpha
    159          channel, but it is actually not alpha. No special processing will be
    160          performed.
    161 
    162          The difference between the generic 4 or 2 channel layouts, and the
    163          specialized _PM versions is with the _PM versions you are telling us that
    164          the data *is* alpha, just don't premultiply it. That's important when
    165          using SRGB pixel formats, we need to know where the alpha is, because
    166          it is converted linearly (rather than with the SRGB converters).
    167 
    168          Because alpha weighting produces the same effect as premultiplying, you
    169          even have the option with non-premultiplied inputs to let the resizer
    170          produce a premultiplied output. Because the intially computed alpha-weighted
    171          output image is effectively premultiplied, this is actually more performant
    172          than the normal path which un-premultiplies the output image as a final step.
    173 
    174          Finally, when converting both in and out of non-premulitplied space (for
    175          example, when using STBIR_RGBA), we go to somewhat heroic measures to
    176          ensure that areas with zero alpha value pixels get something reasonable
    177          in the RGB values. If you don't care about the RGB values of zero alpha
    178          pixels, you can call the stbir_set_non_pm_alpha_speed_over_quality()
    179          function - this runs a premultiplied resize about 25% faster. That said,
    180          when you really care about speed, using premultiplied pixels for both in
    181          and out (STBIR_RGBA_PM, etc) much faster than both of these premultiplied
    182          options.
    183 
    184       PIXEL LAYOUT CONVERSION
    185          The resizer can convert from some pixel layouts to others. When using the
    186          stbir_set_pixel_layouts(), you can, for example, specify STBIR_RGBA
    187          on input, and STBIR_ARGB on output, and it will re-organize the channels
    188          during the resize. Currently, you can only convert between two pixel
    189          layouts with the same number of channels.
    190 
    191       DETERMINISM
    192          We commit to being deterministic (from x64 to ARM to scalar to SIMD, etc).
    193          This requires compiling with fast-math off (using at least /fp:precise).
    194          Also, you must turn off fp-contracting (which turns mult+adds into fmas)!
    195          We attempt to do this with pragmas, but with Clang, you usually want to add
    196          -ffp-contract=off to the command line as well.
    197 
    198          For 32-bit x86, you must use SSE and SSE2 codegen for determinism. That is,
    199          if the scalar x87 unit gets used at all, we immediately lose determinism.
    200          On Microsoft Visual Studio 2008 and earlier, from what we can tell there is
    201          no way to be deterministic in 32-bit x86 (some x87 always leaks in, even
    202          with fp:strict). On 32-bit x86 GCC, determinism requires both -msse2 and
    203          -fpmath=sse.
    204 
    205          Note that we will not be deterministic with float data containing NaNs -
    206          the NaNs will propagate differently on different SIMD and platforms.
    207 
    208          If you turn on STBIR_USE_FMA, then we will be deterministic with other
    209          fma targets, but we will differ from non-fma targets (this is unavoidable,
    210          because a fma isn't simply an add with a mult - it also introduces a
    211          rounding difference compared to non-fma instruction sequences.
    212 
    213       FLOAT PIXEL FORMAT RANGE
    214          Any range of values can be used for the non-alpha float data that you pass
    215          in (0 to 1, -1 to 1, whatever). However, if you are inputting float values
    216          but *outputting* bytes or shorts, you must use a range of 0 to 1 so that we
    217          scale back properly. The alpha channel must also be 0 to 1 for any format
    218          that does premultiplication prior to resizing.
    219 
    220          Note also that with float output, using filters with negative lobes, the
    221          output filtered values might go slightly out of range. You can define
    222          STBIR_FLOAT_LOW_CLAMP and/or STBIR_FLOAT_HIGH_CLAMP to specify the range
    223          to clamp to on output, if that's important.
    224 
    225       MAX/MIN SCALE FACTORS
    226          The input pixel resolutions are in integers, and we do the internal pointer
    227          resolution in size_t sized integers. However, the scale ratio from input
    228          resolution to output resolution is calculated in float form. This means
    229          the effective possible scale ratio is limited to 24 bits (or 16 million
    230          to 1). As you get close to the size of the float resolution (again, 16
    231          million pixels wide or high), you might start seeing float inaccuracy
    232          issues in general in the pipeline. If you have to do extreme resizes,
    233          you can usually do this is multiple stages (using float intermediate
    234          buffers).
    235 
    236       FLIPPED IMAGES
    237          Stride is just the delta from one scanline to the next. This means you can
    238          use a negative stride to handle inverted images (point to the final
    239          scanline and use a negative stride). You can invert the input or output,
    240          using negative strides.
    241 
    242       DEFAULT FILTERS
    243          For functions which don't provide explicit control over what filters to
    244          use, you can change the compile-time defaults with:
    245 
    246             #define STBIR_DEFAULT_FILTER_UPSAMPLE     STBIR_FILTER_something
    247             #define STBIR_DEFAULT_FILTER_DOWNSAMPLE   STBIR_FILTER_something
    248 
    249          See stbir_filter in the header-file section for the list of filters.
    250 
    251       NEW FILTERS
    252          A number of 1D filter kernels are supplied. For a list of supported
    253          filters, see the stbir_filter enum. You can install your own filters by
    254          using the stbir_set_filter_callbacks function.
    255 
    256       PROGRESS
    257          For interactive use with slow resize operations, you can use the the
    258          scanline callbacks in the extended API. It would have to be a *very* large
    259          image resample to need progress though - we're very fast.
    260 
    261       CEIL and FLOOR
    262          In scalar mode, the only functions we use from math.h are ceilf and floorf,
    263          but if you have your own versions, you can define the STBIR_CEILF(v) and
    264          STBIR_FLOORF(v) macros and we'll use them instead. In SIMD, we just use
    265          our own versions.
    266 
    267       ASSERT
    268          Define STBIR_ASSERT(boolval) to override assert() and not use assert.h
    269 
    270      PORTING FROM VERSION 1
    271         The API has changed. You can continue to use the old version of stb_image_resize.h,
    272         which is available in the "deprecated/" directory.
    273 
    274         If you're using the old simple-to-use API, porting is straightforward.
    275         (For more advanced APIs, read the documentation.)
    276 
    277           stbir_resize_uint8():
    278             - call `stbir_resize_uint8_linear`, cast channel count to `stbir_pixel_layout`
    279 
    280           stbir_resize_float():
    281             - call `stbir_resize_float_linear`, cast channel count to `stbir_pixel_layout`
    282 
    283           stbir_resize_uint8_srgb():
    284             - function name is unchanged
    285             - cast channel count to `stbir_pixel_layout`
    286             - above is sufficient unless your image has alpha and it's not RGBA/BGRA
    287               - in that case, follow the below instructions for stbir_resize_uint8_srgb_edgemode
    288 
    289           stbir_resize_uint8_srgb_edgemode()
    290             - switch to the "medium complexity" API
    291             - stbir_resize(), very similar API but a few more parameters:
    292               - pixel_layout: cast channel count to `stbir_pixel_layout`
    293               - data_type:    STBIR_TYPE_UINT8_SRGB
    294               - edge:         unchanged (STBIR_EDGE_WRAP, etc.)
    295               - filter:       STBIR_FILTER_DEFAULT
    296             - which channel is alpha is specified in stbir_pixel_layout, see enum for details
    297 
    298       FUTURE TODOS
    299         *  For polyphase integral filters, we just memcpy the coeffs to dupe
    300            them, but we should indirect and use the same coeff memory.
    301         *  Add pixel layout conversions for sensible different channel counts
    302            (maybe, 1->3/4, 3->4, 4->1, 3->1).
    303          * For SIMD encode and decode scanline routines, do any pre-aligning
    304            for bad input/output buffer alignments and pitch?
    305          * For very wide scanlines, we should we do vertical strips to stay within
    306            L2 cache. Maybe do chunks of 1K pixels at a time. There would be
    307            some pixel reconversion, but probably dwarfed by things falling out
    308            of cache. Probably also something possible with alternating between
    309            scattering and gathering at high resize scales?
    310          * Rewrite the coefficient generator to do many at once.
    311          * AVX-512 vertical kernels - worried about downclocking here.
    312          * Convert the reincludes to macros when we know they aren't changing.
    313          * Experiment with pivoting the horizontal and always using the
    314            vertical filters (which are faster, but perhaps not enough to overcome
    315            the pivot cost and the extra memory touches). Need to buffer the whole
    316            image so have to balance memory use.
    317          * Most of our code is internally function pointers, should we compile
    318            all the SIMD stuff always and dynamically dispatch?
    319 
    320    CONTRIBUTORS
    321       Jeff Roberts: 2.0 implementation, optimizations, SIMD
    322       Martins Mozeiko: NEON simd, WASM simd, clang and GCC whisperer
    323       Fabian Giesen: half float and srgb converters
    324       Sean Barrett: API design, optimizations
    325       Jorge L Rodriguez: Original 1.0 implementation
    326       Aras Pranckevicius: bugfixes
    327       Nathan Reed: warning fixes for 1.0
    328 
    329    REVISIONS
    330       2.12 (2024-10-18) fix incorrect use of user_data with STBIR_FREE
    331       2.11 (2024-09-08) fix harmless asan warnings in 2-channel and 3-channel mode
    332                           with AVX-2, fix some weird scaling edge conditions with
    333                           point sample mode.
    334       2.10 (2024-07-27) fix the defines GCC and mingw for loop unroll control,
    335                           fix MSVC 32-bit arm half float routines.
    336       2.09 (2024-06-19) fix the defines for 32-bit ARM GCC builds (was selecting
    337                           hardware half floats).
    338       2.08 (2024-06-10) fix for RGB->BGR three channel flips and add SIMD (thanks
    339                           to Ryan Salsbury), fix for sub-rect resizes, use the
    340                           pragmas to control unrolling when they are available.
    341       2.07 (2024-05-24) fix for slow final split during threaded conversions of very 
    342                           wide scanlines when downsampling (caused by extra input 
    343                           converting), fix for wide scanline resamples with many 
    344                           splits (int overflow), fix GCC warning.
    345       2.06 (2024-02-10) fix for identical width/height 3x or more down-scaling 
    346                           undersampling a single row on rare resize ratios (about 1%).
    347       2.05 (2024-02-07) fix for 2 pixel to 1 pixel resizes with wrap (thanks Aras),
    348                         fix for output callback (thanks Julien Koenen).
    349       2.04 (2023-11-17) fix for rare AVX bug, shadowed symbol (thanks Nikola Smiljanic).
    350       2.03 (2023-11-01) ASAN and TSAN warnings fixed, minor tweaks.
    351       2.00 (2023-10-10) mostly new source: new api, optimizations, simd, vertical-first, etc
    352                           2x-5x faster without simd, 4x-12x faster with simd,
    353                           in some cases, 20x to 40x faster esp resizing large to very small.
    354       0.96 (2019-03-04) fixed warnings
    355       0.95 (2017-07-23) fixed warnings
    356       0.94 (2017-03-18) fixed warnings
    357       0.93 (2017-03-03) fixed bug with certain combinations of heights
    358       0.92 (2017-01-02) fix integer overflow on large (>2GB) images
    359       0.91 (2016-04-02) fix warnings; fix handling of subpixel regions
    360       0.90 (2014-09-17) first released version
    361 
    362    LICENSE
    363      See end of file for license information.
    364 */
    365 
    366 #if !defined(STB_IMAGE_RESIZE_DO_HORIZONTALS) && !defined(STB_IMAGE_RESIZE_DO_VERTICALS) && !defined(STB_IMAGE_RESIZE_DO_CODERS)   // for internal re-includes
    367 
    368 #ifndef STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
    369 #define STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
    370 
    371 #include <stddef.h>
    372 #ifdef _MSC_VER
    373 typedef unsigned char    stbir_uint8;
    374 typedef unsigned short   stbir_uint16;
    375 typedef unsigned int     stbir_uint32;
    376 typedef unsigned __int64 stbir_uint64;
    377 #else
    378 #include <stdint.h>
    379 typedef uint8_t  stbir_uint8;
    380 typedef uint16_t stbir_uint16;
    381 typedef uint32_t stbir_uint32;
    382 typedef uint64_t stbir_uint64;
    383 #endif
    384 
    385 #ifdef _M_IX86_FP
    386 #if ( _M_IX86_FP >= 1 )
    387 #ifndef STBIR_SSE
    388 #define STBIR_SSE
    389 #endif
    390 #endif
    391 #endif
    392 
    393 #if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(_M_AMD64) || defined(__SSE2__) || defined(STBIR_SSE) || defined(STBIR_SSE2)
    394   #ifndef STBIR_SSE2
    395     #define STBIR_SSE2
    396   #endif
    397   #if defined(__AVX__) || defined(STBIR_AVX2)
    398     #ifndef STBIR_AVX
    399       #ifndef STBIR_NO_AVX
    400         #define STBIR_AVX
    401       #endif
    402     #endif
    403   #endif
    404   #if defined(__AVX2__) || defined(STBIR_AVX2)
    405     #ifndef STBIR_NO_AVX2
    406       #ifndef STBIR_AVX2
    407         #define STBIR_AVX2
    408       #endif
    409       #if defined( _MSC_VER ) && !defined(__clang__)
    410         #ifndef STBIR_FP16C  // FP16C instructions are on all AVX2 cpus, so we can autoselect it here on microsoft - clang needs -m16c
    411           #define STBIR_FP16C
    412         #endif
    413       #endif
    414     #endif
    415   #endif
    416   #ifdef __F16C__
    417     #ifndef STBIR_FP16C  // turn on FP16C instructions if the define is set (for clang and gcc)
    418       #define STBIR_FP16C
    419     #endif
    420   #endif
    421 #endif
    422 
    423 #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || ((__ARM_NEON_FP & 4) != 0) || defined(__ARM_NEON__)
    424 #ifndef STBIR_NEON
    425 #define STBIR_NEON
    426 #endif
    427 #endif
    428 
    429 #if defined(_M_ARM) || defined(__arm__)
    430 #ifdef STBIR_USE_FMA
    431 #undef STBIR_USE_FMA // no FMA for 32-bit arm on MSVC
    432 #endif
    433 #endif
    434 
    435 #if defined(__wasm__) && defined(__wasm_simd128__)
    436 #ifndef STBIR_WASM
    437 #define STBIR_WASM
    438 #endif
    439 #endif
    440 
    441 #ifndef STBIRDEF
    442 #ifdef STB_IMAGE_RESIZE_STATIC
    443 #define STBIRDEF static
    444 #else
    445 #ifdef __cplusplus
    446 #define STBIRDEF extern "C"
    447 #else
    448 #define STBIRDEF extern
    449 #endif
    450 #endif
    451 #endif
    452 
    453 //////////////////////////////////////////////////////////////////////////////
    454 ////   start "header file" ///////////////////////////////////////////////////
    455 //
    456 // Easy-to-use API:
    457 //
    458 //     * stride is the offset between successive rows of image data
    459 //        in memory, in bytes. specify 0 for packed continuously in memory
    460 //     * colorspace is linear or sRGB as specified by function name
    461 //     * Uses the default filters
    462 //     * Uses edge mode clamped
    463 //     * returned result is 1 for success or 0 in case of an error.
    464 
    465 
    466 // stbir_pixel_layout specifies:
    467 //   number of channels
    468 //   order of channels
    469 //   whether color is premultiplied by alpha
    470 // for back compatibility, you can cast the old channel count to an stbir_pixel_layout
    471 typedef enum
    472 {
    473   STBIR_1CHANNEL = 1,
    474   STBIR_2CHANNEL = 2,
    475   STBIR_RGB      = 3,               // 3-chan, with order specified (for channel flipping)
    476   STBIR_BGR      = 0,               // 3-chan, with order specified (for channel flipping)
    477   STBIR_4CHANNEL = 5,
    478 
    479   STBIR_RGBA = 4,                   // alpha formats, where alpha is NOT premultiplied into color channels
    480   STBIR_BGRA = 6,
    481   STBIR_ARGB = 7,
    482   STBIR_ABGR = 8,
    483   STBIR_RA   = 9,
    484   STBIR_AR   = 10,
    485 
    486   STBIR_RGBA_PM = 11,               // alpha formats, where alpha is premultiplied into color channels
    487   STBIR_BGRA_PM = 12,
    488   STBIR_ARGB_PM = 13,
    489   STBIR_ABGR_PM = 14,
    490   STBIR_RA_PM   = 15,
    491   STBIR_AR_PM   = 16,
    492 
    493   STBIR_RGBA_NO_AW = 11,            // alpha formats, where NO alpha weighting is applied at all!
    494   STBIR_BGRA_NO_AW = 12,            //   these are just synonyms for the _PM flags (which also do
    495   STBIR_ARGB_NO_AW = 13,            //   no alpha weighting). These names just make it more clear
    496   STBIR_ABGR_NO_AW = 14,            //   for some folks).
    497   STBIR_RA_NO_AW   = 15,
    498   STBIR_AR_NO_AW   = 16,
    499 
    500 } stbir_pixel_layout;
    501 
    502 //===============================================================
    503 //  Simple-complexity API
    504 //
    505 //    If output_pixels is NULL (0), then we will allocate the buffer and return it to you.
    506 //--------------------------------
    507 
    508 STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
    509                                                         unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
    510                                                         stbir_pixel_layout pixel_type );
    511 
    512 STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
    513                                                           unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
    514                                                           stbir_pixel_layout pixel_type );
    515 
    516 STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
    517                                                   float *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
    518                                                   stbir_pixel_layout pixel_type );
    519 //===============================================================
    520 
    521 //===============================================================
    522 // Medium-complexity API
    523 //
    524 // This extends the easy-to-use API as follows:
    525 //
    526 //     * Can specify the datatype - U8, U8_SRGB, U16, FLOAT, HALF_FLOAT
    527 //     * Edge wrap can selected explicitly
    528 //     * Filter can be selected explicitly
    529 //--------------------------------
    530 
    531 typedef enum
    532 {
    533   STBIR_EDGE_CLAMP   = 0,
    534   STBIR_EDGE_REFLECT = 1,
    535   STBIR_EDGE_WRAP    = 2,  // this edge mode is slower and uses more memory
    536   STBIR_EDGE_ZERO    = 3,
    537 } stbir_edge;
    538 
    539 typedef enum
    540 {
    541   STBIR_FILTER_DEFAULT      = 0,  // use same filter type that easy-to-use API chooses
    542   STBIR_FILTER_BOX          = 1,  // A trapezoid w/1-pixel wide ramps, same result as box for integer scale ratios
    543   STBIR_FILTER_TRIANGLE     = 2,  // On upsampling, produces same results as bilinear texture filtering
    544   STBIR_FILTER_CUBICBSPLINE = 3,  // The cubic b-spline (aka Mitchell-Netrevalli with B=1,C=0), gaussian-esque
    545   STBIR_FILTER_CATMULLROM   = 4,  // An interpolating cubic spline
    546   STBIR_FILTER_MITCHELL     = 5,  // Mitchell-Netrevalli filter with B=1/3, C=1/3
    547   STBIR_FILTER_POINT_SAMPLE = 6,  // Simple point sampling
    548   STBIR_FILTER_OTHER        = 7,  // User callback specified
    549 } stbir_filter;
    550 
    551 typedef enum
    552 {
    553   STBIR_TYPE_UINT8            = 0,
    554   STBIR_TYPE_UINT8_SRGB       = 1,
    555   STBIR_TYPE_UINT8_SRGB_ALPHA = 2,  // alpha channel, when present, should also be SRGB (this is very unusual)
    556   STBIR_TYPE_UINT16           = 3,
    557   STBIR_TYPE_FLOAT            = 4,
    558   STBIR_TYPE_HALF_FLOAT       = 5
    559 } stbir_datatype;
    560 
    561 // medium api
    562 STBIRDEF void *  stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
    563                                      void *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
    564                                stbir_pixel_layout pixel_layout, stbir_datatype data_type,
    565                                stbir_edge edge, stbir_filter filter );
    566 //===============================================================
    567 
    568 
    569 
    570 //===============================================================
    571 // Extended-complexity API
    572 //
    573 // This API exposes all resize functionality.
    574 //
    575 //     * Separate filter types for each axis
    576 //     * Separate edge modes for each axis
    577 //     * Separate input and output data types
    578 //     * Can specify regions with subpixel correctness
    579 //     * Can specify alpha flags
    580 //     * Can specify a memory callback
    581 //     * Can specify a callback data type for pixel input and output
    582 //     * Can be threaded for a single resize
    583 //     * Can be used to resize many frames without recalculating the sampler info
    584 //
    585 //  Use this API as follows:
    586 //     1) Call the stbir_resize_init function on a local STBIR_RESIZE structure
    587 //     2) Call any of the stbir_set functions
    588 //     3) Optionally call stbir_build_samplers() if you are going to resample multiple times
    589 //        with the same input and output dimensions (like resizing video frames)
    590 //     4) Resample by calling stbir_resize_extended().
    591 //     5) Call stbir_free_samplers() if you called stbir_build_samplers()
    592 //--------------------------------
    593 
    594 
    595 // Types:
    596 
    597 // INPUT CALLBACK: this callback is used for input scanlines
    598 typedef void const * stbir_input_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context );
    599 
    600 // OUTPUT CALLBACK: this callback is used for output scanlines
    601 typedef void stbir_output_callback( void const * output_ptr, int num_pixels, int y, void * context );
    602 
    603 // callbacks for user installed filters
    604 typedef float stbir__kernel_callback( float x, float scale, void * user_data ); // centered at zero
    605 typedef float stbir__support_callback( float scale, void * user_data );
    606 
    607 // internal structure with precomputed scaling
    608 typedef struct stbir__info stbir__info;
    609 
    610 typedef struct STBIR_RESIZE  // use the stbir_resize_init and stbir_override functions to set these values for future compatibility
    611 {
    612   void * user_data;
    613   void const * input_pixels;
    614   int input_w, input_h;
    615   double input_s0, input_t0, input_s1, input_t1;
    616   stbir_input_callback * input_cb;
    617   void * output_pixels;
    618   int output_w, output_h;
    619   int output_subx, output_suby, output_subw, output_subh;
    620   stbir_output_callback * output_cb;
    621   int input_stride_in_bytes;
    622   int output_stride_in_bytes;
    623   int splits;
    624   int fast_alpha;
    625   int needs_rebuild;
    626   int called_alloc;
    627   stbir_pixel_layout input_pixel_layout_public;
    628   stbir_pixel_layout output_pixel_layout_public;
    629   stbir_datatype input_data_type;
    630   stbir_datatype output_data_type;
    631   stbir_filter horizontal_filter, vertical_filter;
    632   stbir_edge horizontal_edge, vertical_edge;
    633   stbir__kernel_callback * horizontal_filter_kernel; stbir__support_callback * horizontal_filter_support;
    634   stbir__kernel_callback * vertical_filter_kernel; stbir__support_callback * vertical_filter_support;
    635   stbir__info * samplers;
    636 } STBIR_RESIZE;
    637 
    638 // extended complexity api
    639 
    640 
    641 // First off, you must ALWAYS call stbir_resize_init on your resize structure before any of the other calls!
    642 STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize,
    643                                  const void *input_pixels,  int input_w,  int input_h, int input_stride_in_bytes, // stride can be zero
    644                                        void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero
    645                                  stbir_pixel_layout pixel_layout, stbir_datatype data_type );
    646 
    647 //===============================================================
    648 // You can update these parameters any time after resize_init and there is no cost
    649 //--------------------------------
    650 
    651 STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type );
    652 STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb );   // no callbacks by default
    653 STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data );                                               // pass back STBIR_RESIZE* by default
    654 STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes );
    655 
    656 //===============================================================
    657 
    658 
    659 //===============================================================
    660 // If you call any of these functions, you will trigger a sampler rebuild!
    661 //--------------------------------
    662 
    663 STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout input_pixel_layout, stbir_pixel_layout output_pixel_layout );  // sets new buffer layouts
    664 STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge );       // CLAMP by default
    665 
    666 STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ); // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default
    667 STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support );
    668 
    669 STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh );        // sets both sub-regions (full regions by default)
    670 STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 );    // sets input sub-region (full region by default)
    671 STBIRDEF int stbir_set_output_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ); // sets output sub-region (full region by default)
    672 
    673 // when inputting AND outputting non-premultiplied alpha pixels, we use a slower but higher quality technique
    674 //   that fills the zero alpha pixel's RGB values with something plausible.  If you don't care about areas of
    675 //   zero alpha, you can call this function to get about a 25% speed improvement for STBIR_RGBA to STBIR_RGBA
    676 //   types of resizes.
    677 STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, int non_pma_alpha_speed_over_quality );
    678 //===============================================================
    679 
    680 
    681 //===============================================================
    682 // You can call build_samplers to prebuild all the internal data we need to resample.
    683 //   Then, if you call resize_extended many times with the same resize, you only pay the
    684 //   cost once.
    685 // If you do call build_samplers, you MUST call free_samplers eventually.
    686 //--------------------------------
    687 
    688 // This builds the samplers and does one allocation
    689 STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize );
    690 
    691 // You MUST call this, if you call stbir_build_samplers or stbir_build_samplers_with_splits
    692 STBIRDEF void stbir_free_samplers( STBIR_RESIZE * resize );
    693 //===============================================================
    694 
    695 
    696 // And this is the main function to perform the resize synchronously on one thread.
    697 STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize );
    698 
    699 
    700 //===============================================================
    701 // Use these functions for multithreading.
    702 //   1) You call stbir_build_samplers_with_splits first on the main thread
    703 //   2) Then stbir_resize_with_split on each thread
    704 //   3) stbir_free_samplers when done on the main thread
    705 //--------------------------------
    706 
    707 // This will build samplers for threading.
    708 //   You can pass in the number of threads you'd like to use (try_splits).
    709 //   It returns the number of splits (threads) that you can call it with.
    710 ///  It might be less if the image resize can't be split up that many ways.
    711 
    712 STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_splits );
    713 
    714 // This function does a split of the resizing (you call this fuction for each
    715 // split, on multiple threads). A split is a piece of the output resize pixel space.
    716 
    717 // Note that you MUST call stbir_build_samplers_with_splits before stbir_resize_extended_split!
    718 
    719 // Usually, you will always call stbir_resize_split with split_start as the thread_index
    720 //   and "1" for the split_count.
    721 // But, if you have a weird situation where you MIGHT want 8 threads, but sometimes
    722 //   only 4 threads, you can use 0,2,4,6 for the split_start's and use "2" for the
    723 //   split_count each time to turn in into a 4 thread resize. (This is unusual).
    724 
    725 STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count );
    726 //===============================================================
    727 
    728 
    729 //===============================================================
    730 // Pixel Callbacks info:
    731 //--------------------------------
    732 
    733 //   The input callback is super flexible - it calls you with the input address
    734 //   (based on the stride and base pointer), it gives you an optional_output
    735 //   pointer that you can fill, or you can just return your own pointer into
    736 //   your own data.
    737 //
    738 //   You can also do conversion from non-supported data types if necessary - in
    739 //   this case, you ignore the input_ptr and just use the x and y parameters to
    740 //   calculate your own input_ptr based on the size of each non-supported pixel.
    741 //   (Something like the third example below.)
    742 //
    743 //   You can also install just an input or just an output callback by setting the
    744 //   callback that you don't want to zero.
    745 //
    746 //     First example, progress: (getting a callback that you can monitor the progress):
    747 //        void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
    748 //        {
    749 //           percentage_done = y / input_height;
    750 //           return input_ptr;  // use buffer from call
    751 //        }
    752 //
    753 //     Next example, copying: (copy from some other buffer or stream):
    754 //        void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
    755 //        {
    756 //           CopyOrStreamData( optional_output, other_data_src, num_pixels * pixel_width_in_bytes );
    757 //           return optional_output;  // return the optional buffer that we filled
    758 //        }
    759 //
    760 //     Third example, input another buffer without copying: (zero-copy from other buffer):
    761 //        void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
    762 //        {
    763 //           void * pixels = ( (char*) other_image_base ) + ( y * other_image_stride ) + ( x * other_pixel_width_in_bytes );
    764 //           return pixels;       // return pointer to your data without copying
    765 //        }
    766 //
    767 //
    768 //   The output callback is considerably simpler - it just calls you so that you can dump
    769 //   out each scanline. You could even directly copy out to disk if you have a simple format
    770 //   like TGA or BMP. You can also convert to other output types here if you want.
    771 //
    772 //   Simple example:
    773 //        void const * my_output( void * output_ptr, int num_pixels, int y, void * context )
    774 //        {
    775 //           percentage_done = y / output_height;
    776 //           fwrite( output_ptr, pixel_width_in_bytes, num_pixels, output_file );
    777 //        }
    778 //===============================================================
    779 
    780 
    781 
    782 
    783 //===============================================================
    784 // optional built-in profiling API
    785 //--------------------------------
    786 
    787 #ifdef STBIR_PROFILE
    788 
    789 typedef struct STBIR_PROFILE_INFO
    790 {
    791   stbir_uint64 total_clocks;
    792 
    793   // how many clocks spent (of total_clocks) in the various resize routines, along with a string description
    794   //    there are "resize_count" number of zones
    795   stbir_uint64 clocks[ 8 ];
    796   char const ** descriptions;
    797 
    798   // count of clocks and descriptions
    799   stbir_uint32 count;
    800 } STBIR_PROFILE_INFO;
    801 
    802 // use after calling stbir_resize_extended (or stbir_build_samplers or stbir_build_samplers_with_splits)
    803 STBIRDEF void stbir_resize_build_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize );
    804 
    805 // use after calling stbir_resize_extended
    806 STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize );
    807 
    808 // use after calling stbir_resize_extended_split
    809 STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize, int split_start, int split_num );
    810 
    811 //===============================================================
    812 
    813 #endif
    814 
    815 
    816 ////   end header file   /////////////////////////////////////////////////////
    817 #endif // STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
    818 
    819 #if defined(STB_IMAGE_RESIZE_IMPLEMENTATION) || defined(STB_IMAGE_RESIZE2_IMPLEMENTATION)
    820 
    821 #ifndef STBIR_ASSERT
    822 #include <assert.h>
    823 #define STBIR_ASSERT(x) assert(x)
    824 #endif
    825 
    826 #ifndef STBIR_MALLOC
    827 #include <stdlib.h>
    828 #define STBIR_MALLOC(size,user_data) ((void)(user_data), malloc(size))
    829 #define STBIR_FREE(ptr,user_data)    ((void)(user_data), free(ptr))
    830 // (we used the comma operator to evaluate user_data, to avoid "unused parameter" warnings)
    831 #endif
    832 
    833 #ifdef _MSC_VER
    834 
    835 #define stbir__inline __forceinline
    836 
    837 #else
    838 
    839 #define stbir__inline __inline__
    840 
    841 // Clang address sanitizer
    842 #if defined(__has_feature)
    843   #if __has_feature(address_sanitizer) || __has_feature(memory_sanitizer)
    844     #ifndef STBIR__SEPARATE_ALLOCATIONS
    845       #define STBIR__SEPARATE_ALLOCATIONS
    846     #endif
    847   #endif
    848 #endif
    849 
    850 #endif
    851 
    852 // GCC and MSVC
    853 #if defined(__SANITIZE_ADDRESS__)
    854   #ifndef STBIR__SEPARATE_ALLOCATIONS
    855     #define STBIR__SEPARATE_ALLOCATIONS
    856   #endif
    857 #endif
    858 
    859 // Always turn off automatic FMA use - use STBIR_USE_FMA if you want.
    860 // Otherwise, this is a determinism disaster.
    861 #ifndef STBIR_DONT_CHANGE_FP_CONTRACT  // override in case you don't want this behavior
    862 #if defined(_MSC_VER) && !defined(__clang__)
    863 #if _MSC_VER > 1200
    864 #pragma fp_contract(off)
    865 #endif
    866 #elif defined(__GNUC__) &&  !defined(__clang__)
    867 #pragma GCC optimize("fp-contract=off")
    868 #else
    869 #pragma STDC FP_CONTRACT OFF
    870 #endif
    871 #endif
    872 
    873 #ifdef _MSC_VER
    874 #define STBIR__UNUSED(v)  (void)(v)
    875 #else
    876 #define STBIR__UNUSED(v)  (void)sizeof(v)
    877 #endif
    878 
    879 #define STBIR__ARRAY_SIZE(a) (sizeof((a))/sizeof((a)[0]))
    880 
    881 
    882 #ifndef STBIR_DEFAULT_FILTER_UPSAMPLE
    883 #define STBIR_DEFAULT_FILTER_UPSAMPLE    STBIR_FILTER_CATMULLROM
    884 #endif
    885 
    886 #ifndef STBIR_DEFAULT_FILTER_DOWNSAMPLE
    887 #define STBIR_DEFAULT_FILTER_DOWNSAMPLE  STBIR_FILTER_MITCHELL
    888 #endif
    889 
    890 
    891 #ifndef STBIR__HEADER_FILENAME
    892 #define STBIR__HEADER_FILENAME "stb_image_resize2.h"
    893 #endif
    894 
    895 // the internal pixel layout enums are in a different order, so we can easily do range comparisons of types
    896 //   the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible
    897 typedef enum
    898 {
    899   STBIRI_1CHANNEL = 0,
    900   STBIRI_2CHANNEL = 1,
    901   STBIRI_RGB      = 2,
    902   STBIRI_BGR      = 3,
    903   STBIRI_4CHANNEL = 4,
    904 
    905   STBIRI_RGBA = 5,
    906   STBIRI_BGRA = 6,
    907   STBIRI_ARGB = 7,
    908   STBIRI_ABGR = 8,
    909   STBIRI_RA   = 9,
    910   STBIRI_AR   = 10,
    911 
    912   STBIRI_RGBA_PM = 11,
    913   STBIRI_BGRA_PM = 12,
    914   STBIRI_ARGB_PM = 13,
    915   STBIRI_ABGR_PM = 14,
    916   STBIRI_RA_PM   = 15,
    917   STBIRI_AR_PM   = 16,
    918 } stbir_internal_pixel_layout;
    919 
    920 // define the public pixel layouts to not compile inside the implementation (to avoid accidental use)
    921 #define STBIR_BGR bad_dont_use_in_implementation
    922 #define STBIR_1CHANNEL STBIR_BGR
    923 #define STBIR_2CHANNEL STBIR_BGR
    924 #define STBIR_RGB STBIR_BGR
    925 #define STBIR_RGBA STBIR_BGR
    926 #define STBIR_4CHANNEL STBIR_BGR
    927 #define STBIR_BGRA STBIR_BGR
    928 #define STBIR_ARGB STBIR_BGR
    929 #define STBIR_ABGR STBIR_BGR
    930 #define STBIR_RA STBIR_BGR
    931 #define STBIR_AR STBIR_BGR
    932 #define STBIR_RGBA_PM STBIR_BGR
    933 #define STBIR_BGRA_PM STBIR_BGR
    934 #define STBIR_ARGB_PM STBIR_BGR
    935 #define STBIR_ABGR_PM STBIR_BGR
    936 #define STBIR_RA_PM STBIR_BGR
    937 #define STBIR_AR_PM STBIR_BGR
    938 
    939 // must match stbir_datatype
    940 static unsigned char stbir__type_size[] = {
    941   1,1,1,2,4,2 // STBIR_TYPE_UINT8,STBIR_TYPE_UINT8_SRGB,STBIR_TYPE_UINT8_SRGB_ALPHA,STBIR_TYPE_UINT16,STBIR_TYPE_FLOAT,STBIR_TYPE_HALF_FLOAT
    942 };
    943 
    944 // When gathering, the contributors are which source pixels contribute.
    945 // When scattering, the contributors are which destination pixels are contributed to.
    946 typedef struct
    947 {
    948   int n0; // First contributing pixel
    949   int n1; // Last contributing pixel
    950 } stbir__contributors;
    951 
    952 typedef struct
    953 {
    954   int lowest;    // First sample index for whole filter
    955   int highest;   // Last sample index for whole filter
    956   int widest;    // widest single set of samples for an output
    957 } stbir__filter_extent_info;
    958 
    959 typedef struct
    960 {
    961   int n0; // First pixel of decode buffer to write to
    962   int n1; // Last pixel of decode that will be written to
    963   int pixel_offset_for_input;  // Pixel offset into input_scanline
    964 } stbir__span;
    965 
    966 typedef struct stbir__scale_info
    967 {
    968   int input_full_size;
    969   int output_sub_size;
    970   float scale;
    971   float inv_scale;
    972   float pixel_shift; // starting shift in output pixel space (in pixels)
    973   int scale_is_rational;
    974   stbir_uint32 scale_numerator, scale_denominator;
    975 } stbir__scale_info;
    976 
    977 typedef struct
    978 {
    979   stbir__contributors * contributors;
    980   float* coefficients;
    981   stbir__contributors * gather_prescatter_contributors;
    982   float * gather_prescatter_coefficients;
    983   stbir__scale_info scale_info;
    984   float support;
    985   stbir_filter filter_enum;
    986   stbir__kernel_callback * filter_kernel;
    987   stbir__support_callback * filter_support;
    988   stbir_edge edge;
    989   int coefficient_width;
    990   int filter_pixel_width;
    991   int filter_pixel_margin;
    992   int num_contributors;
    993   int contributors_size;
    994   int coefficients_size;
    995   stbir__filter_extent_info extent_info;
    996   int is_gather;  // 0 = scatter, 1 = gather with scale >= 1, 2 = gather with scale < 1
    997   int gather_prescatter_num_contributors;
    998   int gather_prescatter_coefficient_width;
    999   int gather_prescatter_contributors_size;
   1000   int gather_prescatter_coefficients_size;
   1001 } stbir__sampler;
   1002 
   1003 typedef struct
   1004 {
   1005   stbir__contributors conservative;
   1006   int edge_sizes[2];    // this can be less than filter_pixel_margin, if the filter and scaling falls off
   1007   stbir__span spans[2]; // can be two spans, if doing input subrect with clamp mode WRAP
   1008 } stbir__extents;
   1009 
   1010 typedef struct
   1011 {
   1012 #ifdef STBIR_PROFILE
   1013   union
   1014   {
   1015     struct { stbir_uint64 total, looping, vertical, horizontal, decode, encode, alpha, unalpha; } named;
   1016     stbir_uint64 array[8];
   1017   } profile;
   1018   stbir_uint64 * current_zone_excluded_ptr;
   1019 #endif
   1020   float* decode_buffer;
   1021 
   1022   int ring_buffer_first_scanline;
   1023   int ring_buffer_last_scanline;
   1024   int ring_buffer_begin_index;    // first_scanline is at this index in the ring buffer
   1025   int start_output_y, end_output_y;
   1026   int start_input_y, end_input_y;  // used in scatter only
   1027 
   1028   #ifdef STBIR__SEPARATE_ALLOCATIONS
   1029     float** ring_buffers; // one pointer for each ring buffer
   1030   #else
   1031     float* ring_buffer;  // one big buffer that we index into
   1032   #endif
   1033 
   1034   float* vertical_buffer;
   1035 
   1036   char no_cache_straddle[64];
   1037 } stbir__per_split_info;
   1038 
   1039 typedef void stbir__decode_pixels_func( float * decode, int width_times_channels, void const * input );
   1040 typedef void stbir__alpha_weight_func( float * decode_buffer, int width_times_channels );
   1041 typedef void stbir__horizontal_gather_channels_func( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer,
   1042   stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width );
   1043 typedef void stbir__alpha_unweight_func(float * encode_buffer, int width_times_channels );
   1044 typedef void stbir__encode_pixels_func( void * output, int width_times_channels, float const * encode );
   1045 
   1046 struct stbir__info
   1047 {
   1048 #ifdef STBIR_PROFILE
   1049   union
   1050   {
   1051     struct { stbir_uint64 total, build, alloc, horizontal, vertical, cleanup, pivot; } named;
   1052     stbir_uint64 array[7];
   1053   } profile;
   1054   stbir_uint64 * current_zone_excluded_ptr;
   1055 #endif
   1056   stbir__sampler horizontal;
   1057   stbir__sampler vertical;
   1058 
   1059   void const * input_data;
   1060   void * output_data;
   1061 
   1062   int input_stride_bytes;
   1063   int output_stride_bytes;
   1064   int ring_buffer_length_bytes;   // The length of an individual entry in the ring buffer. The total number of ring buffers is stbir__get_filter_pixel_width(filter)
   1065   int ring_buffer_num_entries;    // Total number of entries in the ring buffer.
   1066 
   1067   stbir_datatype input_type;
   1068   stbir_datatype output_type;
   1069 
   1070   stbir_input_callback * in_pixels_cb;
   1071   void * user_data;
   1072   stbir_output_callback * out_pixels_cb;
   1073 
   1074   stbir__extents scanline_extents;
   1075 
   1076   void * alloced_mem;
   1077   stbir__per_split_info * split_info;  // by default 1, but there will be N of these allocated based on the thread init you did
   1078 
   1079   stbir__decode_pixels_func * decode_pixels;
   1080   stbir__alpha_weight_func * alpha_weight;
   1081   stbir__horizontal_gather_channels_func * horizontal_gather_channels;
   1082   stbir__alpha_unweight_func * alpha_unweight;
   1083   stbir__encode_pixels_func * encode_pixels;
   1084 
   1085   int alloc_ring_buffer_num_entries;    // Number of entries in the ring buffer that will be allocated
   1086   int splits; // count of splits
   1087 
   1088   stbir_internal_pixel_layout input_pixel_layout_internal;
   1089   stbir_internal_pixel_layout output_pixel_layout_internal;
   1090 
   1091   int input_color_and_type;
   1092   int offset_x, offset_y; // offset within output_data
   1093   int vertical_first;
   1094   int channels;
   1095   int effective_channels; // same as channels, except on RGBA/ARGB (7), or XA/AX (3)
   1096   size_t alloced_total;
   1097 };
   1098 
   1099 
   1100 #define stbir__max_uint8_as_float             255.0f
   1101 #define stbir__max_uint16_as_float            65535.0f
   1102 #define stbir__max_uint8_as_float_inverted    (1.0f/255.0f)
   1103 #define stbir__max_uint16_as_float_inverted   (1.0f/65535.0f)
   1104 #define stbir__small_float ((float)1 / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20))
   1105 
   1106 // min/max friendly
   1107 #define STBIR_CLAMP(x, xmin, xmax) for(;;) { \
   1108   if ( (x) < (xmin) ) (x) = (xmin);     \
   1109   if ( (x) > (xmax) ) (x) = (xmax);     \
   1110   break;                                \
   1111 }
   1112 
   1113 static stbir__inline int stbir__min(int a, int b)
   1114 {
   1115   return a < b ? a : b;
   1116 }
   1117 
   1118 static stbir__inline int stbir__max(int a, int b)
   1119 {
   1120   return a > b ? a : b;
   1121 }
   1122 
   1123 static float stbir__srgb_uchar_to_linear_float[256] = {
   1124   0.000000f, 0.000304f, 0.000607f, 0.000911f, 0.001214f, 0.001518f, 0.001821f, 0.002125f, 0.002428f, 0.002732f, 0.003035f,
   1125   0.003347f, 0.003677f, 0.004025f, 0.004391f, 0.004777f, 0.005182f, 0.005605f, 0.006049f, 0.006512f, 0.006995f, 0.007499f,
   1126   0.008023f, 0.008568f, 0.009134f, 0.009721f, 0.010330f, 0.010960f, 0.011612f, 0.012286f, 0.012983f, 0.013702f, 0.014444f,
   1127   0.015209f, 0.015996f, 0.016807f, 0.017642f, 0.018500f, 0.019382f, 0.020289f, 0.021219f, 0.022174f, 0.023153f, 0.024158f,
   1128   0.025187f, 0.026241f, 0.027321f, 0.028426f, 0.029557f, 0.030713f, 0.031896f, 0.033105f, 0.034340f, 0.035601f, 0.036889f,
   1129   0.038204f, 0.039546f, 0.040915f, 0.042311f, 0.043735f, 0.045186f, 0.046665f, 0.048172f, 0.049707f, 0.051269f, 0.052861f,
   1130   0.054480f, 0.056128f, 0.057805f, 0.059511f, 0.061246f, 0.063010f, 0.064803f, 0.066626f, 0.068478f, 0.070360f, 0.072272f,
   1131   0.074214f, 0.076185f, 0.078187f, 0.080220f, 0.082283f, 0.084376f, 0.086500f, 0.088656f, 0.090842f, 0.093059f, 0.095307f,
   1132   0.097587f, 0.099899f, 0.102242f, 0.104616f, 0.107023f, 0.109462f, 0.111932f, 0.114435f, 0.116971f, 0.119538f, 0.122139f,
   1133   0.124772f, 0.127438f, 0.130136f, 0.132868f, 0.135633f, 0.138432f, 0.141263f, 0.144128f, 0.147027f, 0.149960f, 0.152926f,
   1134   0.155926f, 0.158961f, 0.162029f, 0.165132f, 0.168269f, 0.171441f, 0.174647f, 0.177888f, 0.181164f, 0.184475f, 0.187821f,
   1135   0.191202f, 0.194618f, 0.198069f, 0.201556f, 0.205079f, 0.208637f, 0.212231f, 0.215861f, 0.219526f, 0.223228f, 0.226966f,
   1136   0.230740f, 0.234551f, 0.238398f, 0.242281f, 0.246201f, 0.250158f, 0.254152f, 0.258183f, 0.262251f, 0.266356f, 0.270498f,
   1137   0.274677f, 0.278894f, 0.283149f, 0.287441f, 0.291771f, 0.296138f, 0.300544f, 0.304987f, 0.309469f, 0.313989f, 0.318547f,
   1138   0.323143f, 0.327778f, 0.332452f, 0.337164f, 0.341914f, 0.346704f, 0.351533f, 0.356400f, 0.361307f, 0.366253f, 0.371238f,
   1139   0.376262f, 0.381326f, 0.386430f, 0.391573f, 0.396755f, 0.401978f, 0.407240f, 0.412543f, 0.417885f, 0.423268f, 0.428691f,
   1140   0.434154f, 0.439657f, 0.445201f, 0.450786f, 0.456411f, 0.462077f, 0.467784f, 0.473532f, 0.479320f, 0.485150f, 0.491021f,
   1141   0.496933f, 0.502887f, 0.508881f, 0.514918f, 0.520996f, 0.527115f, 0.533276f, 0.539480f, 0.545725f, 0.552011f, 0.558340f,
   1142   0.564712f, 0.571125f, 0.577581f, 0.584078f, 0.590619f, 0.597202f, 0.603827f, 0.610496f, 0.617207f, 0.623960f, 0.630757f,
   1143   0.637597f, 0.644480f, 0.651406f, 0.658375f, 0.665387f, 0.672443f, 0.679543f, 0.686685f, 0.693872f, 0.701102f, 0.708376f,
   1144   0.715694f, 0.723055f, 0.730461f, 0.737911f, 0.745404f, 0.752942f, 0.760525f, 0.768151f, 0.775822f, 0.783538f, 0.791298f,
   1145   0.799103f, 0.806952f, 0.814847f, 0.822786f, 0.830770f, 0.838799f, 0.846873f, 0.854993f, 0.863157f, 0.871367f, 0.879622f,
   1146   0.887923f, 0.896269f, 0.904661f, 0.913099f, 0.921582f, 0.930111f, 0.938686f, 0.947307f, 0.955974f, 0.964686f, 0.973445f,
   1147   0.982251f, 0.991102f, 1.0f
   1148 };
   1149 
   1150 typedef union
   1151 {
   1152   unsigned int u;
   1153   float f;
   1154 } stbir__FP32;
   1155 
   1156 // From https://gist.github.com/rygorous/2203834
   1157 
   1158 static const stbir_uint32 fp32_to_srgb8_tab4[104] = {
   1159   0x0073000d, 0x007a000d, 0x0080000d, 0x0087000d, 0x008d000d, 0x0094000d, 0x009a000d, 0x00a1000d,
   1160   0x00a7001a, 0x00b4001a, 0x00c1001a, 0x00ce001a, 0x00da001a, 0x00e7001a, 0x00f4001a, 0x0101001a,
   1161   0x010e0033, 0x01280033, 0x01410033, 0x015b0033, 0x01750033, 0x018f0033, 0x01a80033, 0x01c20033,
   1162   0x01dc0067, 0x020f0067, 0x02430067, 0x02760067, 0x02aa0067, 0x02dd0067, 0x03110067, 0x03440067,
   1163   0x037800ce, 0x03df00ce, 0x044600ce, 0x04ad00ce, 0x051400ce, 0x057b00c5, 0x05dd00bc, 0x063b00b5,
   1164   0x06970158, 0x07420142, 0x07e30130, 0x087b0120, 0x090b0112, 0x09940106, 0x0a1700fc, 0x0a9500f2,
   1165   0x0b0f01cb, 0x0bf401ae, 0x0ccb0195, 0x0d950180, 0x0e56016e, 0x0f0d015e, 0x0fbc0150, 0x10630143,
   1166   0x11070264, 0x1238023e, 0x1357021d, 0x14660201, 0x156601e9, 0x165a01d3, 0x174401c0, 0x182401af,
   1167   0x18fe0331, 0x1a9602fe, 0x1c1502d2, 0x1d7e02ad, 0x1ed4028d, 0x201a0270, 0x21520256, 0x227d0240,
   1168   0x239f0443, 0x25c003fe, 0x27bf03c4, 0x29a10392, 0x2b6a0367, 0x2d1d0341, 0x2ebe031f, 0x304d0300,
   1169   0x31d105b0, 0x34a80555, 0x37520507, 0x39d504c5, 0x3c37048b, 0x3e7c0458, 0x40a8042a, 0x42bd0401,
   1170   0x44c20798, 0x488e071e, 0x4c1c06b6, 0x4f76065d, 0x52a50610, 0x55ac05cc, 0x5892058f, 0x5b590559,
   1171   0x5e0c0a23, 0x631c0980, 0x67db08f6, 0x6c55087f, 0x70940818, 0x74a007bd, 0x787d076c, 0x7c330723,
   1172 };
   1173 
   1174 static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
   1175 {
   1176   static const stbir__FP32 almostone = { 0x3f7fffff }; // 1-eps
   1177   static const stbir__FP32 minval = { (127-13) << 23 };
   1178   stbir_uint32 tab,bias,scale,t;
   1179   stbir__FP32 f;
   1180 
   1181   // Clamp to [2^(-13), 1-eps]; these two values map to 0 and 1, respectively.
   1182   // The tests are carefully written so that NaNs map to 0, same as in the reference
   1183   // implementation.
   1184   if (!(in > minval.f)) // written this way to catch NaNs
   1185       return 0;
   1186   if (in > almostone.f)
   1187       return 255;
   1188 
   1189   // Do the table lookup and unpack bias, scale
   1190   f.f = in;
   1191   tab = fp32_to_srgb8_tab4[(f.u - minval.u) >> 20];
   1192   bias = (tab >> 16) << 9;
   1193   scale = tab & 0xffff;
   1194 
   1195   // Grab next-highest mantissa bits and perform linear interpolation
   1196   t = (f.u >> 12) & 0xff;
   1197   return (unsigned char) ((bias + scale*t) >> 16);
   1198 }
   1199 
   1200 #ifndef STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT
   1201 #define STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT 32 // when downsampling and <= 32 scanlines of buffering, use gather. gather used down to 1/8th scaling for 25% win.
   1202 #endif
   1203 
   1204 #ifndef STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS
   1205 #define STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS 4 // when threading, what is the minimum number of scanlines for a split?
   1206 #endif
   1207 
   1208 // restrict pointers for the output pointers, other loop and unroll control
   1209 #if defined( _MSC_VER ) && !defined(__clang__)
   1210   #define STBIR_STREAMOUT_PTR( star ) star __restrict
   1211   #define STBIR_NO_UNROLL( ptr ) __assume(ptr) // this oddly keeps msvc from unrolling a loop
   1212   #if _MSC_VER >= 1900
   1213     #define STBIR_NO_UNROLL_LOOP_START __pragma(loop( no_vector )) 
   1214   #else
   1215     #define STBIR_NO_UNROLL_LOOP_START 
   1216   #endif
   1217 #elif defined( __clang__ )
   1218   #define STBIR_STREAMOUT_PTR( star ) star __restrict__
   1219   #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr)) 
   1220   #if ( __clang_major__ >= 4 ) || ( ( __clang_major__ >= 3 ) && ( __clang_minor__ >= 5 ) )
   1221     #define STBIR_NO_UNROLL_LOOP_START _Pragma("clang loop unroll(disable)") _Pragma("clang loop vectorize(disable)")
   1222   #else
   1223     #define STBIR_NO_UNROLL_LOOP_START
   1224   #endif 
   1225 #elif defined( __GNUC__ )
   1226   #define STBIR_STREAMOUT_PTR( star ) star __restrict__
   1227   #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr))
   1228   #if __GNUC__ >= 14
   1229     #define STBIR_NO_UNROLL_LOOP_START _Pragma("GCC unroll 0") _Pragma("GCC novector")
   1230   #else
   1231     #define STBIR_NO_UNROLL_LOOP_START
   1232   #endif
   1233   #define STBIR_NO_UNROLL_LOOP_START_INF_FOR
   1234 #else
   1235   #define STBIR_STREAMOUT_PTR( star ) star
   1236   #define STBIR_NO_UNROLL( ptr )
   1237   #define STBIR_NO_UNROLL_LOOP_START
   1238 #endif
   1239 
   1240 #ifndef STBIR_NO_UNROLL_LOOP_START_INF_FOR
   1241 #define STBIR_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START
   1242 #endif
   1243 
   1244 #ifdef STBIR_NO_SIMD // force simd off for whatever reason
   1245 
   1246 // force simd off overrides everything else, so clear it all
   1247 
   1248 #ifdef STBIR_SSE2
   1249 #undef STBIR_SSE2
   1250 #endif
   1251 
   1252 #ifdef STBIR_AVX
   1253 #undef STBIR_AVX
   1254 #endif
   1255 
   1256 #ifdef STBIR_NEON
   1257 #undef STBIR_NEON
   1258 #endif
   1259 
   1260 #ifdef STBIR_AVX2
   1261 #undef STBIR_AVX2
   1262 #endif
   1263 
   1264 #ifdef STBIR_FP16C
   1265 #undef STBIR_FP16C
   1266 #endif
   1267 
   1268 #ifdef STBIR_WASM
   1269 #undef STBIR_WASM
   1270 #endif
   1271 
   1272 #ifdef STBIR_SIMD
   1273 #undef STBIR_SIMD
   1274 #endif
   1275 
   1276 #else // STBIR_SIMD
   1277 
   1278 #ifdef STBIR_SSE2
   1279   #include <emmintrin.h>
   1280 
   1281   #define stbir__simdf __m128
   1282   #define stbir__simdi __m128i
   1283 
   1284   #define stbir_simdi_castf( reg ) _mm_castps_si128(reg)
   1285   #define stbir_simdf_casti( reg ) _mm_castsi128_ps(reg)
   1286 
   1287   #define stbir__simdf_load( reg, ptr ) (reg) = _mm_loadu_ps( (float const*)(ptr) )
   1288   #define stbir__simdi_load( reg, ptr ) (reg) = _mm_loadu_si128 ( (stbir__simdi const*)(ptr) )
   1289   #define stbir__simdf_load1( out, ptr ) (out) = _mm_load_ss( (float const*)(ptr) )  // top values can be random (not denormal or nan for perf)
   1290   #define stbir__simdi_load1( out, ptr ) (out) = _mm_castps_si128( _mm_load_ss( (float const*)(ptr) ))
   1291   #define stbir__simdf_load1z( out, ptr ) (out) = _mm_load_ss( (float const*)(ptr) )  // top values must be zero
   1292   #define stbir__simdf_frep4( fvar ) _mm_set_ps1( fvar )
   1293   #define stbir__simdf_load1frep4( out, fvar ) (out) = _mm_set_ps1( fvar )
   1294   #define stbir__simdf_load2( out, ptr ) (out) = _mm_castsi128_ps( _mm_loadl_epi64( (__m128i*)(ptr)) ) // top values can be random (not denormal or nan for perf)
   1295   #define stbir__simdf_load2z( out, ptr ) (out) = _mm_castsi128_ps( _mm_loadl_epi64( (__m128i*)(ptr)) ) // top values must be zero
   1296   #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = _mm_castpd_ps(_mm_loadh_pd( _mm_castps_pd(reg), (double*)(ptr) ))
   1297 
   1298   #define stbir__simdf_zeroP() _mm_setzero_ps()
   1299   #define stbir__simdf_zero( reg ) (reg) = _mm_setzero_ps()
   1300 
   1301   #define stbir__simdf_store( ptr, reg )  _mm_storeu_ps( (float*)(ptr), reg )
   1302   #define stbir__simdf_store1( ptr, reg ) _mm_store_ss( (float*)(ptr), reg )
   1303   #define stbir__simdf_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), _mm_castps_si128(reg) )
   1304   #define stbir__simdf_store2h( ptr, reg ) _mm_storeh_pd( (double*)(ptr), _mm_castps_pd(reg) )
   1305 
   1306   #define stbir__simdi_store( ptr, reg )  _mm_storeu_si128( (__m128i*)(ptr), reg )
   1307   #define stbir__simdi_store1( ptr, reg ) _mm_store_ss( (float*)(ptr), _mm_castsi128_ps(reg) )
   1308   #define stbir__simdi_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), (reg) )
   1309 
   1310   #define stbir__prefetch( ptr ) _mm_prefetch((char*)(ptr), _MM_HINT_T0 )
   1311 
   1312   #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
   1313   { \
   1314     stbir__simdi zero = _mm_setzero_si128(); \
   1315     out2 = _mm_unpacklo_epi8( ireg, zero ); \
   1316     out3 = _mm_unpackhi_epi8( ireg, zero ); \
   1317     out0 = _mm_unpacklo_epi16( out2, zero ); \
   1318     out1 = _mm_unpackhi_epi16( out2, zero ); \
   1319     out2 = _mm_unpacklo_epi16( out3, zero ); \
   1320     out3 = _mm_unpackhi_epi16( out3, zero ); \
   1321   }
   1322 
   1323 #define stbir__simdi_expand_u8_to_1u32(out,ireg) \
   1324   { \
   1325     stbir__simdi zero = _mm_setzero_si128(); \
   1326     out = _mm_unpacklo_epi8( ireg, zero ); \
   1327     out = _mm_unpacklo_epi16( out, zero ); \
   1328   }
   1329 
   1330   #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \
   1331   { \
   1332     stbir__simdi zero = _mm_setzero_si128(); \
   1333     out0 = _mm_unpacklo_epi16( ireg, zero ); \
   1334     out1 = _mm_unpackhi_epi16( ireg, zero ); \
   1335   }
   1336 
   1337   #define stbir__simdf_convert_float_to_i32( i, f ) (i) = _mm_cvttps_epi32(f)
   1338   #define stbir__simdf_convert_float_to_int( f ) _mm_cvtt_ss2si(f)
   1339   #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),_mm_setzero_ps()))))
   1340   #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps()))))
   1341 
   1342   #define stbir__simdi_to_int( i ) _mm_cvtsi128_si32(i)
   1343   #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = _mm_cvtepi32_ps( ireg )
   1344   #define stbir__simdf_add( out, reg0, reg1 ) (out) = _mm_add_ps( reg0, reg1 )
   1345   #define stbir__simdf_mult( out, reg0, reg1 ) (out) = _mm_mul_ps( reg0, reg1 )
   1346   #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = _mm_mul_ps( reg, _mm_loadu_ps( (float const*)(ptr) ) )
   1347   #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = _mm_mul_ss( reg, _mm_load_ss( (float const*)(ptr) ) )
   1348   #define stbir__simdf_add_mem( out, reg, ptr ) (out) = _mm_add_ps( reg, _mm_loadu_ps( (float const*)(ptr) ) )
   1349   #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = _mm_add_ss( reg, _mm_load_ss( (float const*)(ptr) ) )
   1350 
   1351   #ifdef STBIR_USE_FMA           // not on by default to maintain bit identical simd to non-simd
   1352   #include <immintrin.h>
   1353   #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = _mm_fmadd_ps( mul1, mul2, add )
   1354   #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = _mm_fmadd_ss( mul1, mul2, add )
   1355   #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = _mm_fmadd_ps( mul, _mm_loadu_ps( (float const*)(ptr) ), add )
   1356   #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = _mm_fmadd_ss( mul, _mm_load_ss( (float const*)(ptr) ), add )
   1357   #else
   1358   #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = _mm_add_ps( add, _mm_mul_ps( mul1, mul2 ) )
   1359   #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = _mm_add_ss( add, _mm_mul_ss( mul1, mul2 ) )
   1360   #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = _mm_add_ps( add, _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ) )
   1361   #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = _mm_add_ss( add, _mm_mul_ss( mul, _mm_load_ss( (float const*)(ptr) ) ) )
   1362   #endif
   1363 
   1364   #define stbir__simdf_add1( out, reg0, reg1 ) (out) = _mm_add_ss( reg0, reg1 )
   1365   #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = _mm_mul_ss( reg0, reg1 )
   1366 
   1367   #define stbir__simdf_and( out, reg0, reg1 ) (out) = _mm_and_ps( reg0, reg1 )
   1368   #define stbir__simdf_or( out, reg0, reg1 ) (out) = _mm_or_ps( reg0, reg1 )
   1369 
   1370   #define stbir__simdf_min( out, reg0, reg1 ) (out) = _mm_min_ps( reg0, reg1 )
   1371   #define stbir__simdf_max( out, reg0, reg1 ) (out) = _mm_max_ps( reg0, reg1 )
   1372   #define stbir__simdf_min1( out, reg0, reg1 ) (out) = _mm_min_ss( reg0, reg1 )
   1373   #define stbir__simdf_max1( out, reg0, reg1 ) (out) = _mm_max_ss( reg0, reg1 )
   1374 
   1375   #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_shuffle_ps( reg1,reg0, (0<<0) + (1<<2) + (2<<4) + (3<<6) )), (3<<0) + (0<<2) + (1<<4) + (2<<6) ) )
   1376   #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_shuffle_ps( reg1,reg0, (0<<0) + (1<<2) + (2<<4) + (3<<6) )), (2<<0) + (3<<2) + (0<<4) + (1<<6) ) )
   1377 
   1378   static const stbir__simdf STBIR_zeroones = { 0.0f,1.0f,0.0f,1.0f };
   1379   static const stbir__simdf STBIR_onezeros = { 1.0f,0.0f,1.0f,0.0f };
   1380   #define stbir__simdf_aaa1( out, alp, ones ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_movehl_ps( ones, alp ) ), (1<<0) + (1<<2) + (1<<4) + (2<<6) ) )
   1381   #define stbir__simdf_1aaa( out, alp, ones ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_movelh_ps( ones, alp ) ), (0<<0) + (2<<2) + (2<<4) + (2<<6) ) )
   1382   #define stbir__simdf_a1a1( out, alp, ones) (out) = _mm_or_ps( _mm_castsi128_ps( _mm_srli_epi64( _mm_castps_si128(alp), 32 ) ), STBIR_zeroones )
   1383   #define stbir__simdf_1a1a( out, alp, ones) (out) = _mm_or_ps( _mm_castsi128_ps( _mm_slli_epi64( _mm_castps_si128(alp), 32 ) ), STBIR_onezeros )
   1384 
   1385   #define stbir__simdf_swiz( reg, one, two, three, four ) _mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( reg ), (one<<0) + (two<<2) + (three<<4) + (four<<6) ) )
   1386 
   1387   #define stbir__simdi_and( out, reg0, reg1 ) (out) = _mm_and_si128( reg0, reg1 )
   1388   #define stbir__simdi_or( out, reg0, reg1 ) (out) = _mm_or_si128( reg0, reg1 )
   1389   #define stbir__simdi_16madd( out, reg0, reg1 ) (out) = _mm_madd_epi16( reg0, reg1 )
   1390 
   1391   #define stbir__simdf_pack_to_8bytes(out,aa,bb) \
   1392   { \
   1393     stbir__simdf af,bf; \
   1394     stbir__simdi a,b; \
   1395     af = _mm_min_ps( aa, STBIR_max_uint8_as_float ); \
   1396     bf = _mm_min_ps( bb, STBIR_max_uint8_as_float ); \
   1397     af = _mm_max_ps( af, _mm_setzero_ps() ); \
   1398     bf = _mm_max_ps( bf, _mm_setzero_ps() ); \
   1399     a = _mm_cvttps_epi32( af ); \
   1400     b = _mm_cvttps_epi32( bf ); \
   1401     a = _mm_packs_epi32( a, b ); \
   1402     out = _mm_packus_epi16( a, a ); \
   1403   }
   1404 
   1405   #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \
   1406       stbir__simdf_load( o0, (ptr) );    \
   1407       stbir__simdf_load( o1, (ptr)+4 );  \
   1408       stbir__simdf_load( o2, (ptr)+8 );  \
   1409       stbir__simdf_load( o3, (ptr)+12 ); \
   1410       {                                  \
   1411         __m128 tmp0, tmp1, tmp2, tmp3;   \
   1412         tmp0 = _mm_unpacklo_ps(o0, o1);  \
   1413         tmp2 = _mm_unpacklo_ps(o2, o3);  \
   1414         tmp1 = _mm_unpackhi_ps(o0, o1);  \
   1415         tmp3 = _mm_unpackhi_ps(o2, o3);  \
   1416         o0 = _mm_movelh_ps(tmp0, tmp2);  \
   1417         o1 = _mm_movehl_ps(tmp2, tmp0);  \
   1418         o2 = _mm_movelh_ps(tmp1, tmp3);  \
   1419         o3 = _mm_movehl_ps(tmp3, tmp1);  \
   1420       }
   1421 
   1422   #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \
   1423       r0 = _mm_packs_epi32( r0, r1 ); \
   1424       r2 = _mm_packs_epi32( r2, r3 ); \
   1425       r1 = _mm_unpacklo_epi16( r0, r2 ); \
   1426       r3 = _mm_unpackhi_epi16( r0, r2 ); \
   1427       r0 = _mm_unpacklo_epi16( r1, r3 ); \
   1428       r2 = _mm_unpackhi_epi16( r1, r3 ); \
   1429       r0 = _mm_packus_epi16( r0, r2 ); \
   1430       stbir__simdi_store( ptr, r0 ); \
   1431 
   1432   #define stbir__simdi_32shr( out, reg, imm ) out = _mm_srli_epi32( reg, imm )
   1433 
   1434   #if defined(_MSC_VER) && !defined(__clang__)
   1435     // msvc inits with 8 bytes
   1436     #define STBIR__CONST_32_TO_8( v ) (char)(unsigned char)((v)&255),(char)(unsigned char)(((v)>>8)&255),(char)(unsigned char)(((v)>>16)&255),(char)(unsigned char)(((v)>>24)&255)
   1437     #define STBIR__CONST_4_32i( v ) STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v )
   1438     #define STBIR__CONST_4d_32i( v0, v1, v2, v3 ) STBIR__CONST_32_TO_8( v0 ), STBIR__CONST_32_TO_8( v1 ), STBIR__CONST_32_TO_8( v2 ), STBIR__CONST_32_TO_8( v3 )
   1439   #else
   1440     // everything else inits with long long's
   1441     #define STBIR__CONST_4_32i( v ) (long long)((((stbir_uint64)(stbir_uint32)(v))<<32)|((stbir_uint64)(stbir_uint32)(v))),(long long)((((stbir_uint64)(stbir_uint32)(v))<<32)|((stbir_uint64)(stbir_uint32)(v)))
   1442     #define STBIR__CONST_4d_32i( v0, v1, v2, v3 ) (long long)((((stbir_uint64)(stbir_uint32)(v1))<<32)|((stbir_uint64)(stbir_uint32)(v0))),(long long)((((stbir_uint64)(stbir_uint32)(v3))<<32)|((stbir_uint64)(stbir_uint32)(v2)))
   1443   #endif
   1444 
   1445   #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = { x, x, x, x }
   1446   #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { STBIR__CONST_4_32i(x) }
   1447   #define STBIR__CONSTF(var) (var)
   1448   #define STBIR__CONSTI(var) (var)
   1449 
   1450   #if defined(STBIR_AVX) || defined(__SSE4_1__)
   1451     #include <smmintrin.h>
   1452     #define stbir__simdf_pack_to_8words(out,reg0,reg1) out = _mm_packus_epi32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg0,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())), _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg1,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())))
   1453   #else
   1454     STBIR__SIMDI_CONST(stbir__s32_32768, 32768);
   1455     STBIR__SIMDI_CONST(stbir__s16_32768, ((32768<<16)|32768));
   1456 
   1457     #define stbir__simdf_pack_to_8words(out,reg0,reg1) \
   1458       { \
   1459         stbir__simdi tmp0,tmp1; \
   1460         tmp0 = _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg0,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())); \
   1461         tmp1 = _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg1,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())); \
   1462         tmp0 = _mm_sub_epi32( tmp0, stbir__s32_32768 ); \
   1463         tmp1 = _mm_sub_epi32( tmp1, stbir__s32_32768 ); \
   1464         out = _mm_packs_epi32( tmp0, tmp1 ); \
   1465         out = _mm_sub_epi16( out, stbir__s16_32768 ); \
   1466       }
   1467 
   1468   #endif
   1469 
   1470   #define STBIR_SIMD
   1471 
   1472   // if we detect AVX, set the simd8 defines
   1473   #ifdef STBIR_AVX
   1474     #include <immintrin.h>
   1475     #define STBIR_SIMD8
   1476     #define stbir__simdf8 __m256
   1477     #define stbir__simdi8 __m256i
   1478     #define stbir__simdf8_load( out, ptr ) (out) = _mm256_loadu_ps( (float const *)(ptr) )
   1479     #define stbir__simdi8_load( out, ptr ) (out) = _mm256_loadu_si256( (__m256i const *)(ptr) )
   1480     #define stbir__simdf8_mult( out, a, b ) (out) = _mm256_mul_ps( (a), (b) )
   1481     #define stbir__simdf8_store( ptr, out ) _mm256_storeu_ps( (float*)(ptr), out )
   1482     #define stbir__simdi8_store( ptr, reg )  _mm256_storeu_si256( (__m256i*)(ptr), reg )
   1483     #define stbir__simdf8_frep8( fval ) _mm256_set1_ps( fval )
   1484 
   1485     #define stbir__simdf8_min( out, reg0, reg1 ) (out) = _mm256_min_ps( reg0, reg1 )
   1486     #define stbir__simdf8_max( out, reg0, reg1 ) (out) = _mm256_max_ps( reg0, reg1 )
   1487 
   1488     #define stbir__simdf8_add4halves( out, bot4, top8 ) (out) = _mm_add_ps( bot4, _mm256_extractf128_ps( top8, 1 ) )
   1489     #define stbir__simdf8_mult_mem( out, reg, ptr ) (out) = _mm256_mul_ps( reg, _mm256_loadu_ps( (float const*)(ptr) ) )
   1490     #define stbir__simdf8_add_mem( out, reg, ptr ) (out) = _mm256_add_ps( reg, _mm256_loadu_ps( (float const*)(ptr) ) )
   1491     #define stbir__simdf8_add( out, a, b ) (out) = _mm256_add_ps( a, b )
   1492     #define stbir__simdf8_load1b( out, ptr ) (out) = _mm256_broadcast_ss( ptr )
   1493     #define stbir__simdf_load1rep4( out, ptr ) (out) = _mm_broadcast_ss( ptr )  // avx load instruction
   1494 
   1495     #define stbir__simdi8_convert_i32_to_float(out, ireg) (out) = _mm256_cvtepi32_ps( ireg )
   1496     #define stbir__simdf8_convert_float_to_i32( i, f ) (i) = _mm256_cvttps_epi32(f)
   1497 
   1498     #define stbir__simdf8_bot4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (0<<0)+(2<<4) )
   1499     #define stbir__simdf8_top4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (1<<0)+(3<<4) )
   1500 
   1501     #define stbir__simdf8_gettop4( reg ) _mm256_extractf128_ps(reg,1)
   1502 
   1503     #ifdef STBIR_AVX2
   1504 
   1505     #define stbir__simdi8_expand_u8_to_u32(out0,out1,ireg) \
   1506     { \
   1507       stbir__simdi8 a, zero  =_mm256_setzero_si256();\
   1508       a = _mm256_permute4x64_epi64( _mm256_unpacklo_epi8( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), zero ),(0<<0)+(2<<2)+(1<<4)+(3<<6)); \
   1509       out0 = _mm256_unpacklo_epi16( a, zero ); \
   1510       out1 = _mm256_unpackhi_epi16( a, zero ); \
   1511     }
   1512 
   1513     #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \
   1514     { \
   1515       stbir__simdi8 t; \
   1516       stbir__simdf8 af,bf; \
   1517       stbir__simdi8 a,b; \
   1518       af = _mm256_min_ps( aa, STBIR_max_uint8_as_floatX ); \
   1519       bf = _mm256_min_ps( bb, STBIR_max_uint8_as_floatX ); \
   1520       af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
   1521       bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
   1522       a = _mm256_cvttps_epi32( af ); \
   1523       b = _mm256_cvttps_epi32( bf ); \
   1524       t = _mm256_permute4x64_epi64( _mm256_packs_epi32( a, b ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ); \
   1525       out = _mm256_castsi256_si128( _mm256_permute4x64_epi64( _mm256_packus_epi16( t, t ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ) ); \
   1526     }
   1527 
   1528     #define stbir__simdi8_expand_u16_to_u32(out,ireg) out = _mm256_unpacklo_epi16( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), _mm256_setzero_si256() );
   1529 
   1530     #define stbir__simdf8_pack_to_16words(out,aa,bb) \
   1531       { \
   1532         stbir__simdf8 af,bf; \
   1533         stbir__simdi8 a,b; \
   1534         af = _mm256_min_ps( aa, STBIR_max_uint16_as_floatX ); \
   1535         bf = _mm256_min_ps( bb, STBIR_max_uint16_as_floatX ); \
   1536         af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
   1537         bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
   1538         a = _mm256_cvttps_epi32( af ); \
   1539         b = _mm256_cvttps_epi32( bf ); \
   1540         (out) = _mm256_permute4x64_epi64( _mm256_packus_epi32(a, b), (0<<0)+(2<<2)+(1<<4)+(3<<6) ); \
   1541       }
   1542 
   1543     #else
   1544 
   1545     #define stbir__simdi8_expand_u8_to_u32(out0,out1,ireg) \
   1546     { \
   1547       stbir__simdi a,zero = _mm_setzero_si128(); \
   1548       a = _mm_unpacklo_epi8( ireg, zero ); \
   1549       out0 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \
   1550       a = _mm_unpackhi_epi8( ireg, zero ); \
   1551       out1 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \
   1552     }
   1553 
   1554     #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \
   1555     { \
   1556       stbir__simdi t; \
   1557       stbir__simdf8 af,bf; \
   1558       stbir__simdi8 a,b; \
   1559       af = _mm256_min_ps( aa, STBIR_max_uint8_as_floatX ); \
   1560       bf = _mm256_min_ps( bb, STBIR_max_uint8_as_floatX ); \
   1561       af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
   1562       bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
   1563       a = _mm256_cvttps_epi32( af ); \
   1564       b = _mm256_cvttps_epi32( bf ); \
   1565       out = _mm_packs_epi32( _mm256_castsi256_si128(a), _mm256_extractf128_si256( a, 1 ) ); \
   1566       out = _mm_packus_epi16( out, out ); \
   1567       t = _mm_packs_epi32( _mm256_castsi256_si128(b), _mm256_extractf128_si256( b, 1 ) ); \
   1568       t = _mm_packus_epi16( t, t ); \
   1569       out = _mm_castps_si128( _mm_shuffle_ps( _mm_castsi128_ps(out), _mm_castsi128_ps(t), (0<<0)+(1<<2)+(0<<4)+(1<<6) ) ); \
   1570     }
   1571 
   1572     #define stbir__simdi8_expand_u16_to_u32(out,ireg) \
   1573     { \
   1574       stbir__simdi a,b,zero = _mm_setzero_si128(); \
   1575       a = _mm_unpacklo_epi16( ireg, zero ); \
   1576       b = _mm_unpackhi_epi16( ireg, zero ); \
   1577       out = _mm256_insertf128_si256( _mm256_castsi128_si256( a ), b, 1 ); \
   1578     }
   1579 
   1580     #define stbir__simdf8_pack_to_16words(out,aa,bb) \
   1581       { \
   1582         stbir__simdi t0,t1; \
   1583         stbir__simdf8 af,bf; \
   1584         stbir__simdi8 a,b; \
   1585         af = _mm256_min_ps( aa, STBIR_max_uint16_as_floatX ); \
   1586         bf = _mm256_min_ps( bb, STBIR_max_uint16_as_floatX ); \
   1587         af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
   1588         bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
   1589         a = _mm256_cvttps_epi32( af ); \
   1590         b = _mm256_cvttps_epi32( bf ); \
   1591         t0 = _mm_packus_epi32( _mm256_castsi256_si128(a), _mm256_extractf128_si256( a, 1 ) ); \
   1592         t1 = _mm_packus_epi32( _mm256_castsi256_si128(b), _mm256_extractf128_si256( b, 1 ) ); \
   1593         out = _mm256_setr_m128i( t0, t1 ); \
   1594       }
   1595 
   1596     #endif
   1597 
   1598     static __m256i stbir_00001111 = { STBIR__CONST_4d_32i( 0, 0, 0, 0 ), STBIR__CONST_4d_32i( 1, 1, 1, 1 ) };
   1599     #define stbir__simdf8_0123to00001111( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_00001111 )
   1600 
   1601     static __m256i stbir_22223333 = { STBIR__CONST_4d_32i( 2, 2, 2, 2 ), STBIR__CONST_4d_32i( 3, 3, 3, 3 ) };
   1602     #define stbir__simdf8_0123to22223333( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_22223333 )
   1603 
   1604     #define stbir__simdf8_0123to2222( out, in ) (out) = stbir__simdf_swiz(_mm256_castps256_ps128(in), 2,2,2,2 )
   1605 
   1606     #define stbir__simdf8_load4b( out, ptr ) (out) = _mm256_broadcast_ps( (__m128 const *)(ptr) )
   1607 
   1608     static __m256i stbir_00112233 = { STBIR__CONST_4d_32i( 0, 0, 1, 1 ), STBIR__CONST_4d_32i( 2, 2, 3, 3 ) };
   1609     #define stbir__simdf8_0123to00112233( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_00112233 )
   1610     #define stbir__simdf8_add4( out, a8, b ) (out) = _mm256_add_ps( a8,  _mm256_castps128_ps256( b ) )
   1611 
   1612     static __m256i stbir_load6 = { STBIR__CONST_4_32i( 0x80000000 ), STBIR__CONST_4d_32i(  0x80000000,  0x80000000, 0, 0 ) };
   1613     #define stbir__simdf8_load6z( out, ptr ) (out) = _mm256_maskload_ps( ptr, stbir_load6 )
   1614 
   1615     #define stbir__simdf8_0123to00000000( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (0<<0)+(0<<2)+(0<<4)+(0<<6) )
   1616     #define stbir__simdf8_0123to11111111( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (1<<0)+(1<<2)+(1<<4)+(1<<6) )
   1617     #define stbir__simdf8_0123to22222222( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (2<<0)+(2<<2)+(2<<4)+(2<<6) )
   1618     #define stbir__simdf8_0123to33333333( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (3<<0)+(3<<2)+(3<<4)+(3<<6) )
   1619     #define stbir__simdf8_0123to21032103( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (2<<0)+(1<<2)+(0<<4)+(3<<6) )
   1620     #define stbir__simdf8_0123to32103210( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (3<<0)+(2<<2)+(1<<4)+(0<<6) )
   1621     #define stbir__simdf8_0123to12301230( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (1<<0)+(2<<2)+(3<<4)+(0<<6) )
   1622     #define stbir__simdf8_0123to10321032( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (1<<0)+(0<<2)+(3<<4)+(2<<6) )
   1623     #define stbir__simdf8_0123to30123012( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (3<<0)+(0<<2)+(1<<4)+(2<<6) )
   1624 
   1625     #define stbir__simdf8_0123to11331133( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (1<<0)+(1<<2)+(3<<4)+(3<<6) )
   1626     #define stbir__simdf8_0123to00220022( out, in ) (out) =  _mm256_shuffle_ps ( in, in, (0<<0)+(0<<2)+(2<<4)+(2<<6) )
   1627 
   1628     #define stbir__simdf8_aaa1( out, alp, ones ) (out) = _mm256_blend_ps( alp, ones, (1<<0)+(1<<1)+(1<<2)+(0<<3)+(1<<4)+(1<<5)+(1<<6)+(0<<7)); (out)=_mm256_shuffle_ps( out,out, (3<<0) + (3<<2) + (3<<4) + (0<<6) )
   1629     #define stbir__simdf8_1aaa( out, alp, ones ) (out) = _mm256_blend_ps( alp, ones, (0<<0)+(1<<1)+(1<<2)+(1<<3)+(0<<4)+(1<<5)+(1<<6)+(1<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (0<<4) + (0<<6) )
   1630     #define stbir__simdf8_a1a1( out, alp, ones) (out) = _mm256_blend_ps( alp, ones, (1<<0)+(0<<1)+(1<<2)+(0<<3)+(1<<4)+(0<<5)+(1<<6)+(0<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (3<<4) + (2<<6) )
   1631     #define stbir__simdf8_1a1a( out, alp, ones) (out) = _mm256_blend_ps( alp, ones, (0<<0)+(1<<1)+(0<<2)+(1<<3)+(0<<4)+(1<<5)+(0<<6)+(1<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (3<<4) + (2<<6) )
   1632 
   1633     #define stbir__simdf8_zero( reg ) (reg) = _mm256_setzero_ps()
   1634 
   1635     #ifdef STBIR_USE_FMA           // not on by default to maintain bit identical simd to non-simd
   1636     #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_fmadd_ps( mul1, mul2, add )
   1637     #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_fmadd_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ), add )
   1638     #define stbir__simdf8_madd_mem4( out, add, mul, ptr )(out) = _mm256_fmadd_ps( _mm256_setr_m128( mul, _mm_setzero_ps() ), _mm256_setr_m128( _mm_loadu_ps( (float const*)(ptr) ), _mm_setzero_ps() ), add )
   1639     #else
   1640     #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul1, mul2 ) )
   1641     #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ) ) )
   1642     #define stbir__simdf8_madd_mem4( out, add, mul, ptr )  (out) = _mm256_add_ps( add, _mm256_setr_m128( _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ), _mm_setzero_ps() ) )
   1643     #endif
   1644     #define stbir__if_simdf8_cast_to_simdf4( val ) _mm256_castps256_ps128( val )
   1645 
   1646   #endif
   1647 
   1648   #ifdef STBIR_FLOORF
   1649   #undef STBIR_FLOORF
   1650   #endif
   1651   #define STBIR_FLOORF stbir_simd_floorf
   1652   static stbir__inline float stbir_simd_floorf(float x)  // martins floorf
   1653   {
   1654     #if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41)
   1655     __m128 t = _mm_set_ss(x);
   1656     return _mm_cvtss_f32( _mm_floor_ss(t, t) );
   1657     #else
   1658     __m128 f = _mm_set_ss(x);
   1659     __m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f));
   1660     __m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(f, t), _mm_set_ss(-1.0f)));
   1661     return _mm_cvtss_f32(r);
   1662     #endif
   1663   }
   1664 
   1665   #ifdef STBIR_CEILF
   1666   #undef STBIR_CEILF
   1667   #endif
   1668   #define STBIR_CEILF stbir_simd_ceilf
   1669   static stbir__inline float stbir_simd_ceilf(float x)  // martins ceilf
   1670   {
   1671     #if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41)
   1672     __m128 t = _mm_set_ss(x);
   1673     return _mm_cvtss_f32( _mm_ceil_ss(t, t) );
   1674     #else
   1675     __m128 f = _mm_set_ss(x);
   1676     __m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f));
   1677     __m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(t, f), _mm_set_ss(1.0f)));
   1678     return _mm_cvtss_f32(r);
   1679     #endif
   1680   }
   1681 
   1682 #elif defined(STBIR_NEON)
   1683 
   1684   #include <arm_neon.h>
   1685 
   1686   #define stbir__simdf float32x4_t
   1687   #define stbir__simdi uint32x4_t
   1688 
   1689   #define stbir_simdi_castf( reg ) vreinterpretq_u32_f32(reg)
   1690   #define stbir_simdf_casti( reg ) vreinterpretq_f32_u32(reg)
   1691 
   1692   #define stbir__simdf_load( reg, ptr ) (reg) = vld1q_f32( (float const*)(ptr) )
   1693   #define stbir__simdi_load( reg, ptr ) (reg) = vld1q_u32( (uint32_t const*)(ptr) )
   1694   #define stbir__simdf_load1( out, ptr ) (out) = vld1q_dup_f32( (float const*)(ptr) ) // top values can be random (not denormal or nan for perf)
   1695   #define stbir__simdi_load1( out, ptr ) (out) = vld1q_dup_u32( (uint32_t const*)(ptr) )
   1696   #define stbir__simdf_load1z( out, ptr ) (out) = vld1q_lane_f32( (float const*)(ptr), vdupq_n_f32(0), 0 ) // top values must be zero
   1697   #define stbir__simdf_frep4( fvar ) vdupq_n_f32( fvar )
   1698   #define stbir__simdf_load1frep4( out, fvar ) (out) = vdupq_n_f32( fvar )
   1699   #define stbir__simdf_load2( out, ptr ) (out) = vcombine_f32( vld1_f32( (float const*)(ptr) ), vcreate_f32(0) ) // top values can be random (not denormal or nan for perf)
   1700   #define stbir__simdf_load2z( out, ptr ) (out) = vcombine_f32( vld1_f32( (float const*)(ptr) ), vcreate_f32(0) )  // top values must be zero
   1701   #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = vcombine_f32( vget_low_f32(reg), vld1_f32( (float const*)(ptr) ) )
   1702 
   1703   #define stbir__simdf_zeroP() vdupq_n_f32(0)
   1704   #define stbir__simdf_zero( reg ) (reg) = vdupq_n_f32(0)
   1705 
   1706   #define stbir__simdf_store( ptr, reg )  vst1q_f32( (float*)(ptr), reg )
   1707   #define stbir__simdf_store1( ptr, reg ) vst1q_lane_f32( (float*)(ptr), reg, 0)
   1708   #define stbir__simdf_store2( ptr, reg ) vst1_f32( (float*)(ptr), vget_low_f32(reg) )
   1709   #define stbir__simdf_store2h( ptr, reg ) vst1_f32( (float*)(ptr), vget_high_f32(reg) )
   1710 
   1711   #define stbir__simdi_store( ptr, reg )  vst1q_u32( (uint32_t*)(ptr), reg )
   1712   #define stbir__simdi_store1( ptr, reg ) vst1q_lane_u32( (uint32_t*)(ptr), reg, 0 )
   1713   #define stbir__simdi_store2( ptr, reg ) vst1_u32( (uint32_t*)(ptr), vget_low_u32(reg) )
   1714 
   1715   #define stbir__prefetch( ptr )
   1716 
   1717   #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
   1718   { \
   1719     uint16x8_t l = vmovl_u8( vget_low_u8 ( vreinterpretq_u8_u32(ireg) ) ); \
   1720     uint16x8_t h = vmovl_u8( vget_high_u8( vreinterpretq_u8_u32(ireg) ) ); \
   1721     out0 = vmovl_u16( vget_low_u16 ( l ) ); \
   1722     out1 = vmovl_u16( vget_high_u16( l ) ); \
   1723     out2 = vmovl_u16( vget_low_u16 ( h ) ); \
   1724     out3 = vmovl_u16( vget_high_u16( h ) ); \
   1725   }
   1726 
   1727   #define stbir__simdi_expand_u8_to_1u32(out,ireg) \
   1728   { \
   1729     uint16x8_t tmp = vmovl_u8( vget_low_u8( vreinterpretq_u8_u32(ireg) ) ); \
   1730     out = vmovl_u16( vget_low_u16( tmp ) ); \
   1731   }
   1732 
   1733   #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \
   1734   { \
   1735     uint16x8_t tmp = vreinterpretq_u16_u32(ireg); \
   1736     out0 = vmovl_u16( vget_low_u16 ( tmp ) ); \
   1737     out1 = vmovl_u16( vget_high_u16( tmp ) ); \
   1738   }
   1739 
   1740   #define stbir__simdf_convert_float_to_i32( i, f ) (i) = vreinterpretq_u32_s32( vcvtq_s32_f32(f) )
   1741   #define stbir__simdf_convert_float_to_int( f ) vgetq_lane_s32(vcvtq_s32_f32(f), 0)
   1742   #define stbir__simdi_to_int( i ) (int)vgetq_lane_u32(i, 0)
   1743   #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),vdupq_n_f32(0))), 0))
   1744   #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),vdupq_n_f32(0))), 0))
   1745   #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = vcvtq_f32_s32( vreinterpretq_s32_u32(ireg) )
   1746   #define stbir__simdf_add( out, reg0, reg1 ) (out) = vaddq_f32( reg0, reg1 )
   1747   #define stbir__simdf_mult( out, reg0, reg1 ) (out) = vmulq_f32( reg0, reg1 )
   1748   #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = vmulq_f32( reg, vld1q_f32( (float const*)(ptr) ) )
   1749   #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = vmulq_f32( reg, vld1q_dup_f32( (float const*)(ptr) ) )
   1750   #define stbir__simdf_add_mem( out, reg, ptr ) (out) = vaddq_f32( reg, vld1q_f32( (float const*)(ptr) ) )
   1751   #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = vaddq_f32( reg, vld1q_dup_f32( (float const*)(ptr) ) )
   1752 
   1753   #ifdef STBIR_USE_FMA           // not on by default to maintain bit identical simd to non-simd (and also x64 no madd to arm madd)
   1754   #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = vfmaq_f32( add, mul1, mul2 )
   1755   #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = vfmaq_f32( add, mul1, mul2 )
   1756   #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = vfmaq_f32( add, mul, vld1q_f32( (float const*)(ptr) ) )
   1757   #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = vfmaq_f32( add, mul, vld1q_dup_f32( (float const*)(ptr) ) )
   1758   #else
   1759   #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = vaddq_f32( add, vmulq_f32( mul1, mul2 ) )
   1760   #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = vaddq_f32( add, vmulq_f32( mul1, mul2 ) )
   1761   #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = vaddq_f32( add, vmulq_f32( mul, vld1q_f32( (float const*)(ptr) ) ) )
   1762   #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = vaddq_f32( add, vmulq_f32( mul, vld1q_dup_f32( (float const*)(ptr) ) ) )
   1763   #endif
   1764 
   1765   #define stbir__simdf_add1( out, reg0, reg1 ) (out) = vaddq_f32( reg0, reg1 )
   1766   #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = vmulq_f32( reg0, reg1 )
   1767 
   1768   #define stbir__simdf_and( out, reg0, reg1 ) (out) = vreinterpretq_f32_u32( vandq_u32( vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1) ) )
   1769   #define stbir__simdf_or( out, reg0, reg1 ) (out) = vreinterpretq_f32_u32( vorrq_u32( vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1) ) )
   1770 
   1771   #define stbir__simdf_min( out, reg0, reg1 ) (out) = vminq_f32( reg0, reg1 )
   1772   #define stbir__simdf_max( out, reg0, reg1 ) (out) = vmaxq_f32( reg0, reg1 )
   1773   #define stbir__simdf_min1( out, reg0, reg1 ) (out) = vminq_f32( reg0, reg1 )
   1774   #define stbir__simdf_max1( out, reg0, reg1 ) (out) = vmaxq_f32( reg0, reg1 )
   1775 
   1776   #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out) = vextq_f32( reg0, reg1, 3 )
   1777   #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out) = vextq_f32( reg0, reg1, 2 )
   1778 
   1779   #define stbir__simdf_a1a1( out, alp, ones ) (out) = vzipq_f32(vuzpq_f32(alp, alp).val[1], ones).val[0]
   1780   #define stbir__simdf_1a1a( out, alp, ones ) (out) = vzipq_f32(ones, vuzpq_f32(alp, alp).val[0]).val[0]
   1781 
   1782   #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ )
   1783 
   1784     #define stbir__simdf_aaa1( out, alp, ones ) (out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3, ones, 3)
   1785     #define stbir__simdf_1aaa( out, alp, ones ) (out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0, ones, 0)
   1786 
   1787     #if defined( _MSC_VER ) && !defined(__clang__)
   1788       #define stbir_make16(a,b,c,d) vcombine_u8( \
   1789         vcreate_u8( (4*a+0) | ((4*a+1)<<8) | ((4*a+2)<<16) | ((4*a+3)<<24) | \
   1790           ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56)), \
   1791         vcreate_u8( (4*c+0) | ((4*c+1)<<8) | ((4*c+2)<<16) | ((4*c+3)<<24) | \
   1792           ((stbir_uint64)(4*d+0)<<32) | ((stbir_uint64)(4*d+1)<<40) | ((stbir_uint64)(4*d+2)<<48) | ((stbir_uint64)(4*d+3)<<56) ) )
   1793 
   1794       static stbir__inline uint8x16x2_t stbir_make16x2(float32x4_t rega,float32x4_t regb)
   1795       {
   1796         uint8x16x2_t r = { vreinterpretq_u8_f32(rega), vreinterpretq_u8_f32(regb) };
   1797         return r;
   1798       }
   1799     #else
   1800       #define stbir_make16(a,b,c,d) (uint8x16_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3,4*c+0,4*c+1,4*c+2,4*c+3,4*d+0,4*d+1,4*d+2,4*d+3}
   1801       #define stbir_make16x2(a,b) (uint8x16x2_t){{vreinterpretq_u8_f32(a),vreinterpretq_u8_f32(b)}}
   1802     #endif
   1803 
   1804     #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vqtbl1q_u8( vreinterpretq_u8_f32(reg), stbir_make16(one, two, three, four) ) )
   1805     #define stbir__simdf_swiz2( rega, regb, one, two, three, four ) vreinterpretq_f32_u8( vqtbl2q_u8( stbir_make16x2(rega,regb), stbir_make16(one, two, three, four) ) )
   1806 
   1807     #define stbir__simdi_16madd( out, reg0, reg1 ) \
   1808     { \
   1809       int16x8_t r0 = vreinterpretq_s16_u32(reg0); \
   1810       int16x8_t r1 = vreinterpretq_s16_u32(reg1); \
   1811       int32x4_t tmp0 = vmull_s16( vget_low_s16(r0), vget_low_s16(r1) ); \
   1812       int32x4_t tmp1 = vmull_s16( vget_high_s16(r0), vget_high_s16(r1) ); \
   1813       (out) = vreinterpretq_u32_s32( vpaddq_s32(tmp0, tmp1) ); \
   1814     }
   1815 
   1816   #else
   1817 
   1818     #define stbir__simdf_aaa1( out, alp, ones ) (out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3)
   1819     #define stbir__simdf_1aaa( out, alp, ones ) (out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0)
   1820 
   1821     #if defined( _MSC_VER ) && !defined(__clang__)
   1822       static stbir__inline uint8x8x2_t stbir_make8x2(float32x4_t reg)
   1823       {
   1824         uint8x8x2_t r = { { vget_low_u8(vreinterpretq_u8_f32(reg)), vget_high_u8(vreinterpretq_u8_f32(reg)) } };
   1825         return r;
   1826       }
   1827       #define stbir_make8(a,b) vcreate_u8( \
   1828         (4*a+0) | ((4*a+1)<<8) | ((4*a+2)<<16) | ((4*a+3)<<24) | \
   1829         ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56) )
   1830     #else
   1831       #define stbir_make8x2(reg) (uint8x8x2_t){ { vget_low_u8(vreinterpretq_u8_f32(reg)), vget_high_u8(vreinterpretq_u8_f32(reg)) } }
   1832       #define stbir_make8(a,b) (uint8x8_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3}
   1833     #endif
   1834 
   1835     #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vcombine_u8( \
   1836         vtbl2_u8( stbir_make8x2( reg ), stbir_make8( one, two ) ), \
   1837         vtbl2_u8( stbir_make8x2( reg ), stbir_make8( three, four ) ) ) )
   1838 
   1839     #define stbir__simdi_16madd( out, reg0, reg1 ) \
   1840     { \
   1841       int16x8_t r0 = vreinterpretq_s16_u32(reg0); \
   1842       int16x8_t r1 = vreinterpretq_s16_u32(reg1); \
   1843       int32x4_t tmp0 = vmull_s16( vget_low_s16(r0), vget_low_s16(r1) ); \
   1844       int32x4_t tmp1 = vmull_s16( vget_high_s16(r0), vget_high_s16(r1) ); \
   1845       int32x2_t out0 = vpadd_s32( vget_low_s32(tmp0), vget_high_s32(tmp0) ); \
   1846       int32x2_t out1 = vpadd_s32( vget_low_s32(tmp1), vget_high_s32(tmp1) ); \
   1847       (out) = vreinterpretq_u32_s32( vcombine_s32(out0, out1) ); \
   1848     }
   1849 
   1850   #endif
   1851 
   1852   #define stbir__simdi_and( out, reg0, reg1 ) (out) = vandq_u32( reg0, reg1 )
   1853   #define stbir__simdi_or( out, reg0, reg1 ) (out) = vorrq_u32( reg0, reg1 )
   1854 
   1855   #define stbir__simdf_pack_to_8bytes(out,aa,bb) \
   1856   { \
   1857     float32x4_t af = vmaxq_f32( vminq_f32(aa,STBIR__CONSTF(STBIR_max_uint8_as_float) ), vdupq_n_f32(0) ); \
   1858     float32x4_t bf = vmaxq_f32( vminq_f32(bb,STBIR__CONSTF(STBIR_max_uint8_as_float) ), vdupq_n_f32(0) ); \
   1859     int16x4_t ai = vqmovn_s32( vcvtq_s32_f32( af ) ); \
   1860     int16x4_t bi = vqmovn_s32( vcvtq_s32_f32( bf ) ); \
   1861     uint8x8_t out8 = vqmovun_s16( vcombine_s16(ai, bi) ); \
   1862     out = vreinterpretq_u32_u8( vcombine_u8(out8, out8) ); \
   1863   }
   1864 
   1865   #define stbir__simdf_pack_to_8words(out,aa,bb) \
   1866   { \
   1867     float32x4_t af = vmaxq_f32( vminq_f32(aa,STBIR__CONSTF(STBIR_max_uint16_as_float) ), vdupq_n_f32(0) ); \
   1868     float32x4_t bf = vmaxq_f32( vminq_f32(bb,STBIR__CONSTF(STBIR_max_uint16_as_float) ), vdupq_n_f32(0) ); \
   1869     int32x4_t ai = vcvtq_s32_f32( af ); \
   1870     int32x4_t bi = vcvtq_s32_f32( bf ); \
   1871     out = vreinterpretq_u32_u16( vcombine_u16(vqmovun_s32(ai), vqmovun_s32(bi)) ); \
   1872   }
   1873 
   1874   #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \
   1875   { \
   1876     int16x4x2_t tmp0 = vzip_s16( vqmovn_s32(vreinterpretq_s32_u32(r0)), vqmovn_s32(vreinterpretq_s32_u32(r2)) ); \
   1877     int16x4x2_t tmp1 = vzip_s16( vqmovn_s32(vreinterpretq_s32_u32(r1)), vqmovn_s32(vreinterpretq_s32_u32(r3)) ); \
   1878     uint8x8x2_t out = \
   1879     { { \
   1880       vqmovun_s16( vcombine_s16(tmp0.val[0], tmp0.val[1]) ), \
   1881       vqmovun_s16( vcombine_s16(tmp1.val[0], tmp1.val[1]) ), \
   1882     } }; \
   1883     vst2_u8(ptr, out); \
   1884   }
   1885 
   1886   #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \
   1887   { \
   1888     float32x4x4_t tmp = vld4q_f32(ptr); \
   1889     o0 = tmp.val[0]; \
   1890     o1 = tmp.val[1]; \
   1891     o2 = tmp.val[2]; \
   1892     o3 = tmp.val[3]; \
   1893   }
   1894 
   1895   #define stbir__simdi_32shr( out, reg, imm ) out = vshrq_n_u32( reg, imm )
   1896 
   1897   #if defined( _MSC_VER ) && !defined(__clang__)
   1898     #define STBIR__SIMDF_CONST(var, x) __declspec(align(8)) float var[] = { x, x, x, x }
   1899     #define STBIR__SIMDI_CONST(var, x) __declspec(align(8)) uint32_t var[] = { x, x, x, x }
   1900     #define STBIR__CONSTF(var) (*(const float32x4_t*)var)
   1901     #define STBIR__CONSTI(var) (*(const uint32x4_t*)var)
   1902   #else
   1903     #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = { x, x, x, x }
   1904     #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { x, x, x, x }
   1905     #define STBIR__CONSTF(var) (var)
   1906     #define STBIR__CONSTI(var) (var)
   1907   #endif
   1908 
   1909   #ifdef STBIR_FLOORF
   1910   #undef STBIR_FLOORF
   1911   #endif
   1912   #define STBIR_FLOORF stbir_simd_floorf
   1913   static stbir__inline float stbir_simd_floorf(float x)
   1914   {
   1915     #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ )
   1916     return vget_lane_f32( vrndm_f32( vdup_n_f32(x) ), 0);
   1917     #else
   1918     float32x2_t f = vdup_n_f32(x);
   1919     float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f));
   1920     uint32x2_t a = vclt_f32(f, t);
   1921     uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(-1.0f));
   1922     float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b)));
   1923     return vget_lane_f32(r, 0);
   1924     #endif
   1925   }
   1926 
   1927   #ifdef STBIR_CEILF
   1928   #undef STBIR_CEILF
   1929   #endif
   1930   #define STBIR_CEILF stbir_simd_ceilf
   1931   static stbir__inline float stbir_simd_ceilf(float x)
   1932   {
   1933     #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ )
   1934     return vget_lane_f32( vrndp_f32( vdup_n_f32(x) ), 0);
   1935     #else
   1936     float32x2_t f = vdup_n_f32(x);
   1937     float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f));
   1938     uint32x2_t a = vclt_f32(t, f);
   1939     uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(1.0f));
   1940     float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b)));
   1941     return vget_lane_f32(r, 0);
   1942     #endif
   1943   }
   1944 
   1945   #define STBIR_SIMD
   1946 
   1947 #elif defined(STBIR_WASM)
   1948 
   1949   #include <wasm_simd128.h>
   1950 
   1951   #define stbir__simdf v128_t
   1952   #define stbir__simdi v128_t
   1953 
   1954   #define stbir_simdi_castf( reg ) (reg)
   1955   #define stbir_simdf_casti( reg ) (reg)
   1956 
   1957   #define stbir__simdf_load( reg, ptr )             (reg) = wasm_v128_load( (void const*)(ptr) )
   1958   #define stbir__simdi_load( reg, ptr )             (reg) = wasm_v128_load( (void const*)(ptr) )
   1959   #define stbir__simdf_load1( out, ptr )            (out) = wasm_v128_load32_splat( (void const*)(ptr) ) // top values can be random (not denormal or nan for perf)
   1960   #define stbir__simdi_load1( out, ptr )            (out) = wasm_v128_load32_splat( (void const*)(ptr) )
   1961   #define stbir__simdf_load1z( out, ptr )           (out) = wasm_v128_load32_zero( (void const*)(ptr) ) // top values must be zero
   1962   #define stbir__simdf_frep4( fvar )                wasm_f32x4_splat( fvar )
   1963   #define stbir__simdf_load1frep4( out, fvar )      (out) = wasm_f32x4_splat( fvar )
   1964   #define stbir__simdf_load2( out, ptr )            (out) = wasm_v128_load64_splat( (void const*)(ptr) ) // top values can be random (not denormal or nan for perf)
   1965   #define stbir__simdf_load2z( out, ptr )           (out) = wasm_v128_load64_zero( (void const*)(ptr) ) // top values must be zero
   1966   #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = wasm_v128_load64_lane( (void const*)(ptr), reg, 1 )
   1967 
   1968   #define stbir__simdf_zeroP() wasm_f32x4_const_splat(0)
   1969   #define stbir__simdf_zero( reg ) (reg) = wasm_f32x4_const_splat(0)
   1970 
   1971   #define stbir__simdf_store( ptr, reg )   wasm_v128_store( (void*)(ptr), reg )
   1972   #define stbir__simdf_store1( ptr, reg )  wasm_v128_store32_lane( (void*)(ptr), reg, 0 )
   1973   #define stbir__simdf_store2( ptr, reg )  wasm_v128_store64_lane( (void*)(ptr), reg, 0 )
   1974   #define stbir__simdf_store2h( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 1 )
   1975 
   1976   #define stbir__simdi_store( ptr, reg )  wasm_v128_store( (void*)(ptr), reg )
   1977   #define stbir__simdi_store1( ptr, reg ) wasm_v128_store32_lane( (void*)(ptr), reg, 0 )
   1978   #define stbir__simdi_store2( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 0 )
   1979 
   1980   #define stbir__prefetch( ptr )
   1981 
   1982   #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
   1983   { \
   1984     v128_t l = wasm_u16x8_extend_low_u8x16 ( ireg ); \
   1985     v128_t h = wasm_u16x8_extend_high_u8x16( ireg ); \
   1986     out0 = wasm_u32x4_extend_low_u16x8 ( l ); \
   1987     out1 = wasm_u32x4_extend_high_u16x8( l ); \
   1988     out2 = wasm_u32x4_extend_low_u16x8 ( h ); \
   1989     out3 = wasm_u32x4_extend_high_u16x8( h ); \
   1990   }
   1991 
   1992   #define stbir__simdi_expand_u8_to_1u32(out,ireg) \
   1993   { \
   1994     v128_t tmp = wasm_u16x8_extend_low_u8x16(ireg); \
   1995     out = wasm_u32x4_extend_low_u16x8(tmp); \
   1996   }
   1997 
   1998   #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \
   1999   { \
   2000     out0 = wasm_u32x4_extend_low_u16x8 ( ireg ); \
   2001     out1 = wasm_u32x4_extend_high_u16x8( ireg ); \
   2002   }
   2003 
   2004   #define stbir__simdf_convert_float_to_i32( i, f )    (i) = wasm_i32x4_trunc_sat_f32x4(f)
   2005   #define stbir__simdf_convert_float_to_int( f )       wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(f), 0)
   2006   #define stbir__simdi_to_int( i )                     wasm_i32x4_extract_lane(i, 0)
   2007   #define stbir__simdf_convert_float_to_uint8( f )     ((unsigned char)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint8_as_float),wasm_f32x4_const_splat(0))), 0))
   2008   #define stbir__simdf_convert_float_to_short( f )     ((unsigned short)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint16_as_float),wasm_f32x4_const_splat(0))), 0))
   2009   #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = wasm_f32x4_convert_i32x4(ireg)
   2010   #define stbir__simdf_add( out, reg0, reg1 )          (out) = wasm_f32x4_add( reg0, reg1 )
   2011   #define stbir__simdf_mult( out, reg0, reg1 )         (out) = wasm_f32x4_mul( reg0, reg1 )
   2012   #define stbir__simdf_mult_mem( out, reg, ptr )       (out) = wasm_f32x4_mul( reg, wasm_v128_load( (void const*)(ptr) ) )
   2013   #define stbir__simdf_mult1_mem( out, reg, ptr )      (out) = wasm_f32x4_mul( reg, wasm_v128_load32_splat( (void const*)(ptr) ) )
   2014   #define stbir__simdf_add_mem( out, reg, ptr )        (out) = wasm_f32x4_add( reg, wasm_v128_load( (void const*)(ptr) ) )
   2015   #define stbir__simdf_add1_mem( out, reg, ptr )       (out) = wasm_f32x4_add( reg, wasm_v128_load32_splat( (void const*)(ptr) ) )
   2016 
   2017   #define stbir__simdf_madd( out, add, mul1, mul2 )    (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul1, mul2 ) )
   2018   #define stbir__simdf_madd1( out, add, mul1, mul2 )   (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul1, mul2 ) )
   2019   #define stbir__simdf_madd_mem( out, add, mul, ptr )  (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul, wasm_v128_load( (void const*)(ptr) ) ) )
   2020   #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul, wasm_v128_load32_splat( (void const*)(ptr) ) ) )
   2021 
   2022   #define stbir__simdf_add1( out, reg0, reg1 )  (out) = wasm_f32x4_add( reg0, reg1 )
   2023   #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = wasm_f32x4_mul( reg0, reg1 )
   2024 
   2025   #define stbir__simdf_and( out, reg0, reg1 ) (out) = wasm_v128_and( reg0, reg1 )
   2026   #define stbir__simdf_or( out, reg0, reg1 )  (out) = wasm_v128_or( reg0, reg1 )
   2027 
   2028   #define stbir__simdf_min( out, reg0, reg1 ) (out) = wasm_f32x4_min( reg0, reg1 )
   2029   #define stbir__simdf_max( out, reg0, reg1 ) (out) = wasm_f32x4_max( reg0, reg1 )
   2030   #define stbir__simdf_min1( out, reg0, reg1 ) (out) = wasm_f32x4_min( reg0, reg1 )
   2031   #define stbir__simdf_max1( out, reg0, reg1 ) (out) = wasm_f32x4_max( reg0, reg1 )
   2032 
   2033   #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out) = wasm_i32x4_shuffle( reg0, reg1, 3, 4, 5, -1 )
   2034   #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out) = wasm_i32x4_shuffle( reg0, reg1, 2, 3, 4, -1 )
   2035 
   2036   #define stbir__simdf_aaa1(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 3, 3, 3, 4)
   2037   #define stbir__simdf_1aaa(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 0, 0)
   2038   #define stbir__simdf_a1a1(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 1, 4, 3, 4)
   2039   #define stbir__simdf_1a1a(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 4, 2)
   2040 
   2041   #define stbir__simdf_swiz( reg, one, two, three, four ) wasm_i32x4_shuffle(reg, reg, one, two, three, four)
   2042 
   2043   #define stbir__simdi_and( out, reg0, reg1 )    (out) = wasm_v128_and( reg0, reg1 )
   2044   #define stbir__simdi_or( out, reg0, reg1 )     (out) = wasm_v128_or( reg0, reg1 )
   2045   #define stbir__simdi_16madd( out, reg0, reg1 ) (out) = wasm_i32x4_dot_i16x8( reg0, reg1 )
   2046 
   2047   #define stbir__simdf_pack_to_8bytes(out,aa,bb) \
   2048   { \
   2049     v128_t af = wasm_f32x4_max( wasm_f32x4_min(aa, STBIR_max_uint8_as_float), wasm_f32x4_const_splat(0) ); \
   2050     v128_t bf = wasm_f32x4_max( wasm_f32x4_min(bb, STBIR_max_uint8_as_float), wasm_f32x4_const_splat(0) ); \
   2051     v128_t ai = wasm_i32x4_trunc_sat_f32x4( af ); \
   2052     v128_t bi = wasm_i32x4_trunc_sat_f32x4( bf ); \
   2053     v128_t out16 = wasm_i16x8_narrow_i32x4( ai, bi ); \
   2054     out = wasm_u8x16_narrow_i16x8( out16, out16 ); \
   2055   }
   2056 
   2057   #define stbir__simdf_pack_to_8words(out,aa,bb) \
   2058   { \
   2059     v128_t af = wasm_f32x4_max( wasm_f32x4_min(aa, STBIR_max_uint16_as_float), wasm_f32x4_const_splat(0)); \
   2060     v128_t bf = wasm_f32x4_max( wasm_f32x4_min(bb, STBIR_max_uint16_as_float), wasm_f32x4_const_splat(0)); \
   2061     v128_t ai = wasm_i32x4_trunc_sat_f32x4( af ); \
   2062     v128_t bi = wasm_i32x4_trunc_sat_f32x4( bf ); \
   2063     out = wasm_u16x8_narrow_i32x4( ai, bi ); \
   2064   }
   2065 
   2066   #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \
   2067   { \
   2068     v128_t tmp0 = wasm_i16x8_narrow_i32x4(r0, r1); \
   2069     v128_t tmp1 = wasm_i16x8_narrow_i32x4(r2, r3); \
   2070     v128_t tmp = wasm_u8x16_narrow_i16x8(tmp0, tmp1); \
   2071     tmp = wasm_i8x16_shuffle(tmp, tmp, 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15); \
   2072     wasm_v128_store( (void*)(ptr), tmp); \
   2073   }
   2074 
   2075   #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \
   2076   { \
   2077     v128_t t0 = wasm_v128_load( ptr    ); \
   2078     v128_t t1 = wasm_v128_load( ptr+4  ); \
   2079     v128_t t2 = wasm_v128_load( ptr+8  ); \
   2080     v128_t t3 = wasm_v128_load( ptr+12 ); \
   2081     v128_t s0 = wasm_i32x4_shuffle(t0, t1, 0, 4, 2, 6); \
   2082     v128_t s1 = wasm_i32x4_shuffle(t0, t1, 1, 5, 3, 7); \
   2083     v128_t s2 = wasm_i32x4_shuffle(t2, t3, 0, 4, 2, 6); \
   2084     v128_t s3 = wasm_i32x4_shuffle(t2, t3, 1, 5, 3, 7); \
   2085     o0 = wasm_i32x4_shuffle(s0, s2, 0, 1, 4, 5); \
   2086     o1 = wasm_i32x4_shuffle(s1, s3, 0, 1, 4, 5); \
   2087     o2 = wasm_i32x4_shuffle(s0, s2, 2, 3, 6, 7); \
   2088     o3 = wasm_i32x4_shuffle(s1, s3, 2, 3, 6, 7); \
   2089   }
   2090 
   2091   #define stbir__simdi_32shr( out, reg, imm ) out = wasm_u32x4_shr( reg, imm )
   2092 
   2093   typedef float stbir__f32x4 __attribute__((__vector_size__(16), __aligned__(16)));
   2094   #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = (v128_t)(stbir__f32x4){ x, x, x, x }
   2095   #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { x, x, x, x }
   2096   #define STBIR__CONSTF(var) (var)
   2097   #define STBIR__CONSTI(var) (var)
   2098 
   2099   #ifdef STBIR_FLOORF
   2100   #undef STBIR_FLOORF
   2101   #endif
   2102   #define STBIR_FLOORF stbir_simd_floorf
   2103   static stbir__inline float stbir_simd_floorf(float x)
   2104   {
   2105     return wasm_f32x4_extract_lane( wasm_f32x4_floor( wasm_f32x4_splat(x) ), 0);
   2106   }
   2107 
   2108   #ifdef STBIR_CEILF
   2109   #undef STBIR_CEILF
   2110   #endif
   2111   #define STBIR_CEILF stbir_simd_ceilf
   2112   static stbir__inline float stbir_simd_ceilf(float x)
   2113   {
   2114     return wasm_f32x4_extract_lane( wasm_f32x4_ceil( wasm_f32x4_splat(x) ), 0);
   2115   }
   2116 
   2117   #define STBIR_SIMD
   2118 
   2119 #endif  // SSE2/NEON/WASM
   2120 
   2121 #endif // NO SIMD
   2122 
   2123 #ifdef STBIR_SIMD8
   2124   #define stbir__simdfX stbir__simdf8
   2125   #define stbir__simdiX stbir__simdi8
   2126   #define stbir__simdfX_load stbir__simdf8_load
   2127   #define stbir__simdiX_load stbir__simdi8_load
   2128   #define stbir__simdfX_mult stbir__simdf8_mult
   2129   #define stbir__simdfX_add_mem stbir__simdf8_add_mem
   2130   #define stbir__simdfX_madd_mem stbir__simdf8_madd_mem
   2131   #define stbir__simdfX_store stbir__simdf8_store
   2132   #define stbir__simdiX_store stbir__simdi8_store
   2133   #define stbir__simdf_frepX  stbir__simdf8_frep8
   2134   #define stbir__simdfX_madd stbir__simdf8_madd
   2135   #define stbir__simdfX_min stbir__simdf8_min
   2136   #define stbir__simdfX_max stbir__simdf8_max
   2137   #define stbir__simdfX_aaa1 stbir__simdf8_aaa1
   2138   #define stbir__simdfX_1aaa stbir__simdf8_1aaa
   2139   #define stbir__simdfX_a1a1 stbir__simdf8_a1a1
   2140   #define stbir__simdfX_1a1a stbir__simdf8_1a1a
   2141   #define stbir__simdfX_convert_float_to_i32 stbir__simdf8_convert_float_to_i32
   2142   #define stbir__simdfX_pack_to_words stbir__simdf8_pack_to_16words
   2143   #define stbir__simdfX_zero stbir__simdf8_zero
   2144   #define STBIR_onesX STBIR_ones8
   2145   #define STBIR_max_uint8_as_floatX STBIR_max_uint8_as_float8
   2146   #define STBIR_max_uint16_as_floatX STBIR_max_uint16_as_float8
   2147   #define STBIR_simd_point5X STBIR_simd_point58
   2148   #define stbir__simdfX_float_count 8
   2149   #define stbir__simdfX_0123to1230 stbir__simdf8_0123to12301230
   2150   #define stbir__simdfX_0123to2103 stbir__simdf8_0123to21032103
   2151   static const stbir__simdf8 STBIR_max_uint16_as_float_inverted8 = { stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted };
   2152   static const stbir__simdf8 STBIR_max_uint8_as_float_inverted8 = { stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted };
   2153   static const stbir__simdf8 STBIR_ones8 = { 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0 };
   2154   static const stbir__simdf8 STBIR_simd_point58 = { 0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5 };
   2155   static const stbir__simdf8 STBIR_max_uint8_as_float8 = { stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float, stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float };
   2156   static const stbir__simdf8 STBIR_max_uint16_as_float8 = { stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float, stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float };
   2157 #else
   2158   #define stbir__simdfX stbir__simdf
   2159   #define stbir__simdiX stbir__simdi
   2160   #define stbir__simdfX_load stbir__simdf_load
   2161   #define stbir__simdiX_load stbir__simdi_load
   2162   #define stbir__simdfX_mult stbir__simdf_mult
   2163   #define stbir__simdfX_add_mem stbir__simdf_add_mem
   2164   #define stbir__simdfX_madd_mem stbir__simdf_madd_mem
   2165   #define stbir__simdfX_store stbir__simdf_store
   2166   #define stbir__simdiX_store stbir__simdi_store
   2167   #define stbir__simdf_frepX  stbir__simdf_frep4
   2168   #define stbir__simdfX_madd stbir__simdf_madd
   2169   #define stbir__simdfX_min stbir__simdf_min
   2170   #define stbir__simdfX_max stbir__simdf_max
   2171   #define stbir__simdfX_aaa1 stbir__simdf_aaa1
   2172   #define stbir__simdfX_1aaa stbir__simdf_1aaa
   2173   #define stbir__simdfX_a1a1 stbir__simdf_a1a1
   2174   #define stbir__simdfX_1a1a stbir__simdf_1a1a
   2175   #define stbir__simdfX_convert_float_to_i32 stbir__simdf_convert_float_to_i32
   2176   #define stbir__simdfX_pack_to_words stbir__simdf_pack_to_8words
   2177   #define stbir__simdfX_zero stbir__simdf_zero
   2178   #define STBIR_onesX STBIR__CONSTF(STBIR_ones)
   2179   #define STBIR_simd_point5X STBIR__CONSTF(STBIR_simd_point5)
   2180   #define STBIR_max_uint8_as_floatX STBIR__CONSTF(STBIR_max_uint8_as_float)
   2181   #define STBIR_max_uint16_as_floatX STBIR__CONSTF(STBIR_max_uint16_as_float)
   2182   #define stbir__simdfX_float_count 4
   2183   #define stbir__if_simdf8_cast_to_simdf4( val ) ( val )
   2184   #define stbir__simdfX_0123to1230 stbir__simdf_0123to1230
   2185   #define stbir__simdfX_0123to2103 stbir__simdf_0123to2103
   2186 #endif
   2187 
   2188 
   2189 #if defined(STBIR_NEON) && !defined(_M_ARM) && !defined(__arm__)
   2190 
   2191   #if defined( _MSC_VER ) && !defined(__clang__)
   2192   typedef __int16 stbir__FP16;
   2193   #else
   2194   typedef float16_t stbir__FP16;
   2195   #endif
   2196 
   2197 #else // no NEON, or 32-bit ARM for MSVC
   2198 
   2199   typedef union stbir__FP16
   2200   {
   2201     unsigned short u;
   2202   } stbir__FP16;
   2203 
   2204 #endif
   2205 
   2206 #if (!defined(STBIR_NEON) && !defined(STBIR_FP16C)) || (defined(STBIR_NEON) && defined(_M_ARM)) || (defined(STBIR_NEON) && defined(__arm__))
   2207 
   2208   // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668
   2209 
   2210   static stbir__inline float stbir__half_to_float( stbir__FP16 h )
   2211   {
   2212     static const stbir__FP32 magic = { (254 - 15) << 23 };
   2213     static const stbir__FP32 was_infnan = { (127 + 16) << 23 };
   2214     stbir__FP32 o;
   2215 
   2216     o.u = (h.u & 0x7fff) << 13;     // exponent/mantissa bits
   2217     o.f *= magic.f;                 // exponent adjust
   2218     if (o.f >= was_infnan.f)        // make sure Inf/NaN survive
   2219       o.u |= 255 << 23;
   2220     o.u |= (h.u & 0x8000) << 16;    // sign bit
   2221     return o.f;
   2222   }
   2223 
   2224   static stbir__inline stbir__FP16 stbir__float_to_half(float val)
   2225   {
   2226     stbir__FP32 f32infty = { 255 << 23 };
   2227     stbir__FP32 f16max   = { (127 + 16) << 23 };
   2228     stbir__FP32 denorm_magic = { ((127 - 15) + (23 - 10) + 1) << 23 };
   2229     unsigned int sign_mask = 0x80000000u;
   2230     stbir__FP16 o = { 0 };
   2231     stbir__FP32 f;
   2232     unsigned int sign;
   2233 
   2234     f.f = val;
   2235     sign = f.u & sign_mask;
   2236     f.u ^= sign;
   2237 
   2238     if (f.u >= f16max.u) // result is Inf or NaN (all exponent bits set)
   2239       o.u = (f.u > f32infty.u) ? 0x7e00 : 0x7c00; // NaN->qNaN and Inf->Inf
   2240     else // (De)normalized number or zero
   2241     {
   2242       if (f.u < (113 << 23)) // resulting FP16 is subnormal or zero
   2243       {
   2244         // use a magic value to align our 10 mantissa bits at the bottom of
   2245         // the float. as long as FP addition is round-to-nearest-even this
   2246         // just works.
   2247         f.f += denorm_magic.f;
   2248         // and one integer subtract of the bias later, we have our final float!
   2249         o.u = (unsigned short) ( f.u - denorm_magic.u );
   2250       }
   2251       else
   2252       {
   2253         unsigned int mant_odd = (f.u >> 13) & 1; // resulting mantissa is odd
   2254         // update exponent, rounding bias part 1
   2255         f.u = f.u + ((15u - 127) << 23) + 0xfff;
   2256         // rounding bias part 2
   2257         f.u += mant_odd;
   2258         // take the bits!
   2259         o.u = (unsigned short) ( f.u >> 13 );
   2260       }
   2261     }
   2262 
   2263     o.u |= sign >> 16;
   2264     return o;
   2265   }
   2266 
   2267 #endif
   2268 
   2269 
   2270 #if defined(STBIR_FP16C)
   2271 
   2272   #include <immintrin.h>
   2273 
   2274   static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
   2275   {
   2276     _mm256_storeu_ps( (float*)output, _mm256_cvtph_ps( _mm_loadu_si128( (__m128i const* )input ) ) );
   2277   }
   2278 
   2279   static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
   2280   {
   2281     _mm_storeu_si128( (__m128i*)output, _mm256_cvtps_ph( _mm256_loadu_ps( input ), 0 ) );
   2282   }
   2283 
   2284   static stbir__inline float stbir__half_to_float( stbir__FP16 h )
   2285   {
   2286     return _mm_cvtss_f32( _mm_cvtph_ps( _mm_cvtsi32_si128( (int)h.u ) ) );
   2287   }
   2288 
   2289   static stbir__inline stbir__FP16 stbir__float_to_half( float f )
   2290   {
   2291     stbir__FP16 h;
   2292     h.u = (unsigned short) _mm_cvtsi128_si32( _mm_cvtps_ph( _mm_set_ss( f ), 0 ) );
   2293     return h;
   2294   }
   2295 
   2296 #elif defined(STBIR_SSE2)
   2297 
   2298   // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668
   2299   stbir__inline static void stbir__half_to_float_SIMD(float * output, void const * input)
   2300   {
   2301     static const STBIR__SIMDI_CONST(mask_nosign,      0x7fff);
   2302     static const STBIR__SIMDI_CONST(smallest_normal,  0x0400);
   2303     static const STBIR__SIMDI_CONST(infinity,         0x7c00);
   2304     static const STBIR__SIMDI_CONST(expadjust_normal, (127 - 15) << 23);
   2305     static const STBIR__SIMDI_CONST(magic_denorm,     113 << 23);
   2306 
   2307     __m128i i = _mm_loadu_si128 ( (__m128i const*)(input) );
   2308     __m128i h = _mm_unpacklo_epi16 ( i, _mm_setzero_si128() );
   2309     __m128i mnosign     = STBIR__CONSTI(mask_nosign);
   2310     __m128i eadjust     = STBIR__CONSTI(expadjust_normal);
   2311     __m128i smallest    = STBIR__CONSTI(smallest_normal);
   2312     __m128i infty       = STBIR__CONSTI(infinity);
   2313     __m128i expmant     = _mm_and_si128(mnosign, h);
   2314     __m128i justsign    = _mm_xor_si128(h, expmant);
   2315     __m128i b_notinfnan = _mm_cmpgt_epi32(infty, expmant);
   2316     __m128i b_isdenorm  = _mm_cmpgt_epi32(smallest, expmant);
   2317     __m128i shifted     = _mm_slli_epi32(expmant, 13);
   2318     __m128i adj_infnan  = _mm_andnot_si128(b_notinfnan, eadjust);
   2319     __m128i adjusted    = _mm_add_epi32(eadjust, shifted);
   2320     __m128i den1        = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm));
   2321     __m128i adjusted2   = _mm_add_epi32(adjusted, adj_infnan);
   2322     __m128  den2        = _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm);
   2323     __m128  adjusted3   = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm));
   2324     __m128  adjusted4   = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm), _mm_castsi128_ps(adjusted2));
   2325     __m128  adjusted5   = _mm_or_ps(adjusted3, adjusted4);
   2326     __m128i sign        = _mm_slli_epi32(justsign, 16);
   2327     __m128  final       = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign));
   2328     stbir__simdf_store( output + 0,  final );
   2329 
   2330     h = _mm_unpackhi_epi16 ( i, _mm_setzero_si128() );
   2331     expmant     = _mm_and_si128(mnosign, h);
   2332     justsign    = _mm_xor_si128(h, expmant);
   2333     b_notinfnan = _mm_cmpgt_epi32(infty, expmant);
   2334     b_isdenorm  = _mm_cmpgt_epi32(smallest, expmant);
   2335     shifted     = _mm_slli_epi32(expmant, 13);
   2336     adj_infnan  = _mm_andnot_si128(b_notinfnan, eadjust);
   2337     adjusted    = _mm_add_epi32(eadjust, shifted);
   2338     den1        = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm));
   2339     adjusted2   = _mm_add_epi32(adjusted, adj_infnan);
   2340     den2        = _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm);
   2341     adjusted3   = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm));
   2342     adjusted4   = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm), _mm_castsi128_ps(adjusted2));
   2343     adjusted5   = _mm_or_ps(adjusted3, adjusted4);
   2344     sign        = _mm_slli_epi32(justsign, 16);
   2345     final       = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign));
   2346     stbir__simdf_store( output + 4,  final );
   2347 
   2348     // ~38 SSE2 ops for 8 values
   2349   }
   2350 
   2351   // Fabian's round-to-nearest-even float to half
   2352   // ~48 SSE2 ops for 8 output
   2353   stbir__inline static void stbir__float_to_half_SIMD(void * output, float const * input)
   2354   {
   2355     static const STBIR__SIMDI_CONST(mask_sign,      0x80000000u);
   2356     static const STBIR__SIMDI_CONST(c_f16max,       (127 + 16) << 23); // all FP32 values >=this round to +inf
   2357     static const STBIR__SIMDI_CONST(c_nanbit,        0x200);
   2358     static const STBIR__SIMDI_CONST(c_infty_as_fp16, 0x7c00);
   2359     static const STBIR__SIMDI_CONST(c_min_normal,    (127 - 14) << 23); // smallest FP32 that yields a normalized FP16
   2360     static const STBIR__SIMDI_CONST(c_subnorm_magic, ((127 - 15) + (23 - 10) + 1) << 23);
   2361     static const STBIR__SIMDI_CONST(c_normal_bias,    0xfff - ((127 - 15) << 23)); // adjust exponent and add mantissa rounding
   2362 
   2363     __m128  f           =  _mm_loadu_ps(input);
   2364     __m128  msign       = _mm_castsi128_ps(STBIR__CONSTI(mask_sign));
   2365     __m128  justsign    = _mm_and_ps(msign, f);
   2366     __m128  absf        = _mm_xor_ps(f, justsign);
   2367     __m128i absf_int    = _mm_castps_si128(absf); // the cast is "free" (extra bypass latency, but no thruput hit)
   2368     __m128i f16max      = STBIR__CONSTI(c_f16max);
   2369     __m128  b_isnan     = _mm_cmpunord_ps(absf, absf); // is this a NaN?
   2370     __m128i b_isregular = _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special?
   2371     __m128i nanbit      = _mm_and_si128(_mm_castps_si128(b_isnan), STBIR__CONSTI(c_nanbit));
   2372     __m128i inf_or_nan  = _mm_or_si128(nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials
   2373 
   2374     __m128i min_normal  = STBIR__CONSTI(c_min_normal);
   2375     __m128i b_issub     = _mm_cmpgt_epi32(min_normal, absf_int);
   2376 
   2377     // "result is subnormal" path
   2378     __m128  subnorm1    = _mm_add_ps(absf, _mm_castsi128_ps(STBIR__CONSTI(c_subnorm_magic))); // magic value to round output mantissa
   2379     __m128i subnorm2    = _mm_sub_epi32(_mm_castps_si128(subnorm1), STBIR__CONSTI(c_subnorm_magic)); // subtract out bias
   2380 
   2381     // "result is normal" path
   2382     __m128i mantoddbit  = _mm_slli_epi32(absf_int, 31 - 13); // shift bit 13 (mantissa LSB) to sign
   2383     __m128i mantodd     = _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0
   2384 
   2385     __m128i round1      = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias));
   2386     __m128i round2      = _mm_sub_epi32(round1, mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE)
   2387     __m128i normal      = _mm_srli_epi32(round2, 13); // rounded result
   2388 
   2389     // combine the two non-specials
   2390     __m128i nonspecial  = _mm_or_si128(_mm_and_si128(subnorm2, b_issub), _mm_andnot_si128(b_issub, normal));
   2391 
   2392     // merge in specials as well
   2393     __m128i joined      = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular), _mm_andnot_si128(b_isregular, inf_or_nan));
   2394 
   2395     __m128i sign_shift  = _mm_srai_epi32(_mm_castps_si128(justsign), 16);
   2396     __m128i final2, final= _mm_or_si128(joined, sign_shift);
   2397 
   2398     f           =  _mm_loadu_ps(input+4);
   2399     justsign    = _mm_and_ps(msign, f);
   2400     absf        = _mm_xor_ps(f, justsign);
   2401     absf_int    = _mm_castps_si128(absf); // the cast is "free" (extra bypass latency, but no thruput hit)
   2402     b_isnan     = _mm_cmpunord_ps(absf, absf); // is this a NaN?
   2403     b_isregular = _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special?
   2404     nanbit      = _mm_and_si128(_mm_castps_si128(b_isnan), c_nanbit);
   2405     inf_or_nan  = _mm_or_si128(nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials
   2406 
   2407     b_issub     = _mm_cmpgt_epi32(min_normal, absf_int);
   2408 
   2409     // "result is subnormal" path
   2410     subnorm1    = _mm_add_ps(absf, _mm_castsi128_ps(STBIR__CONSTI(c_subnorm_magic))); // magic value to round output mantissa
   2411     subnorm2    = _mm_sub_epi32(_mm_castps_si128(subnorm1), STBIR__CONSTI(c_subnorm_magic)); // subtract out bias
   2412 
   2413     // "result is normal" path
   2414     mantoddbit  = _mm_slli_epi32(absf_int, 31 - 13); // shift bit 13 (mantissa LSB) to sign
   2415     mantodd     = _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0
   2416 
   2417     round1      = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias));
   2418     round2      = _mm_sub_epi32(round1, mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE)
   2419     normal      = _mm_srli_epi32(round2, 13); // rounded result
   2420 
   2421     // combine the two non-specials
   2422     nonspecial  = _mm_or_si128(_mm_and_si128(subnorm2, b_issub), _mm_andnot_si128(b_issub, normal));
   2423 
   2424     // merge in specials as well
   2425     joined      = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular), _mm_andnot_si128(b_isregular, inf_or_nan));
   2426 
   2427     sign_shift  = _mm_srai_epi32(_mm_castps_si128(justsign), 16);
   2428     final2      = _mm_or_si128(joined, sign_shift);
   2429     final       = _mm_packs_epi32(final, final2);
   2430     stbir__simdi_store( output,final );
   2431   }
   2432 
   2433 #elif defined(STBIR_NEON) && defined(_MSC_VER) && defined(_M_ARM64) && !defined(__clang__) // 64-bit ARM on MSVC (not clang)
   2434 
   2435   static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
   2436   {
   2437     float16x4_t in0 = vld1_f16(input + 0);
   2438     float16x4_t in1 = vld1_f16(input + 4);
   2439     vst1q_f32(output + 0, vcvt_f32_f16(in0));
   2440     vst1q_f32(output + 4, vcvt_f32_f16(in1));
   2441   }
   2442 
   2443   static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
   2444   {
   2445     float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0));
   2446     float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4));
   2447     vst1_f16(output+0, out0);
   2448     vst1_f16(output+4, out1);
   2449   }
   2450 
   2451   static stbir__inline float stbir__half_to_float( stbir__FP16 h )
   2452   {
   2453     return vgetq_lane_f32(vcvt_f32_f16(vld1_dup_f16(&h)), 0);
   2454   }
   2455 
   2456   static stbir__inline stbir__FP16 stbir__float_to_half( float f )
   2457   {
   2458     return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0).n16_u16[0];
   2459   }
   2460 
   2461 #elif defined(STBIR_NEON) && ( defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) ) // 64-bit ARM
   2462 
   2463   static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
   2464   {
   2465     float16x8_t in = vld1q_f16(input);
   2466     vst1q_f32(output + 0, vcvt_f32_f16(vget_low_f16(in)));
   2467     vst1q_f32(output + 4, vcvt_f32_f16(vget_high_f16(in)));
   2468   }
   2469 
   2470   static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
   2471   {
   2472     float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0));
   2473     float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4));
   2474     vst1q_f16(output, vcombine_f16(out0, out1));
   2475   }
   2476 
   2477   static stbir__inline float stbir__half_to_float( stbir__FP16 h )
   2478   {
   2479     return vgetq_lane_f32(vcvt_f32_f16(vdup_n_f16(h)), 0);
   2480   }
   2481 
   2482   static stbir__inline stbir__FP16 stbir__float_to_half( float f )
   2483   {
   2484     return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0);
   2485   }
   2486 
   2487 #elif defined(STBIR_WASM) || (defined(STBIR_NEON) && (defined(_MSC_VER) || defined(_M_ARM) || defined(__arm__))) // WASM or 32-bit ARM on MSVC/clang
   2488 
   2489   static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
   2490   {
   2491     for (int i=0; i<8; i++)
   2492     {
   2493       output[i] = stbir__half_to_float(input[i]);
   2494     }
   2495   }
   2496   static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
   2497   {
   2498     for (int i=0; i<8; i++)
   2499     {
   2500       output[i] = stbir__float_to_half(input[i]);
   2501     }
   2502   }
   2503 
   2504 #endif
   2505 
   2506 
   2507 #ifdef STBIR_SIMD
   2508 
   2509 #define stbir__simdf_0123to3333( out, reg ) (out) = stbir__simdf_swiz( reg, 3,3,3,3 )
   2510 #define stbir__simdf_0123to2222( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,2,2 )
   2511 #define stbir__simdf_0123to1111( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,1,1 )
   2512 #define stbir__simdf_0123to0000( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,0 )
   2513 #define stbir__simdf_0123to0003( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,3 )
   2514 #define stbir__simdf_0123to0001( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,1 )
   2515 #define stbir__simdf_0123to1122( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,2,2 )
   2516 #define stbir__simdf_0123to2333( out, reg ) (out) = stbir__simdf_swiz( reg, 2,3,3,3 )
   2517 #define stbir__simdf_0123to0023( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,3 )
   2518 #define stbir__simdf_0123to1230( out, reg ) (out) = stbir__simdf_swiz( reg, 1,2,3,0 )
   2519 #define stbir__simdf_0123to2103( out, reg ) (out) = stbir__simdf_swiz( reg, 2,1,0,3 )
   2520 #define stbir__simdf_0123to3210( out, reg ) (out) = stbir__simdf_swiz( reg, 3,2,1,0 )
   2521 #define stbir__simdf_0123to2301( out, reg ) (out) = stbir__simdf_swiz( reg, 2,3,0,1 )
   2522 #define stbir__simdf_0123to3012( out, reg ) (out) = stbir__simdf_swiz( reg, 3,0,1,2 )
   2523 #define stbir__simdf_0123to0011( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,1,1 )
   2524 #define stbir__simdf_0123to1100( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,0,0 )
   2525 #define stbir__simdf_0123to2233( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,3,3 )
   2526 #define stbir__simdf_0123to1133( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,3,3 )
   2527 #define stbir__simdf_0123to0022( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,2 )
   2528 #define stbir__simdf_0123to1032( out, reg ) (out) = stbir__simdf_swiz( reg, 1,0,3,2 )
   2529 
   2530 typedef union stbir__simdi_u32
   2531 {
   2532   stbir_uint32 m128i_u32[4];
   2533   int m128i_i32[4];
   2534   stbir__simdi m128i_i128;
   2535 } stbir__simdi_u32;
   2536 
   2537 static const int STBIR_mask[9] = { 0,0,0,-1,-1,-1,0,0,0 };
   2538 
   2539 static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float,           stbir__max_uint8_as_float);
   2540 static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float,          stbir__max_uint16_as_float);
   2541 static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float_inverted,  stbir__max_uint8_as_float_inverted);
   2542 static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float_inverted, stbir__max_uint16_as_float_inverted);
   2543 
   2544 static const STBIR__SIMDF_CONST(STBIR_simd_point5,   0.5f);
   2545 static const STBIR__SIMDF_CONST(STBIR_ones,          1.0f);
   2546 static const STBIR__SIMDI_CONST(STBIR_almost_zero,   (127 - 13) << 23);
   2547 static const STBIR__SIMDI_CONST(STBIR_almost_one,    0x3f7fffff);
   2548 static const STBIR__SIMDI_CONST(STBIR_mastissa_mask, 0xff);
   2549 static const STBIR__SIMDI_CONST(STBIR_topscale,      0x02000000);
   2550 
   2551 //   Basically, in simd mode, we unroll the proper amount, and we don't want
   2552 //   the non-simd remnant loops to be unroll because they only run a few times
   2553 //   Adding this switch saves about 5K on clang which is Captain Unroll the 3rd.
   2554 #define STBIR_SIMD_STREAMOUT_PTR( star )  STBIR_STREAMOUT_PTR( star )
   2555 #define STBIR_SIMD_NO_UNROLL(ptr) STBIR_NO_UNROLL(ptr)
   2556 #define STBIR_SIMD_NO_UNROLL_LOOP_START STBIR_NO_UNROLL_LOOP_START
   2557 #define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START_INF_FOR
   2558 
   2559 #ifdef STBIR_MEMCPY
   2560 #undef STBIR_MEMCPY
   2561 #endif
   2562 #define STBIR_MEMCPY stbir_simd_memcpy
   2563 
   2564 // override normal use of memcpy with much simpler copy (faster and smaller with our sized copies)
   2565 static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes )
   2566 {
   2567   char STBIR_SIMD_STREAMOUT_PTR (*) d = (char*) dest;
   2568   char STBIR_SIMD_STREAMOUT_PTR( * ) d_end = ((char*) dest) + bytes;
   2569   ptrdiff_t ofs_to_src = (char*)src - (char*)dest;
   2570 
   2571   // check overlaps
   2572   STBIR_ASSERT( ( ( d >= ( (char*)src) + bytes ) ) || ( ( d + bytes ) <= (char*)src ) );
   2573 
   2574   if ( bytes < (16*stbir__simdfX_float_count) )
   2575   {
   2576     if ( bytes < 16 )
   2577     {
   2578       if ( bytes )
   2579       {
   2580         STBIR_SIMD_NO_UNROLL_LOOP_START
   2581         do
   2582         {
   2583           STBIR_SIMD_NO_UNROLL(d);
   2584           d[ 0 ] = d[ ofs_to_src ];
   2585           ++d;
   2586         } while ( d < d_end );
   2587       }
   2588     }
   2589     else
   2590     {
   2591       stbir__simdf x;
   2592       // do one unaligned to get us aligned for the stream out below
   2593       stbir__simdf_load( x, ( d + ofs_to_src ) );
   2594       stbir__simdf_store( d, x );
   2595       d = (char*)( ( ( (size_t)d ) + 16 ) & ~15 );
   2596 
   2597       STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   2598       for(;;)
   2599       {
   2600         STBIR_SIMD_NO_UNROLL(d);
   2601 
   2602         if ( d > ( d_end - 16 ) )
   2603         {
   2604           if ( d == d_end )
   2605             return;
   2606           d = d_end - 16;
   2607         }
   2608 
   2609         stbir__simdf_load( x, ( d + ofs_to_src ) );
   2610         stbir__simdf_store( d, x );
   2611         d += 16;
   2612       }
   2613     }
   2614   }
   2615   else
   2616   {
   2617     stbir__simdfX x0,x1,x2,x3;
   2618 
   2619     // do one unaligned to get us aligned for the stream out below
   2620     stbir__simdfX_load( x0, ( d + ofs_to_src ) +  0*stbir__simdfX_float_count );
   2621     stbir__simdfX_load( x1, ( d + ofs_to_src ) +  4*stbir__simdfX_float_count );
   2622     stbir__simdfX_load( x2, ( d + ofs_to_src ) +  8*stbir__simdfX_float_count );
   2623     stbir__simdfX_load( x3, ( d + ofs_to_src ) + 12*stbir__simdfX_float_count );
   2624     stbir__simdfX_store( d +  0*stbir__simdfX_float_count, x0 );
   2625     stbir__simdfX_store( d +  4*stbir__simdfX_float_count, x1 );
   2626     stbir__simdfX_store( d +  8*stbir__simdfX_float_count, x2 );
   2627     stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 );
   2628     d = (char*)( ( ( (size_t)d ) + (16*stbir__simdfX_float_count) ) & ~((16*stbir__simdfX_float_count)-1) );
   2629 
   2630     STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   2631     for(;;)
   2632     {
   2633       STBIR_SIMD_NO_UNROLL(d);
   2634 
   2635       if ( d > ( d_end - (16*stbir__simdfX_float_count) ) )
   2636       {
   2637         if ( d == d_end )
   2638           return;
   2639         d = d_end - (16*stbir__simdfX_float_count);
   2640       }
   2641 
   2642       stbir__simdfX_load( x0, ( d + ofs_to_src ) +  0*stbir__simdfX_float_count );
   2643       stbir__simdfX_load( x1, ( d + ofs_to_src ) +  4*stbir__simdfX_float_count );
   2644       stbir__simdfX_load( x2, ( d + ofs_to_src ) +  8*stbir__simdfX_float_count );
   2645       stbir__simdfX_load( x3, ( d + ofs_to_src ) + 12*stbir__simdfX_float_count );
   2646       stbir__simdfX_store( d +  0*stbir__simdfX_float_count, x0 );
   2647       stbir__simdfX_store( d +  4*stbir__simdfX_float_count, x1 );
   2648       stbir__simdfX_store( d +  8*stbir__simdfX_float_count, x2 );
   2649       stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 );
   2650       d += (16*stbir__simdfX_float_count);
   2651     }
   2652   }
   2653 }
   2654 
   2655 // memcpy that is specically intentionally overlapping (src is smaller then dest, so can be
   2656 //   a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to
   2657 //   the diff between dest and src)
   2658 static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes )
   2659 {
   2660   char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src;
   2661   char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes;
   2662   ptrdiff_t ofs_to_dest = (char*)dest - (char*)src;
   2663 
   2664   if ( ofs_to_dest >= 16 ) // is the overlap more than 16 away?
   2665   {
   2666     char STBIR_SIMD_STREAMOUT_PTR( * ) s_end16 = ((char*) src) + (bytes&~15);
   2667     STBIR_SIMD_NO_UNROLL_LOOP_START
   2668     do
   2669     {
   2670       stbir__simdf x;
   2671       STBIR_SIMD_NO_UNROLL(sd);
   2672       stbir__simdf_load( x, sd );
   2673       stbir__simdf_store(  ( sd + ofs_to_dest ), x );
   2674       sd += 16;
   2675     } while ( sd < s_end16 );
   2676 
   2677     if ( sd == s_end )
   2678       return;
   2679   }
   2680 
   2681   do
   2682   {
   2683     STBIR_SIMD_NO_UNROLL(sd);
   2684     *(int*)( sd + ofs_to_dest ) = *(int*) sd;
   2685     sd += 4;
   2686   } while ( sd < s_end );
   2687 }
   2688 
   2689 #else // no SSE2
   2690 
   2691 // when in scalar mode, we let unrolling happen, so this macro just does the __restrict
   2692 #define STBIR_SIMD_STREAMOUT_PTR( star ) STBIR_STREAMOUT_PTR( star )
   2693 #define STBIR_SIMD_NO_UNROLL(ptr)
   2694 #define STBIR_SIMD_NO_UNROLL_LOOP_START
   2695 #define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   2696 
   2697 #endif // SSE2
   2698 
   2699 
   2700 #ifdef STBIR_PROFILE
   2701 
   2702 #ifndef STBIR_PROFILE_FUNC
   2703 
   2704 #if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(__SSE2__) || defined(STBIR_SSE) || defined( _M_IX86_FP ) || defined(__i386) || defined( __i386__ ) || defined( _M_IX86 ) || defined( _X86_ )
   2705 
   2706 #ifdef _MSC_VER
   2707 
   2708   STBIRDEF stbir_uint64 __rdtsc();
   2709   #define STBIR_PROFILE_FUNC() __rdtsc()
   2710 
   2711 #else // non msvc
   2712 
   2713   static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC()
   2714   {
   2715     stbir_uint32 lo, hi;
   2716     asm volatile ("rdtsc" : "=a" (lo), "=d" (hi) );
   2717     return ( ( (stbir_uint64) hi ) << 32 ) | ( (stbir_uint64) lo );
   2718   }
   2719 
   2720 #endif  // msvc
   2721 
   2722 #elif defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(__ARM_NEON__)
   2723 
   2724 #if defined( _MSC_VER ) && !defined(__clang__)
   2725 
   2726   #define STBIR_PROFILE_FUNC() _ReadStatusReg(ARM64_CNTVCT)
   2727 
   2728 #else
   2729 
   2730   static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC()
   2731   {
   2732     stbir_uint64 tsc;
   2733     asm volatile("mrs %0, cntvct_el0" : "=r" (tsc));
   2734     return tsc;
   2735   }
   2736 
   2737 #endif
   2738 
   2739 #else // x64, arm
   2740 
   2741 #error Unknown platform for profiling.
   2742 
   2743 #endif  // x64, arm
   2744 
   2745 #endif // STBIR_PROFILE_FUNC
   2746 
   2747 #define STBIR_ONLY_PROFILE_GET_SPLIT_INFO ,stbir__per_split_info * split_info
   2748 #define STBIR_ONLY_PROFILE_SET_SPLIT_INFO ,split_info
   2749 
   2750 #define STBIR_ONLY_PROFILE_BUILD_GET_INFO ,stbir__info * profile_info
   2751 #define STBIR_ONLY_PROFILE_BUILD_SET_INFO ,profile_info
   2752 
   2753 // super light-weight micro profiler
   2754 #define STBIR_PROFILE_START_ll( info, wh ) { stbir_uint64 wh##thiszonetime = STBIR_PROFILE_FUNC(); stbir_uint64 * wh##save_parent_excluded_ptr = info->current_zone_excluded_ptr; stbir_uint64 wh##current_zone_excluded = 0; info->current_zone_excluded_ptr = &wh##current_zone_excluded;
   2755 #define STBIR_PROFILE_END_ll( info, wh ) wh##thiszonetime = STBIR_PROFILE_FUNC() - wh##thiszonetime; info->profile.named.wh += wh##thiszonetime - wh##current_zone_excluded; *wh##save_parent_excluded_ptr += wh##thiszonetime; info->current_zone_excluded_ptr = wh##save_parent_excluded_ptr; }
   2756 #define STBIR_PROFILE_FIRST_START_ll( info, wh ) { int i; info->current_zone_excluded_ptr = &info->profile.named.total; for(i=0;i<STBIR__ARRAY_SIZE(info->profile.array);i++) info->profile.array[i]=0; } STBIR_PROFILE_START_ll( info, wh );
   2757 #define STBIR_PROFILE_CLEAR_EXTRAS_ll( info, num ) { int extra; for(extra=1;extra<(num);extra++) { int i; for(i=0;i<STBIR__ARRAY_SIZE((info)->profile.array);i++) (info)[extra].profile.array[i]=0; } }
   2758 
   2759 // for thread data
   2760 #define STBIR_PROFILE_START( wh ) STBIR_PROFILE_START_ll( split_info, wh )
   2761 #define STBIR_PROFILE_END( wh ) STBIR_PROFILE_END_ll( split_info, wh )
   2762 #define STBIR_PROFILE_FIRST_START( wh ) STBIR_PROFILE_FIRST_START_ll( split_info, wh )
   2763 #define STBIR_PROFILE_CLEAR_EXTRAS() STBIR_PROFILE_CLEAR_EXTRAS_ll( split_info, split_count )
   2764 
   2765 // for build data
   2766 #define STBIR_PROFILE_BUILD_START( wh ) STBIR_PROFILE_START_ll( profile_info, wh )
   2767 #define STBIR_PROFILE_BUILD_END( wh ) STBIR_PROFILE_END_ll( profile_info, wh )
   2768 #define STBIR_PROFILE_BUILD_FIRST_START( wh ) STBIR_PROFILE_FIRST_START_ll( profile_info, wh )
   2769 #define STBIR_PROFILE_BUILD_CLEAR( info ) { int i; for(i=0;i<STBIR__ARRAY_SIZE(info->profile.array);i++) info->profile.array[i]=0; }
   2770 
   2771 #else  // no profile
   2772 
   2773 #define STBIR_ONLY_PROFILE_GET_SPLIT_INFO
   2774 #define STBIR_ONLY_PROFILE_SET_SPLIT_INFO
   2775 
   2776 #define STBIR_ONLY_PROFILE_BUILD_GET_INFO
   2777 #define STBIR_ONLY_PROFILE_BUILD_SET_INFO
   2778 
   2779 #define STBIR_PROFILE_START( wh )
   2780 #define STBIR_PROFILE_END( wh )
   2781 #define STBIR_PROFILE_FIRST_START( wh )
   2782 #define STBIR_PROFILE_CLEAR_EXTRAS( )
   2783 
   2784 #define STBIR_PROFILE_BUILD_START( wh )
   2785 #define STBIR_PROFILE_BUILD_END( wh )
   2786 #define STBIR_PROFILE_BUILD_FIRST_START( wh )
   2787 #define STBIR_PROFILE_BUILD_CLEAR( info )
   2788 
   2789 #endif  // stbir_profile
   2790 
   2791 #ifndef STBIR_CEILF
   2792 #include <math.h>
   2793 #if _MSC_VER <= 1200 // support VC6 for Sean
   2794 #define STBIR_CEILF(x) ((float)ceil((float)(x)))
   2795 #define STBIR_FLOORF(x) ((float)floor((float)(x)))
   2796 #else
   2797 #define STBIR_CEILF(x) ceilf(x)
   2798 #define STBIR_FLOORF(x) floorf(x)
   2799 #endif
   2800 #endif
   2801 
   2802 #ifndef STBIR_MEMCPY
   2803 // For memcpy
   2804 #include <string.h>
   2805 #define STBIR_MEMCPY( dest, src, len ) memcpy( dest, src, len )
   2806 #endif
   2807 
   2808 #ifndef STBIR_SIMD
   2809 
   2810 // memcpy that is specifically intentionally overlapping (src is smaller then dest, so can be
   2811 //   a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to
   2812 //   the diff between dest and src)
   2813 static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes )
   2814 {
   2815   char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src;
   2816   char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes;
   2817   ptrdiff_t ofs_to_dest = (char*)dest - (char*)src;
   2818 
   2819   if ( ofs_to_dest >= 8 ) // is the overlap more than 8 away?
   2820   {
   2821     char STBIR_SIMD_STREAMOUT_PTR( * ) s_end8 = ((char*) src) + (bytes&~7);
   2822     STBIR_NO_UNROLL_LOOP_START
   2823     do
   2824     {
   2825       STBIR_NO_UNROLL(sd);
   2826       *(stbir_uint64*)( sd + ofs_to_dest ) = *(stbir_uint64*) sd;
   2827       sd += 8;
   2828     } while ( sd < s_end8 );
   2829 
   2830     if ( sd == s_end )
   2831       return;
   2832   }
   2833 
   2834   STBIR_NO_UNROLL_LOOP_START
   2835   do
   2836   {
   2837     STBIR_NO_UNROLL(sd);
   2838     *(int*)( sd + ofs_to_dest ) = *(int*) sd;
   2839     sd += 4;
   2840   } while ( sd < s_end );
   2841 }
   2842 
   2843 #endif
   2844 
   2845 static float stbir__filter_trapezoid(float x, float scale, void * user_data)
   2846 {
   2847   float halfscale = scale / 2;
   2848   float t = 0.5f + halfscale;
   2849   STBIR_ASSERT(scale <= 1);
   2850   STBIR__UNUSED(user_data);
   2851 
   2852   if ( x < 0.0f ) x = -x;
   2853 
   2854   if (x >= t)
   2855     return 0.0f;
   2856   else
   2857   {
   2858     float r = 0.5f - halfscale;
   2859     if (x <= r)
   2860       return 1.0f;
   2861     else
   2862       return (t - x) / scale;
   2863   }
   2864 }
   2865 
   2866 static float stbir__support_trapezoid(float scale, void * user_data)
   2867 {
   2868   STBIR__UNUSED(user_data);
   2869   return 0.5f + scale / 2.0f;
   2870 }
   2871 
   2872 static float stbir__filter_triangle(float x, float s, void * user_data)
   2873 {
   2874   STBIR__UNUSED(s);
   2875   STBIR__UNUSED(user_data);
   2876 
   2877   if ( x < 0.0f ) x = -x;
   2878 
   2879   if (x <= 1.0f)
   2880     return 1.0f - x;
   2881   else
   2882     return 0.0f;
   2883 }
   2884 
   2885 static float stbir__filter_point(float x, float s, void * user_data)
   2886 {
   2887   STBIR__UNUSED(x);
   2888   STBIR__UNUSED(s);
   2889   STBIR__UNUSED(user_data);
   2890 
   2891   return 1.0f;
   2892 }
   2893 
   2894 static float stbir__filter_cubic(float x, float s, void * user_data)
   2895 {
   2896   STBIR__UNUSED(s);
   2897   STBIR__UNUSED(user_data);
   2898 
   2899   if ( x < 0.0f ) x = -x;
   2900 
   2901   if (x < 1.0f)
   2902     return (4.0f + x*x*(3.0f*x - 6.0f))/6.0f;
   2903   else if (x < 2.0f)
   2904     return (8.0f + x*(-12.0f + x*(6.0f - x)))/6.0f;
   2905 
   2906   return (0.0f);
   2907 }
   2908 
   2909 static float stbir__filter_catmullrom(float x, float s, void * user_data)
   2910 {
   2911   STBIR__UNUSED(s);
   2912   STBIR__UNUSED(user_data);
   2913 
   2914   if ( x < 0.0f ) x = -x;
   2915 
   2916   if (x < 1.0f)
   2917     return 1.0f - x*x*(2.5f - 1.5f*x);
   2918   else if (x < 2.0f)
   2919     return 2.0f - x*(4.0f + x*(0.5f*x - 2.5f));
   2920 
   2921   return (0.0f);
   2922 }
   2923 
   2924 static float stbir__filter_mitchell(float x, float s, void * user_data)
   2925 {
   2926   STBIR__UNUSED(s);
   2927   STBIR__UNUSED(user_data);
   2928 
   2929   if ( x < 0.0f ) x = -x;
   2930 
   2931   if (x < 1.0f)
   2932     return (16.0f + x*x*(21.0f * x - 36.0f))/18.0f;
   2933   else if (x < 2.0f)
   2934     return (32.0f + x*(-60.0f + x*(36.0f - 7.0f*x)))/18.0f;
   2935 
   2936   return (0.0f);
   2937 }
   2938 
   2939 static float stbir__support_zeropoint5(float s, void * user_data)
   2940 {
   2941   STBIR__UNUSED(s);
   2942   STBIR__UNUSED(user_data);
   2943   return 0.5f;
   2944 }
   2945 
   2946 static float stbir__support_one(float s, void * user_data)
   2947 {
   2948   STBIR__UNUSED(s);
   2949   STBIR__UNUSED(user_data);
   2950   return 1;
   2951 }
   2952 
   2953 static float stbir__support_two(float s, void * user_data)
   2954 {
   2955   STBIR__UNUSED(s);
   2956   STBIR__UNUSED(user_data);
   2957   return 2;
   2958 }
   2959 
   2960 // This is the maximum number of input samples that can affect an output sample
   2961 // with the given filter from the output pixel's perspective
   2962 static int stbir__get_filter_pixel_width(stbir__support_callback * support, float scale, void * user_data)
   2963 {
   2964   STBIR_ASSERT(support != 0);
   2965 
   2966   if ( scale >= ( 1.0f-stbir__small_float ) ) // upscale
   2967     return (int)STBIR_CEILF(support(1.0f/scale,user_data) * 2.0f);
   2968   else
   2969     return (int)STBIR_CEILF(support(scale,user_data) * 2.0f / scale);
   2970 }
   2971 
   2972 // this is how many coefficents per run of the filter (which is different
   2973 //   from the filter_pixel_width depending on if we are scattering or gathering)
   2974 static int stbir__get_coefficient_width(stbir__sampler * samp, int is_gather, void * user_data)
   2975 {
   2976   float scale = samp->scale_info.scale;
   2977   stbir__support_callback * support = samp->filter_support;
   2978 
   2979   switch( is_gather )
   2980   {
   2981     case 1:
   2982       return (int)STBIR_CEILF(support(1.0f / scale, user_data) * 2.0f);
   2983     case 2:
   2984       return (int)STBIR_CEILF(support(scale, user_data) * 2.0f / scale);
   2985     case 0:
   2986       return (int)STBIR_CEILF(support(scale, user_data) * 2.0f);
   2987     default:
   2988       STBIR_ASSERT( (is_gather >= 0 ) && (is_gather <= 2 ) );
   2989       return 0;
   2990   }
   2991 }
   2992 
   2993 static int stbir__get_contributors(stbir__sampler * samp, int is_gather)
   2994 {
   2995   if (is_gather)
   2996       return samp->scale_info.output_sub_size;
   2997   else
   2998       return (samp->scale_info.input_full_size + samp->filter_pixel_margin * 2);
   2999 }
   3000 
   3001 static int stbir__edge_zero_full( int n, int max )
   3002 {
   3003   STBIR__UNUSED(n);
   3004   STBIR__UNUSED(max);
   3005   return 0; // NOTREACHED
   3006 }
   3007 
   3008 static int stbir__edge_clamp_full( int n, int max )
   3009 {
   3010   if (n < 0)
   3011     return 0;
   3012 
   3013   if (n >= max)
   3014     return max - 1;
   3015 
   3016   return n; // NOTREACHED
   3017 }
   3018 
   3019 static int stbir__edge_reflect_full( int n, int max )
   3020 {
   3021   if (n < 0)
   3022   {
   3023     if (n > -max)
   3024       return -n;
   3025     else
   3026       return max - 1;
   3027   }
   3028 
   3029   if (n >= max)
   3030   {
   3031     int max2 = max * 2;
   3032     if (n >= max2)
   3033       return 0;
   3034     else
   3035       return max2 - n - 1;
   3036   }
   3037 
   3038   return n; // NOTREACHED
   3039 }
   3040 
   3041 static int stbir__edge_wrap_full( int n, int max )
   3042 {
   3043   if (n >= 0)
   3044     return (n % max);
   3045   else
   3046   {
   3047     int m = (-n) % max;
   3048 
   3049     if (m != 0)
   3050       m = max - m;
   3051 
   3052     return (m);
   3053   }
   3054 }
   3055 
   3056 typedef int stbir__edge_wrap_func( int n, int max );
   3057 static stbir__edge_wrap_func * stbir__edge_wrap_slow[] =
   3058 {
   3059   stbir__edge_clamp_full,    // STBIR_EDGE_CLAMP
   3060   stbir__edge_reflect_full,  // STBIR_EDGE_REFLECT
   3061   stbir__edge_wrap_full,     // STBIR_EDGE_WRAP
   3062   stbir__edge_zero_full,     // STBIR_EDGE_ZERO
   3063 };
   3064 
   3065 stbir__inline static int stbir__edge_wrap(stbir_edge edge, int n, int max)
   3066 {
   3067   // avoid per-pixel switch
   3068   if (n >= 0 && n < max)
   3069       return n;
   3070   return stbir__edge_wrap_slow[edge]( n, max );
   3071 }
   3072 
   3073 #define STBIR__MERGE_RUNS_PIXEL_THRESHOLD 16
   3074 
   3075 // get information on the extents of a sampler
   3076 static void stbir__get_extents( stbir__sampler * samp, stbir__extents * scanline_extents )
   3077 {
   3078   int j, stop;
   3079   int left_margin, right_margin;
   3080   int min_n = 0x7fffffff, max_n = -0x7fffffff;
   3081   int min_left = 0x7fffffff, max_left = -0x7fffffff;
   3082   int min_right = 0x7fffffff, max_right = -0x7fffffff;
   3083   stbir_edge edge = samp->edge;
   3084   stbir__contributors* contributors = samp->contributors;
   3085   int output_sub_size = samp->scale_info.output_sub_size;
   3086   int input_full_size = samp->scale_info.input_full_size;
   3087   int filter_pixel_margin = samp->filter_pixel_margin;
   3088 
   3089   STBIR_ASSERT( samp->is_gather );
   3090 
   3091   stop = output_sub_size;
   3092   for (j = 0; j < stop; j++ )
   3093   {
   3094     STBIR_ASSERT( contributors[j].n1 >= contributors[j].n0 );
   3095     if ( contributors[j].n0 < min_n )
   3096     {
   3097       min_n = contributors[j].n0;
   3098       stop = j + filter_pixel_margin;  // if we find a new min, only scan another filter width
   3099       if ( stop > output_sub_size ) stop = output_sub_size;
   3100     }
   3101   }
   3102 
   3103   stop = 0;
   3104   for (j = output_sub_size - 1; j >= stop; j-- )
   3105   {
   3106     STBIR_ASSERT( contributors[j].n1 >= contributors[j].n0 );
   3107     if ( contributors[j].n1 > max_n )
   3108     {
   3109       max_n = contributors[j].n1;
   3110       stop = j - filter_pixel_margin;  // if we find a new max, only scan another filter width
   3111       if (stop<0) stop = 0;
   3112     }
   3113   }
   3114 
   3115   STBIR_ASSERT( scanline_extents->conservative.n0 <= min_n );
   3116   STBIR_ASSERT( scanline_extents->conservative.n1 >= max_n );
   3117 
   3118   // now calculate how much into the margins we really read
   3119   left_margin = 0;
   3120   if ( min_n < 0 )
   3121   {
   3122     left_margin = -min_n;
   3123     min_n = 0;
   3124   }
   3125 
   3126   right_margin = 0;
   3127   if ( max_n >= input_full_size )
   3128   {
   3129     right_margin = max_n - input_full_size + 1;
   3130     max_n = input_full_size - 1;
   3131   }
   3132 
   3133   // index 1 is margin pixel extents (how many pixels we hang over the edge)
   3134   scanline_extents->edge_sizes[0] = left_margin;
   3135   scanline_extents->edge_sizes[1] = right_margin;
   3136 
   3137   // index 2 is pixels read from the input
   3138   scanline_extents->spans[0].n0 = min_n;
   3139   scanline_extents->spans[0].n1 = max_n;
   3140   scanline_extents->spans[0].pixel_offset_for_input = min_n;
   3141 
   3142   // default to no other input range
   3143   scanline_extents->spans[1].n0 = 0;
   3144   scanline_extents->spans[1].n1 = -1;
   3145   scanline_extents->spans[1].pixel_offset_for_input = 0;
   3146 
   3147   // don't have to do edge calc for zero clamp
   3148   if ( edge == STBIR_EDGE_ZERO )
   3149     return;
   3150 
   3151   // convert margin pixels to the pixels within the input (min and max)
   3152   for( j = -left_margin ; j < 0 ; j++ )
   3153   {
   3154       int p = stbir__edge_wrap( edge, j, input_full_size );
   3155       if ( p < min_left )
   3156         min_left = p;
   3157       if ( p > max_left )
   3158         max_left = p;
   3159   }
   3160 
   3161   for( j = input_full_size ; j < (input_full_size + right_margin) ; j++ )
   3162   {
   3163       int p = stbir__edge_wrap( edge, j, input_full_size );
   3164       if ( p < min_right )
   3165         min_right = p;
   3166       if ( p > max_right )
   3167         max_right = p;
   3168   }
   3169 
   3170   // merge the left margin pixel region if it connects within 4 pixels of main pixel region
   3171   if ( min_left != 0x7fffffff )
   3172   {
   3173     if ( ( ( min_left <= min_n ) && ( ( max_left  + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= min_n ) ) ||
   3174          ( ( min_n <= min_left ) && ( ( max_n  + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= max_left ) ) )
   3175     {
   3176       scanline_extents->spans[0].n0 = min_n = stbir__min( min_n, min_left );
   3177       scanline_extents->spans[0].n1 = max_n = stbir__max( max_n, max_left );
   3178       scanline_extents->spans[0].pixel_offset_for_input = min_n;
   3179       left_margin = 0;
   3180     }
   3181   }
   3182 
   3183   // merge the right margin pixel region if it connects within 4 pixels of main pixel region
   3184   if ( min_right != 0x7fffffff )
   3185   {
   3186     if ( ( ( min_right <= min_n ) && ( ( max_right  + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= min_n ) ) ||
   3187          ( ( min_n <= min_right ) && ( ( max_n  + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= max_right ) ) )
   3188     {
   3189       scanline_extents->spans[0].n0 = min_n = stbir__min( min_n, min_right );
   3190       scanline_extents->spans[0].n1 = max_n = stbir__max( max_n, max_right );
   3191       scanline_extents->spans[0].pixel_offset_for_input = min_n;
   3192       right_margin = 0;
   3193     }
   3194   }
   3195 
   3196   STBIR_ASSERT( scanline_extents->conservative.n0 <= min_n );
   3197   STBIR_ASSERT( scanline_extents->conservative.n1 >= max_n );
   3198 
   3199   // you get two ranges when you have the WRAP edge mode and you are doing just the a piece of the resize
   3200   //   so you need to get a second run of pixels from the opposite side of the scanline (which you
   3201   //   wouldn't need except for WRAP)
   3202 
   3203 
   3204   // if we can't merge the min_left range, add it as a second range
   3205   if ( ( left_margin ) && ( min_left != 0x7fffffff ) )
   3206   {
   3207     stbir__span * newspan = scanline_extents->spans + 1;
   3208     STBIR_ASSERT( right_margin == 0 );
   3209     if ( min_left < scanline_extents->spans[0].n0 )
   3210     {
   3211       scanline_extents->spans[1].pixel_offset_for_input = scanline_extents->spans[0].n0;
   3212       scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0;
   3213       scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1;
   3214       --newspan;
   3215     }
   3216     newspan->pixel_offset_for_input = min_left;
   3217     newspan->n0 = -left_margin;
   3218     newspan->n1 = ( max_left - min_left ) - left_margin;
   3219     scanline_extents->edge_sizes[0] = 0;  // don't need to copy the left margin, since we are directly decoding into the margin
   3220     return;
   3221   }
   3222 
   3223   // if we can't merge the min_left range, add it as a second range
   3224   if ( ( right_margin ) && ( min_right != 0x7fffffff ) )
   3225   {
   3226     stbir__span * newspan = scanline_extents->spans + 1;
   3227     if ( min_right < scanline_extents->spans[0].n0 )
   3228     {
   3229       scanline_extents->spans[1].pixel_offset_for_input = scanline_extents->spans[0].n0;
   3230       scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0;
   3231       scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1;
   3232       --newspan;
   3233     }
   3234     newspan->pixel_offset_for_input = min_right;
   3235     newspan->n0 = scanline_extents->spans[1].n1 + 1;
   3236     newspan->n1 = scanline_extents->spans[1].n1 + 1 + ( max_right - min_right );
   3237     scanline_extents->edge_sizes[1] = 0;  // don't need to copy the right margin, since we are directly decoding into the margin
   3238     return;
   3239   }
   3240 }
   3241 
   3242 static void stbir__calculate_in_pixel_range( int * first_pixel, int * last_pixel, float out_pixel_center, float out_filter_radius, float inv_scale, float out_shift, int input_size, stbir_edge edge )
   3243 {
   3244   int first, last;
   3245   float out_pixel_influence_lowerbound = out_pixel_center - out_filter_radius;
   3246   float out_pixel_influence_upperbound = out_pixel_center + out_filter_radius;
   3247 
   3248   float in_pixel_influence_lowerbound = (out_pixel_influence_lowerbound + out_shift) * inv_scale;
   3249   float in_pixel_influence_upperbound = (out_pixel_influence_upperbound + out_shift) * inv_scale;
   3250 
   3251   first = (int)(STBIR_FLOORF(in_pixel_influence_lowerbound + 0.5f));
   3252   last = (int)(STBIR_FLOORF(in_pixel_influence_upperbound - 0.5f));
   3253   if ( last < first ) last = first; // point sample mode can span a value *right* at 0.5, and cause these to cross
   3254 
   3255   if ( edge == STBIR_EDGE_WRAP )
   3256   {
   3257     if ( first < -input_size )
   3258       first = -input_size;
   3259     if ( last >= (input_size*2))
   3260       last = (input_size*2) - 1;
   3261   }
   3262 
   3263   *first_pixel = first;
   3264   *last_pixel = last;
   3265 }
   3266 
   3267 static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_radius, stbir__kernel_callback * kernel, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float* coefficient_group, int coefficient_width, stbir_edge edge, void * user_data )
   3268 {
   3269   int n, end;
   3270   float inv_scale = scale_info->inv_scale;
   3271   float out_shift = scale_info->pixel_shift;
   3272   int input_size  = scale_info->input_full_size;
   3273   int numerator = scale_info->scale_numerator;
   3274   int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < num_contributors ) );
   3275 
   3276   // Looping through out pixels
   3277   end = num_contributors; if ( polyphase ) end = numerator;
   3278   for (n = 0; n < end; n++)
   3279   {
   3280     int i;
   3281     int last_non_zero;
   3282     float out_pixel_center = (float)n + 0.5f;
   3283     float in_center_of_out = (out_pixel_center + out_shift) * inv_scale;
   3284 
   3285     int in_first_pixel, in_last_pixel;
   3286 
   3287     stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, out_pixel_center, out_filter_radius, inv_scale, out_shift, input_size, edge );
   3288 
   3289     // make sure we never generate a range larger than our precalculated coeff width
   3290     //   this only happens in point sample mode, but it's a good safe thing to do anyway
   3291     if ( ( in_last_pixel - in_first_pixel + 1 ) > coefficient_width )
   3292       in_last_pixel = in_first_pixel + coefficient_width - 1;
   3293 
   3294     last_non_zero = -1;
   3295     for (i = 0; i <= in_last_pixel - in_first_pixel; i++)
   3296     {
   3297       float in_pixel_center = (float)(i + in_first_pixel) + 0.5f;
   3298       float coeff = kernel(in_center_of_out - in_pixel_center, inv_scale, user_data);
   3299 
   3300       // kill denormals
   3301       if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) )
   3302       {
   3303         if ( i == 0 )  // if we're at the front, just eat zero contributors
   3304         {
   3305           STBIR_ASSERT ( ( in_last_pixel - in_first_pixel ) != 0 ); // there should be at least one contrib
   3306           ++in_first_pixel;
   3307           i--;
   3308           continue;
   3309         }
   3310         coeff = 0;  // make sure is fully zero (should keep denormals away)
   3311       }
   3312       else
   3313         last_non_zero = i;
   3314 
   3315       coefficient_group[i] = coeff;
   3316     }
   3317 
   3318     in_last_pixel = last_non_zero+in_first_pixel; // kills trailing zeros
   3319     contributors->n0 = in_first_pixel;
   3320     contributors->n1 = in_last_pixel;
   3321 
   3322     STBIR_ASSERT(contributors->n1 >= contributors->n0);
   3323 
   3324     ++contributors;
   3325     coefficient_group += coefficient_width;
   3326   }
   3327 }
   3328 
   3329 static void stbir__insert_coeff( stbir__contributors * contribs, float * coeffs, int new_pixel, float new_coeff, int max_width )
   3330 {
   3331   if ( new_pixel <= contribs->n1 )  // before the end
   3332   {
   3333     if ( new_pixel < contribs->n0 ) // before the front?
   3334     {
   3335       if ( ( contribs->n1 - new_pixel + 1 ) <= max_width )
   3336       { 
   3337         int j, o = contribs->n0 - new_pixel;
   3338         for ( j = contribs->n1 - contribs->n0 ; j <= 0 ; j-- )
   3339           coeffs[ j + o ] = coeffs[ j ];
   3340         for ( j = 1 ; j < o ; j-- )
   3341           coeffs[ j ] = coeffs[ 0 ];
   3342         coeffs[ 0 ] = new_coeff;
   3343         contribs->n0 = new_pixel;
   3344       }
   3345     }
   3346     else
   3347     {
   3348       coeffs[ new_pixel - contribs->n0 ] += new_coeff;
   3349     }
   3350   }
   3351   else
   3352   {
   3353     if ( ( new_pixel - contribs->n0 + 1 ) <= max_width )
   3354     {
   3355       int j, e = new_pixel - contribs->n0;
   3356       for( j = ( contribs->n1 - contribs->n0 ) + 1 ; j < e ; j++ ) // clear in-betweens coeffs if there are any
   3357         coeffs[j] = 0;
   3358 
   3359       coeffs[ e ] = new_coeff;
   3360       contribs->n1 = new_pixel;
   3361     }
   3362   }
   3363 }
   3364 
   3365 static void stbir__calculate_out_pixel_range( int * first_pixel, int * last_pixel, float in_pixel_center, float in_pixels_radius, float scale, float out_shift, int out_size )
   3366 {
   3367   float in_pixel_influence_lowerbound = in_pixel_center - in_pixels_radius;
   3368   float in_pixel_influence_upperbound = in_pixel_center + in_pixels_radius;
   3369   float out_pixel_influence_lowerbound = in_pixel_influence_lowerbound * scale - out_shift;
   3370   float out_pixel_influence_upperbound = in_pixel_influence_upperbound * scale - out_shift;
   3371   int out_first_pixel = (int)(STBIR_FLOORF(out_pixel_influence_lowerbound + 0.5f));
   3372   int out_last_pixel = (int)(STBIR_FLOORF(out_pixel_influence_upperbound - 0.5f));
   3373 
   3374   if ( out_first_pixel < 0 )
   3375     out_first_pixel = 0;
   3376   if ( out_last_pixel >= out_size )
   3377     out_last_pixel = out_size - 1;
   3378   *first_pixel = out_first_pixel;
   3379   *last_pixel = out_last_pixel;
   3380 }
   3381 
   3382 static void stbir__calculate_coefficients_for_gather_downsample( int start, int end, float in_pixels_radius, stbir__kernel_callback * kernel, stbir__scale_info * scale_info, int coefficient_width, int num_contributors, stbir__contributors * contributors, float * coefficient_group, void * user_data )
   3383 {
   3384   int in_pixel;
   3385   int i;
   3386   int first_out_inited = -1;
   3387   float scale = scale_info->scale;
   3388   float out_shift = scale_info->pixel_shift;
   3389   int out_size = scale_info->output_sub_size;
   3390   int numerator = scale_info->scale_numerator;
   3391   int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < out_size ) );
   3392 
   3393   STBIR__UNUSED(num_contributors);
   3394 
   3395   // Loop through the input pixels
   3396   for (in_pixel = start; in_pixel < end; in_pixel++)
   3397   {
   3398     float in_pixel_center = (float)in_pixel + 0.5f;
   3399     float out_center_of_in = in_pixel_center * scale - out_shift;
   3400     int out_first_pixel, out_last_pixel;
   3401 
   3402     stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, in_pixel_center, in_pixels_radius, scale, out_shift, out_size );
   3403 
   3404     if ( out_first_pixel > out_last_pixel )
   3405       continue;
   3406 
   3407     // clamp or exit if we are using polyphase filtering, and the limit is up
   3408     if ( polyphase )
   3409     {
   3410       // when polyphase, you only have to do coeffs up to the numerator count
   3411       if ( out_first_pixel == numerator )
   3412         break;
   3413 
   3414       // don't do any extra work, clamp last pixel at numerator too
   3415       if ( out_last_pixel >= numerator )
   3416         out_last_pixel = numerator - 1;
   3417     }
   3418 
   3419     for (i = 0; i <= out_last_pixel - out_first_pixel; i++)
   3420     {
   3421       float out_pixel_center = (float)(i + out_first_pixel) + 0.5f;
   3422       float x = out_pixel_center - out_center_of_in;
   3423       float coeff = kernel(x, scale, user_data) * scale;
   3424 
   3425       // kill the coeff if it's too small (avoid denormals)
   3426       if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) )
   3427         coeff = 0.0f;
   3428 
   3429       {
   3430         int out = i + out_first_pixel;
   3431         float * coeffs = coefficient_group + out * coefficient_width;
   3432         stbir__contributors * contribs = contributors + out;
   3433 
   3434         // is this the first time this output pixel has been seen?  Init it.
   3435         if ( out > first_out_inited )
   3436         {
   3437           STBIR_ASSERT( out == ( first_out_inited + 1 ) ); // ensure we have only advanced one at time
   3438           first_out_inited = out;
   3439           contribs->n0 = in_pixel;
   3440           contribs->n1 = in_pixel;
   3441           coeffs[0]  = coeff;
   3442         }
   3443         else
   3444         {
   3445           // insert on end (always in order)
   3446           if ( coeffs[0] == 0.0f )  // if the first coefficent is zero, then zap it for this coeffs
   3447           {
   3448             STBIR_ASSERT( ( in_pixel - contribs->n0 ) == 1 ); // ensure that when we zap, we're at the 2nd pos
   3449             contribs->n0 = in_pixel;
   3450           }
   3451           contribs->n1 = in_pixel;
   3452           STBIR_ASSERT( ( in_pixel - contribs->n0 ) < coefficient_width );
   3453           coeffs[in_pixel - contribs->n0]  = coeff;
   3454         }
   3455       }
   3456     }
   3457   }
   3458 }
   3459 
   3460 #ifdef STBIR_RENORMALIZE_IN_FLOAT
   3461 #define STBIR_RENORM_TYPE float
   3462 #else
   3463 #define STBIR_RENORM_TYPE double
   3464 #endif
   3465 
   3466 static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter_extent_info* filter_info, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float * coefficient_group, int coefficient_width )
   3467 {
   3468   int input_size = scale_info->input_full_size;
   3469   int input_last_n1 = input_size - 1;
   3470   int n, end;
   3471   int lowest = 0x7fffffff;
   3472   int highest = -0x7fffffff;
   3473   int widest = -1;
   3474   int numerator = scale_info->scale_numerator;
   3475   int denominator = scale_info->scale_denominator;
   3476   int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < num_contributors ) );
   3477   float * coeffs;
   3478   stbir__contributors * contribs;
   3479 
   3480   // weight all the coeffs for each sample
   3481   coeffs = coefficient_group;
   3482   contribs = contributors;
   3483   end = num_contributors; if ( polyphase ) end = numerator;
   3484   for (n = 0; n < end; n++)
   3485   {
   3486     int i;
   3487     STBIR_RENORM_TYPE filter_scale, total_filter = 0;
   3488     int e;
   3489 
   3490     // add all contribs
   3491     e = contribs->n1 - contribs->n0;
   3492     for( i = 0 ; i <= e ; i++ )
   3493     {
   3494       total_filter += (STBIR_RENORM_TYPE) coeffs[i];
   3495       STBIR_ASSERT( ( coeffs[i] >= -2.0f ) && ( coeffs[i] <= 2.0f )  ); // check for wonky weights
   3496     }
   3497 
   3498     // rescale
   3499     if ( ( total_filter < stbir__small_float ) && ( total_filter > -stbir__small_float ) )
   3500     {
   3501       // all coeffs are extremely small, just zero it
   3502       contribs->n1 = contribs->n0;
   3503       coeffs[0] = 0.0f;
   3504     }
   3505     else
   3506     {
   3507       // if the total isn't 1.0, rescale everything
   3508       if ( ( total_filter < (1.0f-stbir__small_float) ) || ( total_filter > (1.0f+stbir__small_float) ) )
   3509       {
   3510         filter_scale = ((STBIR_RENORM_TYPE)1.0) / total_filter;
   3511 
   3512         // scale them all
   3513         for (i = 0; i <= e; i++)
   3514           coeffs[i] = (float) ( coeffs[i] * filter_scale );
   3515       }
   3516     }
   3517     ++contribs;
   3518     coeffs += coefficient_width;
   3519   }
   3520 
   3521   // if we have a rational for the scale, we can exploit the polyphaseness to not calculate
   3522   //   most of the coefficients, so we copy them here
   3523   if ( polyphase )
   3524   {
   3525     stbir__contributors * prev_contribs = contributors;
   3526     stbir__contributors * cur_contribs = contributors + numerator;
   3527 
   3528     for( n = numerator ; n < num_contributors ; n++ )
   3529     {
   3530       cur_contribs->n0 = prev_contribs->n0 + denominator;
   3531       cur_contribs->n1 = prev_contribs->n1 + denominator;
   3532       ++cur_contribs;
   3533       ++prev_contribs;
   3534     }
   3535     stbir_overlapping_memcpy( coefficient_group + numerator * coefficient_width, coefficient_group, ( num_contributors - numerator ) * coefficient_width * sizeof( coeffs[ 0 ] ) );
   3536   }
   3537 
   3538   coeffs = coefficient_group;
   3539   contribs = contributors;
   3540 
   3541   for (n = 0; n < num_contributors; n++)
   3542   {
   3543     int i;
   3544 
   3545     // in zero edge mode, just remove out of bounds contribs completely (since their weights are accounted for now)
   3546     if ( edge == STBIR_EDGE_ZERO )
   3547     {
   3548       // shrink the right side if necessary
   3549       if ( contribs->n1 > input_last_n1 )
   3550         contribs->n1 = input_last_n1;
   3551 
   3552       // shrink the left side
   3553       if ( contribs->n0 < 0 )
   3554       {
   3555         int j, left, skips = 0;
   3556 
   3557         skips = -contribs->n0;
   3558         contribs->n0 = 0;
   3559 
   3560         // now move down the weights
   3561         left = contribs->n1 - contribs->n0 + 1;
   3562         if ( left > 0 )
   3563         {
   3564           for( j = 0 ; j < left ; j++ )
   3565             coeffs[ j ] = coeffs[ j + skips ];
   3566         }
   3567       }
   3568     }
   3569     else if ( ( edge == STBIR_EDGE_CLAMP ) || ( edge == STBIR_EDGE_REFLECT ) )
   3570     {
   3571       // for clamp and reflect, calculate the true inbounds position (based on edge type) and just add that to the existing weight
   3572 
   3573       // right hand side first
   3574       if ( contribs->n1 > input_last_n1 )
   3575       {
   3576         int start = contribs->n0;
   3577         int endi = contribs->n1;
   3578         contribs->n1 = input_last_n1;
   3579         for( i = input_size; i <= endi; i++ )
   3580           stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), coeffs[i-start], coefficient_width );
   3581       }
   3582 
   3583       // now check left hand edge
   3584       if ( contribs->n0 < 0 )
   3585       {
   3586         int save_n0;
   3587         float save_n0_coeff;
   3588         float * c = coeffs - ( contribs->n0 + 1 );
   3589 
   3590         // reinsert the coeffs with it reflected or clamped (insert accumulates, if the coeffs exist)
   3591         for( i = -1 ; i > contribs->n0 ; i-- )
   3592           stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), *c--, coefficient_width );
   3593         save_n0 = contribs->n0;
   3594         save_n0_coeff = c[0]; // save it, since we didn't do the final one (i==n0), because there might be too many coeffs to hold (before we resize)!
   3595 
   3596         // now slide all the coeffs down (since we have accumulated them in the positive contribs) and reset the first contrib
   3597         contribs->n0 = 0;
   3598         for(i = 0 ; i <= contribs->n1 ; i++ )
   3599           coeffs[i] = coeffs[i-save_n0];
   3600 
   3601         // now that we have shrunk down the contribs, we insert the first one safely
   3602         stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( save_n0, input_size ), save_n0_coeff, coefficient_width );
   3603       }
   3604     }
   3605 
   3606     if ( contribs->n0 <= contribs->n1 )
   3607     {
   3608       int diff = contribs->n1 - contribs->n0 + 1;
   3609       while ( diff && ( coeffs[ diff-1 ] == 0.0f ) )
   3610         --diff;
   3611 
   3612       contribs->n1 = contribs->n0 + diff - 1;
   3613 
   3614       if ( contribs->n0 <= contribs->n1 )
   3615       {
   3616         if ( contribs->n0 < lowest )
   3617           lowest = contribs->n0;
   3618         if ( contribs->n1 > highest )
   3619           highest = contribs->n1;
   3620         if ( diff > widest )
   3621           widest = diff;
   3622       }
   3623 
   3624       // re-zero out unused coefficients (if any)
   3625       for( i = diff ; i < coefficient_width ; i++ )
   3626         coeffs[i] = 0.0f;
   3627     }
   3628 
   3629     ++contribs;
   3630     coeffs += coefficient_width;
   3631   }
   3632   filter_info->lowest = lowest;
   3633   filter_info->highest = highest;
   3634   filter_info->widest = widest;
   3635 }
   3636 
   3637 #undef STBIR_RENORM_TYPE 
   3638 
   3639 static int stbir__pack_coefficients( int num_contributors, stbir__contributors* contributors, float * coefficents, int coefficient_width, int widest, int row0, int row1 ) 
   3640 {
   3641   #define STBIR_MOVE_1( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint32*)(dest))[0] = ((stbir_uint32*)(src))[0]; }
   3642   #define STBIR_MOVE_2( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; }
   3643   #ifdef STBIR_SIMD
   3644   #define STBIR_MOVE_4( dest, src ) { stbir__simdf t; STBIR_NO_UNROLL(dest); stbir__simdf_load( t, src ); stbir__simdf_store( dest, t ); }
   3645   #else
   3646   #define STBIR_MOVE_4( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; ((stbir_uint64*)(dest))[1] = ((stbir_uint64*)(src))[1]; }
   3647   #endif
   3648 
   3649   int row_end = row1 + 1;
   3650   STBIR__UNUSED( row0 ); // only used in an assert
   3651 
   3652   if ( coefficient_width != widest )
   3653   {
   3654     float * pc = coefficents;
   3655     float * coeffs = coefficents;
   3656     float * pc_end = coefficents + num_contributors * widest;
   3657     switch( widest )
   3658     {
   3659       case 1:
   3660         STBIR_NO_UNROLL_LOOP_START
   3661         do {
   3662           STBIR_MOVE_1( pc, coeffs );
   3663           ++pc;
   3664           coeffs += coefficient_width;
   3665         } while ( pc < pc_end );
   3666         break;
   3667       case 2:
   3668         STBIR_NO_UNROLL_LOOP_START
   3669         do {
   3670           STBIR_MOVE_2( pc, coeffs );
   3671           pc += 2;
   3672           coeffs += coefficient_width;
   3673         } while ( pc < pc_end );
   3674         break;
   3675       case 3:
   3676         STBIR_NO_UNROLL_LOOP_START
   3677         do {
   3678           STBIR_MOVE_2( pc, coeffs );
   3679           STBIR_MOVE_1( pc+2, coeffs+2 );
   3680           pc += 3;
   3681           coeffs += coefficient_width;
   3682         } while ( pc < pc_end );
   3683         break;
   3684       case 4:
   3685         STBIR_NO_UNROLL_LOOP_START
   3686         do {
   3687           STBIR_MOVE_4( pc, coeffs );
   3688           pc += 4;
   3689           coeffs += coefficient_width;
   3690         } while ( pc < pc_end );
   3691         break;
   3692       case 5:
   3693         STBIR_NO_UNROLL_LOOP_START
   3694         do {
   3695           STBIR_MOVE_4( pc, coeffs );
   3696           STBIR_MOVE_1( pc+4, coeffs+4 );
   3697           pc += 5;
   3698           coeffs += coefficient_width;
   3699         } while ( pc < pc_end );
   3700         break;
   3701       case 6:
   3702         STBIR_NO_UNROLL_LOOP_START
   3703         do {
   3704           STBIR_MOVE_4( pc, coeffs );
   3705           STBIR_MOVE_2( pc+4, coeffs+4 );
   3706           pc += 6;
   3707           coeffs += coefficient_width;
   3708         } while ( pc < pc_end );
   3709         break;
   3710       case 7:
   3711         STBIR_NO_UNROLL_LOOP_START
   3712         do {
   3713           STBIR_MOVE_4( pc, coeffs );
   3714           STBIR_MOVE_2( pc+4, coeffs+4 );
   3715           STBIR_MOVE_1( pc+6, coeffs+6 );
   3716           pc += 7;
   3717           coeffs += coefficient_width;
   3718         } while ( pc < pc_end );
   3719         break;
   3720       case 8:
   3721         STBIR_NO_UNROLL_LOOP_START
   3722         do {
   3723           STBIR_MOVE_4( pc, coeffs );
   3724           STBIR_MOVE_4( pc+4, coeffs+4 );
   3725           pc += 8;
   3726           coeffs += coefficient_width;
   3727         } while ( pc < pc_end );
   3728         break;
   3729       case 9:
   3730         STBIR_NO_UNROLL_LOOP_START
   3731         do {
   3732           STBIR_MOVE_4( pc, coeffs );
   3733           STBIR_MOVE_4( pc+4, coeffs+4 );
   3734           STBIR_MOVE_1( pc+8, coeffs+8 );
   3735           pc += 9;
   3736           coeffs += coefficient_width;
   3737         } while ( pc < pc_end );
   3738         break;
   3739       case 10:
   3740         STBIR_NO_UNROLL_LOOP_START
   3741         do {
   3742           STBIR_MOVE_4( pc, coeffs );
   3743           STBIR_MOVE_4( pc+4, coeffs+4 );
   3744           STBIR_MOVE_2( pc+8, coeffs+8 );
   3745           pc += 10;
   3746           coeffs += coefficient_width;
   3747         } while ( pc < pc_end );
   3748         break;
   3749       case 11:
   3750         STBIR_NO_UNROLL_LOOP_START
   3751         do {
   3752           STBIR_MOVE_4( pc, coeffs );
   3753           STBIR_MOVE_4( pc+4, coeffs+4 );
   3754           STBIR_MOVE_2( pc+8, coeffs+8 );
   3755           STBIR_MOVE_1( pc+10, coeffs+10 );
   3756           pc += 11;
   3757           coeffs += coefficient_width;
   3758         } while ( pc < pc_end );
   3759         break;
   3760       case 12:
   3761         STBIR_NO_UNROLL_LOOP_START
   3762         do {
   3763           STBIR_MOVE_4( pc, coeffs );
   3764           STBIR_MOVE_4( pc+4, coeffs+4 );
   3765           STBIR_MOVE_4( pc+8, coeffs+8 );
   3766           pc += 12;
   3767           coeffs += coefficient_width;
   3768         } while ( pc < pc_end );
   3769         break;
   3770       default:
   3771         STBIR_NO_UNROLL_LOOP_START
   3772         do {
   3773           float * copy_end = pc + widest - 4;
   3774           float * c = coeffs;
   3775           do {
   3776             STBIR_NO_UNROLL( pc );
   3777             STBIR_MOVE_4( pc, c );
   3778             pc += 4;
   3779             c += 4;
   3780           } while ( pc <= copy_end );
   3781           copy_end += 4;
   3782           STBIR_NO_UNROLL_LOOP_START
   3783           while ( pc < copy_end )
   3784           {
   3785             STBIR_MOVE_1( pc, c );
   3786             ++pc; ++c;
   3787           }
   3788           coeffs += coefficient_width;
   3789         } while ( pc < pc_end );
   3790         break;
   3791     }
   3792   }
   3793 
   3794   // some horizontal routines read one float off the end (which is then masked off), so put in a sentinal so we don't read an snan or denormal
   3795   coefficents[ widest * num_contributors ] = 8888.0f;
   3796 
   3797   // the minimum we might read for unrolled filters widths is 12. So, we need to
   3798   //   make sure we never read outside the decode buffer, by possibly moving
   3799   //   the sample area back into the scanline, and putting zeros weights first.
   3800   // we start on the right edge and check until we're well past the possible
   3801   //   clip area (2*widest).
   3802   {
   3803     stbir__contributors * contribs = contributors + num_contributors - 1;
   3804     float * coeffs = coefficents + widest * ( num_contributors - 1 );
   3805 
   3806     // go until no chance of clipping (this is usually less than 8 lops)
   3807     while ( ( contribs >= contributors ) && ( ( contribs->n0 + widest*2 ) >= row_end ) )
   3808     {
   3809       // might we clip??
   3810       if ( ( contribs->n0 + widest ) > row_end )
   3811       {
   3812         int stop_range = widest;
   3813 
   3814         // if range is larger than 12, it will be handled by generic loops that can terminate on the exact length
   3815         //   of this contrib n1, instead of a fixed widest amount - so calculate this
   3816         if ( widest > 12 )
   3817         {
   3818           int mod;
   3819 
   3820           // how far will be read in the n_coeff loop (which depends on the widest count mod4);
   3821           mod = widest & 3;
   3822           stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod;
   3823 
   3824           // the n_coeff loops do a minimum amount of coeffs, so factor that in!
   3825           if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod;
   3826         }
   3827 
   3828         // now see if we still clip with the refined range
   3829         if ( ( contribs->n0 + stop_range ) > row_end )
   3830         {
   3831           int new_n0 = row_end - stop_range;
   3832           int num = contribs->n1 - contribs->n0 + 1;
   3833           int backup = contribs->n0 - new_n0;
   3834           float * from_co = coeffs + num - 1;
   3835           float * to_co = from_co + backup;
   3836 
   3837           STBIR_ASSERT( ( new_n0 >= row0 ) && ( new_n0 < contribs->n0 ) );
   3838 
   3839           // move the coeffs over
   3840           while( num )
   3841           {
   3842             *to_co-- = *from_co--;
   3843             --num;
   3844           }
   3845           // zero new positions
   3846           while ( to_co >= coeffs )
   3847             *to_co-- = 0;
   3848           // set new start point
   3849           contribs->n0 = new_n0;
   3850           if ( widest > 12 )
   3851           {
   3852             int mod;
   3853 
   3854             // how far will be read in the n_coeff loop (which depends on the widest count mod4);
   3855             mod = widest & 3;
   3856             stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod;
   3857 
   3858             // the n_coeff loops do a minimum amount of coeffs, so factor that in!
   3859             if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod;
   3860           }
   3861         }
   3862       }
   3863       --contribs;
   3864       coeffs -= widest;
   3865     }
   3866   }
   3867 
   3868   return widest;
   3869   #undef STBIR_MOVE_1
   3870   #undef STBIR_MOVE_2
   3871   #undef STBIR_MOVE_4
   3872 }
   3873 
   3874 static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * other_axis_for_pivot, void * user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO )
   3875 {
   3876   int n;
   3877   float scale = samp->scale_info.scale;
   3878   stbir__kernel_callback * kernel = samp->filter_kernel;
   3879   stbir__support_callback * support = samp->filter_support;
   3880   float inv_scale = samp->scale_info.inv_scale;
   3881   int input_full_size = samp->scale_info.input_full_size;
   3882   int gather_num_contributors = samp->num_contributors;
   3883   stbir__contributors* gather_contributors = samp->contributors;
   3884   float * gather_coeffs = samp->coefficients;
   3885   int gather_coefficient_width = samp->coefficient_width;
   3886 
   3887   switch ( samp->is_gather )
   3888   {
   3889     case 1: // gather upsample
   3890     {
   3891       float out_pixels_radius = support(inv_scale,user_data) * scale;
   3892 
   3893       stbir__calculate_coefficients_for_gather_upsample( out_pixels_radius, kernel, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width, samp->edge, user_data );
   3894 
   3895       STBIR_PROFILE_BUILD_START( cleanup );
   3896       stbir__cleanup_gathered_coefficients( samp->edge, &samp->extent_info, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width );
   3897       STBIR_PROFILE_BUILD_END( cleanup );
   3898     }
   3899     break;
   3900 
   3901     case 0: // scatter downsample (only on vertical)
   3902     case 2: // gather downsample
   3903     {
   3904       float in_pixels_radius = support(scale,user_data) * inv_scale;
   3905       int filter_pixel_margin = samp->filter_pixel_margin;
   3906       int input_end = input_full_size + filter_pixel_margin;
   3907 
   3908       // if this is a scatter, we do a downsample gather to get the coeffs, and then pivot after
   3909       if ( !samp->is_gather )
   3910       {
   3911         // check if we are using the same gather downsample on the horizontal as this vertical,
   3912         //   if so, then we don't have to generate them, we can just pivot from the horizontal.
   3913         if ( other_axis_for_pivot )
   3914         {
   3915           gather_contributors = other_axis_for_pivot->contributors;
   3916           gather_coeffs = other_axis_for_pivot->coefficients;
   3917           gather_coefficient_width = other_axis_for_pivot->coefficient_width;
   3918           gather_num_contributors = other_axis_for_pivot->num_contributors;
   3919           samp->extent_info.lowest = other_axis_for_pivot->extent_info.lowest;
   3920           samp->extent_info.highest = other_axis_for_pivot->extent_info.highest;
   3921           samp->extent_info.widest = other_axis_for_pivot->extent_info.widest;
   3922           goto jump_right_to_pivot;
   3923         }
   3924 
   3925         gather_contributors = samp->gather_prescatter_contributors;
   3926         gather_coeffs = samp->gather_prescatter_coefficients;
   3927         gather_coefficient_width = samp->gather_prescatter_coefficient_width;
   3928         gather_num_contributors = samp->gather_prescatter_num_contributors;
   3929       }
   3930 
   3931       stbir__calculate_coefficients_for_gather_downsample( -filter_pixel_margin, input_end, in_pixels_radius, kernel, &samp->scale_info, gather_coefficient_width, gather_num_contributors, gather_contributors, gather_coeffs, user_data );
   3932 
   3933       STBIR_PROFILE_BUILD_START( cleanup );
   3934       stbir__cleanup_gathered_coefficients( samp->edge, &samp->extent_info, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width );
   3935       STBIR_PROFILE_BUILD_END( cleanup );
   3936 
   3937       if ( !samp->is_gather )
   3938       {
   3939         // if this is a scatter (vertical only), then we need to pivot the coeffs
   3940         stbir__contributors * scatter_contributors;
   3941         int highest_set;
   3942 
   3943         jump_right_to_pivot:
   3944 
   3945         STBIR_PROFILE_BUILD_START( pivot );
   3946 
   3947         highest_set = (-filter_pixel_margin) - 1;
   3948         for (n = 0; n < gather_num_contributors; n++)
   3949         {
   3950           int k;
   3951           int gn0 = gather_contributors->n0, gn1 = gather_contributors->n1;
   3952           int scatter_coefficient_width = samp->coefficient_width;
   3953           float * scatter_coeffs = samp->coefficients + ( gn0 + filter_pixel_margin ) * scatter_coefficient_width;
   3954           float * g_coeffs = gather_coeffs;
   3955           scatter_contributors = samp->contributors + ( gn0 + filter_pixel_margin );
   3956 
   3957           for (k = gn0 ; k <= gn1 ; k++ )
   3958           {
   3959             float gc = *g_coeffs++;
   3960             
   3961             // skip zero and denormals - must skip zeros to avoid adding coeffs beyond scatter_coefficient_width
   3962             //   (which happens when pivoting from horizontal, which might have dummy zeros)
   3963             if ( ( ( gc >= stbir__small_float ) || ( gc <= -stbir__small_float ) ) )
   3964             {
   3965               if ( ( k > highest_set ) || ( scatter_contributors->n0 > scatter_contributors->n1 ) )
   3966               {
   3967                 {
   3968                   // if we are skipping over several contributors, we need to clear the skipped ones
   3969                   stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1);
   3970                   while ( clear_contributors < scatter_contributors )
   3971                   {
   3972                     clear_contributors->n0 = 0;
   3973                     clear_contributors->n1 = -1;
   3974                     ++clear_contributors;
   3975                   }
   3976                 }
   3977                 scatter_contributors->n0 = n;
   3978                 scatter_contributors->n1 = n;
   3979                 scatter_coeffs[0]  = gc;
   3980                 highest_set = k;
   3981               }
   3982               else
   3983               {
   3984                 stbir__insert_coeff( scatter_contributors, scatter_coeffs, n, gc, scatter_coefficient_width );
   3985               }
   3986               STBIR_ASSERT( ( scatter_contributors->n1 - scatter_contributors->n0 + 1 ) <= scatter_coefficient_width );
   3987             }
   3988             ++scatter_contributors;
   3989             scatter_coeffs += scatter_coefficient_width;
   3990           }
   3991 
   3992           ++gather_contributors;
   3993           gather_coeffs += gather_coefficient_width;
   3994         }
   3995 
   3996         // now clear any unset contribs
   3997         {
   3998           stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1);
   3999           stbir__contributors * end_contributors = samp->contributors + samp->num_contributors;
   4000           while ( clear_contributors < end_contributors )
   4001           {
   4002             clear_contributors->n0 = 0;
   4003             clear_contributors->n1 = -1;
   4004             ++clear_contributors;
   4005           }
   4006         }
   4007 
   4008         STBIR_PROFILE_BUILD_END( pivot );
   4009       }
   4010     }
   4011     break;
   4012   }
   4013 }
   4014 
   4015 
   4016 //========================================================================================================
   4017 // scanline decoders and encoders
   4018 
   4019 #define stbir__coder_min_num 1
   4020 #define STB_IMAGE_RESIZE_DO_CODERS
   4021 #include STBIR__HEADER_FILENAME
   4022 
   4023 #define stbir__decode_suffix BGRA
   4024 #define stbir__decode_swizzle
   4025 #define stbir__decode_order0  2
   4026 #define stbir__decode_order1  1
   4027 #define stbir__decode_order2  0
   4028 #define stbir__decode_order3  3
   4029 #define stbir__encode_order0  2
   4030 #define stbir__encode_order1  1
   4031 #define stbir__encode_order2  0
   4032 #define stbir__encode_order3  3
   4033 #define stbir__coder_min_num 4
   4034 #define STB_IMAGE_RESIZE_DO_CODERS
   4035 #include STBIR__HEADER_FILENAME
   4036 
   4037 #define stbir__decode_suffix ARGB
   4038 #define stbir__decode_swizzle
   4039 #define stbir__decode_order0  1
   4040 #define stbir__decode_order1  2
   4041 #define stbir__decode_order2  3
   4042 #define stbir__decode_order3  0
   4043 #define stbir__encode_order0  3
   4044 #define stbir__encode_order1  0
   4045 #define stbir__encode_order2  1
   4046 #define stbir__encode_order3  2
   4047 #define stbir__coder_min_num 4
   4048 #define STB_IMAGE_RESIZE_DO_CODERS
   4049 #include STBIR__HEADER_FILENAME
   4050 
   4051 #define stbir__decode_suffix ABGR
   4052 #define stbir__decode_swizzle
   4053 #define stbir__decode_order0  3
   4054 #define stbir__decode_order1  2
   4055 #define stbir__decode_order2  1
   4056 #define stbir__decode_order3  0
   4057 #define stbir__encode_order0  3
   4058 #define stbir__encode_order1  2
   4059 #define stbir__encode_order2  1
   4060 #define stbir__encode_order3  0
   4061 #define stbir__coder_min_num 4
   4062 #define STB_IMAGE_RESIZE_DO_CODERS
   4063 #include STBIR__HEADER_FILENAME
   4064 
   4065 #define stbir__decode_suffix AR
   4066 #define stbir__decode_swizzle
   4067 #define stbir__decode_order0  1
   4068 #define stbir__decode_order1  0
   4069 #define stbir__decode_order2  3
   4070 #define stbir__decode_order3  2
   4071 #define stbir__encode_order0  1
   4072 #define stbir__encode_order1  0
   4073 #define stbir__encode_order2  3
   4074 #define stbir__encode_order3  2
   4075 #define stbir__coder_min_num 2
   4076 #define STB_IMAGE_RESIZE_DO_CODERS
   4077 #include STBIR__HEADER_FILENAME
   4078 
   4079 
   4080 // fancy alpha means we expand to keep both premultipied and non-premultiplied color channels
   4081 static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_channels )
   4082 {
   4083   float STBIR_STREAMOUT_PTR(*) out = out_buffer;
   4084   float const * end_decode = out_buffer + ( width_times_channels / 4 ) * 7;  // decode buffer aligned to end of out_buffer
   4085   float STBIR_STREAMOUT_PTR(*) decode = (float*)end_decode - width_times_channels;
   4086 
   4087   // fancy alpha is stored internally as R G B A Rpm Gpm Bpm
   4088 
   4089   #ifdef STBIR_SIMD
   4090 
   4091   #ifdef STBIR_SIMD8
   4092   decode += 16;
   4093   STBIR_NO_UNROLL_LOOP_START
   4094   while ( decode <= end_decode )
   4095   {
   4096     stbir__simdf8 d0,d1,a0,a1,p0,p1;
   4097     STBIR_NO_UNROLL(decode);
   4098     stbir__simdf8_load( d0, decode-16 );
   4099     stbir__simdf8_load( d1, decode-16+8 );
   4100     stbir__simdf8_0123to33333333( a0, d0 );
   4101     stbir__simdf8_0123to33333333( a1, d1 );
   4102     stbir__simdf8_mult( p0, a0, d0 );
   4103     stbir__simdf8_mult( p1, a1, d1 );
   4104     stbir__simdf8_bot4s( a0, d0, p0 );
   4105     stbir__simdf8_bot4s( a1, d1, p1 );
   4106     stbir__simdf8_top4s( d0, d0, p0 );
   4107     stbir__simdf8_top4s( d1, d1, p1 );
   4108     stbir__simdf8_store ( out, a0 );
   4109     stbir__simdf8_store ( out+7, d0 );
   4110     stbir__simdf8_store ( out+14, a1 );
   4111     stbir__simdf8_store ( out+21, d1 );
   4112     decode += 16;
   4113     out += 28;
   4114   }
   4115   decode -= 16;
   4116   #else
   4117   decode += 8;
   4118   STBIR_NO_UNROLL_LOOP_START
   4119   while ( decode <= end_decode )
   4120   {
   4121     stbir__simdf d0,a0,d1,a1,p0,p1;
   4122     STBIR_NO_UNROLL(decode);
   4123     stbir__simdf_load( d0, decode-8 );
   4124     stbir__simdf_load( d1, decode-8+4 );
   4125     stbir__simdf_0123to3333( a0, d0 );
   4126     stbir__simdf_0123to3333( a1, d1 );
   4127     stbir__simdf_mult( p0, a0, d0 );
   4128     stbir__simdf_mult( p1, a1, d1 );
   4129     stbir__simdf_store ( out, d0 );
   4130     stbir__simdf_store ( out+4, p0 );
   4131     stbir__simdf_store ( out+7, d1 );
   4132     stbir__simdf_store ( out+7+4, p1 );
   4133     decode += 8;
   4134     out += 14;
   4135   }
   4136   decode -= 8;
   4137   #endif
   4138 
   4139   // might be one last odd pixel
   4140   #ifdef STBIR_SIMD8
   4141   STBIR_NO_UNROLL_LOOP_START
   4142   while ( decode < end_decode )
   4143   #else
   4144   if ( decode < end_decode )
   4145   #endif
   4146   {
   4147     stbir__simdf d,a,p;
   4148     STBIR_NO_UNROLL(decode);
   4149     stbir__simdf_load( d, decode );
   4150     stbir__simdf_0123to3333( a, d );
   4151     stbir__simdf_mult( p, a, d );
   4152     stbir__simdf_store ( out, d );
   4153     stbir__simdf_store ( out+4, p );
   4154     decode += 4;
   4155     out += 7;
   4156   }
   4157 
   4158   #else
   4159 
   4160   while( decode < end_decode )
   4161   {
   4162     float r = decode[0], g = decode[1], b = decode[2], alpha = decode[3];
   4163     out[0] = r;
   4164     out[1] = g;
   4165     out[2] = b;
   4166     out[3] = alpha;
   4167     out[4] = r * alpha;
   4168     out[5] = g * alpha;
   4169     out[6] = b * alpha;
   4170     out += 7;
   4171     decode += 4;
   4172   }
   4173 
   4174   #endif
   4175 }
   4176 
   4177 static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_channels )
   4178 {
   4179   float STBIR_STREAMOUT_PTR(*) out = out_buffer;
   4180   float const * end_decode = out_buffer + ( width_times_channels / 2 ) * 3;
   4181   float STBIR_STREAMOUT_PTR(*) decode = (float*)end_decode - width_times_channels;
   4182 
   4183   //  for fancy alpha, turns into: [X A Xpm][X A Xpm],etc
   4184 
   4185   #ifdef STBIR_SIMD
   4186 
   4187   decode += 8;
   4188   if ( decode <= end_decode )
   4189   {
   4190     STBIR_NO_UNROLL_LOOP_START
   4191     do {
   4192       #ifdef STBIR_SIMD8
   4193       stbir__simdf8 d0,a0,p0;
   4194       STBIR_NO_UNROLL(decode);
   4195       stbir__simdf8_load( d0, decode-8 );
   4196       stbir__simdf8_0123to11331133( p0, d0 );
   4197       stbir__simdf8_0123to00220022( a0, d0 );
   4198       stbir__simdf8_mult( p0, p0, a0 );
   4199 
   4200       stbir__simdf_store2( out, stbir__if_simdf8_cast_to_simdf4( d0 ) );
   4201       stbir__simdf_store( out+2, stbir__if_simdf8_cast_to_simdf4( p0 ) );
   4202       stbir__simdf_store2h( out+3, stbir__if_simdf8_cast_to_simdf4( d0 ) );
   4203 
   4204       stbir__simdf_store2( out+6, stbir__simdf8_gettop4( d0 ) );
   4205       stbir__simdf_store( out+8, stbir__simdf8_gettop4( p0 ) );
   4206       stbir__simdf_store2h( out+9, stbir__simdf8_gettop4( d0 ) );
   4207       #else
   4208       stbir__simdf d0,a0,d1,a1,p0,p1;
   4209       STBIR_NO_UNROLL(decode);
   4210       stbir__simdf_load( d0, decode-8 );
   4211       stbir__simdf_load( d1, decode-8+4 );
   4212       stbir__simdf_0123to1133( p0, d0 );
   4213       stbir__simdf_0123to1133( p1, d1 );
   4214       stbir__simdf_0123to0022( a0, d0 );
   4215       stbir__simdf_0123to0022( a1, d1 );
   4216       stbir__simdf_mult( p0, p0, a0 );
   4217       stbir__simdf_mult( p1, p1, a1 );
   4218 
   4219       stbir__simdf_store2( out, d0 );
   4220       stbir__simdf_store( out+2, p0 );
   4221       stbir__simdf_store2h( out+3, d0 );
   4222 
   4223       stbir__simdf_store2( out+6, d1 );
   4224       stbir__simdf_store( out+8, p1 );
   4225       stbir__simdf_store2h( out+9, d1 );
   4226       #endif
   4227       decode += 8;
   4228       out += 12;
   4229     } while ( decode <= end_decode );
   4230   }
   4231   decode -= 8;
   4232   #endif
   4233 
   4234   STBIR_SIMD_NO_UNROLL_LOOP_START
   4235   while( decode < end_decode )
   4236   {
   4237     float x = decode[0], y = decode[1];
   4238     STBIR_SIMD_NO_UNROLL(decode);
   4239     out[0] = x;
   4240     out[1] = y;
   4241     out[2] = x * y;
   4242     out += 3;
   4243     decode += 2;
   4244   }
   4245 }
   4246 
   4247 static void stbir__fancy_alpha_unweight_4ch( float * encode_buffer, int width_times_channels )
   4248 {
   4249   float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
   4250   float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer;
   4251   float const * end_output = encode_buffer + width_times_channels;
   4252 
   4253   // fancy RGBA is stored internally as R G B A Rpm Gpm Bpm
   4254 
   4255   STBIR_SIMD_NO_UNROLL_LOOP_START
   4256   do {
   4257     float alpha = input[3];
   4258 #ifdef STBIR_SIMD
   4259     stbir__simdf i,ia;
   4260     STBIR_SIMD_NO_UNROLL(encode);
   4261     if ( alpha < stbir__small_float )
   4262     {
   4263       stbir__simdf_load( i, input );
   4264       stbir__simdf_store( encode, i );
   4265     }
   4266     else
   4267     {
   4268       stbir__simdf_load1frep4( ia, 1.0f / alpha );
   4269       stbir__simdf_load( i, input+4 );
   4270       stbir__simdf_mult( i, i, ia );
   4271       stbir__simdf_store( encode, i );
   4272       encode[3] = alpha;
   4273     }
   4274 #else
   4275     if ( alpha < stbir__small_float )
   4276     {
   4277       encode[0] = input[0];
   4278       encode[1] = input[1];
   4279       encode[2] = input[2];
   4280     }
   4281     else
   4282     {
   4283       float ialpha = 1.0f / alpha;
   4284       encode[0] = input[4] * ialpha;
   4285       encode[1] = input[5] * ialpha;
   4286       encode[2] = input[6] * ialpha;
   4287     }
   4288     encode[3] = alpha;
   4289 #endif
   4290 
   4291     input += 7;
   4292     encode += 4;
   4293   } while ( encode < end_output );
   4294 }
   4295 
   4296 //  format: [X A Xpm][X A Xpm] etc
   4297 static void stbir__fancy_alpha_unweight_2ch( float * encode_buffer, int width_times_channels )
   4298 {
   4299   float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
   4300   float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer;
   4301   float const * end_output = encode_buffer + width_times_channels;
   4302 
   4303   do {
   4304     float alpha = input[1];
   4305     encode[0] = input[0];
   4306     if ( alpha >= stbir__small_float )
   4307       encode[0] = input[2] / alpha;
   4308     encode[1] = alpha;
   4309 
   4310     input += 3;
   4311     encode += 2;
   4312   } while ( encode < end_output );
   4313 }
   4314 
   4315 static void stbir__simple_alpha_weight_4ch( float * decode_buffer, int width_times_channels )
   4316 {
   4317   float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
   4318   float const * end_decode = decode_buffer + width_times_channels;
   4319 
   4320   #ifdef STBIR_SIMD
   4321   {
   4322     decode += 2 * stbir__simdfX_float_count;
   4323     STBIR_NO_UNROLL_LOOP_START
   4324     while ( decode <= end_decode )
   4325     {
   4326       stbir__simdfX d0,a0,d1,a1;
   4327       STBIR_NO_UNROLL(decode);
   4328       stbir__simdfX_load( d0, decode-2*stbir__simdfX_float_count );
   4329       stbir__simdfX_load( d1, decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count );
   4330       stbir__simdfX_aaa1( a0, d0, STBIR_onesX );
   4331       stbir__simdfX_aaa1( a1, d1, STBIR_onesX );
   4332       stbir__simdfX_mult( d0, d0, a0 );
   4333       stbir__simdfX_mult( d1, d1, a1 );
   4334       stbir__simdfX_store ( decode-2*stbir__simdfX_float_count, d0 );
   4335       stbir__simdfX_store ( decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count, d1 );
   4336       decode += 2 * stbir__simdfX_float_count;
   4337     }
   4338     decode -= 2 * stbir__simdfX_float_count;
   4339 
   4340     // few last pixels remnants
   4341     #ifdef STBIR_SIMD8
   4342     STBIR_NO_UNROLL_LOOP_START
   4343     while ( decode < end_decode )
   4344     #else
   4345     if ( decode < end_decode )
   4346     #endif
   4347     {
   4348       stbir__simdf d,a;
   4349       stbir__simdf_load( d, decode );
   4350       stbir__simdf_aaa1( a, d, STBIR__CONSTF(STBIR_ones) );
   4351       stbir__simdf_mult( d, d, a );
   4352       stbir__simdf_store ( decode, d );
   4353       decode += 4;
   4354     }
   4355   }
   4356 
   4357   #else
   4358 
   4359   while( decode < end_decode )
   4360   {
   4361     float alpha = decode[3];
   4362     decode[0] *= alpha;
   4363     decode[1] *= alpha;
   4364     decode[2] *= alpha;
   4365     decode += 4;
   4366   }
   4367 
   4368   #endif
   4369 }
   4370 
   4371 static void stbir__simple_alpha_weight_2ch( float * decode_buffer, int width_times_channels )
   4372 {
   4373   float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
   4374   float const * end_decode = decode_buffer + width_times_channels;
   4375 
   4376   #ifdef STBIR_SIMD
   4377   decode += 2 * stbir__simdfX_float_count;
   4378   STBIR_NO_UNROLL_LOOP_START
   4379   while ( decode <= end_decode )
   4380   {
   4381     stbir__simdfX d0,a0,d1,a1;
   4382     STBIR_NO_UNROLL(decode);
   4383     stbir__simdfX_load( d0, decode-2*stbir__simdfX_float_count );
   4384     stbir__simdfX_load( d1, decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count );
   4385     stbir__simdfX_a1a1( a0, d0, STBIR_onesX );
   4386     stbir__simdfX_a1a1( a1, d1, STBIR_onesX );
   4387     stbir__simdfX_mult( d0, d0, a0 );
   4388     stbir__simdfX_mult( d1, d1, a1 );
   4389     stbir__simdfX_store ( decode-2*stbir__simdfX_float_count, d0 );
   4390     stbir__simdfX_store ( decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count, d1 );
   4391     decode += 2 * stbir__simdfX_float_count;
   4392   }
   4393   decode -= 2 * stbir__simdfX_float_count;
   4394   #endif
   4395 
   4396   STBIR_SIMD_NO_UNROLL_LOOP_START
   4397   while( decode < end_decode )
   4398   {
   4399     float alpha = decode[1];
   4400     STBIR_SIMD_NO_UNROLL(decode);
   4401     decode[0] *= alpha;
   4402     decode += 2;
   4403   }
   4404 }
   4405 
   4406 static void stbir__simple_alpha_unweight_4ch( float * encode_buffer, int width_times_channels )
   4407 {
   4408   float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
   4409   float const * end_output = encode_buffer + width_times_channels;
   4410 
   4411   STBIR_SIMD_NO_UNROLL_LOOP_START
   4412   do {
   4413     float alpha = encode[3];
   4414 
   4415 #ifdef STBIR_SIMD
   4416     stbir__simdf i,ia;
   4417     STBIR_SIMD_NO_UNROLL(encode);
   4418     if ( alpha >= stbir__small_float )
   4419     {
   4420       stbir__simdf_load1frep4( ia, 1.0f / alpha );
   4421       stbir__simdf_load( i, encode );
   4422       stbir__simdf_mult( i, i, ia );
   4423       stbir__simdf_store( encode, i );
   4424       encode[3] = alpha;
   4425     }
   4426 #else
   4427     if ( alpha >= stbir__small_float )
   4428     {
   4429       float ialpha = 1.0f / alpha;
   4430       encode[0] *= ialpha;
   4431       encode[1] *= ialpha;
   4432       encode[2] *= ialpha;
   4433     }
   4434 #endif
   4435     encode += 4;
   4436   } while ( encode < end_output );
   4437 }
   4438 
   4439 static void stbir__simple_alpha_unweight_2ch( float * encode_buffer, int width_times_channels )
   4440 {
   4441   float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
   4442   float const * end_output = encode_buffer + width_times_channels;
   4443 
   4444   do {
   4445     float alpha = encode[1];
   4446     if ( alpha >= stbir__small_float )
   4447       encode[0] /= alpha;
   4448     encode += 2;
   4449   } while ( encode < end_output );
   4450 }
   4451 
   4452 
   4453 // only used in RGB->BGR or BGR->RGB
   4454 static void stbir__simple_flip_3ch( float * decode_buffer, int width_times_channels )
   4455 {
   4456   float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
   4457   float const * end_decode = decode_buffer + width_times_channels;
   4458 
   4459 #ifdef STBIR_SIMD
   4460     #ifdef stbir__simdf_swiz2 // do we have two argument swizzles?
   4461       end_decode -= 12; 
   4462       STBIR_NO_UNROLL_LOOP_START
   4463       while( decode <= end_decode )
   4464       {
   4465         // on arm64 8 instructions, no overlapping stores
   4466         stbir__simdf a,b,c,na,nb;
   4467         STBIR_SIMD_NO_UNROLL(decode);
   4468         stbir__simdf_load( a, decode );
   4469         stbir__simdf_load( b, decode+4 );
   4470         stbir__simdf_load( c, decode+8 );
   4471 
   4472         na = stbir__simdf_swiz2( a, b, 2, 1, 0, 5 );   
   4473         b  = stbir__simdf_swiz2( a, b, 4, 3, 6, 7 );   
   4474         nb = stbir__simdf_swiz2( b, c, 0, 1, 4, 3 );   
   4475         c  = stbir__simdf_swiz2( b, c, 2, 7, 6, 5 );   
   4476 
   4477         stbir__simdf_store( decode, na );
   4478         stbir__simdf_store( decode+4, nb ); 
   4479         stbir__simdf_store( decode+8, c );
   4480         decode += 12;
   4481       }
   4482       end_decode += 12;
   4483     #else
   4484       end_decode -= 24;
   4485       STBIR_NO_UNROLL_LOOP_START
   4486       while( decode <= end_decode )
   4487       {
   4488         // 26 instructions on x64
   4489         stbir__simdf a,b,c,d,e,f,g;
   4490         float i21, i23;
   4491         STBIR_SIMD_NO_UNROLL(decode);
   4492         stbir__simdf_load( a, decode );
   4493         stbir__simdf_load( b, decode+3 );
   4494         stbir__simdf_load( c, decode+6 );
   4495         stbir__simdf_load( d, decode+9 );
   4496         stbir__simdf_load( e, decode+12 );
   4497         stbir__simdf_load( f, decode+15 );
   4498         stbir__simdf_load( g, decode+18 );
   4499 
   4500         a = stbir__simdf_swiz( a, 2, 1, 0, 3 );   
   4501         b = stbir__simdf_swiz( b, 2, 1, 0, 3 );   
   4502         c = stbir__simdf_swiz( c, 2, 1, 0, 3 );   
   4503         d = stbir__simdf_swiz( d, 2, 1, 0, 3 );   
   4504         e = stbir__simdf_swiz( e, 2, 1, 0, 3 );   
   4505         f = stbir__simdf_swiz( f, 2, 1, 0, 3 );   
   4506         g = stbir__simdf_swiz( g, 2, 1, 0, 3 );   
   4507 
   4508         // stores overlap, need to be in order, 
   4509         stbir__simdf_store( decode,    a );
   4510         i21 = decode[21];
   4511         stbir__simdf_store( decode+3,  b ); 
   4512         i23 = decode[23];
   4513         stbir__simdf_store( decode+6,  c );
   4514         stbir__simdf_store( decode+9,  d );
   4515         stbir__simdf_store( decode+12, e );
   4516         stbir__simdf_store( decode+15, f );
   4517         stbir__simdf_store( decode+18, g );
   4518         decode[21] = i23;
   4519         decode[23] = i21;
   4520         decode += 24;
   4521       }
   4522       end_decode += 24;
   4523     #endif
   4524 #else
   4525   end_decode -= 12;
   4526   STBIR_NO_UNROLL_LOOP_START
   4527   while( decode <= end_decode )
   4528   {
   4529     // 16 instructions
   4530     float t0,t1,t2,t3;
   4531     STBIR_NO_UNROLL(decode);
   4532     t0 = decode[0]; t1 = decode[3]; t2 = decode[6]; t3 = decode[9];
   4533     decode[0] = decode[2]; decode[3] = decode[5]; decode[6] = decode[8]; decode[9] = decode[11];
   4534     decode[2] = t0; decode[5] = t1; decode[8] = t2; decode[11] = t3;
   4535     decode += 12;
   4536   }
   4537   end_decode += 12;
   4538 #endif
   4539 
   4540   STBIR_NO_UNROLL_LOOP_START
   4541   while( decode < end_decode )
   4542   {
   4543     float t = decode[0];
   4544     STBIR_NO_UNROLL(decode);
   4545     decode[0] = decode[2];
   4546     decode[2] = t;
   4547     decode += 3;
   4548   }
   4549 }
   4550 
   4551 
   4552 
   4553 static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float * output_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO )
   4554 {
   4555   int channels = stbir_info->channels;
   4556   int effective_channels = stbir_info->effective_channels;
   4557   int input_sample_in_bytes = stbir__type_size[stbir_info->input_type] * channels;
   4558   stbir_edge edge_horizontal = stbir_info->horizontal.edge;
   4559   stbir_edge edge_vertical = stbir_info->vertical.edge;
   4560   int row = stbir__edge_wrap(edge_vertical, n, stbir_info->vertical.scale_info.input_full_size);
   4561   const void* input_plane_data = ( (char *) stbir_info->input_data ) + (size_t)row * (size_t) stbir_info->input_stride_bytes;
   4562   stbir__span const * spans = stbir_info->scanline_extents.spans;
   4563   float* full_decode_buffer = output_buffer - stbir_info->scanline_extents.conservative.n0 * effective_channels;
   4564 
   4565   // if we are on edge_zero, and we get in here with an out of bounds n, then the calculate filters has failed
   4566   STBIR_ASSERT( !(edge_vertical == STBIR_EDGE_ZERO && (n < 0 || n >= stbir_info->vertical.scale_info.input_full_size)) );
   4567 
   4568   do
   4569   {
   4570     float * decode_buffer;
   4571     void const * input_data;
   4572     float * end_decode;
   4573     int width_times_channels;
   4574     int width;
   4575 
   4576     if ( spans->n1 < spans->n0 )
   4577       break;
   4578 
   4579     width = spans->n1 + 1 - spans->n0;
   4580     decode_buffer = full_decode_buffer + spans->n0 * effective_channels;
   4581     end_decode = full_decode_buffer + ( spans->n1 + 1 ) * effective_channels;
   4582     width_times_channels = width * channels;
   4583 
   4584     // read directly out of input plane by default
   4585     input_data = ( (char*)input_plane_data ) + spans->pixel_offset_for_input * input_sample_in_bytes;
   4586 
   4587     // if we have an input callback, call it to get the input data
   4588     if ( stbir_info->in_pixels_cb )
   4589     {
   4590       // call the callback with a temp buffer (that they can choose to use or not).  the temp is just right aligned memory in the decode_buffer itself
   4591       input_data = stbir_info->in_pixels_cb( ( (char*) end_decode ) - ( width * input_sample_in_bytes ), input_plane_data, width, spans->pixel_offset_for_input, row, stbir_info->user_data );
   4592     }
   4593 
   4594     STBIR_PROFILE_START( decode );
   4595     // convert the pixels info the float decode_buffer, (we index from end_decode, so that when channels<effective_channels, we are right justified in the buffer)
   4596     stbir_info->decode_pixels( (float*)end_decode - width_times_channels, width_times_channels, input_data );
   4597     STBIR_PROFILE_END( decode );
   4598 
   4599     if (stbir_info->alpha_weight)
   4600     {
   4601       STBIR_PROFILE_START( alpha );
   4602       stbir_info->alpha_weight( decode_buffer, width_times_channels );
   4603       STBIR_PROFILE_END( alpha );
   4604     }
   4605 
   4606     ++spans;
   4607   } while ( spans <= ( &stbir_info->scanline_extents.spans[1] ) );
   4608 
   4609   // handle the edge_wrap filter (all other types are handled back out at the calculate_filter stage)
   4610   // basically the idea here is that if we have the whole scanline in memory, we don't redecode the
   4611   //   wrapped edge pixels, and instead just memcpy them from the scanline into the edge positions
   4612   if ( ( edge_horizontal == STBIR_EDGE_WRAP ) && ( stbir_info->scanline_extents.edge_sizes[0] | stbir_info->scanline_extents.edge_sizes[1] ) )
   4613   {
   4614     // this code only runs if we're in edge_wrap, and we're doing the entire scanline
   4615     int e, start_x[2];
   4616     int input_full_size = stbir_info->horizontal.scale_info.input_full_size;
   4617 
   4618     start_x[0] = -stbir_info->scanline_extents.edge_sizes[0];  // left edge start x
   4619     start_x[1] =  input_full_size;                             // right edge
   4620 
   4621     for( e = 0; e < 2 ; e++ )
   4622     {
   4623       // do each margin
   4624       int margin = stbir_info->scanline_extents.edge_sizes[e];
   4625       if ( margin )
   4626       {
   4627         int x = start_x[e];
   4628         float * marg = full_decode_buffer + x * effective_channels;
   4629         float const * src = full_decode_buffer + stbir__edge_wrap(edge_horizontal, x, input_full_size) * effective_channels;
   4630         STBIR_MEMCPY( marg, src, margin * effective_channels * sizeof(float) );
   4631       }
   4632     }
   4633   }
   4634 }
   4635 
   4636 
   4637 //=================
   4638 // Do 1 channel horizontal routines
   4639 
   4640 #ifdef STBIR_SIMD
   4641 
   4642 #define stbir__1_coeff_only()          \
   4643     stbir__simdf tot,c;                \
   4644     STBIR_SIMD_NO_UNROLL(decode);      \
   4645     stbir__simdf_load1( c, hc );       \
   4646     stbir__simdf_mult1_mem( tot, c, decode );
   4647 
   4648 #define stbir__2_coeff_only()          \
   4649     stbir__simdf tot,c,d;              \
   4650     STBIR_SIMD_NO_UNROLL(decode);      \
   4651     stbir__simdf_load2z( c, hc );      \
   4652     stbir__simdf_load2( d, decode );   \
   4653     stbir__simdf_mult( tot, c, d );    \
   4654     stbir__simdf_0123to1230( c, tot ); \
   4655     stbir__simdf_add1( tot, tot, c );
   4656 
   4657 #define stbir__3_coeff_only()                  \
   4658     stbir__simdf tot,c,t;                      \
   4659     STBIR_SIMD_NO_UNROLL(decode);              \
   4660     stbir__simdf_load( c, hc );                \
   4661     stbir__simdf_mult_mem( tot, c, decode );   \
   4662     stbir__simdf_0123to1230( c, tot );         \
   4663     stbir__simdf_0123to2301( t, tot );         \
   4664     stbir__simdf_add1( tot, tot, c );          \
   4665     stbir__simdf_add1( tot, tot, t );
   4666 
   4667 #define stbir__store_output_tiny()                \
   4668     stbir__simdf_store1( output, tot );           \
   4669     horizontal_coefficients += coefficient_width; \
   4670     ++horizontal_contributors;                    \
   4671     output += 1;
   4672 
   4673 #define stbir__4_coeff_start()                 \
   4674     stbir__simdf tot,c;                        \
   4675     STBIR_SIMD_NO_UNROLL(decode);              \
   4676     stbir__simdf_load( c, hc );                \
   4677     stbir__simdf_mult_mem( tot, c, decode );   \
   4678 
   4679 #define stbir__4_coeff_continue_from_4( ofs )  \
   4680     STBIR_SIMD_NO_UNROLL(decode);              \
   4681     stbir__simdf_load( c, hc + (ofs) );        \
   4682     stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) );
   4683 
   4684 #define stbir__1_coeff_remnant( ofs )          \
   4685     { stbir__simdf d;                          \
   4686     stbir__simdf_load1z( c, hc + (ofs) );      \
   4687     stbir__simdf_load1( d, decode + (ofs) );   \
   4688     stbir__simdf_madd( tot, tot, d, c ); }
   4689 
   4690 #define stbir__2_coeff_remnant( ofs )          \
   4691     { stbir__simdf d;                          \
   4692     stbir__simdf_load2z( c, hc+(ofs) );        \
   4693     stbir__simdf_load2( d, decode+(ofs) );     \
   4694     stbir__simdf_madd( tot, tot, d, c ); }
   4695 
   4696 #define stbir__3_coeff_setup()                 \
   4697     stbir__simdf mask;                         \
   4698     stbir__simdf_load( mask, STBIR_mask + 3 );
   4699 
   4700 #define stbir__3_coeff_remnant( ofs )                  \
   4701     stbir__simdf_load( c, hc+(ofs) );                  \
   4702     stbir__simdf_and( c, c, mask );                    \
   4703     stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) );
   4704 
   4705 #define stbir__store_output()                     \
   4706     stbir__simdf_0123to2301( c, tot );            \
   4707     stbir__simdf_add( tot, tot, c );              \
   4708     stbir__simdf_0123to1230( c, tot );            \
   4709     stbir__simdf_add1( tot, tot, c );             \
   4710     stbir__simdf_store1( output, tot );           \
   4711     horizontal_coefficients += coefficient_width; \
   4712     ++horizontal_contributors;                    \
   4713     output += 1;
   4714 
   4715 #else
   4716 
   4717 #define stbir__1_coeff_only()  \
   4718     float tot;                 \
   4719     tot = decode[0]*hc[0];
   4720 
   4721 #define stbir__2_coeff_only()  \
   4722     float tot;                 \
   4723     tot = decode[0] * hc[0];   \
   4724     tot += decode[1] * hc[1];
   4725 
   4726 #define stbir__3_coeff_only()  \
   4727     float tot;                 \
   4728     tot = decode[0] * hc[0];   \
   4729     tot += decode[1] * hc[1];  \
   4730     tot += decode[2] * hc[2];
   4731 
   4732 #define stbir__store_output_tiny()                \
   4733     output[0] = tot;                              \
   4734     horizontal_coefficients += coefficient_width; \
   4735     ++horizontal_contributors;                    \
   4736     output += 1;
   4737 
   4738 #define stbir__4_coeff_start()  \
   4739     float tot0,tot1,tot2,tot3;  \
   4740     tot0 = decode[0] * hc[0];   \
   4741     tot1 = decode[1] * hc[1];   \
   4742     tot2 = decode[2] * hc[2];   \
   4743     tot3 = decode[3] * hc[3];
   4744 
   4745 #define stbir__4_coeff_continue_from_4( ofs )  \
   4746     tot0 += decode[0+(ofs)] * hc[0+(ofs)];     \
   4747     tot1 += decode[1+(ofs)] * hc[1+(ofs)];     \
   4748     tot2 += decode[2+(ofs)] * hc[2+(ofs)];     \
   4749     tot3 += decode[3+(ofs)] * hc[3+(ofs)];
   4750 
   4751 #define stbir__1_coeff_remnant( ofs )        \
   4752     tot0 += decode[0+(ofs)] * hc[0+(ofs)];
   4753 
   4754 #define stbir__2_coeff_remnant( ofs )        \
   4755     tot0 += decode[0+(ofs)] * hc[0+(ofs)];   \
   4756     tot1 += decode[1+(ofs)] * hc[1+(ofs)];   \
   4757 
   4758 #define stbir__3_coeff_remnant( ofs )        \
   4759     tot0 += decode[0+(ofs)] * hc[0+(ofs)];   \
   4760     tot1 += decode[1+(ofs)] * hc[1+(ofs)];   \
   4761     tot2 += decode[2+(ofs)] * hc[2+(ofs)];
   4762 
   4763 #define stbir__store_output()                     \
   4764     output[0] = (tot0+tot2)+(tot1+tot3);          \
   4765     horizontal_coefficients += coefficient_width; \
   4766     ++horizontal_contributors;                    \
   4767     output += 1;
   4768 
   4769 #endif
   4770 
   4771 #define STBIR__horizontal_channels 1
   4772 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
   4773 #include STBIR__HEADER_FILENAME
   4774 
   4775 
   4776 //=================
   4777 // Do 2 channel horizontal routines
   4778 
   4779 #ifdef STBIR_SIMD
   4780 
   4781 #define stbir__1_coeff_only()         \
   4782     stbir__simdf tot,c,d;             \
   4783     STBIR_SIMD_NO_UNROLL(decode);     \
   4784     stbir__simdf_load1z( c, hc );     \
   4785     stbir__simdf_0123to0011( c, c );  \
   4786     stbir__simdf_load2( d, decode );  \
   4787     stbir__simdf_mult( tot, d, c );
   4788 
   4789 #define stbir__2_coeff_only()         \
   4790     stbir__simdf tot,c;               \
   4791     STBIR_SIMD_NO_UNROLL(decode);     \
   4792     stbir__simdf_load2( c, hc );      \
   4793     stbir__simdf_0123to0011( c, c );  \
   4794     stbir__simdf_mult_mem( tot, c, decode );
   4795 
   4796 #define stbir__3_coeff_only()                \
   4797     stbir__simdf tot,c,cs,d;                 \
   4798     STBIR_SIMD_NO_UNROLL(decode);            \
   4799     stbir__simdf_load( cs, hc );             \
   4800     stbir__simdf_0123to0011( c, cs );        \
   4801     stbir__simdf_mult_mem( tot, c, decode ); \
   4802     stbir__simdf_0123to2222( c, cs );        \
   4803     stbir__simdf_load2z( d, decode+4 );      \
   4804     stbir__simdf_madd( tot, tot, d, c );
   4805 
   4806 #define stbir__store_output_tiny()                \
   4807     stbir__simdf_0123to2301( c, tot );            \
   4808     stbir__simdf_add( tot, tot, c );              \
   4809     stbir__simdf_store2( output, tot );           \
   4810     horizontal_coefficients += coefficient_width; \
   4811     ++horizontal_contributors;                    \
   4812     output += 2;
   4813 
   4814 #ifdef STBIR_SIMD8
   4815 
   4816 #define stbir__4_coeff_start()                    \
   4817     stbir__simdf8 tot0,c,cs;                      \
   4818     STBIR_SIMD_NO_UNROLL(decode);                 \
   4819     stbir__simdf8_load4b( cs, hc );               \
   4820     stbir__simdf8_0123to00112233( c, cs );        \
   4821     stbir__simdf8_mult_mem( tot0, c, decode );
   4822 
   4823 #define stbir__4_coeff_continue_from_4( ofs )        \
   4824     STBIR_SIMD_NO_UNROLL(decode);                    \
   4825     stbir__simdf8_load4b( cs, hc + (ofs) );          \
   4826     stbir__simdf8_0123to00112233( c, cs );           \
   4827     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*2 );
   4828 
   4829 #define stbir__1_coeff_remnant( ofs )                \
   4830     { stbir__simdf t,d;                              \
   4831     stbir__simdf_load1z( t, hc + (ofs) );            \
   4832     stbir__simdf_load2( d, decode + (ofs) * 2 );     \
   4833     stbir__simdf_0123to0011( t, t );                 \
   4834     stbir__simdf_mult( t, t, d );                    \
   4835     stbir__simdf8_add4( tot0, tot0, t ); }
   4836  
   4837 #define stbir__2_coeff_remnant( ofs )                \
   4838     { stbir__simdf t;                                \
   4839     stbir__simdf_load2( t, hc + (ofs) );             \
   4840     stbir__simdf_0123to0011( t, t );                 \
   4841     stbir__simdf_mult_mem( t, t, decode+(ofs)*2 );   \
   4842     stbir__simdf8_add4( tot0, tot0, t ); }
   4843 
   4844 #define stbir__3_coeff_remnant( ofs )                \
   4845     { stbir__simdf8 d;                               \
   4846     stbir__simdf8_load4b( cs, hc + (ofs) );          \
   4847     stbir__simdf8_0123to00112233( c, cs );           \
   4848     stbir__simdf8_load6z( d, decode+(ofs)*2 );       \
   4849     stbir__simdf8_madd( tot0, tot0, c, d ); }
   4850 
   4851 #define stbir__store_output()                     \
   4852     { stbir__simdf t,d;                           \
   4853     stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 );    \
   4854     stbir__simdf_0123to2301( d, t );              \
   4855     stbir__simdf_add( t, t, d );                  \
   4856     stbir__simdf_store2( output, t );             \
   4857     horizontal_coefficients += coefficient_width; \
   4858     ++horizontal_contributors;                    \
   4859     output += 2; }
   4860 
   4861 #else
   4862 
   4863 #define stbir__4_coeff_start()                   \
   4864     stbir__simdf tot0,tot1,c,cs;                 \
   4865     STBIR_SIMD_NO_UNROLL(decode);                \
   4866     stbir__simdf_load( cs, hc );                 \
   4867     stbir__simdf_0123to0011( c, cs );            \
   4868     stbir__simdf_mult_mem( tot0, c, decode );    \
   4869     stbir__simdf_0123to2233( c, cs );            \
   4870     stbir__simdf_mult_mem( tot1, c, decode+4 );
   4871 
   4872 #define stbir__4_coeff_continue_from_4( ofs )                \
   4873     STBIR_SIMD_NO_UNROLL(decode);                            \
   4874     stbir__simdf_load( cs, hc + (ofs) );                     \
   4875     stbir__simdf_0123to0011( c, cs );                        \
   4876     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 );  \
   4877     stbir__simdf_0123to2233( c, cs );                        \
   4878     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*2+4 );
   4879 
   4880 #define stbir__1_coeff_remnant( ofs )            \
   4881     { stbir__simdf d;                            \
   4882     stbir__simdf_load1z( cs, hc + (ofs) );       \
   4883     stbir__simdf_0123to0011( c, cs );            \
   4884     stbir__simdf_load2( d, decode + (ofs) * 2 ); \
   4885     stbir__simdf_madd( tot0, tot0, d, c ); }
   4886 
   4887 #define stbir__2_coeff_remnant( ofs )                      \
   4888     stbir__simdf_load2( cs, hc + (ofs) );                  \
   4889     stbir__simdf_0123to0011( c, cs );                      \
   4890     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 );
   4891 
   4892 #define stbir__3_coeff_remnant( ofs )                       \
   4893     { stbir__simdf d;                                       \
   4894     stbir__simdf_load( cs, hc + (ofs) );                    \
   4895     stbir__simdf_0123to0011( c, cs );                       \
   4896     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); \
   4897     stbir__simdf_0123to2222( c, cs );                       \
   4898     stbir__simdf_load2z( d, decode + (ofs) * 2 + 4 );       \
   4899     stbir__simdf_madd( tot1, tot1, d, c ); }
   4900 
   4901 #define stbir__store_output()                     \
   4902     stbir__simdf_add( tot0, tot0, tot1 );         \
   4903     stbir__simdf_0123to2301( c, tot0 );           \
   4904     stbir__simdf_add( tot0, tot0, c );            \
   4905     stbir__simdf_store2( output, tot0 );          \
   4906     horizontal_coefficients += coefficient_width; \
   4907     ++horizontal_contributors;                    \
   4908     output += 2;
   4909 
   4910 #endif
   4911 
   4912 #else
   4913 
   4914 #define stbir__1_coeff_only()  \
   4915     float tota,totb,c;         \
   4916     c = hc[0];                 \
   4917     tota = decode[0]*c;        \
   4918     totb = decode[1]*c;
   4919 
   4920 #define stbir__2_coeff_only()  \
   4921     float tota,totb,c;         \
   4922     c = hc[0];                 \
   4923     tota = decode[0]*c;        \
   4924     totb = decode[1]*c;        \
   4925     c = hc[1];                 \
   4926     tota += decode[2]*c;       \
   4927     totb += decode[3]*c;
   4928 
   4929 // this weird order of add matches the simd
   4930 #define stbir__3_coeff_only()  \
   4931     float tota,totb,c;         \
   4932     c = hc[0];                 \
   4933     tota = decode[0]*c;        \
   4934     totb = decode[1]*c;        \
   4935     c = hc[2];                 \
   4936     tota += decode[4]*c;       \
   4937     totb += decode[5]*c;       \
   4938     c = hc[1];                 \
   4939     tota += decode[2]*c;       \
   4940     totb += decode[3]*c;
   4941 
   4942 #define stbir__store_output_tiny()                \
   4943     output[0] = tota;                             \
   4944     output[1] = totb;                             \
   4945     horizontal_coefficients += coefficient_width; \
   4946     ++horizontal_contributors;                    \
   4947     output += 2;
   4948 
   4949 #define stbir__4_coeff_start()      \
   4950     float tota0,tota1,tota2,tota3,totb0,totb1,totb2,totb3,c;  \
   4951     c = hc[0];                      \
   4952     tota0 = decode[0]*c;            \
   4953     totb0 = decode[1]*c;            \
   4954     c = hc[1];                      \
   4955     tota1 = decode[2]*c;            \
   4956     totb1 = decode[3]*c;            \
   4957     c = hc[2];                      \
   4958     tota2 = decode[4]*c;            \
   4959     totb2 = decode[5]*c;            \
   4960     c = hc[3];                      \
   4961     tota3 = decode[6]*c;            \
   4962     totb3 = decode[7]*c;
   4963 
   4964 #define stbir__4_coeff_continue_from_4( ofs )  \
   4965     c = hc[0+(ofs)];                           \
   4966     tota0 += decode[0+(ofs)*2]*c;              \
   4967     totb0 += decode[1+(ofs)*2]*c;              \
   4968     c = hc[1+(ofs)];                           \
   4969     tota1 += decode[2+(ofs)*2]*c;              \
   4970     totb1 += decode[3+(ofs)*2]*c;              \
   4971     c = hc[2+(ofs)];                           \
   4972     tota2 += decode[4+(ofs)*2]*c;              \
   4973     totb2 += decode[5+(ofs)*2]*c;              \
   4974     c = hc[3+(ofs)];                           \
   4975     tota3 += decode[6+(ofs)*2]*c;              \
   4976     totb3 += decode[7+(ofs)*2]*c;
   4977 
   4978 #define stbir__1_coeff_remnant( ofs )  \
   4979     c = hc[0+(ofs)];                   \
   4980     tota0 += decode[0+(ofs)*2] * c;    \
   4981     totb0 += decode[1+(ofs)*2] * c;
   4982 
   4983 #define stbir__2_coeff_remnant( ofs )  \
   4984     c = hc[0+(ofs)];                   \
   4985     tota0 += decode[0+(ofs)*2] * c;    \
   4986     totb0 += decode[1+(ofs)*2] * c;    \
   4987     c = hc[1+(ofs)];                   \
   4988     tota1 += decode[2+(ofs)*2] * c;    \
   4989     totb1 += decode[3+(ofs)*2] * c;
   4990 
   4991 #define stbir__3_coeff_remnant( ofs )  \
   4992     c = hc[0+(ofs)];                   \
   4993     tota0 += decode[0+(ofs)*2] * c;    \
   4994     totb0 += decode[1+(ofs)*2] * c;    \
   4995     c = hc[1+(ofs)];                   \
   4996     tota1 += decode[2+(ofs)*2] * c;    \
   4997     totb1 += decode[3+(ofs)*2] * c;    \
   4998     c = hc[2+(ofs)];                   \
   4999     tota2 += decode[4+(ofs)*2] * c;    \
   5000     totb2 += decode[5+(ofs)*2] * c;
   5001 
   5002 #define stbir__store_output()                     \
   5003     output[0] = (tota0+tota2)+(tota1+tota3);      \
   5004     output[1] = (totb0+totb2)+(totb1+totb3);      \
   5005     horizontal_coefficients += coefficient_width; \
   5006     ++horizontal_contributors;                    \
   5007     output += 2;
   5008 
   5009 #endif
   5010 
   5011 #define STBIR__horizontal_channels 2
   5012 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
   5013 #include STBIR__HEADER_FILENAME
   5014 
   5015 
   5016 //=================
   5017 // Do 3 channel horizontal routines
   5018 
   5019 #ifdef STBIR_SIMD
   5020 
   5021 #define stbir__1_coeff_only()         \
   5022     stbir__simdf tot,c,d;             \
   5023     STBIR_SIMD_NO_UNROLL(decode);     \
   5024     stbir__simdf_load1z( c, hc );     \
   5025     stbir__simdf_0123to0001( c, c );  \
   5026     stbir__simdf_load( d, decode );   \
   5027     stbir__simdf_mult( tot, d, c );
   5028 
   5029 #define stbir__2_coeff_only()         \
   5030     stbir__simdf tot,c,cs,d;          \
   5031     STBIR_SIMD_NO_UNROLL(decode);     \
   5032     stbir__simdf_load2( cs, hc );     \
   5033     stbir__simdf_0123to0000( c, cs ); \
   5034     stbir__simdf_load( d, decode );   \
   5035     stbir__simdf_mult( tot, d, c );   \
   5036     stbir__simdf_0123to1111( c, cs ); \
   5037     stbir__simdf_load( d, decode+3 ); \
   5038     stbir__simdf_madd( tot, tot, d, c );
   5039 
   5040 #define stbir__3_coeff_only()            \
   5041     stbir__simdf tot,c,d,cs;             \
   5042     STBIR_SIMD_NO_UNROLL(decode);        \
   5043     stbir__simdf_load( cs, hc );         \
   5044     stbir__simdf_0123to0000( c, cs );    \
   5045     stbir__simdf_load( d, decode );      \
   5046     stbir__simdf_mult( tot, d, c );      \
   5047     stbir__simdf_0123to1111( c, cs );    \
   5048     stbir__simdf_load( d, decode+3 );    \
   5049     stbir__simdf_madd( tot, tot, d, c ); \
   5050     stbir__simdf_0123to2222( c, cs );    \
   5051     stbir__simdf_load( d, decode+6 );    \
   5052     stbir__simdf_madd( tot, tot, d, c );
   5053 
   5054 #define stbir__store_output_tiny()                \
   5055     stbir__simdf_store2( output, tot );           \
   5056     stbir__simdf_0123to2301( tot, tot );          \
   5057     stbir__simdf_store1( output+2, tot );         \
   5058     horizontal_coefficients += coefficient_width; \
   5059     ++horizontal_contributors;                    \
   5060     output += 3;
   5061 
   5062 #ifdef STBIR_SIMD8
   5063 
   5064 // we're loading from the XXXYYY decode by -1 to get the XXXYYY into different halves of the AVX reg fyi
   5065 #define stbir__4_coeff_start()                     \
   5066     stbir__simdf8 tot0,tot1,c,cs; stbir__simdf t;  \
   5067     STBIR_SIMD_NO_UNROLL(decode);                  \
   5068     stbir__simdf8_load4b( cs, hc );                \
   5069     stbir__simdf8_0123to00001111( c, cs );         \
   5070     stbir__simdf8_mult_mem( tot0, c, decode - 1 ); \
   5071     stbir__simdf8_0123to22223333( c, cs );         \
   5072     stbir__simdf8_mult_mem( tot1, c, decode+6 - 1 );
   5073 
   5074 #define stbir__4_coeff_continue_from_4( ofs )      \
   5075     STBIR_SIMD_NO_UNROLL(decode);                  \
   5076     stbir__simdf8_load4b( cs, hc + (ofs) );        \
   5077     stbir__simdf8_0123to00001111( c, cs );         \
   5078     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \
   5079     stbir__simdf8_0123to22223333( c, cs );         \
   5080     stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*3 + 6 - 1 );
   5081 
   5082 #define stbir__1_coeff_remnant( ofs )                          \
   5083     STBIR_SIMD_NO_UNROLL(decode);                              \
   5084     stbir__simdf_load1rep4( t, hc + (ofs) );                   \
   5085     stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*3 - 1 );
   5086 
   5087 #define stbir__2_coeff_remnant( ofs )                          \
   5088     STBIR_SIMD_NO_UNROLL(decode);                              \
   5089     stbir__simdf8_load4b( cs, hc + (ofs) - 2 );                \
   5090     stbir__simdf8_0123to22223333( c, cs );                     \
   5091     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 );
   5092 
   5093  #define stbir__3_coeff_remnant( ofs )                           \
   5094     STBIR_SIMD_NO_UNROLL(decode);                                \
   5095     stbir__simdf8_load4b( cs, hc + (ofs) );                      \
   5096     stbir__simdf8_0123to00001111( c, cs );                       \
   5097     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \
   5098     stbir__simdf8_0123to2222( t, cs );                           \
   5099     stbir__simdf8_madd_mem4( tot1, tot1, t, decode+(ofs)*3 + 6 - 1 );
   5100 
   5101 #define stbir__store_output()                       \
   5102     stbir__simdf8_add( tot0, tot0, tot1 );          \
   5103     stbir__simdf_0123to1230( t, stbir__if_simdf8_cast_to_simdf4( tot0 ) ); \
   5104     stbir__simdf8_add4halves( t, t, tot0 );         \
   5105     horizontal_coefficients += coefficient_width;   \
   5106     ++horizontal_contributors;                      \
   5107     output += 3;                                    \
   5108     if ( output < output_end )                      \
   5109     {                                               \
   5110       stbir__simdf_store( output-3, t );            \
   5111       continue;                                     \
   5112     }                                               \
   5113     { stbir__simdf tt; stbir__simdf_0123to2301( tt, t ); \
   5114     stbir__simdf_store2( output-3, t );             \
   5115     stbir__simdf_store1( output+2-3, tt ); }        \
   5116     break;
   5117 
   5118 
   5119 #else
   5120 
   5121 #define stbir__4_coeff_start()                  \
   5122     stbir__simdf tot0,tot1,tot2,c,cs;           \
   5123     STBIR_SIMD_NO_UNROLL(decode);               \
   5124     stbir__simdf_load( cs, hc );                \
   5125     stbir__simdf_0123to0001( c, cs );           \
   5126     stbir__simdf_mult_mem( tot0, c, decode );   \
   5127     stbir__simdf_0123to1122( c, cs );           \
   5128     stbir__simdf_mult_mem( tot1, c, decode+4 ); \
   5129     stbir__simdf_0123to2333( c, cs );           \
   5130     stbir__simdf_mult_mem( tot2, c, decode+8 );
   5131 
   5132 #define stbir__4_coeff_continue_from_4( ofs )                 \
   5133     STBIR_SIMD_NO_UNROLL(decode);                             \
   5134     stbir__simdf_load( cs, hc + (ofs) );                      \
   5135     stbir__simdf_0123to0001( c, cs );                         \
   5136     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 );   \
   5137     stbir__simdf_0123to1122( c, cs );                         \
   5138     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \
   5139     stbir__simdf_0123to2333( c, cs );                         \
   5140     stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*3+8 );
   5141 
   5142 #define stbir__1_coeff_remnant( ofs )         \
   5143     STBIR_SIMD_NO_UNROLL(decode);             \
   5144     stbir__simdf_load1z( c, hc + (ofs) );     \
   5145     stbir__simdf_0123to0001( c, c );          \
   5146     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 );
   5147 
   5148 #define stbir__2_coeff_remnant( ofs )                       \
   5149     { stbir__simdf d;                                       \
   5150     STBIR_SIMD_NO_UNROLL(decode);                           \
   5151     stbir__simdf_load2z( cs, hc + (ofs) );                  \
   5152     stbir__simdf_0123to0001( c, cs );                       \
   5153     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \
   5154     stbir__simdf_0123to1122( c, cs );                       \
   5155     stbir__simdf_load2z( d, decode+(ofs)*3+4 );             \
   5156     stbir__simdf_madd( tot1, tot1, c, d ); }
   5157 
   5158 #define stbir__3_coeff_remnant( ofs )                         \
   5159     { stbir__simdf d;                                         \
   5160     STBIR_SIMD_NO_UNROLL(decode);                             \
   5161     stbir__simdf_load( cs, hc + (ofs) );                      \
   5162     stbir__simdf_0123to0001( c, cs );                         \
   5163     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 );   \
   5164     stbir__simdf_0123to1122( c, cs );                         \
   5165     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \
   5166     stbir__simdf_0123to2222( c, cs );                         \
   5167     stbir__simdf_load1z( d, decode+(ofs)*3+8 );               \
   5168     stbir__simdf_madd( tot2, tot2, c, d );  }
   5169 
   5170 #define stbir__store_output()                       \
   5171     stbir__simdf_0123ABCDto3ABx( c, tot0, tot1 );   \
   5172     stbir__simdf_0123ABCDto23Ax( cs, tot1, tot2 );  \
   5173     stbir__simdf_0123to1230( tot2, tot2 );          \
   5174     stbir__simdf_add( tot0, tot0, cs );             \
   5175     stbir__simdf_add( c, c, tot2 );                 \
   5176     stbir__simdf_add( tot0, tot0, c );              \
   5177     horizontal_coefficients += coefficient_width;   \
   5178     ++horizontal_contributors;                      \
   5179     output += 3;                                    \
   5180     if ( output < output_end )                      \
   5181     {                                               \
   5182       stbir__simdf_store( output-3, tot0 );         \
   5183       continue;                                     \
   5184     }                                               \
   5185     stbir__simdf_0123to2301( tot1, tot0 );          \
   5186     stbir__simdf_store2( output-3, tot0 );          \
   5187     stbir__simdf_store1( output+2-3, tot1 );        \
   5188     break;
   5189 
   5190 #endif
   5191 
   5192 #else
   5193 
   5194 #define stbir__1_coeff_only()  \
   5195     float tot0, tot1, tot2, c; \
   5196     c = hc[0];                 \
   5197     tot0 = decode[0]*c;        \
   5198     tot1 = decode[1]*c;        \
   5199     tot2 = decode[2]*c;
   5200 
   5201 #define stbir__2_coeff_only()  \
   5202     float tot0, tot1, tot2, c; \
   5203     c = hc[0];                 \
   5204     tot0 = decode[0]*c;        \
   5205     tot1 = decode[1]*c;        \
   5206     tot2 = decode[2]*c;        \
   5207     c = hc[1];                 \
   5208     tot0 += decode[3]*c;       \
   5209     tot1 += decode[4]*c;       \
   5210     tot2 += decode[5]*c;
   5211 
   5212 #define stbir__3_coeff_only()  \
   5213     float tot0, tot1, tot2, c; \
   5214     c = hc[0];                 \
   5215     tot0 = decode[0]*c;        \
   5216     tot1 = decode[1]*c;        \
   5217     tot2 = decode[2]*c;        \
   5218     c = hc[1];                 \
   5219     tot0 += decode[3]*c;       \
   5220     tot1 += decode[4]*c;       \
   5221     tot2 += decode[5]*c;       \
   5222     c = hc[2];                 \
   5223     tot0 += decode[6]*c;       \
   5224     tot1 += decode[7]*c;       \
   5225     tot2 += decode[8]*c;
   5226 
   5227 #define stbir__store_output_tiny()                \
   5228     output[0] = tot0;                             \
   5229     output[1] = tot1;                             \
   5230     output[2] = tot2;                             \
   5231     horizontal_coefficients += coefficient_width; \
   5232     ++horizontal_contributors;                    \
   5233     output += 3;
   5234 
   5235 #define stbir__4_coeff_start()      \
   5236     float tota0,tota1,tota2,totb0,totb1,totb2,totc0,totc1,totc2,totd0,totd1,totd2,c;  \
   5237     c = hc[0];                      \
   5238     tota0 = decode[0]*c;            \
   5239     tota1 = decode[1]*c;            \
   5240     tota2 = decode[2]*c;            \
   5241     c = hc[1];                      \
   5242     totb0 = decode[3]*c;            \
   5243     totb1 = decode[4]*c;            \
   5244     totb2 = decode[5]*c;            \
   5245     c = hc[2];                      \
   5246     totc0 = decode[6]*c;            \
   5247     totc1 = decode[7]*c;            \
   5248     totc2 = decode[8]*c;            \
   5249     c = hc[3];                      \
   5250     totd0 = decode[9]*c;            \
   5251     totd1 = decode[10]*c;           \
   5252     totd2 = decode[11]*c;
   5253 
   5254 #define stbir__4_coeff_continue_from_4( ofs )  \
   5255     c = hc[0+(ofs)];                           \
   5256     tota0 += decode[0+(ofs)*3]*c;              \
   5257     tota1 += decode[1+(ofs)*3]*c;              \
   5258     tota2 += decode[2+(ofs)*3]*c;              \
   5259     c = hc[1+(ofs)];                           \
   5260     totb0 += decode[3+(ofs)*3]*c;              \
   5261     totb1 += decode[4+(ofs)*3]*c;              \
   5262     totb2 += decode[5+(ofs)*3]*c;              \
   5263     c = hc[2+(ofs)];                           \
   5264     totc0 += decode[6+(ofs)*3]*c;              \
   5265     totc1 += decode[7+(ofs)*3]*c;              \
   5266     totc2 += decode[8+(ofs)*3]*c;              \
   5267     c = hc[3+(ofs)];                           \
   5268     totd0 += decode[9+(ofs)*3]*c;              \
   5269     totd1 += decode[10+(ofs)*3]*c;             \
   5270     totd2 += decode[11+(ofs)*3]*c;
   5271 
   5272 #define stbir__1_coeff_remnant( ofs )  \
   5273     c = hc[0+(ofs)];                   \
   5274     tota0 += decode[0+(ofs)*3]*c;      \
   5275     tota1 += decode[1+(ofs)*3]*c;      \
   5276     tota2 += decode[2+(ofs)*3]*c;
   5277 
   5278 #define stbir__2_coeff_remnant( ofs )  \
   5279     c = hc[0+(ofs)];                   \
   5280     tota0 += decode[0+(ofs)*3]*c;      \
   5281     tota1 += decode[1+(ofs)*3]*c;      \
   5282     tota2 += decode[2+(ofs)*3]*c;      \
   5283     c = hc[1+(ofs)];                   \
   5284     totb0 += decode[3+(ofs)*3]*c;      \
   5285     totb1 += decode[4+(ofs)*3]*c;      \
   5286     totb2 += decode[5+(ofs)*3]*c;      \
   5287 
   5288 #define stbir__3_coeff_remnant( ofs )  \
   5289     c = hc[0+(ofs)];                   \
   5290     tota0 += decode[0+(ofs)*3]*c;      \
   5291     tota1 += decode[1+(ofs)*3]*c;      \
   5292     tota2 += decode[2+(ofs)*3]*c;      \
   5293     c = hc[1+(ofs)];                   \
   5294     totb0 += decode[3+(ofs)*3]*c;      \
   5295     totb1 += decode[4+(ofs)*3]*c;      \
   5296     totb2 += decode[5+(ofs)*3]*c;      \
   5297     c = hc[2+(ofs)];                   \
   5298     totc0 += decode[6+(ofs)*3]*c;      \
   5299     totc1 += decode[7+(ofs)*3]*c;      \
   5300     totc2 += decode[8+(ofs)*3]*c;
   5301 
   5302 #define stbir__store_output()                     \
   5303     output[0] = (tota0+totc0)+(totb0+totd0);      \
   5304     output[1] = (tota1+totc1)+(totb1+totd1);      \
   5305     output[2] = (tota2+totc2)+(totb2+totd2);      \
   5306     horizontal_coefficients += coefficient_width; \
   5307     ++horizontal_contributors;                    \
   5308     output += 3;
   5309 
   5310 #endif
   5311 
   5312 #define STBIR__horizontal_channels 3
   5313 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
   5314 #include STBIR__HEADER_FILENAME
   5315 
   5316 //=================
   5317 // Do 4 channel horizontal routines
   5318 
   5319 #ifdef STBIR_SIMD
   5320 
   5321 #define stbir__1_coeff_only()             \
   5322     stbir__simdf tot,c;                   \
   5323     STBIR_SIMD_NO_UNROLL(decode);         \
   5324     stbir__simdf_load1( c, hc );          \
   5325     stbir__simdf_0123to0000( c, c );      \
   5326     stbir__simdf_mult_mem( tot, c, decode );
   5327 
   5328 #define stbir__2_coeff_only()                       \
   5329     stbir__simdf tot,c,cs;                          \
   5330     STBIR_SIMD_NO_UNROLL(decode);                   \
   5331     stbir__simdf_load2( cs, hc );                   \
   5332     stbir__simdf_0123to0000( c, cs );               \
   5333     stbir__simdf_mult_mem( tot, c, decode );        \
   5334     stbir__simdf_0123to1111( c, cs );               \
   5335     stbir__simdf_madd_mem( tot, tot, c, decode+4 );
   5336 
   5337 #define stbir__3_coeff_only()                       \
   5338     stbir__simdf tot,c,cs;                          \
   5339     STBIR_SIMD_NO_UNROLL(decode);                   \
   5340     stbir__simdf_load( cs, hc );                    \
   5341     stbir__simdf_0123to0000( c, cs );               \
   5342     stbir__simdf_mult_mem( tot, c, decode );        \
   5343     stbir__simdf_0123to1111( c, cs );               \
   5344     stbir__simdf_madd_mem( tot, tot, c, decode+4 ); \
   5345     stbir__simdf_0123to2222( c, cs );               \
   5346     stbir__simdf_madd_mem( tot, tot, c, decode+8 );
   5347 
   5348 #define stbir__store_output_tiny()                \
   5349     stbir__simdf_store( output, tot );            \
   5350     horizontal_coefficients += coefficient_width; \
   5351     ++horizontal_contributors;                    \
   5352     output += 4;
   5353 
   5354 #ifdef STBIR_SIMD8
   5355 
   5356 #define stbir__4_coeff_start()                     \
   5357     stbir__simdf8 tot0,c,cs; stbir__simdf t;  \
   5358     STBIR_SIMD_NO_UNROLL(decode);                  \
   5359     stbir__simdf8_load4b( cs, hc );                \
   5360     stbir__simdf8_0123to00001111( c, cs );         \
   5361     stbir__simdf8_mult_mem( tot0, c, decode );     \
   5362     stbir__simdf8_0123to22223333( c, cs );         \
   5363     stbir__simdf8_madd_mem( tot0, tot0, c, decode+8 );
   5364 
   5365 #define stbir__4_coeff_continue_from_4( ofs )                  \
   5366     STBIR_SIMD_NO_UNROLL(decode);                              \
   5367     stbir__simdf8_load4b( cs, hc + (ofs) );                    \
   5368     stbir__simdf8_0123to00001111( c, cs );                     \
   5369     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   \
   5370     stbir__simdf8_0123to22223333( c, cs );                     \
   5371     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );
   5372 
   5373 #define stbir__1_coeff_remnant( ofs )                          \
   5374     STBIR_SIMD_NO_UNROLL(decode);                              \
   5375     stbir__simdf_load1rep4( t, hc + (ofs) );                   \
   5376     stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4 );
   5377 
   5378 #define stbir__2_coeff_remnant( ofs )                          \
   5379     STBIR_SIMD_NO_UNROLL(decode);                              \
   5380     stbir__simdf8_load4b( cs, hc + (ofs) - 2 );                \
   5381     stbir__simdf8_0123to22223333( c, cs );                     \
   5382     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );
   5383 
   5384  #define stbir__3_coeff_remnant( ofs )                         \
   5385     STBIR_SIMD_NO_UNROLL(decode);                              \
   5386     stbir__simdf8_load4b( cs, hc + (ofs) );                    \
   5387     stbir__simdf8_0123to00001111( c, cs );                     \
   5388     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   \
   5389     stbir__simdf8_0123to2222( t, cs );                         \
   5390     stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4+8 );
   5391 
   5392 #define stbir__store_output()                      \
   5393     stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 );     \
   5394     stbir__simdf_store( output, t );               \
   5395     horizontal_coefficients += coefficient_width;  \
   5396     ++horizontal_contributors;                     \
   5397     output += 4;
   5398 
   5399 #else
   5400 
   5401 #define stbir__4_coeff_start()                        \
   5402     stbir__simdf tot0,tot1,c,cs;                      \
   5403     STBIR_SIMD_NO_UNROLL(decode);                     \
   5404     stbir__simdf_load( cs, hc );                      \
   5405     stbir__simdf_0123to0000( c, cs );                 \
   5406     stbir__simdf_mult_mem( tot0, c, decode );         \
   5407     stbir__simdf_0123to1111( c, cs );                 \
   5408     stbir__simdf_mult_mem( tot1, c, decode+4 );       \
   5409     stbir__simdf_0123to2222( c, cs );                 \
   5410     stbir__simdf_madd_mem( tot0, tot0, c, decode+8 ); \
   5411     stbir__simdf_0123to3333( c, cs );                 \
   5412     stbir__simdf_madd_mem( tot1, tot1, c, decode+12 );
   5413 
   5414 #define stbir__4_coeff_continue_from_4( ofs )                  \
   5415     STBIR_SIMD_NO_UNROLL(decode);                              \
   5416     stbir__simdf_load( cs, hc + (ofs) );                       \
   5417     stbir__simdf_0123to0000( c, cs );                          \
   5418     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );    \
   5419     stbir__simdf_0123to1111( c, cs );                          \
   5420     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 );  \
   5421     stbir__simdf_0123to2222( c, cs );                          \
   5422     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );  \
   5423     stbir__simdf_0123to3333( c, cs );                          \
   5424     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+12 );
   5425 
   5426 #define stbir__1_coeff_remnant( ofs )                       \
   5427     STBIR_SIMD_NO_UNROLL(decode);                           \
   5428     stbir__simdf_load1( c, hc + (ofs) );                    \
   5429     stbir__simdf_0123to0000( c, c );                        \
   5430     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );
   5431 
   5432 #define stbir__2_coeff_remnant( ofs )                         \
   5433     STBIR_SIMD_NO_UNROLL(decode);                             \
   5434     stbir__simdf_load2( cs, hc + (ofs) );                     \
   5435     stbir__simdf_0123to0000( c, cs );                         \
   5436     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   \
   5437     stbir__simdf_0123to1111( c, cs );                         \
   5438     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 );
   5439 
   5440 #define stbir__3_coeff_remnant( ofs )                          \
   5441     STBIR_SIMD_NO_UNROLL(decode);                              \
   5442     stbir__simdf_load( cs, hc + (ofs) );                       \
   5443     stbir__simdf_0123to0000( c, cs );                          \
   5444     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );    \
   5445     stbir__simdf_0123to1111( c, cs );                          \
   5446     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 );  \
   5447     stbir__simdf_0123to2222( c, cs );                          \
   5448     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );
   5449 
   5450 #define stbir__store_output()                     \
   5451     stbir__simdf_add( tot0, tot0, tot1 );         \
   5452     stbir__simdf_store( output, tot0 );           \
   5453     horizontal_coefficients += coefficient_width; \
   5454     ++horizontal_contributors;                    \
   5455     output += 4;
   5456 
   5457 #endif
   5458 
   5459 #else
   5460 
   5461 #define stbir__1_coeff_only()         \
   5462     float p0,p1,p2,p3,c;              \
   5463     STBIR_SIMD_NO_UNROLL(decode);     \
   5464     c = hc[0];                        \
   5465     p0 = decode[0] * c;               \
   5466     p1 = decode[1] * c;               \
   5467     p2 = decode[2] * c;               \
   5468     p3 = decode[3] * c;
   5469 
   5470 #define stbir__2_coeff_only()         \
   5471     float p0,p1,p2,p3,c;              \
   5472     STBIR_SIMD_NO_UNROLL(decode);     \
   5473     c = hc[0];                        \
   5474     p0 = decode[0] * c;               \
   5475     p1 = decode[1] * c;               \
   5476     p2 = decode[2] * c;               \
   5477     p3 = decode[3] * c;               \
   5478     c = hc[1];                        \
   5479     p0 += decode[4] * c;              \
   5480     p1 += decode[5] * c;              \
   5481     p2 += decode[6] * c;              \
   5482     p3 += decode[7] * c;
   5483 
   5484 #define stbir__3_coeff_only()         \
   5485     float p0,p1,p2,p3,c;              \
   5486     STBIR_SIMD_NO_UNROLL(decode);     \
   5487     c = hc[0];                        \
   5488     p0 = decode[0] * c;               \
   5489     p1 = decode[1] * c;               \
   5490     p2 = decode[2] * c;               \
   5491     p3 = decode[3] * c;               \
   5492     c = hc[1];                        \
   5493     p0 += decode[4] * c;              \
   5494     p1 += decode[5] * c;              \
   5495     p2 += decode[6] * c;              \
   5496     p3 += decode[7] * c;              \
   5497     c = hc[2];                        \
   5498     p0 += decode[8] * c;              \
   5499     p1 += decode[9] * c;              \
   5500     p2 += decode[10] * c;             \
   5501     p3 += decode[11] * c;
   5502 
   5503 #define stbir__store_output_tiny()                \
   5504     output[0] = p0;                               \
   5505     output[1] = p1;                               \
   5506     output[2] = p2;                               \
   5507     output[3] = p3;                               \
   5508     horizontal_coefficients += coefficient_width; \
   5509     ++horizontal_contributors;                    \
   5510     output += 4;
   5511 
   5512 #define stbir__4_coeff_start()        \
   5513     float x0,x1,x2,x3,y0,y1,y2,y3,c;  \
   5514     STBIR_SIMD_NO_UNROLL(decode);     \
   5515     c = hc[0];                        \
   5516     x0 = decode[0] * c;               \
   5517     x1 = decode[1] * c;               \
   5518     x2 = decode[2] * c;               \
   5519     x3 = decode[3] * c;               \
   5520     c = hc[1];                        \
   5521     y0 = decode[4] * c;               \
   5522     y1 = decode[5] * c;               \
   5523     y2 = decode[6] * c;               \
   5524     y3 = decode[7] * c;               \
   5525     c = hc[2];                        \
   5526     x0 += decode[8] * c;              \
   5527     x1 += decode[9] * c;              \
   5528     x2 += decode[10] * c;             \
   5529     x3 += decode[11] * c;             \
   5530     c = hc[3];                        \
   5531     y0 += decode[12] * c;             \
   5532     y1 += decode[13] * c;             \
   5533     y2 += decode[14] * c;             \
   5534     y3 += decode[15] * c;
   5535 
   5536 #define stbir__4_coeff_continue_from_4( ofs ) \
   5537     STBIR_SIMD_NO_UNROLL(decode);     \
   5538     c = hc[0+(ofs)];                  \
   5539     x0 += decode[0+(ofs)*4] * c;      \
   5540     x1 += decode[1+(ofs)*4] * c;      \
   5541     x2 += decode[2+(ofs)*4] * c;      \
   5542     x3 += decode[3+(ofs)*4] * c;      \
   5543     c = hc[1+(ofs)];                  \
   5544     y0 += decode[4+(ofs)*4] * c;      \
   5545     y1 += decode[5+(ofs)*4] * c;      \
   5546     y2 += decode[6+(ofs)*4] * c;      \
   5547     y3 += decode[7+(ofs)*4] * c;      \
   5548     c = hc[2+(ofs)];                  \
   5549     x0 += decode[8+(ofs)*4] * c;      \
   5550     x1 += decode[9+(ofs)*4] * c;      \
   5551     x2 += decode[10+(ofs)*4] * c;     \
   5552     x3 += decode[11+(ofs)*4] * c;     \
   5553     c = hc[3+(ofs)];                  \
   5554     y0 += decode[12+(ofs)*4] * c;     \
   5555     y1 += decode[13+(ofs)*4] * c;     \
   5556     y2 += decode[14+(ofs)*4] * c;     \
   5557     y3 += decode[15+(ofs)*4] * c;
   5558 
   5559 #define stbir__1_coeff_remnant( ofs ) \
   5560     STBIR_SIMD_NO_UNROLL(decode);     \
   5561     c = hc[0+(ofs)];                  \
   5562     x0 += decode[0+(ofs)*4] * c;      \
   5563     x1 += decode[1+(ofs)*4] * c;      \
   5564     x2 += decode[2+(ofs)*4] * c;      \
   5565     x3 += decode[3+(ofs)*4] * c;
   5566 
   5567 #define stbir__2_coeff_remnant( ofs ) \
   5568     STBIR_SIMD_NO_UNROLL(decode);     \
   5569     c = hc[0+(ofs)];                  \
   5570     x0 += decode[0+(ofs)*4] * c;      \
   5571     x1 += decode[1+(ofs)*4] * c;      \
   5572     x2 += decode[2+(ofs)*4] * c;      \
   5573     x3 += decode[3+(ofs)*4] * c;      \
   5574     c = hc[1+(ofs)];                  \
   5575     y0 += decode[4+(ofs)*4] * c;      \
   5576     y1 += decode[5+(ofs)*4] * c;      \
   5577     y2 += decode[6+(ofs)*4] * c;      \
   5578     y3 += decode[7+(ofs)*4] * c;
   5579 
   5580 #define stbir__3_coeff_remnant( ofs ) \
   5581     STBIR_SIMD_NO_UNROLL(decode);     \
   5582     c = hc[0+(ofs)];                  \
   5583     x0 += decode[0+(ofs)*4] * c;      \
   5584     x1 += decode[1+(ofs)*4] * c;      \
   5585     x2 += decode[2+(ofs)*4] * c;      \
   5586     x3 += decode[3+(ofs)*4] * c;      \
   5587     c = hc[1+(ofs)];                  \
   5588     y0 += decode[4+(ofs)*4] * c;      \
   5589     y1 += decode[5+(ofs)*4] * c;      \
   5590     y2 += decode[6+(ofs)*4] * c;      \
   5591     y3 += decode[7+(ofs)*4] * c;      \
   5592     c = hc[2+(ofs)];                  \
   5593     x0 += decode[8+(ofs)*4] * c;      \
   5594     x1 += decode[9+(ofs)*4] * c;      \
   5595     x2 += decode[10+(ofs)*4] * c;     \
   5596     x3 += decode[11+(ofs)*4] * c;
   5597 
   5598 #define stbir__store_output()                     \
   5599     output[0] = x0 + y0;                          \
   5600     output[1] = x1 + y1;                          \
   5601     output[2] = x2 + y2;                          \
   5602     output[3] = x3 + y3;                          \
   5603     horizontal_coefficients += coefficient_width; \
   5604     ++horizontal_contributors;                    \
   5605     output += 4;
   5606 
   5607 #endif
   5608 
   5609 #define STBIR__horizontal_channels 4
   5610 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
   5611 #include STBIR__HEADER_FILENAME
   5612 
   5613 
   5614 
   5615 //=================
   5616 // Do 7 channel horizontal routines
   5617 
   5618 #ifdef STBIR_SIMD
   5619 
   5620 #define stbir__1_coeff_only()                   \
   5621     stbir__simdf tot0,tot1,c;                   \
   5622     STBIR_SIMD_NO_UNROLL(decode);               \
   5623     stbir__simdf_load1( c, hc );                \
   5624     stbir__simdf_0123to0000( c, c );            \
   5625     stbir__simdf_mult_mem( tot0, c, decode );   \
   5626     stbir__simdf_mult_mem( tot1, c, decode+3 );
   5627 
   5628 #define stbir__2_coeff_only()                         \
   5629     stbir__simdf tot0,tot1,c,cs;                      \
   5630     STBIR_SIMD_NO_UNROLL(decode);                     \
   5631     stbir__simdf_load2( cs, hc );                     \
   5632     stbir__simdf_0123to0000( c, cs );                 \
   5633     stbir__simdf_mult_mem( tot0, c, decode );         \
   5634     stbir__simdf_mult_mem( tot1, c, decode+3 );       \
   5635     stbir__simdf_0123to1111( c, cs );                 \
   5636     stbir__simdf_madd_mem( tot0, tot0, c, decode+7 ); \
   5637     stbir__simdf_madd_mem( tot1, tot1, c,decode+10 );
   5638 
   5639 #define stbir__3_coeff_only()                           \
   5640     stbir__simdf tot0,tot1,c,cs;                        \
   5641     STBIR_SIMD_NO_UNROLL(decode);                       \
   5642     stbir__simdf_load( cs, hc );                        \
   5643     stbir__simdf_0123to0000( c, cs );                   \
   5644     stbir__simdf_mult_mem( tot0, c, decode );           \
   5645     stbir__simdf_mult_mem( tot1, c, decode+3 );         \
   5646     stbir__simdf_0123to1111( c, cs );                   \
   5647     stbir__simdf_madd_mem( tot0, tot0, c, decode+7 );   \
   5648     stbir__simdf_madd_mem( tot1, tot1, c, decode+10 );  \
   5649     stbir__simdf_0123to2222( c, cs );                   \
   5650     stbir__simdf_madd_mem( tot0, tot0, c, decode+14 );  \
   5651     stbir__simdf_madd_mem( tot1, tot1, c, decode+17 );
   5652 
   5653 #define stbir__store_output_tiny()                \
   5654     stbir__simdf_store( output+3, tot1 );         \
   5655     stbir__simdf_store( output, tot0 );           \
   5656     horizontal_coefficients += coefficient_width; \
   5657     ++horizontal_contributors;                    \
   5658     output += 7;
   5659 
   5660 #ifdef STBIR_SIMD8
   5661 
   5662 #define stbir__4_coeff_start()                     \
   5663     stbir__simdf8 tot0,tot1,c,cs;                  \
   5664     STBIR_SIMD_NO_UNROLL(decode);                  \
   5665     stbir__simdf8_load4b( cs, hc );                \
   5666     stbir__simdf8_0123to00000000( c, cs );         \
   5667     stbir__simdf8_mult_mem( tot0, c, decode );     \
   5668     stbir__simdf8_0123to11111111( c, cs );         \
   5669     stbir__simdf8_mult_mem( tot1, c, decode+7 );   \
   5670     stbir__simdf8_0123to22222222( c, cs );         \
   5671     stbir__simdf8_madd_mem( tot0, tot0, c, decode+14 );  \
   5672     stbir__simdf8_0123to33333333( c, cs );         \
   5673     stbir__simdf8_madd_mem( tot1, tot1, c, decode+21 );
   5674 
   5675 #define stbir__4_coeff_continue_from_4( ofs )                   \
   5676     STBIR_SIMD_NO_UNROLL(decode);                               \
   5677     stbir__simdf8_load4b( cs, hc + (ofs) );                     \
   5678     stbir__simdf8_0123to00000000( c, cs );                      \
   5679     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );    \
   5680     stbir__simdf8_0123to11111111( c, cs );                      \
   5681     stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );  \
   5682     stbir__simdf8_0123to22222222( c, cs );                      \
   5683     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \
   5684     stbir__simdf8_0123to33333333( c, cs );                      \
   5685     stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+21 );
   5686 
   5687 #define stbir__1_coeff_remnant( ofs )                           \
   5688     STBIR_SIMD_NO_UNROLL(decode);                               \
   5689     stbir__simdf8_load1b( c, hc + (ofs) );                      \
   5690     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );
   5691 
   5692 #define stbir__2_coeff_remnant( ofs )                           \
   5693     STBIR_SIMD_NO_UNROLL(decode);                               \
   5694     stbir__simdf8_load1b( c, hc + (ofs) );                      \
   5695     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );    \
   5696     stbir__simdf8_load1b( c, hc + (ofs)+1 );                    \
   5697     stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );
   5698 
   5699 #define stbir__3_coeff_remnant( ofs )                           \
   5700     STBIR_SIMD_NO_UNROLL(decode);                               \
   5701     stbir__simdf8_load4b( cs, hc + (ofs) );                     \
   5702     stbir__simdf8_0123to00000000( c, cs );                      \
   5703     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );    \
   5704     stbir__simdf8_0123to11111111( c, cs );                      \
   5705     stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );  \
   5706     stbir__simdf8_0123to22222222( c, cs );                      \
   5707     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 );
   5708 
   5709 #define stbir__store_output()                     \
   5710     stbir__simdf8_add( tot0, tot0, tot1 );        \
   5711     horizontal_coefficients += coefficient_width; \
   5712     ++horizontal_contributors;                    \
   5713     output += 7;                                  \
   5714     if ( output < output_end )                    \
   5715     {                                             \
   5716       stbir__simdf8_store( output-7, tot0 );      \
   5717       continue;                                   \
   5718     }                                             \
   5719     stbir__simdf_store( output-7+3, stbir__simdf_swiz(stbir__simdf8_gettop4(tot0),0,0,1,2) ); \
   5720     stbir__simdf_store( output-7, stbir__if_simdf8_cast_to_simdf4(tot0) );           \
   5721     break;
   5722 
   5723 #else
   5724 
   5725 #define stbir__4_coeff_start()                    \
   5726     stbir__simdf tot0,tot1,tot2,tot3,c,cs;        \
   5727     STBIR_SIMD_NO_UNROLL(decode);                 \
   5728     stbir__simdf_load( cs, hc );                  \
   5729     stbir__simdf_0123to0000( c, cs );             \
   5730     stbir__simdf_mult_mem( tot0, c, decode );     \
   5731     stbir__simdf_mult_mem( tot1, c, decode+3 );   \
   5732     stbir__simdf_0123to1111( c, cs );             \
   5733     stbir__simdf_mult_mem( tot2, c, decode+7 );   \
   5734     stbir__simdf_mult_mem( tot3, c, decode+10 );  \
   5735     stbir__simdf_0123to2222( c, cs );             \
   5736     stbir__simdf_madd_mem( tot0, tot0, c, decode+14 );  \
   5737     stbir__simdf_madd_mem( tot1, tot1, c, decode+17 );  \
   5738     stbir__simdf_0123to3333( c, cs );                   \
   5739     stbir__simdf_madd_mem( tot2, tot2, c, decode+21 );  \
   5740     stbir__simdf_madd_mem( tot3, tot3, c, decode+24 );
   5741 
   5742 #define stbir__4_coeff_continue_from_4( ofs )                   \
   5743     STBIR_SIMD_NO_UNROLL(decode);                               \
   5744     stbir__simdf_load( cs, hc + (ofs) );                        \
   5745     stbir__simdf_0123to0000( c, cs );                           \
   5746     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 );     \
   5747     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 );   \
   5748     stbir__simdf_0123to1111( c, cs );                           \
   5749     stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 );   \
   5750     stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );  \
   5751     stbir__simdf_0123to2222( c, cs );                           \
   5752     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 );  \
   5753     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 );  \
   5754     stbir__simdf_0123to3333( c, cs );                           \
   5755     stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+21 );  \
   5756     stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+24 );
   5757 
   5758 #define stbir__1_coeff_remnant( ofs )                           \
   5759     STBIR_SIMD_NO_UNROLL(decode);                               \
   5760     stbir__simdf_load1( c, hc + (ofs) );                        \
   5761     stbir__simdf_0123to0000( c, c );                            \
   5762     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 );     \
   5763     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 );   \
   5764 
   5765 #define stbir__2_coeff_remnant( ofs )                           \
   5766     STBIR_SIMD_NO_UNROLL(decode);                               \
   5767     stbir__simdf_load2( cs, hc + (ofs) );                       \
   5768     stbir__simdf_0123to0000( c, cs );                           \
   5769     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 );     \
   5770     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 );   \
   5771     stbir__simdf_0123to1111( c, cs );                           \
   5772     stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 );   \
   5773     stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );
   5774 
   5775 #define stbir__3_coeff_remnant( ofs )                           \
   5776     STBIR_SIMD_NO_UNROLL(decode);                               \
   5777     stbir__simdf_load( cs, hc + (ofs) );                        \
   5778     stbir__simdf_0123to0000( c, cs );                           \
   5779     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 );     \
   5780     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 );   \
   5781     stbir__simdf_0123to1111( c, cs );                           \
   5782     stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 );   \
   5783     stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );  \
   5784     stbir__simdf_0123to2222( c, cs );                           \
   5785     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 );  \
   5786     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 );
   5787 
   5788 #define stbir__store_output()                     \
   5789     stbir__simdf_add( tot0, tot0, tot2 );         \
   5790     stbir__simdf_add( tot1, tot1, tot3 );         \
   5791     stbir__simdf_store( output+3, tot1 );         \
   5792     stbir__simdf_store( output, tot0 );           \
   5793     horizontal_coefficients += coefficient_width; \
   5794     ++horizontal_contributors;                    \
   5795     output += 7;
   5796 
   5797 #endif
   5798 
   5799 #else
   5800 
   5801 #define stbir__1_coeff_only()        \
   5802     float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
   5803     c = hc[0];                       \
   5804     tot0 = decode[0]*c;              \
   5805     tot1 = decode[1]*c;              \
   5806     tot2 = decode[2]*c;              \
   5807     tot3 = decode[3]*c;              \
   5808     tot4 = decode[4]*c;              \
   5809     tot5 = decode[5]*c;              \
   5810     tot6 = decode[6]*c;
   5811 
   5812 #define stbir__2_coeff_only()        \
   5813     float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
   5814     c = hc[0];                       \
   5815     tot0 = decode[0]*c;              \
   5816     tot1 = decode[1]*c;              \
   5817     tot2 = decode[2]*c;              \
   5818     tot3 = decode[3]*c;              \
   5819     tot4 = decode[4]*c;              \
   5820     tot5 = decode[5]*c;              \
   5821     tot6 = decode[6]*c;              \
   5822     c = hc[1];                       \
   5823     tot0 += decode[7]*c;             \
   5824     tot1 += decode[8]*c;             \
   5825     tot2 += decode[9]*c;             \
   5826     tot3 += decode[10]*c;            \
   5827     tot4 += decode[11]*c;            \
   5828     tot5 += decode[12]*c;            \
   5829     tot6 += decode[13]*c;            \
   5830 
   5831 #define stbir__3_coeff_only()        \
   5832     float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
   5833     c = hc[0];                       \
   5834     tot0 = decode[0]*c;              \
   5835     tot1 = decode[1]*c;              \
   5836     tot2 = decode[2]*c;              \
   5837     tot3 = decode[3]*c;              \
   5838     tot4 = decode[4]*c;              \
   5839     tot5 = decode[5]*c;              \
   5840     tot6 = decode[6]*c;              \
   5841     c = hc[1];                       \
   5842     tot0 += decode[7]*c;             \
   5843     tot1 += decode[8]*c;             \
   5844     tot2 += decode[9]*c;             \
   5845     tot3 += decode[10]*c;            \
   5846     tot4 += decode[11]*c;            \
   5847     tot5 += decode[12]*c;            \
   5848     tot6 += decode[13]*c;            \
   5849     c = hc[2];                       \
   5850     tot0 += decode[14]*c;            \
   5851     tot1 += decode[15]*c;            \
   5852     tot2 += decode[16]*c;            \
   5853     tot3 += decode[17]*c;            \
   5854     tot4 += decode[18]*c;            \
   5855     tot5 += decode[19]*c;            \
   5856     tot6 += decode[20]*c;            \
   5857 
   5858 #define stbir__store_output_tiny()                \
   5859     output[0] = tot0;                             \
   5860     output[1] = tot1;                             \
   5861     output[2] = tot2;                             \
   5862     output[3] = tot3;                             \
   5863     output[4] = tot4;                             \
   5864     output[5] = tot5;                             \
   5865     output[6] = tot6;                             \
   5866     horizontal_coefficients += coefficient_width; \
   5867     ++horizontal_contributors;                    \
   5868     output += 7;
   5869 
   5870 #define stbir__4_coeff_start()    \
   5871     float x0,x1,x2,x3,x4,x5,x6,y0,y1,y2,y3,y4,y5,y6,c; \
   5872     STBIR_SIMD_NO_UNROLL(decode); \
   5873     c = hc[0];                    \
   5874     x0 = decode[0] * c;           \
   5875     x1 = decode[1] * c;           \
   5876     x2 = decode[2] * c;           \
   5877     x3 = decode[3] * c;           \
   5878     x4 = decode[4] * c;           \
   5879     x5 = decode[5] * c;           \
   5880     x6 = decode[6] * c;           \
   5881     c = hc[1];                    \
   5882     y0 = decode[7] * c;           \
   5883     y1 = decode[8] * c;           \
   5884     y2 = decode[9] * c;           \
   5885     y3 = decode[10] * c;          \
   5886     y4 = decode[11] * c;          \
   5887     y5 = decode[12] * c;          \
   5888     y6 = decode[13] * c;          \
   5889     c = hc[2];                    \
   5890     x0 += decode[14] * c;         \
   5891     x1 += decode[15] * c;         \
   5892     x2 += decode[16] * c;         \
   5893     x3 += decode[17] * c;         \
   5894     x4 += decode[18] * c;         \
   5895     x5 += decode[19] * c;         \
   5896     x6 += decode[20] * c;         \
   5897     c = hc[3];                    \
   5898     y0 += decode[21] * c;         \
   5899     y1 += decode[22] * c;         \
   5900     y2 += decode[23] * c;         \
   5901     y3 += decode[24] * c;         \
   5902     y4 += decode[25] * c;         \
   5903     y5 += decode[26] * c;         \
   5904     y6 += decode[27] * c;
   5905 
   5906 #define stbir__4_coeff_continue_from_4( ofs ) \
   5907     STBIR_SIMD_NO_UNROLL(decode);  \
   5908     c = hc[0+(ofs)];               \
   5909     x0 += decode[0+(ofs)*7] * c;   \
   5910     x1 += decode[1+(ofs)*7] * c;   \
   5911     x2 += decode[2+(ofs)*7] * c;   \
   5912     x3 += decode[3+(ofs)*7] * c;   \
   5913     x4 += decode[4+(ofs)*7] * c;   \
   5914     x5 += decode[5+(ofs)*7] * c;   \
   5915     x6 += decode[6+(ofs)*7] * c;   \
   5916     c = hc[1+(ofs)];               \
   5917     y0 += decode[7+(ofs)*7] * c;   \
   5918     y1 += decode[8+(ofs)*7] * c;   \
   5919     y2 += decode[9+(ofs)*7] * c;   \
   5920     y3 += decode[10+(ofs)*7] * c;  \
   5921     y4 += decode[11+(ofs)*7] * c;  \
   5922     y5 += decode[12+(ofs)*7] * c;  \
   5923     y6 += decode[13+(ofs)*7] * c;  \
   5924     c = hc[2+(ofs)];               \
   5925     x0 += decode[14+(ofs)*7] * c;  \
   5926     x1 += decode[15+(ofs)*7] * c;  \
   5927     x2 += decode[16+(ofs)*7] * c;  \
   5928     x3 += decode[17+(ofs)*7] * c;  \
   5929     x4 += decode[18+(ofs)*7] * c;  \
   5930     x5 += decode[19+(ofs)*7] * c;  \
   5931     x6 += decode[20+(ofs)*7] * c;  \
   5932     c = hc[3+(ofs)];               \
   5933     y0 += decode[21+(ofs)*7] * c;  \
   5934     y1 += decode[22+(ofs)*7] * c;  \
   5935     y2 += decode[23+(ofs)*7] * c;  \
   5936     y3 += decode[24+(ofs)*7] * c;  \
   5937     y4 += decode[25+(ofs)*7] * c;  \
   5938     y5 += decode[26+(ofs)*7] * c;  \
   5939     y6 += decode[27+(ofs)*7] * c;
   5940 
   5941 #define stbir__1_coeff_remnant( ofs ) \
   5942     STBIR_SIMD_NO_UNROLL(decode);  \
   5943     c = hc[0+(ofs)];               \
   5944     x0 += decode[0+(ofs)*7] * c;   \
   5945     x1 += decode[1+(ofs)*7] * c;   \
   5946     x2 += decode[2+(ofs)*7] * c;   \
   5947     x3 += decode[3+(ofs)*7] * c;   \
   5948     x4 += decode[4+(ofs)*7] * c;   \
   5949     x5 += decode[5+(ofs)*7] * c;   \
   5950     x6 += decode[6+(ofs)*7] * c;   \
   5951 
   5952 #define stbir__2_coeff_remnant( ofs ) \
   5953     STBIR_SIMD_NO_UNROLL(decode);  \
   5954     c = hc[0+(ofs)];               \
   5955     x0 += decode[0+(ofs)*7] * c;   \
   5956     x1 += decode[1+(ofs)*7] * c;   \
   5957     x2 += decode[2+(ofs)*7] * c;   \
   5958     x3 += decode[3+(ofs)*7] * c;   \
   5959     x4 += decode[4+(ofs)*7] * c;   \
   5960     x5 += decode[5+(ofs)*7] * c;   \
   5961     x6 += decode[6+(ofs)*7] * c;   \
   5962     c = hc[1+(ofs)];               \
   5963     y0 += decode[7+(ofs)*7] * c;   \
   5964     y1 += decode[8+(ofs)*7] * c;   \
   5965     y2 += decode[9+(ofs)*7] * c;   \
   5966     y3 += decode[10+(ofs)*7] * c;  \
   5967     y4 += decode[11+(ofs)*7] * c;  \
   5968     y5 += decode[12+(ofs)*7] * c;  \
   5969     y6 += decode[13+(ofs)*7] * c;  \
   5970 
   5971 #define stbir__3_coeff_remnant( ofs ) \
   5972     STBIR_SIMD_NO_UNROLL(decode);  \
   5973     c = hc[0+(ofs)];               \
   5974     x0 += decode[0+(ofs)*7] * c;   \
   5975     x1 += decode[1+(ofs)*7] * c;   \
   5976     x2 += decode[2+(ofs)*7] * c;   \
   5977     x3 += decode[3+(ofs)*7] * c;   \
   5978     x4 += decode[4+(ofs)*7] * c;   \
   5979     x5 += decode[5+(ofs)*7] * c;   \
   5980     x6 += decode[6+(ofs)*7] * c;   \
   5981     c = hc[1+(ofs)];               \
   5982     y0 += decode[7+(ofs)*7] * c;   \
   5983     y1 += decode[8+(ofs)*7] * c;   \
   5984     y2 += decode[9+(ofs)*7] * c;   \
   5985     y3 += decode[10+(ofs)*7] * c;  \
   5986     y4 += decode[11+(ofs)*7] * c;  \
   5987     y5 += decode[12+(ofs)*7] * c;  \
   5988     y6 += decode[13+(ofs)*7] * c;  \
   5989     c = hc[2+(ofs)];               \
   5990     x0 += decode[14+(ofs)*7] * c;  \
   5991     x1 += decode[15+(ofs)*7] * c;  \
   5992     x2 += decode[16+(ofs)*7] * c;  \
   5993     x3 += decode[17+(ofs)*7] * c;  \
   5994     x4 += decode[18+(ofs)*7] * c;  \
   5995     x5 += decode[19+(ofs)*7] * c;  \
   5996     x6 += decode[20+(ofs)*7] * c;  \
   5997 
   5998 #define stbir__store_output()                     \
   5999     output[0] = x0 + y0;                          \
   6000     output[1] = x1 + y1;                          \
   6001     output[2] = x2 + y2;                          \
   6002     output[3] = x3 + y3;                          \
   6003     output[4] = x4 + y4;                          \
   6004     output[5] = x5 + y5;                          \
   6005     output[6] = x6 + y6;                          \
   6006     horizontal_coefficients += coefficient_width; \
   6007     ++horizontal_contributors;                    \
   6008     output += 7;
   6009 
   6010 #endif
   6011 
   6012 #define STBIR__horizontal_channels 7
   6013 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
   6014 #include STBIR__HEADER_FILENAME
   6015 
   6016 
   6017 // include all of the vertical resamplers (both scatter and gather versions)
   6018 
   6019 #define STBIR__vertical_channels 1
   6020 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6021 #include STBIR__HEADER_FILENAME
   6022 
   6023 #define STBIR__vertical_channels 1
   6024 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6025 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   6026 #include STBIR__HEADER_FILENAME
   6027 
   6028 #define STBIR__vertical_channels 2
   6029 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6030 #include STBIR__HEADER_FILENAME
   6031 
   6032 #define STBIR__vertical_channels 2
   6033 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6034 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   6035 #include STBIR__HEADER_FILENAME
   6036 
   6037 #define STBIR__vertical_channels 3
   6038 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6039 #include STBIR__HEADER_FILENAME
   6040 
   6041 #define STBIR__vertical_channels 3
   6042 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6043 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   6044 #include STBIR__HEADER_FILENAME
   6045 
   6046 #define STBIR__vertical_channels 4
   6047 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6048 #include STBIR__HEADER_FILENAME
   6049 
   6050 #define STBIR__vertical_channels 4
   6051 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6052 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   6053 #include STBIR__HEADER_FILENAME
   6054 
   6055 #define STBIR__vertical_channels 5
   6056 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6057 #include STBIR__HEADER_FILENAME
   6058 
   6059 #define STBIR__vertical_channels 5
   6060 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6061 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   6062 #include STBIR__HEADER_FILENAME
   6063 
   6064 #define STBIR__vertical_channels 6
   6065 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6066 #include STBIR__HEADER_FILENAME
   6067 
   6068 #define STBIR__vertical_channels 6
   6069 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6070 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   6071 #include STBIR__HEADER_FILENAME
   6072 
   6073 #define STBIR__vertical_channels 7
   6074 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6075 #include STBIR__HEADER_FILENAME
   6076 
   6077 #define STBIR__vertical_channels 7
   6078 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6079 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   6080 #include STBIR__HEADER_FILENAME
   6081 
   6082 #define STBIR__vertical_channels 8
   6083 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6084 #include STBIR__HEADER_FILENAME
   6085 
   6086 #define STBIR__vertical_channels 8
   6087 #define STB_IMAGE_RESIZE_DO_VERTICALS
   6088 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   6089 #include STBIR__HEADER_FILENAME
   6090 
   6091 typedef void STBIR_VERTICAL_GATHERFUNC( float * output, float const * coeffs, float const ** inputs, float const * input0_end );
   6092 
   6093 static STBIR_VERTICAL_GATHERFUNC * stbir__vertical_gathers[ 8 ] =
   6094 {
   6095   stbir__vertical_gather_with_1_coeffs,stbir__vertical_gather_with_2_coeffs,stbir__vertical_gather_with_3_coeffs,stbir__vertical_gather_with_4_coeffs,stbir__vertical_gather_with_5_coeffs,stbir__vertical_gather_with_6_coeffs,stbir__vertical_gather_with_7_coeffs,stbir__vertical_gather_with_8_coeffs
   6096 };
   6097 
   6098 static STBIR_VERTICAL_GATHERFUNC * stbir__vertical_gathers_continues[ 8 ] =
   6099 {
   6100   stbir__vertical_gather_with_1_coeffs_cont,stbir__vertical_gather_with_2_coeffs_cont,stbir__vertical_gather_with_3_coeffs_cont,stbir__vertical_gather_with_4_coeffs_cont,stbir__vertical_gather_with_5_coeffs_cont,stbir__vertical_gather_with_6_coeffs_cont,stbir__vertical_gather_with_7_coeffs_cont,stbir__vertical_gather_with_8_coeffs_cont
   6101 };
   6102 
   6103 typedef void STBIR_VERTICAL_SCATTERFUNC( float ** outputs, float const * coeffs, float const * input, float const * input_end );
   6104 
   6105 static STBIR_VERTICAL_SCATTERFUNC * stbir__vertical_scatter_sets[ 8 ] =
   6106 {
   6107   stbir__vertical_scatter_with_1_coeffs,stbir__vertical_scatter_with_2_coeffs,stbir__vertical_scatter_with_3_coeffs,stbir__vertical_scatter_with_4_coeffs,stbir__vertical_scatter_with_5_coeffs,stbir__vertical_scatter_with_6_coeffs,stbir__vertical_scatter_with_7_coeffs,stbir__vertical_scatter_with_8_coeffs
   6108 };
   6109 
   6110 static STBIR_VERTICAL_SCATTERFUNC * stbir__vertical_scatter_blends[ 8 ] =
   6111 {
   6112   stbir__vertical_scatter_with_1_coeffs_cont,stbir__vertical_scatter_with_2_coeffs_cont,stbir__vertical_scatter_with_3_coeffs_cont,stbir__vertical_scatter_with_4_coeffs_cont,stbir__vertical_scatter_with_5_coeffs_cont,stbir__vertical_scatter_with_6_coeffs_cont,stbir__vertical_scatter_with_7_coeffs_cont,stbir__vertical_scatter_with_8_coeffs_cont
   6113 };
   6114 
   6115 
   6116 static void stbir__encode_scanline( stbir__info const * stbir_info, void *output_buffer_data, float * encode_buffer, int row  STBIR_ONLY_PROFILE_GET_SPLIT_INFO )
   6117 {
   6118   int num_pixels = stbir_info->horizontal.scale_info.output_sub_size;
   6119   int channels = stbir_info->channels;
   6120   int width_times_channels = num_pixels * channels;
   6121   void * output_buffer;
   6122 
   6123   // un-alpha weight if we need to
   6124   if ( stbir_info->alpha_unweight )
   6125   {
   6126     STBIR_PROFILE_START( unalpha );
   6127     stbir_info->alpha_unweight( encode_buffer, width_times_channels );
   6128     STBIR_PROFILE_END( unalpha );
   6129   }
   6130 
   6131   // write directly into output by default
   6132   output_buffer = output_buffer_data;
   6133 
   6134   // if we have an output callback, we first convert the decode buffer in place (and then hand that to the callback)
   6135   if ( stbir_info->out_pixels_cb )
   6136     output_buffer = encode_buffer;
   6137 
   6138   STBIR_PROFILE_START( encode );
   6139   // convert into the output buffer
   6140   stbir_info->encode_pixels( output_buffer, width_times_channels, encode_buffer );
   6141   STBIR_PROFILE_END( encode );
   6142 
   6143   // if we have an output callback, call it to send the data
   6144   if ( stbir_info->out_pixels_cb )
   6145     stbir_info->out_pixels_cb( output_buffer, num_pixels, row, stbir_info->user_data );
   6146 }
   6147 
   6148 
   6149 // Get the ring buffer pointer for an index
   6150 static float* stbir__get_ring_buffer_entry(stbir__info const * stbir_info, stbir__per_split_info const * split_info, int index )
   6151 {
   6152   STBIR_ASSERT( index < stbir_info->ring_buffer_num_entries );
   6153 
   6154   #ifdef STBIR__SEPARATE_ALLOCATIONS
   6155     return split_info->ring_buffers[ index ];
   6156   #else
   6157     return (float*) ( ( (char*) split_info->ring_buffer ) + ( index * stbir_info->ring_buffer_length_bytes ) );
   6158   #endif
   6159 }
   6160 
   6161 // Get the specified scan line from the ring buffer
   6162 static float* stbir__get_ring_buffer_scanline(stbir__info const * stbir_info, stbir__per_split_info const * split_info, int get_scanline)
   6163 {
   6164   int ring_buffer_index = (split_info->ring_buffer_begin_index + (get_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries;
   6165   return stbir__get_ring_buffer_entry( stbir_info, split_info, ring_buffer_index );
   6166 }
   6167 
   6168 static void stbir__resample_horizontal_gather(stbir__info const * stbir_info, float* output_buffer, float const * input_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO )
   6169 {
   6170   float const * decode_buffer = input_buffer - ( stbir_info->scanline_extents.conservative.n0 * stbir_info->effective_channels );
   6171 
   6172   STBIR_PROFILE_START( horizontal );
   6173   if ( ( stbir_info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( stbir_info->horizontal.scale_info.scale == 1.0f ) )
   6174     STBIR_MEMCPY( output_buffer, input_buffer, stbir_info->horizontal.scale_info.output_sub_size * sizeof( float ) * stbir_info->effective_channels );
   6175   else
   6176     stbir_info->horizontal_gather_channels( output_buffer, stbir_info->horizontal.scale_info.output_sub_size, decode_buffer, stbir_info->horizontal.contributors, stbir_info->horizontal.coefficients, stbir_info->horizontal.coefficient_width );
   6177   STBIR_PROFILE_END( horizontal );
   6178 }
   6179 
   6180 static void stbir__resample_vertical_gather(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n, int contrib_n0, int contrib_n1, float const * vertical_coefficients )
   6181 {
   6182   float* encode_buffer = split_info->vertical_buffer;
   6183   float* decode_buffer = split_info->decode_buffer;
   6184   int vertical_first = stbir_info->vertical_first;
   6185   int width = (vertical_first) ? ( stbir_info->scanline_extents.conservative.n1-stbir_info->scanline_extents.conservative.n0+1 ) : stbir_info->horizontal.scale_info.output_sub_size;
   6186   int width_times_channels = stbir_info->effective_channels * width;
   6187 
   6188   STBIR_ASSERT( stbir_info->vertical.is_gather );
   6189 
   6190   // loop over the contributing scanlines and scale into the buffer
   6191   STBIR_PROFILE_START( vertical );
   6192   {
   6193     int k = 0, total = contrib_n1 - contrib_n0 + 1;
   6194     STBIR_ASSERT( total > 0 );
   6195     do {
   6196       float const * inputs[8];
   6197       int i, cnt = total; if ( cnt > 8 ) cnt = 8;
   6198       for( i = 0 ; i < cnt ; i++ )
   6199         inputs[ i ] = stbir__get_ring_buffer_scanline(stbir_info, split_info, k+i+contrib_n0 );
   6200 
   6201       // call the N scanlines at a time function (up to 8 scanlines of blending at once)
   6202       ((k==0)?stbir__vertical_gathers:stbir__vertical_gathers_continues)[cnt-1]( (vertical_first) ? decode_buffer : encode_buffer, vertical_coefficients + k, inputs, inputs[0] + width_times_channels );
   6203       k += cnt;
   6204       total -= cnt;
   6205     } while ( total );
   6206   }
   6207   STBIR_PROFILE_END( vertical );
   6208 
   6209   if ( vertical_first )
   6210   {
   6211     // Now resample the gathered vertical data in the horizontal axis into the encode buffer
   6212     stbir__resample_horizontal_gather(stbir_info, encode_buffer, decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6213   }
   6214 
   6215   stbir__encode_scanline( stbir_info, ( (char *) stbir_info->output_data ) + ((size_t)n * (size_t)stbir_info->output_stride_bytes),
   6216                           encode_buffer, n  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6217 }
   6218 
   6219 static void stbir__decode_and_resample_for_vertical_gather_loop(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n)
   6220 {
   6221   int ring_buffer_index;
   6222   float* ring_buffer;
   6223 
   6224   // Decode the nth scanline from the source image into the decode buffer.
   6225   stbir__decode_scanline( stbir_info, n, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6226 
   6227   // update new end scanline
   6228   split_info->ring_buffer_last_scanline = n;
   6229 
   6230   // get ring buffer
   6231   ring_buffer_index = (split_info->ring_buffer_begin_index + (split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries;
   6232   ring_buffer = stbir__get_ring_buffer_entry(stbir_info, split_info, ring_buffer_index);
   6233 
   6234   // Now resample it into the ring buffer.
   6235   stbir__resample_horizontal_gather( stbir_info, ring_buffer, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6236 
   6237   // Now it's sitting in the ring buffer ready to be used as source for the vertical sampling.
   6238 }
   6239 
   6240 static void stbir__vertical_gather_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count )
   6241 {
   6242   int y, start_output_y, end_output_y;
   6243   stbir__contributors* vertical_contributors = stbir_info->vertical.contributors;
   6244   float const * vertical_coefficients = stbir_info->vertical.coefficients;
   6245 
   6246   STBIR_ASSERT( stbir_info->vertical.is_gather );
   6247 
   6248   start_output_y = split_info->start_output_y;
   6249   end_output_y = split_info[split_count-1].end_output_y;
   6250 
   6251   vertical_contributors += start_output_y;
   6252   vertical_coefficients += start_output_y * stbir_info->vertical.coefficient_width;
   6253 
   6254   // initialize the ring buffer for gathering
   6255   split_info->ring_buffer_begin_index = 0;
   6256   split_info->ring_buffer_first_scanline = vertical_contributors->n0;
   6257   split_info->ring_buffer_last_scanline = split_info->ring_buffer_first_scanline - 1; // means "empty"
   6258 
   6259   for (y = start_output_y; y < end_output_y; y++)
   6260   {
   6261     int in_first_scanline, in_last_scanline;
   6262 
   6263     in_first_scanline = vertical_contributors->n0;
   6264     in_last_scanline = vertical_contributors->n1;
   6265 
   6266     // make sure the indexing hasn't broken
   6267     STBIR_ASSERT( in_first_scanline >= split_info->ring_buffer_first_scanline );
   6268 
   6269     // Load in new scanlines
   6270     while (in_last_scanline > split_info->ring_buffer_last_scanline)
   6271     {
   6272       STBIR_ASSERT( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) <= stbir_info->ring_buffer_num_entries );
   6273 
   6274       // make sure there was room in the ring buffer when we add new scanlines
   6275       if ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries )
   6276       {
   6277         split_info->ring_buffer_first_scanline++;
   6278         split_info->ring_buffer_begin_index++;
   6279       }
   6280 
   6281       if ( stbir_info->vertical_first )
   6282       {
   6283         float * ring_buffer = stbir__get_ring_buffer_scanline( stbir_info, split_info, ++split_info->ring_buffer_last_scanline );
   6284         // Decode the nth scanline from the source image into the decode buffer.
   6285         stbir__decode_scanline( stbir_info, split_info->ring_buffer_last_scanline, ring_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6286       }
   6287       else
   6288       {
   6289         stbir__decode_and_resample_for_vertical_gather_loop(stbir_info, split_info, split_info->ring_buffer_last_scanline + 1);
   6290       }
   6291     }
   6292 
   6293     // Now all buffers should be ready to write a row of vertical sampling, so do it.
   6294     stbir__resample_vertical_gather(stbir_info, split_info, y, in_first_scanline, in_last_scanline, vertical_coefficients );
   6295 
   6296     ++vertical_contributors;
   6297     vertical_coefficients += stbir_info->vertical.coefficient_width;
   6298   }
   6299 }
   6300 
   6301 #define STBIR__FLOAT_EMPTY_MARKER 3.0e+38F
   6302 #define STBIR__FLOAT_BUFFER_IS_EMPTY(ptr) ((ptr)[0]==STBIR__FLOAT_EMPTY_MARKER)
   6303 
   6304 static void stbir__encode_first_scanline_from_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info)
   6305 {
   6306   // evict a scanline out into the output buffer
   6307   float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index );
   6308 
   6309   // dump the scanline out
   6310   stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), ring_buffer_entry, split_info->ring_buffer_first_scanline  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6311 
   6312   // mark it as empty
   6313   ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER;
   6314 
   6315   // advance the first scanline
   6316   split_info->ring_buffer_first_scanline++;
   6317   if ( ++split_info->ring_buffer_begin_index == stbir_info->ring_buffer_num_entries )
   6318     split_info->ring_buffer_begin_index = 0;
   6319 }
   6320 
   6321 static void stbir__horizontal_resample_and_encode_first_scanline_from_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info)
   6322 {
   6323   // evict a scanline out into the output buffer
   6324 
   6325   float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index );
   6326 
   6327   // Now resample it into the buffer.
   6328   stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, ring_buffer_entry  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6329 
   6330   // dump the scanline out
   6331   stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), split_info->vertical_buffer, split_info->ring_buffer_first_scanline  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6332 
   6333   // mark it as empty
   6334   ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER;
   6335 
   6336   // advance the first scanline
   6337   split_info->ring_buffer_first_scanline++;
   6338   if ( ++split_info->ring_buffer_begin_index == stbir_info->ring_buffer_num_entries )
   6339     split_info->ring_buffer_begin_index = 0;
   6340 }
   6341 
   6342 static void stbir__resample_vertical_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n0, int n1, float const * vertical_coefficients, float const * vertical_buffer, float const * vertical_buffer_end )
   6343 {
   6344   STBIR_ASSERT( !stbir_info->vertical.is_gather );
   6345 
   6346   STBIR_PROFILE_START( vertical );
   6347   {
   6348     int k = 0, total = n1 - n0 + 1;
   6349     STBIR_ASSERT( total > 0 );
   6350     do {
   6351       float * outputs[8];
   6352       int i, n = total; if ( n > 8 ) n = 8;
   6353       for( i = 0 ; i < n ; i++ )
   6354       {
   6355         outputs[ i ] = stbir__get_ring_buffer_scanline(stbir_info, split_info, k+i+n0 );
   6356         if ( ( i ) && ( STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[i] ) != STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[0] ) ) ) // make sure runs are of the same type
   6357         {
   6358           n = i;
   6359           break;
   6360         }
   6361       }
   6362       // call the scatter to N scanlines at a time function (up to 8 scanlines of scattering at once)
   6363       ((STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[0] ))?stbir__vertical_scatter_sets:stbir__vertical_scatter_blends)[n-1]( outputs, vertical_coefficients + k, vertical_buffer, vertical_buffer_end );
   6364       k += n;
   6365       total -= n;
   6366     } while ( total );
   6367   }
   6368 
   6369   STBIR_PROFILE_END( vertical );
   6370 }
   6371 
   6372 typedef void stbir__handle_scanline_for_scatter_func(stbir__info const * stbir_info, stbir__per_split_info* split_info);
   6373 
   6374 static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count )
   6375 {
   6376   int y, start_output_y, end_output_y, start_input_y, end_input_y;
   6377   stbir__contributors* vertical_contributors = stbir_info->vertical.contributors;
   6378   float const * vertical_coefficients = stbir_info->vertical.coefficients;
   6379   stbir__handle_scanline_for_scatter_func * handle_scanline_for_scatter;
   6380   void * scanline_scatter_buffer;
   6381   void * scanline_scatter_buffer_end;
   6382   int on_first_input_y, last_input_y;
   6383 
   6384   STBIR_ASSERT( !stbir_info->vertical.is_gather );
   6385 
   6386   start_output_y = split_info->start_output_y;
   6387   end_output_y = split_info[split_count-1].end_output_y;  // may do multiple split counts
   6388 
   6389   start_input_y = split_info->start_input_y;
   6390   end_input_y = split_info[split_count-1].end_input_y;
   6391 
   6392   // adjust for starting offset start_input_y
   6393   y = start_input_y + stbir_info->vertical.filter_pixel_margin;
   6394   vertical_contributors += y ;
   6395   vertical_coefficients += stbir_info->vertical.coefficient_width * y;
   6396 
   6397   if ( stbir_info->vertical_first )
   6398   {
   6399     handle_scanline_for_scatter = stbir__horizontal_resample_and_encode_first_scanline_from_scatter;
   6400     scanline_scatter_buffer = split_info->decode_buffer;
   6401     scanline_scatter_buffer_end = ( (char*) scanline_scatter_buffer ) + sizeof( float ) * stbir_info->effective_channels * (stbir_info->scanline_extents.conservative.n1-stbir_info->scanline_extents.conservative.n0+1);
   6402   }
   6403   else
   6404   {
   6405     handle_scanline_for_scatter = stbir__encode_first_scanline_from_scatter;
   6406     scanline_scatter_buffer = split_info->vertical_buffer;
   6407     scanline_scatter_buffer_end = ( (char*) scanline_scatter_buffer ) + sizeof( float ) * stbir_info->effective_channels * stbir_info->horizontal.scale_info.output_sub_size;
   6408   }
   6409 
   6410   // initialize the ring buffer for scattering
   6411   split_info->ring_buffer_first_scanline = start_output_y;
   6412   split_info->ring_buffer_last_scanline = -1;
   6413   split_info->ring_buffer_begin_index = -1;
   6414 
   6415   // mark all the buffers as empty to start
   6416   for( y = 0 ; y < stbir_info->ring_buffer_num_entries ; y++ )
   6417     stbir__get_ring_buffer_entry( stbir_info, split_info, y )[0] = STBIR__FLOAT_EMPTY_MARKER; // only used on scatter
   6418 
   6419   // do the loop in input space
   6420   on_first_input_y = 1; last_input_y = start_input_y;
   6421   for (y = start_input_y ; y < end_input_y; y++)
   6422   {
   6423     int out_first_scanline, out_last_scanline;
   6424 
   6425     out_first_scanline = vertical_contributors->n0;
   6426     out_last_scanline = vertical_contributors->n1;
   6427 
   6428     STBIR_ASSERT(out_last_scanline - out_first_scanline + 1 <= stbir_info->ring_buffer_num_entries);
   6429 
   6430     if ( ( out_last_scanline >= out_first_scanline ) && ( ( ( out_first_scanline >= start_output_y ) && ( out_first_scanline < end_output_y ) ) || ( ( out_last_scanline >= start_output_y ) && ( out_last_scanline < end_output_y ) ) ) )
   6431     {
   6432       float const * vc = vertical_coefficients;
   6433 
   6434       // keep track of the range actually seen for the next resize
   6435       last_input_y = y;
   6436       if ( ( on_first_input_y ) && ( y > start_input_y ) )
   6437         split_info->start_input_y = y;
   6438       on_first_input_y = 0;
   6439 
   6440       // clip the region
   6441       if ( out_first_scanline < start_output_y )
   6442       {
   6443         vc += start_output_y - out_first_scanline;
   6444         out_first_scanline = start_output_y;
   6445       }
   6446 
   6447       if ( out_last_scanline >= end_output_y )
   6448         out_last_scanline = end_output_y - 1;
   6449 
   6450       // if very first scanline, init the index
   6451       if (split_info->ring_buffer_begin_index < 0)
   6452         split_info->ring_buffer_begin_index = out_first_scanline - start_output_y;
   6453 
   6454       STBIR_ASSERT( split_info->ring_buffer_begin_index <= out_first_scanline );
   6455 
   6456       // Decode the nth scanline from the source image into the decode buffer.
   6457       stbir__decode_scanline( stbir_info, y, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6458 
   6459       // When horizontal first, we resample horizontally into the vertical buffer before we scatter it out
   6460       if ( !stbir_info->vertical_first )
   6461         stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   6462 
   6463       // Now it's sitting in the buffer ready to be distributed into the ring buffers.
   6464 
   6465       // evict from the ringbuffer, if we need are full
   6466       if ( ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries ) &&
   6467            ( out_last_scanline > split_info->ring_buffer_last_scanline ) )
   6468         handle_scanline_for_scatter( stbir_info, split_info );
   6469 
   6470       // Now the horizontal buffer is ready to write to all ring buffer rows, so do it.
   6471       stbir__resample_vertical_scatter(stbir_info, split_info, out_first_scanline, out_last_scanline, vc, (float*)scanline_scatter_buffer, (float*)scanline_scatter_buffer_end );
   6472 
   6473       // update the end of the buffer
   6474       if ( out_last_scanline > split_info->ring_buffer_last_scanline )
   6475         split_info->ring_buffer_last_scanline = out_last_scanline;
   6476     }
   6477     ++vertical_contributors;
   6478     vertical_coefficients += stbir_info->vertical.coefficient_width;
   6479   }
   6480 
   6481   // now evict the scanlines that are left over in the ring buffer
   6482   while ( split_info->ring_buffer_first_scanline < end_output_y )
   6483     handle_scanline_for_scatter(stbir_info, split_info);
   6484 
   6485   // update the end_input_y if we do multiple resizes with the same data
   6486   ++last_input_y;
   6487   for( y = 0 ; y < split_count; y++ )
   6488     if ( split_info[y].end_input_y > last_input_y )
   6489       split_info[y].end_input_y = last_input_y;
   6490 }
   6491 
   6492 
   6493 static stbir__kernel_callback * stbir__builtin_kernels[] =   { 0, stbir__filter_trapezoid,  stbir__filter_triangle, stbir__filter_cubic, stbir__filter_catmullrom, stbir__filter_mitchell, stbir__filter_point };
   6494 static stbir__support_callback * stbir__builtin_supports[] = { 0, stbir__support_trapezoid, stbir__support_one,     stbir__support_two,  stbir__support_two,       stbir__support_two,     stbir__support_zeropoint5 };
   6495 
   6496 static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir__kernel_callback * kernel, stbir__support_callback * support, stbir_edge edge, stbir__scale_info * scale_info, int always_gather, void * user_data )
   6497 {
   6498   // set filter
   6499   if (filter == 0)
   6500   {
   6501     filter = STBIR_DEFAULT_FILTER_DOWNSAMPLE; // default to downsample
   6502     if (scale_info->scale >= ( 1.0f - stbir__small_float ) )
   6503     {
   6504       if ( (scale_info->scale <= ( 1.0f + stbir__small_float ) ) && ( STBIR_CEILF(scale_info->pixel_shift) == scale_info->pixel_shift ) )
   6505         filter = STBIR_FILTER_POINT_SAMPLE;
   6506       else
   6507         filter = STBIR_DEFAULT_FILTER_UPSAMPLE;
   6508     }
   6509   }
   6510   samp->filter_enum = filter;
   6511 
   6512   STBIR_ASSERT(samp->filter_enum != 0);
   6513   STBIR_ASSERT((unsigned)samp->filter_enum < STBIR_FILTER_OTHER);
   6514   samp->filter_kernel = stbir__builtin_kernels[ filter ];
   6515   samp->filter_support = stbir__builtin_supports[ filter ];
   6516 
   6517   if ( kernel && support )
   6518   {
   6519     samp->filter_kernel = kernel;
   6520     samp->filter_support = support;
   6521     samp->filter_enum = STBIR_FILTER_OTHER;
   6522   }
   6523 
   6524   samp->edge = edge;
   6525   samp->filter_pixel_width  = stbir__get_filter_pixel_width (samp->filter_support, scale_info->scale, user_data );
   6526   // Gather is always better, but in extreme downsamples, you have to most or all of the data in memory
   6527   //    For horizontal, we always have all the pixels, so we always use gather here (always_gather==1).
   6528   //    For vertical, we use gather if scaling up (which means we will have samp->filter_pixel_width
   6529   //    scanlines in memory at once).
   6530   samp->is_gather = 0;
   6531   if ( scale_info->scale >= ( 1.0f - stbir__small_float ) )
   6532     samp->is_gather = 1;
   6533   else if ( ( always_gather ) || ( samp->filter_pixel_width <= STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT ) )
   6534     samp->is_gather = 2;
   6535 
   6536   // pre calculate stuff based on the above
   6537   samp->coefficient_width = stbir__get_coefficient_width(samp, samp->is_gather, user_data);
   6538 
   6539   // filter_pixel_width is the conservative size in pixels of input that affect an output pixel.
   6540   //   In rare cases (only with 2 pix to 1 pix with the default filters), it's possible that the 
   6541   //   filter will extend before or after the scanline beyond just one extra entire copy of the 
   6542   //   scanline (we would hit the edge twice). We don't let you do that, so we clamp the total 
   6543   //   width to 3x the total of input pixel (once for the scanline, once for the left side 
   6544   //   overhang, and once for the right side). We only do this for edge mode, since the other 
   6545   //   modes can just re-edge clamp back in again.
   6546   if ( edge == STBIR_EDGE_WRAP )
   6547     if ( samp->filter_pixel_width > ( scale_info->input_full_size * 3 ) )
   6548       samp->filter_pixel_width = scale_info->input_full_size * 3;
   6549 
   6550   // This is how much to expand buffers to account for filters seeking outside
   6551   // the image boundaries.
   6552   samp->filter_pixel_margin = samp->filter_pixel_width / 2;
   6553   
   6554   // filter_pixel_margin is the amount that this filter can overhang on just one side of either 
   6555   //   end of the scanline (left or the right). Since we only allow you to overhang 1 scanline's 
   6556   //   worth of pixels, we clamp this one side of overhang to the input scanline size. Again, 
   6557   //   this clamping only happens in rare cases with the default filters (2 pix to 1 pix). 
   6558   if ( edge == STBIR_EDGE_WRAP )
   6559     if ( samp->filter_pixel_margin > scale_info->input_full_size )
   6560       samp->filter_pixel_margin = scale_info->input_full_size;
   6561 
   6562   samp->num_contributors = stbir__get_contributors(samp, samp->is_gather);
   6563 
   6564   samp->contributors_size = samp->num_contributors * sizeof(stbir__contributors);
   6565   samp->coefficients_size = samp->num_contributors * samp->coefficient_width * sizeof(float) + sizeof(float); // extra sizeof(float) is padding
   6566 
   6567   samp->gather_prescatter_contributors = 0;
   6568   samp->gather_prescatter_coefficients = 0;
   6569   if ( samp->is_gather == 0 )
   6570   {
   6571     samp->gather_prescatter_coefficient_width = samp->filter_pixel_width;
   6572     samp->gather_prescatter_num_contributors  = stbir__get_contributors(samp, 2);
   6573     samp->gather_prescatter_contributors_size = samp->gather_prescatter_num_contributors * sizeof(stbir__contributors);
   6574     samp->gather_prescatter_coefficients_size = samp->gather_prescatter_num_contributors * samp->gather_prescatter_coefficient_width * sizeof(float);
   6575   }
   6576 }
   6577 
   6578 static void stbir__get_conservative_extents( stbir__sampler * samp, stbir__contributors * range, void * user_data )
   6579 {
   6580   float scale = samp->scale_info.scale;
   6581   float out_shift = samp->scale_info.pixel_shift;
   6582   stbir__support_callback * support = samp->filter_support;
   6583   int input_full_size = samp->scale_info.input_full_size;
   6584   stbir_edge edge = samp->edge;
   6585   float inv_scale = samp->scale_info.inv_scale;
   6586 
   6587   STBIR_ASSERT( samp->is_gather != 0 );
   6588 
   6589   if ( samp->is_gather == 1 )
   6590   {
   6591     int in_first_pixel, in_last_pixel;
   6592     float out_filter_radius = support(inv_scale, user_data) * scale;
   6593 
   6594     stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, 0.5, out_filter_radius, inv_scale, out_shift, input_full_size, edge );
   6595     range->n0 = in_first_pixel;
   6596     stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, ( (float)(samp->scale_info.output_sub_size-1) ) + 0.5f, out_filter_radius, inv_scale, out_shift, input_full_size, edge );
   6597     range->n1 = in_last_pixel;
   6598   }
   6599   else if ( samp->is_gather == 2 ) // downsample gather, refine
   6600   {
   6601     float in_pixels_radius = support(scale, user_data) * inv_scale;
   6602     int filter_pixel_margin = samp->filter_pixel_margin;
   6603     int output_sub_size = samp->scale_info.output_sub_size;
   6604     int input_end;
   6605     int n;
   6606     int in_first_pixel, in_last_pixel;
   6607 
   6608     // get a conservative area of the input range
   6609     stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, 0, 0, inv_scale, out_shift, input_full_size, edge );
   6610     range->n0 = in_first_pixel;
   6611     stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, (float)output_sub_size, 0, inv_scale, out_shift, input_full_size, edge );
   6612     range->n1 = in_last_pixel;
   6613 
   6614     // now go through the margin to the start of area to find bottom
   6615     n = range->n0 + 1;
   6616     input_end = -filter_pixel_margin;
   6617     while( n >= input_end )
   6618     {
   6619       int out_first_pixel, out_last_pixel;
   6620       stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, ((float)n)+0.5f, in_pixels_radius, scale, out_shift, output_sub_size );
   6621       if ( out_first_pixel > out_last_pixel )
   6622         break;
   6623 
   6624       if ( ( out_first_pixel < output_sub_size ) || ( out_last_pixel >= 0 ) )
   6625         range->n0 = n;
   6626       --n;
   6627     }
   6628 
   6629     // now go through the end of the area through the margin to find top
   6630     n = range->n1 - 1;
   6631     input_end = n + 1 + filter_pixel_margin;
   6632     while( n <= input_end )
   6633     {
   6634       int out_first_pixel, out_last_pixel;
   6635       stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, ((float)n)+0.5f, in_pixels_radius, scale, out_shift, output_sub_size );
   6636       if ( out_first_pixel > out_last_pixel )
   6637         break;
   6638       if ( ( out_first_pixel < output_sub_size ) || ( out_last_pixel >= 0 ) )
   6639         range->n1 = n;
   6640       ++n;
   6641     }
   6642   }
   6643 
   6644   if ( samp->edge == STBIR_EDGE_WRAP )
   6645   {
   6646     // if we are wrapping, and we are very close to the image size (so the edges might merge), just use the scanline up to the edge
   6647     if ( ( range->n0 > 0 ) && ( range->n1 >= input_full_size ) )
   6648     {
   6649       int marg = range->n1 - input_full_size + 1;
   6650       if ( ( marg + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= range->n0 )
   6651         range->n0 = 0;
   6652     }
   6653     if ( ( range->n0 < 0 ) && ( range->n1 < (input_full_size-1) ) )
   6654     {
   6655       int marg = -range->n0;
   6656       if ( ( input_full_size - marg - STBIR__MERGE_RUNS_PIXEL_THRESHOLD - 1 ) <= range->n1 )
   6657         range->n1 = input_full_size - 1;
   6658     }
   6659   }
   6660   else
   6661   {
   6662     // for non-edge-wrap modes, we never read over the edge, so clamp
   6663     if ( range->n0 < 0 )
   6664       range->n0 = 0;
   6665     if ( range->n1 >= input_full_size )
   6666       range->n1 = input_full_size - 1;
   6667   }
   6668 }
   6669 
   6670 static void stbir__get_split_info( stbir__per_split_info* split_info, int splits, int output_height, int vertical_pixel_margin, int input_full_height )
   6671 {
   6672   int i, cur;
   6673   int left = output_height;
   6674 
   6675   cur = 0;
   6676   for( i = 0 ; i < splits ; i++ )
   6677   {
   6678     int each;
   6679     split_info[i].start_output_y = cur;
   6680     each = left / ( splits - i );
   6681     split_info[i].end_output_y = cur + each;
   6682     cur += each;
   6683     left -= each;
   6684 
   6685     // scatter range (updated to minimum as you run it)
   6686     split_info[i].start_input_y = -vertical_pixel_margin;
   6687     split_info[i].end_input_y = input_full_height + vertical_pixel_margin;
   6688   }
   6689 }
   6690 
   6691 static void stbir__free_internal_mem( stbir__info *info )
   6692 {
   6693   #define STBIR__FREE_AND_CLEAR( ptr ) { if ( ptr ) { void * p = (ptr); (ptr) = 0; STBIR_FREE( p, info->user_data); } }
   6694 
   6695   if ( info )
   6696   {
   6697   #ifndef STBIR__SEPARATE_ALLOCATIONS
   6698     STBIR__FREE_AND_CLEAR( info->alloced_mem );
   6699   #else
   6700     int i,j;
   6701 
   6702     if ( ( info->vertical.gather_prescatter_contributors ) && ( (void*)info->vertical.gather_prescatter_contributors != (void*)info->split_info[0].decode_buffer ) )
   6703     {
   6704       STBIR__FREE_AND_CLEAR( info->vertical.gather_prescatter_coefficients );
   6705       STBIR__FREE_AND_CLEAR( info->vertical.gather_prescatter_contributors );
   6706     }
   6707     for( i = 0 ; i < info->splits ; i++ )
   6708     {
   6709       for( j = 0 ; j < info->alloc_ring_buffer_num_entries ; j++ )
   6710       {
   6711         #ifdef STBIR_SIMD8
   6712         if ( info->effective_channels == 3 )
   6713           --info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer
   6714         #endif
   6715         STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers[j] );
   6716       }
   6717 
   6718       #ifdef STBIR_SIMD8
   6719       if ( info->effective_channels == 3 )
   6720         --info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer
   6721       #endif
   6722       STBIR__FREE_AND_CLEAR( info->split_info[i].decode_buffer );
   6723       STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers );
   6724       STBIR__FREE_AND_CLEAR( info->split_info[i].vertical_buffer );
   6725     }
   6726     STBIR__FREE_AND_CLEAR( info->split_info );
   6727     if ( info->vertical.coefficients != info->horizontal.coefficients )
   6728     {
   6729       STBIR__FREE_AND_CLEAR( info->vertical.coefficients );
   6730       STBIR__FREE_AND_CLEAR( info->vertical.contributors );
   6731     }
   6732     STBIR__FREE_AND_CLEAR( info->horizontal.coefficients );
   6733     STBIR__FREE_AND_CLEAR( info->horizontal.contributors );
   6734     STBIR__FREE_AND_CLEAR( info->alloced_mem );
   6735     STBIR_FREE( info, info->user_data );
   6736   #endif
   6737   }
   6738 
   6739   #undef STBIR__FREE_AND_CLEAR
   6740 }
   6741 
   6742 static int stbir__get_max_split( int splits, int height )
   6743 {
   6744   int i;
   6745   int max = 0;
   6746 
   6747   for( i = 0 ; i < splits ; i++ )
   6748   {
   6749     int each = height / ( splits - i );
   6750     if ( each > max )
   6751       max = each;
   6752     height -= each;
   6753   }
   6754   return max;
   6755 }
   6756 
   6757 static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_n_coeffs_funcs[8] =
   6758 {
   6759   0, stbir__horizontal_gather_1_channels_with_n_coeffs_funcs, stbir__horizontal_gather_2_channels_with_n_coeffs_funcs, stbir__horizontal_gather_3_channels_with_n_coeffs_funcs, stbir__horizontal_gather_4_channels_with_n_coeffs_funcs, 0,0, stbir__horizontal_gather_7_channels_with_n_coeffs_funcs
   6760 };
   6761 
   6762 static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_channels_funcs[8] =
   6763 {
   6764   0, stbir__horizontal_gather_1_channels_funcs, stbir__horizontal_gather_2_channels_funcs, stbir__horizontal_gather_3_channels_funcs, stbir__horizontal_gather_4_channels_funcs, 0,0, stbir__horizontal_gather_7_channels_funcs
   6765 };
   6766 
   6767 // there are six resize classifications: 0 == vertical scatter, 1 == vertical gather < 1x scale, 2 == vertical gather 1x-2x scale, 4 == vertical gather < 3x scale, 4 == vertical gather > 3x scale, 5 == <=4 pixel height, 6 == <=4 pixel wide column
   6768 #define STBIR_RESIZE_CLASSIFICATIONS 8
   6769 
   6770 static float stbir__compute_weights[5][STBIR_RESIZE_CLASSIFICATIONS][4]=  // 5 = 0=1chan, 1=2chan, 2=3chan, 3=4chan, 4=7chan
   6771 {
   6772   {
   6773     { 1.00000f, 1.00000f, 0.31250f, 1.00000f },
   6774     { 0.56250f, 0.59375f, 0.00000f, 0.96875f },
   6775     { 1.00000f, 0.06250f, 0.00000f, 1.00000f },
   6776     { 0.00000f, 0.09375f, 1.00000f, 1.00000f },
   6777     { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
   6778     { 0.03125f, 0.12500f, 1.00000f, 1.00000f },
   6779     { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
   6780     { 0.00000f, 1.00000f, 0.00000f, 0.03125f },
   6781   }, {
   6782     { 0.00000f, 0.84375f, 0.00000f, 0.03125f },
   6783     { 0.09375f, 0.93750f, 0.00000f, 0.78125f },
   6784     { 0.87500f, 0.21875f, 0.00000f, 0.96875f },
   6785     { 0.09375f, 0.09375f, 1.00000f, 1.00000f },
   6786     { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
   6787     { 0.03125f, 0.12500f, 1.00000f, 1.00000f },
   6788     { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
   6789     { 0.00000f, 1.00000f, 0.00000f, 0.53125f },
   6790   }, {
   6791     { 0.00000f, 0.53125f, 0.00000f, 0.03125f },
   6792     { 0.06250f, 0.96875f, 0.00000f, 0.53125f },
   6793     { 0.87500f, 0.18750f, 0.00000f, 0.93750f },
   6794     { 0.00000f, 0.09375f, 1.00000f, 1.00000f },
   6795     { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
   6796     { 0.03125f, 0.12500f, 1.00000f, 1.00000f },
   6797     { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
   6798     { 0.00000f, 1.00000f, 0.00000f, 0.56250f },
   6799   }, {
   6800     { 0.00000f, 0.50000f, 0.00000f, 0.71875f },
   6801     { 0.06250f, 0.84375f, 0.00000f, 0.87500f },
   6802     { 1.00000f, 0.50000f, 0.50000f, 0.96875f },
   6803     { 1.00000f, 0.09375f, 0.31250f, 0.50000f },
   6804     { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
   6805     { 1.00000f, 0.03125f, 0.03125f, 0.53125f },
   6806     { 0.18750f, 0.12500f, 0.00000f, 1.00000f },
   6807     { 0.00000f, 1.00000f, 0.03125f, 0.18750f },
   6808   }, {
   6809     { 0.00000f, 0.59375f, 0.00000f, 0.96875f },
   6810     { 0.06250f, 0.81250f, 0.06250f, 0.59375f },
   6811     { 0.75000f, 0.43750f, 0.12500f, 0.96875f },
   6812     { 0.87500f, 0.06250f, 0.18750f, 0.43750f },
   6813     { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
   6814     { 0.15625f, 0.12500f, 1.00000f, 1.00000f },
   6815     { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
   6816     { 0.00000f, 1.00000f, 0.03125f, 0.34375f },
   6817   }
   6818 };
   6819 
   6820 // structure that allow us to query and override info for training the costs
   6821 typedef struct STBIR__V_FIRST_INFO
   6822 {
   6823   double v_cost, h_cost;
   6824   int control_v_first; // 0 = no control, 1 = force hori, 2 = force vert
   6825   int v_first;
   6826   int v_resize_classification;
   6827   int is_gather;
   6828 } STBIR__V_FIRST_INFO;
   6829 
   6830 #ifdef STBIR__V_FIRST_INFO_BUFFER
   6831 static STBIR__V_FIRST_INFO STBIR__V_FIRST_INFO_BUFFER = {0};
   6832 #define STBIR__V_FIRST_INFO_POINTER &STBIR__V_FIRST_INFO_BUFFER
   6833 #else
   6834 #define STBIR__V_FIRST_INFO_POINTER 0
   6835 #endif
   6836 
   6837 // Figure out whether to scale along the horizontal or vertical first.
   6838 //   This only *super* important when you are scaling by a massively
   6839 //   different amount in the vertical vs the horizontal (for example, if
   6840 //   you are scaling by 2x in the width, and 0.5x in the height, then you
   6841 //   want to do the vertical scale first, because it's around 3x faster
   6842 //   in that order.
   6843 //
   6844 //   In more normal circumstances, this makes a 20-40% differences, so
   6845 //     it's good to get right, but not critical. The normal way that you
   6846 //     decide which direction goes first is just figuring out which
   6847 //     direction does more multiplies. But with modern CPUs with their
   6848 //     fancy caches and SIMD and high IPC abilities, so there's just a lot
   6849 //     more that goes into it.
   6850 //
   6851 //   My handwavy sort of solution is to have an app that does a whole
   6852 //     bunch of timing for both vertical and horizontal first modes,
   6853 //     and then another app that can read lots of these timing files
   6854 //     and try to search for the best weights to use. Dotimings.c
   6855 //     is the app that does a bunch of timings, and vf_train.c is the
   6856 //     app that solves for the best weights (and shows how well it
   6857 //     does currently).
   6858 
   6859 static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLASSIFICATIONS][4], int horizontal_filter_pixel_width, float horizontal_scale, int horizontal_output_size, int vertical_filter_pixel_width, float vertical_scale, int vertical_output_size, int is_gather, STBIR__V_FIRST_INFO * info )
   6860 {
   6861   double v_cost, h_cost;
   6862   float * weights;
   6863   int vertical_first;
   6864   int v_classification;
   6865 
   6866   // categorize the resize into buckets
   6867   if ( ( vertical_output_size <= 4 ) || ( horizontal_output_size <= 4 ) )
   6868     v_classification = ( vertical_output_size < horizontal_output_size ) ? 6 : 7;
   6869   else if ( vertical_scale <= 1.0f )
   6870     v_classification = ( is_gather ) ? 1 : 0;
   6871   else if ( vertical_scale <= 2.0f)
   6872     v_classification = 2;
   6873   else if ( vertical_scale <= 3.0f)
   6874     v_classification = 3;
   6875   else if ( vertical_scale <= 4.0f)
   6876     v_classification = 5;
   6877   else
   6878     v_classification = 6;
   6879 
   6880   // use the right weights
   6881   weights = weights_table[ v_classification ];
   6882 
   6883   // this is the costs when you don't take into account modern CPUs with high ipc and simd and caches - wish we had a better estimate
   6884   h_cost = (float)horizontal_filter_pixel_width * weights[0] + horizontal_scale * (float)vertical_filter_pixel_width * weights[1];
   6885   v_cost = (float)vertical_filter_pixel_width  * weights[2] + vertical_scale * (float)horizontal_filter_pixel_width * weights[3];
   6886 
   6887   // use computation estimate to decide vertical first or not
   6888   vertical_first = ( v_cost <= h_cost ) ? 1 : 0;
   6889 
   6890   // save these, if requested
   6891   if ( info )
   6892   {
   6893     info->h_cost = h_cost;
   6894     info->v_cost = v_cost;
   6895     info->v_resize_classification = v_classification;
   6896     info->v_first = vertical_first;
   6897     info->is_gather = is_gather;
   6898   }
   6899 
   6900   // and this allows us to override everything for testing (see dotiming.c)
   6901   if ( ( info ) && ( info->control_v_first ) )
   6902     vertical_first = ( info->control_v_first == 2 ) ? 1 : 0;
   6903 
   6904   return vertical_first;
   6905 }
   6906 
   6907 // layout lookups - must match stbir_internal_pixel_layout
   6908 static unsigned char stbir__pixel_channels[] = {
   6909   1,2,3,3,4,   // 1ch, 2ch, rgb, bgr, 4ch
   6910   4,4,4,4,2,2, // RGBA,BGRA,ARGB,ABGR,RA,AR
   6911   4,4,4,4,2,2, // RGBA_PM,BGRA_PM,ARGB_PM,ABGR_PM,RA_PM,AR_PM
   6912 };
   6913 
   6914 // the internal pixel layout enums are in a different order, so we can easily do range comparisons of types
   6915 //   the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible
   6916 static stbir_internal_pixel_layout stbir__pixel_layout_convert_public_to_internal[] = {
   6917   STBIRI_BGR, STBIRI_1CHANNEL, STBIRI_2CHANNEL, STBIRI_RGB, STBIRI_RGBA,
   6918   STBIRI_4CHANNEL, STBIRI_BGRA, STBIRI_ARGB, STBIRI_ABGR, STBIRI_RA, STBIRI_AR,
   6919   STBIRI_RGBA_PM, STBIRI_BGRA_PM, STBIRI_ARGB_PM, STBIRI_ABGR_PM, STBIRI_RA_PM, STBIRI_AR_PM,
   6920 };
   6921 
   6922 static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sampler * horizontal, stbir__sampler * vertical, stbir__contributors * conservative, stbir_pixel_layout input_pixel_layout_public, stbir_pixel_layout output_pixel_layout_public, int splits, int new_x, int new_y, int fast_alpha, void * user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO )
   6923 {
   6924   static char stbir_channel_count_index[8]={ 9,0,1,2, 3,9,9,4 };
   6925 
   6926   stbir__info * info = 0;
   6927   void * alloced = 0;
   6928   size_t alloced_total = 0;
   6929   int vertical_first;
   6930   int decode_buffer_size, ring_buffer_length_bytes, ring_buffer_size, vertical_buffer_size, alloc_ring_buffer_num_entries;
   6931 
   6932   int alpha_weighting_type = 0; // 0=none, 1=simple, 2=fancy
   6933   int conservative_split_output_size = stbir__get_max_split( splits, vertical->scale_info.output_sub_size );
   6934   stbir_internal_pixel_layout input_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ input_pixel_layout_public ];
   6935   stbir_internal_pixel_layout output_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ output_pixel_layout_public ];
   6936   int channels = stbir__pixel_channels[ input_pixel_layout ];
   6937   int effective_channels = channels;
   6938 
   6939   // first figure out what type of alpha weighting to use (if any)
   6940   if ( ( horizontal->filter_enum != STBIR_FILTER_POINT_SAMPLE ) || ( vertical->filter_enum != STBIR_FILTER_POINT_SAMPLE ) ) // no alpha weighting on point sampling
   6941   {
   6942     if ( ( input_pixel_layout >= STBIRI_RGBA ) && ( input_pixel_layout <= STBIRI_AR ) && ( output_pixel_layout >= STBIRI_RGBA ) && ( output_pixel_layout <= STBIRI_AR ) )
   6943     {
   6944       if ( fast_alpha )
   6945       {
   6946         alpha_weighting_type = 4;
   6947       }
   6948       else
   6949       {
   6950         static int fancy_alpha_effective_cnts[6] = { 7, 7, 7, 7, 3, 3 };
   6951         alpha_weighting_type = 2;
   6952         effective_channels = fancy_alpha_effective_cnts[ input_pixel_layout - STBIRI_RGBA ];
   6953       }
   6954     }
   6955     else if ( ( input_pixel_layout >= STBIRI_RGBA_PM ) && ( input_pixel_layout <= STBIRI_AR_PM ) && ( output_pixel_layout >= STBIRI_RGBA ) && ( output_pixel_layout <= STBIRI_AR ) )
   6956     {
   6957       // input premult, output non-premult
   6958       alpha_weighting_type = 3;
   6959     }
   6960     else if ( ( input_pixel_layout >= STBIRI_RGBA ) && ( input_pixel_layout <= STBIRI_AR ) && ( output_pixel_layout >= STBIRI_RGBA_PM ) && ( output_pixel_layout <= STBIRI_AR_PM ) )
   6961     {
   6962       // input non-premult, output premult
   6963       alpha_weighting_type = 1;
   6964     }
   6965   }
   6966 
   6967   // channel in and out count must match currently
   6968   if ( channels != stbir__pixel_channels[ output_pixel_layout ] )
   6969     return 0;
   6970 
   6971   // get vertical first
   6972   vertical_first = stbir__should_do_vertical_first( stbir__compute_weights[ (int)stbir_channel_count_index[ effective_channels ] ], horizontal->filter_pixel_width, horizontal->scale_info.scale, horizontal->scale_info.output_sub_size, vertical->filter_pixel_width, vertical->scale_info.scale, vertical->scale_info.output_sub_size, vertical->is_gather, STBIR__V_FIRST_INFO_POINTER );
   6973 
   6974   // sometimes read one float off in some of the unrolled loops (with a weight of zero coeff, so it doesn't have an effect)
   6975   decode_buffer_size = ( conservative->n1 - conservative->n0 + 1 ) * effective_channels * sizeof(float) + sizeof(float); // extra float for padding
   6976 
   6977 #if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8)
   6978   if ( effective_channels == 3 )
   6979     decode_buffer_size += sizeof(float); // avx in 3 channel mode needs one float at the start of the buffer (only with separate allocations)
   6980 #endif
   6981 
   6982   ring_buffer_length_bytes = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float); // extra float for padding
   6983 
   6984   // if we do vertical first, the ring buffer holds a whole decoded line
   6985   if ( vertical_first )
   6986     ring_buffer_length_bytes = ( decode_buffer_size + 15 ) & ~15;
   6987 
   6988   if ( ( ring_buffer_length_bytes & 4095 ) == 0 ) ring_buffer_length_bytes += 64*3; // avoid 4k alias
   6989 
   6990   // One extra entry because floating point precision problems sometimes cause an extra to be necessary.
   6991   alloc_ring_buffer_num_entries = vertical->filter_pixel_width + 1;
   6992 
   6993   // we never need more ring buffer entries than the scanlines we're outputting when in scatter mode
   6994   if ( ( !vertical->is_gather ) && ( alloc_ring_buffer_num_entries > conservative_split_output_size ) )
   6995     alloc_ring_buffer_num_entries = conservative_split_output_size;
   6996 
   6997   ring_buffer_size = alloc_ring_buffer_num_entries * ring_buffer_length_bytes;
   6998 
   6999   // The vertical buffer is used differently, depending on whether we are scattering
   7000   //   the vertical scanlines, or gathering them.
   7001   //   If scattering, it's used at the temp buffer to accumulate each output.
   7002   //   If gathering, it's just the output buffer.
   7003   vertical_buffer_size = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float);  // extra float for padding
   7004 
   7005   // we make two passes through this loop, 1st to add everything up, 2nd to allocate and init
   7006   for(;;)
   7007   {
   7008     int i;
   7009     void * advance_mem = alloced;
   7010     int copy_horizontal = 0;
   7011     stbir__sampler * possibly_use_horizontal_for_pivot = 0;
   7012 
   7013 #ifdef STBIR__SEPARATE_ALLOCATIONS
   7014     #define STBIR__NEXT_PTR( ptr, size, ntype ) if ( alloced ) { void * p = STBIR_MALLOC( size, user_data); if ( p == 0 ) { stbir__free_internal_mem( info ); return 0; } (ptr) = (ntype*)p; }
   7015 #else
   7016     #define STBIR__NEXT_PTR( ptr, size, ntype ) advance_mem = (void*) ( ( ((size_t)advance_mem) + 15 ) & ~15 ); if ( alloced ) ptr = (ntype*)advance_mem; advance_mem = ((char*)advance_mem) + (size);
   7017 #endif
   7018 
   7019     STBIR__NEXT_PTR( info, sizeof( stbir__info ), stbir__info );
   7020 
   7021     STBIR__NEXT_PTR( info->split_info, sizeof( stbir__per_split_info ) * splits, stbir__per_split_info );
   7022 
   7023     if ( info )
   7024     {
   7025       static stbir__alpha_weight_func * fancy_alpha_weights[6]  =    { stbir__fancy_alpha_weight_4ch,   stbir__fancy_alpha_weight_4ch,   stbir__fancy_alpha_weight_4ch,   stbir__fancy_alpha_weight_4ch,   stbir__fancy_alpha_weight_2ch,   stbir__fancy_alpha_weight_2ch };
   7026       static stbir__alpha_unweight_func * fancy_alpha_unweights[6] = { stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_2ch, stbir__fancy_alpha_unweight_2ch };
   7027       static stbir__alpha_weight_func * simple_alpha_weights[6] = { stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_2ch, stbir__simple_alpha_weight_2ch };
   7028       static stbir__alpha_unweight_func * simple_alpha_unweights[6] = { stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_2ch, stbir__simple_alpha_unweight_2ch };
   7029 
   7030       // initialize info fields
   7031       info->alloced_mem = alloced;
   7032       info->alloced_total = alloced_total;
   7033 
   7034       info->channels = channels;
   7035       info->effective_channels = effective_channels;
   7036 
   7037       info->offset_x = new_x;
   7038       info->offset_y = new_y;
   7039       info->alloc_ring_buffer_num_entries = alloc_ring_buffer_num_entries;
   7040       info->ring_buffer_num_entries = 0;
   7041       info->ring_buffer_length_bytes = ring_buffer_length_bytes;
   7042       info->splits = splits;
   7043       info->vertical_first = vertical_first;
   7044 
   7045       info->input_pixel_layout_internal = input_pixel_layout;
   7046       info->output_pixel_layout_internal = output_pixel_layout;
   7047 
   7048       // setup alpha weight functions
   7049       info->alpha_weight = 0;
   7050       info->alpha_unweight = 0;
   7051 
   7052       // handle alpha weighting functions and overrides
   7053       if ( alpha_weighting_type == 2 )
   7054       {
   7055         // high quality alpha multiplying on the way in, dividing on the way out
   7056         info->alpha_weight = fancy_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
   7057         info->alpha_unweight = fancy_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
   7058       }
   7059       else if ( alpha_weighting_type == 4 )
   7060       {
   7061         // fast alpha multiplying on the way in, dividing on the way out
   7062         info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
   7063         info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
   7064       }
   7065       else if ( alpha_weighting_type == 1 )
   7066       {
   7067         // fast alpha on the way in, leave in premultiplied form on way out
   7068         info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
   7069       }
   7070       else if ( alpha_weighting_type == 3 )
   7071       {
   7072         // incoming is premultiplied, fast alpha dividing on the way out - non-premultiplied output
   7073         info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
   7074       }
   7075 
   7076       // handle 3-chan color flipping, using the alpha weight path
   7077       if ( ( ( input_pixel_layout == STBIRI_RGB ) && ( output_pixel_layout == STBIRI_BGR ) ) ||
   7078            ( ( input_pixel_layout == STBIRI_BGR ) && ( output_pixel_layout == STBIRI_RGB ) ) )
   7079       {
   7080         // do the flipping on the smaller of the two ends
   7081         if ( horizontal->scale_info.scale < 1.0f )
   7082           info->alpha_unweight = stbir__simple_flip_3ch;
   7083         else
   7084           info->alpha_weight = stbir__simple_flip_3ch;
   7085       }
   7086 
   7087     }
   7088 
   7089     // get all the per-split buffers
   7090     for( i = 0 ; i < splits ; i++ )
   7091     {
   7092       STBIR__NEXT_PTR( info->split_info[i].decode_buffer, decode_buffer_size, float );
   7093 
   7094 #ifdef STBIR__SEPARATE_ALLOCATIONS
   7095 
   7096       #ifdef STBIR_SIMD8
   7097       if ( ( info ) && ( effective_channels == 3 ) )
   7098         ++info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer
   7099       #endif
   7100 
   7101       STBIR__NEXT_PTR( info->split_info[i].ring_buffers, alloc_ring_buffer_num_entries * sizeof(float*), float* );
   7102       {
   7103         int j;
   7104         for( j = 0 ; j < alloc_ring_buffer_num_entries ; j++ )
   7105         {
   7106           STBIR__NEXT_PTR( info->split_info[i].ring_buffers[j], ring_buffer_length_bytes, float );
   7107           #ifdef STBIR_SIMD8
   7108           if ( ( info ) && ( effective_channels == 3 ) )
   7109             ++info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer
   7110           #endif
   7111         }
   7112       }
   7113 #else
   7114       STBIR__NEXT_PTR( info->split_info[i].ring_buffer, ring_buffer_size, float );
   7115 #endif
   7116       STBIR__NEXT_PTR( info->split_info[i].vertical_buffer, vertical_buffer_size, float );
   7117     }
   7118 
   7119     // alloc memory for to-be-pivoted coeffs (if necessary)
   7120     if ( vertical->is_gather == 0 )
   7121     {
   7122       int both;
   7123       int temp_mem_amt;
   7124 
   7125       // when in vertical scatter mode, we first build the coefficients in gather mode, and then pivot after,
   7126       //   that means we need two buffers, so we try to use the decode buffer and ring buffer for this. if that
   7127       //   is too small, we just allocate extra memory to use as this temp.
   7128 
   7129       both = vertical->gather_prescatter_contributors_size + vertical->gather_prescatter_coefficients_size;
   7130 
   7131 #ifdef STBIR__SEPARATE_ALLOCATIONS
   7132       temp_mem_amt = decode_buffer_size;
   7133 
   7134       #ifdef STBIR_SIMD8
   7135       if ( effective_channels == 3 )
   7136         --temp_mem_amt; // avx in 3 channel mode needs one float at the start of the buffer
   7137       #endif
   7138 #else
   7139       temp_mem_amt = ( decode_buffer_size + ring_buffer_size + vertical_buffer_size ) * splits;
   7140 #endif
   7141       if ( temp_mem_amt >= both )
   7142       {
   7143         if ( info )
   7144         {
   7145           vertical->gather_prescatter_contributors = (stbir__contributors*)info->split_info[0].decode_buffer;
   7146           vertical->gather_prescatter_coefficients = (float*) ( ( (char*)info->split_info[0].decode_buffer ) + vertical->gather_prescatter_contributors_size );
   7147         }
   7148       }
   7149       else
   7150       {
   7151         // ring+decode memory is too small, so allocate temp memory
   7152         STBIR__NEXT_PTR( vertical->gather_prescatter_contributors, vertical->gather_prescatter_contributors_size, stbir__contributors );
   7153         STBIR__NEXT_PTR( vertical->gather_prescatter_coefficients, vertical->gather_prescatter_coefficients_size, float );
   7154       }
   7155     }
   7156 
   7157     STBIR__NEXT_PTR( horizontal->contributors, horizontal->contributors_size, stbir__contributors );
   7158     STBIR__NEXT_PTR( horizontal->coefficients, horizontal->coefficients_size, float );
   7159 
   7160     // are the two filters identical?? (happens a lot with mipmap generation)
   7161     if ( ( horizontal->filter_kernel == vertical->filter_kernel ) && ( horizontal->filter_support == vertical->filter_support ) && ( horizontal->edge == vertical->edge ) && ( horizontal->scale_info.output_sub_size == vertical->scale_info.output_sub_size ) )
   7162     {
   7163       float diff_scale = horizontal->scale_info.scale - vertical->scale_info.scale;
   7164       float diff_shift = horizontal->scale_info.pixel_shift - vertical->scale_info.pixel_shift;
   7165       if ( diff_scale < 0.0f ) diff_scale = -diff_scale;
   7166       if ( diff_shift < 0.0f ) diff_shift = -diff_shift;
   7167       if ( ( diff_scale <= stbir__small_float ) && ( diff_shift <= stbir__small_float ) )
   7168       {
   7169         if ( horizontal->is_gather == vertical->is_gather )
   7170         {
   7171           copy_horizontal = 1;
   7172           goto no_vert_alloc;
   7173         }
   7174         // everything matches, but vertical is scatter, horizontal is gather, use horizontal coeffs for vertical pivot coeffs
   7175         possibly_use_horizontal_for_pivot = horizontal;
   7176       }
   7177     }
   7178 
   7179     STBIR__NEXT_PTR( vertical->contributors, vertical->contributors_size, stbir__contributors );
   7180     STBIR__NEXT_PTR( vertical->coefficients, vertical->coefficients_size, float );
   7181 
   7182    no_vert_alloc:
   7183 
   7184     if ( info )
   7185     {
   7186       STBIR_PROFILE_BUILD_START( horizontal );
   7187 
   7188       stbir__calculate_filters( horizontal, 0, user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
   7189 
   7190       // setup the horizontal gather functions
   7191       // start with defaulting to the n_coeffs functions (specialized on channels and remnant leftover)
   7192       info->horizontal_gather_channels = stbir__horizontal_gather_n_coeffs_funcs[ effective_channels ][ horizontal->extent_info.widest & 3 ];
   7193       // but if the number of coeffs <= 12, use another set of special cases. <=12 coeffs is any enlarging resize, or shrinking resize down to about 1/3 size
   7194       if ( horizontal->extent_info.widest <= 12 )
   7195         info->horizontal_gather_channels = stbir__horizontal_gather_channels_funcs[ effective_channels ][ horizontal->extent_info.widest - 1 ];
   7196 
   7197       info->scanline_extents.conservative.n0 = conservative->n0;
   7198       info->scanline_extents.conservative.n1 = conservative->n1;
   7199 
   7200       // get exact extents
   7201       stbir__get_extents( horizontal, &info->scanline_extents );
   7202 
   7203       // pack the horizontal coeffs
   7204       horizontal->coefficient_width = stbir__pack_coefficients(horizontal->num_contributors, horizontal->contributors, horizontal->coefficients, horizontal->coefficient_width, horizontal->extent_info.widest, info->scanline_extents.conservative.n0, info->scanline_extents.conservative.n1 );
   7205 
   7206       STBIR_MEMCPY( &info->horizontal, horizontal, sizeof( stbir__sampler ) );
   7207 
   7208       STBIR_PROFILE_BUILD_END( horizontal );
   7209 
   7210       if ( copy_horizontal )
   7211       {
   7212         STBIR_MEMCPY( &info->vertical, horizontal, sizeof( stbir__sampler ) );
   7213       }
   7214       else
   7215       {
   7216         STBIR_PROFILE_BUILD_START( vertical );
   7217 
   7218         stbir__calculate_filters( vertical, possibly_use_horizontal_for_pivot, user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
   7219         STBIR_MEMCPY( &info->vertical, vertical, sizeof( stbir__sampler ) );
   7220 
   7221         STBIR_PROFILE_BUILD_END( vertical );
   7222       }
   7223 
   7224       // setup the vertical split ranges
   7225       stbir__get_split_info( info->split_info, info->splits, info->vertical.scale_info.output_sub_size, info->vertical.filter_pixel_margin, info->vertical.scale_info.input_full_size );
   7226 
   7227       // now we know precisely how many entries we need
   7228       info->ring_buffer_num_entries = info->vertical.extent_info.widest;
   7229 
   7230       // we never need more ring buffer entries than the scanlines we're outputting
   7231       if ( ( !info->vertical.is_gather ) && ( info->ring_buffer_num_entries > conservative_split_output_size ) )
   7232         info->ring_buffer_num_entries = conservative_split_output_size;
   7233       STBIR_ASSERT( info->ring_buffer_num_entries <= info->alloc_ring_buffer_num_entries );
   7234 
   7235       // a few of the horizontal gather functions read past the end of the decode (but mask it out), 
   7236       //   so put in normal values so no snans or denormals accidentally sneak in (also, in the ring 
   7237       //   buffer for vertical first)
   7238       for( i = 0 ; i < splits ; i++ )
   7239       {
   7240         int t, ofs, start;
   7241 
   7242         ofs = decode_buffer_size / 4;
   7243 
   7244         #if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8)
   7245         if ( effective_channels == 3 ) 
   7246           --ofs; // avx in 3 channel mode needs one float at the start of the buffer, so we snap back for clearing
   7247         #endif
   7248 
   7249         start = ofs - 4;
   7250         if ( start < 0 ) start = 0;
   7251 
   7252         for( t = start ; t < ofs; t++ )
   7253           info->split_info[i].decode_buffer[ t ] = 9999.0f;
   7254 
   7255         if ( vertical_first )
   7256         {
   7257           int j;
   7258           for( j = 0; j < info->ring_buffer_num_entries ; j++ )
   7259           {
   7260             for( t = start ; t < ofs; t++ )
   7261               stbir__get_ring_buffer_entry( info, info->split_info + i, j )[ t ] = 9999.0f;
   7262           }
   7263         }
   7264       }
   7265     }
   7266 
   7267     #undef STBIR__NEXT_PTR
   7268 
   7269 
   7270     // is this the first time through loop?
   7271     if ( info == 0 )
   7272     {
   7273       alloced_total = ( 15 + (size_t)advance_mem );
   7274       alloced = STBIR_MALLOC( alloced_total, user_data );
   7275       if ( alloced == 0 )
   7276         return 0;
   7277     }
   7278     else
   7279       return info;  // success
   7280   }
   7281 }
   7282 
   7283 static int stbir__perform_resize( stbir__info const * info, int split_start, int split_count )
   7284 {
   7285   stbir__per_split_info * split_info = info->split_info + split_start;
   7286 
   7287   STBIR_PROFILE_CLEAR_EXTRAS();
   7288 
   7289   STBIR_PROFILE_FIRST_START( looping );
   7290   if (info->vertical.is_gather)
   7291     stbir__vertical_gather_loop( info, split_info, split_count );
   7292   else
   7293     stbir__vertical_scatter_loop( info, split_info, split_count );
   7294   STBIR_PROFILE_END( looping );
   7295 
   7296   return 1;
   7297 }
   7298 
   7299 static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * resize )
   7300 {
   7301   static stbir__decode_pixels_func * decode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
   7302   {
   7303     /* 1ch-4ch */ stbir__decode_uint8_srgb, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear,
   7304   };
   7305 
   7306   static stbir__decode_pixels_func * decode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
   7307   {
   7308     { /* RGBA */ stbir__decode_uint8_srgb4_linearalpha,      stbir__decode_uint8_srgb,      0, stbir__decode_float_linear,      stbir__decode_half_float_linear },
   7309     { /* BGRA */ stbir__decode_uint8_srgb4_linearalpha_BGRA, stbir__decode_uint8_srgb_BGRA, 0, stbir__decode_float_linear_BGRA, stbir__decode_half_float_linear_BGRA },
   7310     { /* ARGB */ stbir__decode_uint8_srgb4_linearalpha_ARGB, stbir__decode_uint8_srgb_ARGB, 0, stbir__decode_float_linear_ARGB, stbir__decode_half_float_linear_ARGB },
   7311     { /* ABGR */ stbir__decode_uint8_srgb4_linearalpha_ABGR, stbir__decode_uint8_srgb_ABGR, 0, stbir__decode_float_linear_ABGR, stbir__decode_half_float_linear_ABGR },
   7312     { /* RA   */ stbir__decode_uint8_srgb2_linearalpha,      stbir__decode_uint8_srgb,      0, stbir__decode_float_linear,      stbir__decode_half_float_linear },
   7313     { /* AR   */ stbir__decode_uint8_srgb2_linearalpha_AR,   stbir__decode_uint8_srgb_AR,   0, stbir__decode_float_linear_AR,   stbir__decode_half_float_linear_AR },
   7314   };
   7315 
   7316   static stbir__decode_pixels_func * decode_simple_scaled_or_not[2][2]=
   7317   {
   7318     { stbir__decode_uint8_linear_scaled,  stbir__decode_uint8_linear }, { stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear },
   7319   };
   7320 
   7321   static stbir__decode_pixels_func * decode_alphas_scaled_or_not[STBIRI_AR-STBIRI_RGBA+1][2][2]=
   7322   {
   7323     { /* RGBA */ { stbir__decode_uint8_linear_scaled,       stbir__decode_uint8_linear },      { stbir__decode_uint16_linear_scaled,      stbir__decode_uint16_linear } },
   7324     { /* BGRA */ { stbir__decode_uint8_linear_scaled_BGRA,  stbir__decode_uint8_linear_BGRA }, { stbir__decode_uint16_linear_scaled_BGRA, stbir__decode_uint16_linear_BGRA } },
   7325     { /* ARGB */ { stbir__decode_uint8_linear_scaled_ARGB,  stbir__decode_uint8_linear_ARGB }, { stbir__decode_uint16_linear_scaled_ARGB, stbir__decode_uint16_linear_ARGB } },
   7326     { /* ABGR */ { stbir__decode_uint8_linear_scaled_ABGR,  stbir__decode_uint8_linear_ABGR }, { stbir__decode_uint16_linear_scaled_ABGR, stbir__decode_uint16_linear_ABGR } },
   7327     { /* RA   */ { stbir__decode_uint8_linear_scaled,       stbir__decode_uint8_linear },      { stbir__decode_uint16_linear_scaled,      stbir__decode_uint16_linear } },
   7328     { /* AR   */ { stbir__decode_uint8_linear_scaled_AR,    stbir__decode_uint8_linear_AR },   { stbir__decode_uint16_linear_scaled_AR,   stbir__decode_uint16_linear_AR } }
   7329   };
   7330 
   7331   static stbir__encode_pixels_func * encode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
   7332   {
   7333     /* 1ch-4ch */ stbir__encode_uint8_srgb, stbir__encode_uint8_srgb, 0, stbir__encode_float_linear, stbir__encode_half_float_linear,
   7334   };
   7335 
   7336   static stbir__encode_pixels_func * encode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
   7337   {
   7338     { /* RGBA */ stbir__encode_uint8_srgb4_linearalpha,      stbir__encode_uint8_srgb,      0, stbir__encode_float_linear,      stbir__encode_half_float_linear },
   7339     { /* BGRA */ stbir__encode_uint8_srgb4_linearalpha_BGRA, stbir__encode_uint8_srgb_BGRA, 0, stbir__encode_float_linear_BGRA, stbir__encode_half_float_linear_BGRA },
   7340     { /* ARGB */ stbir__encode_uint8_srgb4_linearalpha_ARGB, stbir__encode_uint8_srgb_ARGB, 0, stbir__encode_float_linear_ARGB, stbir__encode_half_float_linear_ARGB },
   7341     { /* ABGR */ stbir__encode_uint8_srgb4_linearalpha_ABGR, stbir__encode_uint8_srgb_ABGR, 0, stbir__encode_float_linear_ABGR, stbir__encode_half_float_linear_ABGR },
   7342     { /* RA   */ stbir__encode_uint8_srgb2_linearalpha,      stbir__encode_uint8_srgb,      0, stbir__encode_float_linear,      stbir__encode_half_float_linear },
   7343     { /* AR   */ stbir__encode_uint8_srgb2_linearalpha_AR,   stbir__encode_uint8_srgb_AR,   0, stbir__encode_float_linear_AR,   stbir__encode_half_float_linear_AR }
   7344   };
   7345 
   7346   static stbir__encode_pixels_func * encode_simple_scaled_or_not[2][2]=
   7347   {
   7348     { stbir__encode_uint8_linear_scaled,  stbir__encode_uint8_linear }, { stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear },
   7349   };
   7350 
   7351   static stbir__encode_pixels_func * encode_alphas_scaled_or_not[STBIRI_AR-STBIRI_RGBA+1][2][2]=
   7352   {
   7353     { /* RGBA */ { stbir__encode_uint8_linear_scaled,       stbir__encode_uint8_linear },       { stbir__encode_uint16_linear_scaled,      stbir__encode_uint16_linear } },
   7354     { /* BGRA */ { stbir__encode_uint8_linear_scaled_BGRA,  stbir__encode_uint8_linear_BGRA },  { stbir__encode_uint16_linear_scaled_BGRA, stbir__encode_uint16_linear_BGRA } },
   7355     { /* ARGB */ { stbir__encode_uint8_linear_scaled_ARGB,  stbir__encode_uint8_linear_ARGB },  { stbir__encode_uint16_linear_scaled_ARGB, stbir__encode_uint16_linear_ARGB } },
   7356     { /* ABGR */ { stbir__encode_uint8_linear_scaled_ABGR,  stbir__encode_uint8_linear_ABGR },  { stbir__encode_uint16_linear_scaled_ABGR, stbir__encode_uint16_linear_ABGR } },
   7357     { /* RA   */ { stbir__encode_uint8_linear_scaled,       stbir__encode_uint8_linear },       { stbir__encode_uint16_linear_scaled,      stbir__encode_uint16_linear } },
   7358     { /* AR   */ { stbir__encode_uint8_linear_scaled_AR,    stbir__encode_uint8_linear_AR },    { stbir__encode_uint16_linear_scaled_AR,   stbir__encode_uint16_linear_AR } }
   7359   };
   7360 
   7361   stbir__decode_pixels_func * decode_pixels = 0;
   7362   stbir__encode_pixels_func * encode_pixels = 0;
   7363   stbir_datatype input_type, output_type;
   7364 
   7365   input_type = resize->input_data_type;
   7366   output_type = resize->output_data_type;
   7367   info->input_data = resize->input_pixels;
   7368   info->input_stride_bytes = resize->input_stride_in_bytes;
   7369   info->output_stride_bytes = resize->output_stride_in_bytes;
   7370 
   7371   // if we're completely point sampling, then we can turn off SRGB
   7372   if ( ( info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( info->vertical.filter_enum == STBIR_FILTER_POINT_SAMPLE ) )
   7373   {
   7374     if ( ( ( input_type  == STBIR_TYPE_UINT8_SRGB ) || ( input_type  == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) &&
   7375          ( ( output_type == STBIR_TYPE_UINT8_SRGB ) || ( output_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) )
   7376     {
   7377       input_type = STBIR_TYPE_UINT8;
   7378       output_type = STBIR_TYPE_UINT8;
   7379     }
   7380   }
   7381 
   7382   // recalc the output and input strides
   7383   if ( info->input_stride_bytes == 0 )
   7384     info->input_stride_bytes = info->channels * info->horizontal.scale_info.input_full_size * stbir__type_size[input_type];
   7385 
   7386   if ( info->output_stride_bytes == 0 )
   7387     info->output_stride_bytes = info->channels * info->horizontal.scale_info.output_sub_size * stbir__type_size[output_type];
   7388 
   7389   // calc offset
   7390   info->output_data = ( (char*) resize->output_pixels ) + ( (size_t) info->offset_y * (size_t) resize->output_stride_in_bytes ) + ( info->offset_x * info->channels * stbir__type_size[output_type] );
   7391 
   7392   info->in_pixels_cb = resize->input_cb;
   7393   info->user_data = resize->user_data;
   7394   info->out_pixels_cb = resize->output_cb;
   7395 
   7396   // setup the input format converters
   7397   if ( ( input_type == STBIR_TYPE_UINT8 ) || ( input_type == STBIR_TYPE_UINT16 ) )
   7398   {
   7399     int non_scaled = 0;
   7400 
   7401     // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16)
   7402     if ( ( !info->alpha_weight ) && ( !info->alpha_unweight )  ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual)
   7403       if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) )
   7404         non_scaled = 1;
   7405 
   7406     if ( info->input_pixel_layout_internal <= STBIRI_4CHANNEL )
   7407       decode_pixels = decode_simple_scaled_or_not[ input_type == STBIR_TYPE_UINT16 ][ non_scaled ];
   7408     else
   7409       decode_pixels = decode_alphas_scaled_or_not[ ( info->input_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ input_type == STBIR_TYPE_UINT16 ][ non_scaled ];
   7410   }
   7411   else
   7412   {
   7413     if ( info->input_pixel_layout_internal <= STBIRI_4CHANNEL )
   7414       decode_pixels = decode_simple[ input_type - STBIR_TYPE_UINT8_SRGB ];
   7415     else
   7416       decode_pixels = decode_alphas[ ( info->input_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ input_type - STBIR_TYPE_UINT8_SRGB ];
   7417   }
   7418 
   7419   // setup the output format converters
   7420   if ( ( output_type == STBIR_TYPE_UINT8 ) || ( output_type == STBIR_TYPE_UINT16 ) )
   7421   {
   7422     int non_scaled = 0;
   7423 
   7424     // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16)
   7425     if ( ( !info->alpha_weight ) && ( !info->alpha_unweight ) ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual)
   7426       if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) )
   7427         non_scaled = 1;
   7428 
   7429     if ( info->output_pixel_layout_internal <= STBIRI_4CHANNEL )
   7430       encode_pixels = encode_simple_scaled_or_not[ output_type == STBIR_TYPE_UINT16 ][ non_scaled ];
   7431     else
   7432       encode_pixels = encode_alphas_scaled_or_not[ ( info->output_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ output_type == STBIR_TYPE_UINT16 ][ non_scaled ];
   7433   }
   7434   else
   7435   {
   7436     if ( info->output_pixel_layout_internal <= STBIRI_4CHANNEL )
   7437       encode_pixels = encode_simple[ output_type - STBIR_TYPE_UINT8_SRGB ];
   7438     else
   7439       encode_pixels = encode_alphas[ ( info->output_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ output_type - STBIR_TYPE_UINT8_SRGB ];
   7440   }
   7441 
   7442   info->input_type = input_type;
   7443   info->output_type = output_type;
   7444   info->decode_pixels = decode_pixels;
   7445   info->encode_pixels = encode_pixels;
   7446 }
   7447 
   7448 static void stbir__clip( int * outx, int * outsubw, int outw, double * u0, double * u1 )
   7449 {
   7450   double per, adj;
   7451   int over;
   7452 
   7453   // do left/top edge
   7454   if ( *outx < 0 )
   7455   {
   7456     per = ( (double)*outx ) / ( (double)*outsubw ); // is negative
   7457     adj = per * ( *u1 - *u0 );
   7458     *u0 -= adj; // increases u0
   7459     *outx = 0;
   7460   }
   7461 
   7462   // do right/bot edge
   7463   over = outw - ( *outx + *outsubw );
   7464   if ( over < 0 )
   7465   {
   7466     per = ( (double)over ) / ( (double)*outsubw ); // is negative
   7467     adj = per * ( *u1 - *u0 );
   7468     *u1 += adj; // decrease u1
   7469     *outsubw = outw - *outx;
   7470   }
   7471 }
   7472 
   7473 // converts a double to a rational that has less than one float bit of error (returns 0 if unable to do so)
   7474 static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32 *numer, stbir_uint32 *denom, int limit_denom ) // limit_denom (1) or limit numer (0)
   7475 {
   7476   double err;
   7477   stbir_uint64 top, bot;
   7478   stbir_uint64 numer_last = 0;
   7479   stbir_uint64 denom_last = 1;
   7480   stbir_uint64 numer_estimate = 1;
   7481   stbir_uint64 denom_estimate = 0;
   7482 
   7483   // scale to past float error range
   7484   top = (stbir_uint64)( f * (double)(1 << 25) );
   7485   bot = 1 << 25;
   7486 
   7487   // keep refining, but usually stops in a few loops - usually 5 for bad cases
   7488   for(;;)
   7489   {
   7490     stbir_uint64 est, temp;
   7491 
   7492     // hit limit, break out and do best full range estimate
   7493     if ( ( ( limit_denom ) ? denom_estimate : numer_estimate ) >= limit )
   7494       break;
   7495 
   7496     // is the current error less than 1 bit of a float? if so, we're done
   7497     if ( denom_estimate )
   7498     {
   7499       err = ( (double)numer_estimate / (double)denom_estimate ) - f;
   7500       if ( err < 0.0 ) err = -err;
   7501       if ( err < ( 1.0 / (double)(1<<24) ) )
   7502       {
   7503         // yup, found it
   7504         *numer = (stbir_uint32) numer_estimate;
   7505         *denom = (stbir_uint32) denom_estimate;
   7506         return 1;
   7507       }
   7508     }
   7509 
   7510     // no more refinement bits left? break out and do full range estimate
   7511     if ( bot == 0 )
   7512       break;
   7513 
   7514     // gcd the estimate bits
   7515     est = top / bot;
   7516     temp = top % bot;
   7517     top = bot;
   7518     bot = temp;
   7519 
   7520     // move remainders
   7521     temp = est * denom_estimate + denom_last;
   7522     denom_last = denom_estimate;
   7523     denom_estimate = temp;
   7524 
   7525     // move remainders
   7526     temp = est * numer_estimate + numer_last;
   7527     numer_last = numer_estimate;
   7528     numer_estimate = temp;
   7529   }
   7530 
   7531   // we didn't fine anything good enough for float, use a full range estimate
   7532   if ( limit_denom )
   7533   {
   7534     numer_estimate= (stbir_uint64)( f * (double)limit + 0.5 );
   7535     denom_estimate = limit;
   7536   }
   7537   else
   7538   {
   7539     numer_estimate = limit;
   7540     denom_estimate = (stbir_uint64)( ( (double)limit / f ) + 0.5 );
   7541   }
   7542 
   7543   *numer = (stbir_uint32) numer_estimate;
   7544   *denom = (stbir_uint32) denom_estimate;
   7545 
   7546   err = ( denom_estimate ) ? ( ( (double)(stbir_uint32)numer_estimate / (double)(stbir_uint32)denom_estimate ) - f ) : 1.0;
   7547   if ( err < 0.0 ) err = -err;
   7548   return ( err < ( 1.0 / (double)(1<<24) ) ) ? 1 : 0;
   7549 }
   7550 
   7551 static int stbir__calculate_region_transform( stbir__scale_info * scale_info, int output_full_range, int * output_offset, int output_sub_range, int input_full_range, double input_s0, double input_s1 )
   7552 {
   7553   double output_range, input_range, output_s, input_s, ratio, scale;
   7554 
   7555   input_s = input_s1 - input_s0;
   7556 
   7557   // null area
   7558   if ( ( output_full_range == 0 ) || ( input_full_range == 0 ) ||
   7559        ( output_sub_range == 0 ) || ( input_s <= stbir__small_float ) )
   7560     return 0;
   7561 
   7562   // are either of the ranges completely out of bounds?
   7563   if ( ( *output_offset >= output_full_range ) || ( ( *output_offset + output_sub_range ) <= 0 ) || ( input_s0 >= (1.0f-stbir__small_float) ) || ( input_s1 <= stbir__small_float ) )
   7564     return 0;
   7565 
   7566   output_range = (double)output_full_range;
   7567   input_range = (double)input_full_range;
   7568 
   7569   output_s = ( (double)output_sub_range) / output_range;
   7570 
   7571   // figure out the scaling to use
   7572   ratio = output_s / input_s;
   7573 
   7574   // save scale before clipping
   7575   scale = ( output_range / input_range ) * ratio;
   7576   scale_info->scale = (float)scale;
   7577   scale_info->inv_scale = (float)( 1.0 / scale );
   7578 
   7579   // clip output area to left/right output edges (and adjust input area)
   7580   stbir__clip( output_offset, &output_sub_range, output_full_range, &input_s0, &input_s1 );
   7581 
   7582   // recalc input area
   7583   input_s = input_s1 - input_s0;
   7584 
   7585   // after clipping do we have zero input area?
   7586   if ( input_s <= stbir__small_float )
   7587     return 0;
   7588 
   7589   // calculate and store the starting source offsets in output pixel space
   7590   scale_info->pixel_shift = (float) ( input_s0 * ratio * output_range );
   7591 
   7592   scale_info->scale_is_rational = stbir__double_to_rational( scale, ( scale <= 1.0 ) ? output_full_range : input_full_range, &scale_info->scale_numerator, &scale_info->scale_denominator, ( scale >= 1.0 ) );
   7593 
   7594   scale_info->input_full_size = input_full_range;
   7595   scale_info->output_sub_size = output_sub_range;
   7596 
   7597   return 1;
   7598 }
   7599 
   7600 
   7601 static void stbir__init_and_set_layout( STBIR_RESIZE * resize, stbir_pixel_layout pixel_layout, stbir_datatype data_type )
   7602 {
   7603   resize->input_cb = 0;
   7604   resize->output_cb = 0;
   7605   resize->user_data = resize;
   7606   resize->samplers = 0;
   7607   resize->called_alloc = 0;
   7608   resize->horizontal_filter = STBIR_FILTER_DEFAULT;
   7609   resize->horizontal_filter_kernel = 0; resize->horizontal_filter_support = 0;
   7610   resize->vertical_filter = STBIR_FILTER_DEFAULT;
   7611   resize->vertical_filter_kernel = 0; resize->vertical_filter_support = 0;
   7612   resize->horizontal_edge = STBIR_EDGE_CLAMP;
   7613   resize->vertical_edge = STBIR_EDGE_CLAMP;
   7614   resize->input_s0 = 0; resize->input_t0 = 0; resize->input_s1 = 1; resize->input_t1 = 1;
   7615   resize->output_subx = 0; resize->output_suby = 0; resize->output_subw = resize->output_w; resize->output_subh = resize->output_h;
   7616   resize->input_data_type = data_type;
   7617   resize->output_data_type = data_type;
   7618   resize->input_pixel_layout_public = pixel_layout;
   7619   resize->output_pixel_layout_public = pixel_layout;
   7620   resize->needs_rebuild = 1;
   7621 }
   7622 
   7623 STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize,
   7624                                  const void *input_pixels,  int input_w,  int input_h, int input_stride_in_bytes, // stride can be zero
   7625                                        void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero
   7626                                  stbir_pixel_layout pixel_layout, stbir_datatype data_type )
   7627 {
   7628   resize->input_pixels = input_pixels;
   7629   resize->input_w = input_w;
   7630   resize->input_h = input_h;
   7631   resize->input_stride_in_bytes = input_stride_in_bytes;
   7632   resize->output_pixels = output_pixels;
   7633   resize->output_w = output_w;
   7634   resize->output_h = output_h;
   7635   resize->output_stride_in_bytes = output_stride_in_bytes;
   7636   resize->fast_alpha = 0;
   7637 
   7638   stbir__init_and_set_layout( resize, pixel_layout, data_type );
   7639 }
   7640 
   7641 // You can update parameters any time after resize_init
   7642 STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type )  // by default, datatype from resize_init
   7643 {
   7644   resize->input_data_type = input_type;
   7645   resize->output_data_type = output_type;
   7646   if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
   7647     stbir__update_info_from_resize( resize->samplers, resize );
   7648 }
   7649 
   7650 STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb )   // no callbacks by default
   7651 {
   7652   resize->input_cb = input_cb;
   7653   resize->output_cb = output_cb;
   7654 
   7655   if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
   7656   {
   7657     resize->samplers->in_pixels_cb = input_cb;
   7658     resize->samplers->out_pixels_cb = output_cb;
   7659   }
   7660 }
   7661 
   7662 STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data )                                     // pass back STBIR_RESIZE* by default
   7663 {
   7664   resize->user_data = user_data;
   7665   if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
   7666     resize->samplers->user_data = user_data;
   7667 }
   7668 
   7669 STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes )
   7670 {
   7671   resize->input_pixels = input_pixels;
   7672   resize->input_stride_in_bytes = input_stride_in_bytes;
   7673   resize->output_pixels = output_pixels;
   7674   resize->output_stride_in_bytes = output_stride_in_bytes;
   7675   if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
   7676     stbir__update_info_from_resize( resize->samplers, resize );
   7677 }
   7678 
   7679 
   7680 STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge )       // CLAMP by default
   7681 {
   7682   resize->horizontal_edge = horizontal_edge;
   7683   resize->vertical_edge = vertical_edge;
   7684   resize->needs_rebuild = 1;
   7685   return 1;
   7686 }
   7687 
   7688 STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ) // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default
   7689 {
   7690   resize->horizontal_filter = horizontal_filter;
   7691   resize->vertical_filter = vertical_filter;
   7692   resize->needs_rebuild = 1;
   7693   return 1;
   7694 }
   7695 
   7696 STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support )
   7697 {
   7698   resize->horizontal_filter_kernel = horizontal_filter; resize->horizontal_filter_support = horizontal_support;
   7699   resize->vertical_filter_kernel = vertical_filter; resize->vertical_filter_support = vertical_support;
   7700   resize->needs_rebuild = 1;
   7701   return 1;
   7702 }
   7703 
   7704 STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout input_pixel_layout, stbir_pixel_layout output_pixel_layout )   // sets new pixel layouts
   7705 {
   7706   resize->input_pixel_layout_public = input_pixel_layout;
   7707   resize->output_pixel_layout_public = output_pixel_layout;
   7708   resize->needs_rebuild = 1;
   7709   return 1;
   7710 }
   7711 
   7712 
   7713 STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, int non_pma_alpha_speed_over_quality )   // sets alpha speed
   7714 {
   7715   resize->fast_alpha = non_pma_alpha_speed_over_quality;
   7716   resize->needs_rebuild = 1;
   7717   return 1;
   7718 }
   7719 
   7720 STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 )                 // sets input region (full region by default)
   7721 {
   7722   resize->input_s0 = s0;
   7723   resize->input_t0 = t0;
   7724   resize->input_s1 = s1;
   7725   resize->input_t1 = t1;
   7726   resize->needs_rebuild = 1;
   7727 
   7728   // are we inbounds?
   7729   if ( ( s1 < stbir__small_float ) || ( (s1-s0) < stbir__small_float ) ||
   7730        ( t1 < stbir__small_float ) || ( (t1-t0) < stbir__small_float ) ||
   7731        ( s0 > (1.0f-stbir__small_float) ) ||
   7732        ( t0 > (1.0f-stbir__small_float) ) )
   7733     return 0;
   7734 
   7735   return 1;
   7736 }
   7737 
   7738 STBIRDEF int stbir_set_output_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh )          // sets input region (full region by default)
   7739 {
   7740   resize->output_subx = subx;
   7741   resize->output_suby = suby;
   7742   resize->output_subw = subw;
   7743   resize->output_subh = subh;
   7744   resize->needs_rebuild = 1;
   7745 
   7746   // are we inbounds?
   7747   if ( ( subx >= resize->output_w ) || ( ( subx + subw ) <= 0 ) || ( suby >= resize->output_h ) || ( ( suby + subh ) <= 0 ) || ( subw == 0 ) || ( subh == 0 ) )
   7748     return 0;
   7749 
   7750   return 1;
   7751 }
   7752 
   7753 STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh )                 // sets both regions (full regions by default)
   7754 {
   7755   double s0, t0, s1, t1;
   7756 
   7757   s0 = ( (double)subx ) / ( (double)resize->output_w );
   7758   t0 = ( (double)suby ) / ( (double)resize->output_h );
   7759   s1 = ( (double)(subx+subw) ) / ( (double)resize->output_w );
   7760   t1 = ( (double)(suby+subh) ) / ( (double)resize->output_h );
   7761 
   7762   resize->input_s0 = s0;
   7763   resize->input_t0 = t0;
   7764   resize->input_s1 = s1;
   7765   resize->input_t1 = t1;
   7766   resize->output_subx = subx;
   7767   resize->output_suby = suby;
   7768   resize->output_subw = subw;
   7769   resize->output_subh = subh;
   7770   resize->needs_rebuild = 1;
   7771 
   7772   // are we inbounds?
   7773   if ( ( subx >= resize->output_w ) || ( ( subx + subw ) <= 0 ) || ( suby >= resize->output_h ) || ( ( suby + subh ) <= 0 ) || ( subw == 0 ) || ( subh == 0 ) )
   7774     return 0;
   7775 
   7776   return 1;
   7777 }
   7778 
   7779 static int stbir__perform_build( STBIR_RESIZE * resize, int splits )
   7780 {
   7781   stbir__contributors conservative = { 0, 0 };
   7782   stbir__sampler horizontal, vertical;
   7783   int new_output_subx, new_output_suby;
   7784   stbir__info * out_info;
   7785   #ifdef STBIR_PROFILE
   7786   stbir__info profile_infod;  // used to contain building profile info before everything is allocated
   7787   stbir__info * profile_info = &profile_infod;
   7788   #endif
   7789 
   7790   // have we already built the samplers?
   7791   if ( resize->samplers )
   7792     return 0;
   7793 
   7794   #define STBIR_RETURN_ERROR_AND_ASSERT( exp )  STBIR_ASSERT( !(exp) ); if (exp) return 0;
   7795   STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->horizontal_filter >= STBIR_FILTER_OTHER)
   7796   STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->vertical_filter >= STBIR_FILTER_OTHER)
   7797   #undef STBIR_RETURN_ERROR_AND_ASSERT
   7798 
   7799   if ( splits <= 0 )
   7800     return 0;
   7801 
   7802   STBIR_PROFILE_BUILD_FIRST_START( build );
   7803 
   7804   new_output_subx = resize->output_subx;
   7805   new_output_suby = resize->output_suby;
   7806 
   7807   // do horizontal clip and scale calcs
   7808   if ( !stbir__calculate_region_transform( &horizontal.scale_info, resize->output_w, &new_output_subx, resize->output_subw, resize->input_w, resize->input_s0, resize->input_s1 ) )
   7809     return 0;
   7810 
   7811   // do vertical clip and scale calcs
   7812   if ( !stbir__calculate_region_transform( &vertical.scale_info, resize->output_h, &new_output_suby, resize->output_subh, resize->input_h, resize->input_t0, resize->input_t1 ) )
   7813     return 0;
   7814 
   7815   // if nothing to do, just return
   7816   if ( ( horizontal.scale_info.output_sub_size == 0 ) || ( vertical.scale_info.output_sub_size == 0 ) )
   7817     return 0;
   7818 
   7819   stbir__set_sampler(&horizontal, resize->horizontal_filter, resize->horizontal_filter_kernel, resize->horizontal_filter_support, resize->horizontal_edge, &horizontal.scale_info, 1, resize->user_data );
   7820   stbir__get_conservative_extents( &horizontal, &conservative, resize->user_data );
   7821   stbir__set_sampler(&vertical, resize->vertical_filter, resize->horizontal_filter_kernel, resize->vertical_filter_support, resize->vertical_edge, &vertical.scale_info, 0, resize->user_data );
   7822 
   7823   if ( ( vertical.scale_info.output_sub_size / splits ) < STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS ) // each split should be a minimum of 4 scanlines (handwavey choice)
   7824   {
   7825     splits = vertical.scale_info.output_sub_size / STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS;
   7826     if ( splits == 0 ) splits = 1;
   7827   }
   7828 
   7829   STBIR_PROFILE_BUILD_START( alloc );
   7830   out_info = stbir__alloc_internal_mem_and_build_samplers( &horizontal, &vertical, &conservative, resize->input_pixel_layout_public, resize->output_pixel_layout_public, splits, new_output_subx, new_output_suby, resize->fast_alpha, resize->user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
   7831   STBIR_PROFILE_BUILD_END( alloc );
   7832   STBIR_PROFILE_BUILD_END( build );
   7833 
   7834   if ( out_info )
   7835   {
   7836     resize->splits = splits;
   7837     resize->samplers = out_info;
   7838     resize->needs_rebuild = 0;
   7839     #ifdef STBIR_PROFILE
   7840       STBIR_MEMCPY( &out_info->profile, &profile_infod.profile, sizeof( out_info->profile ) );
   7841     #endif
   7842 
   7843     // update anything that can be changed without recalcing samplers
   7844     stbir__update_info_from_resize( out_info, resize );
   7845 
   7846     return splits;
   7847   }
   7848 
   7849   return 0;
   7850 }
   7851 
   7852 void stbir_free_samplers( STBIR_RESIZE * resize )
   7853 {
   7854   if ( resize->samplers )
   7855   {
   7856     stbir__free_internal_mem( resize->samplers );
   7857     resize->samplers = 0;
   7858     resize->called_alloc = 0;
   7859   }
   7860 }
   7861 
   7862 STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int splits )
   7863 {
   7864   if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
   7865   {
   7866     if ( resize->samplers )
   7867       stbir_free_samplers( resize );
   7868 
   7869     resize->called_alloc = 1;
   7870     return stbir__perform_build( resize, splits );
   7871   }
   7872 
   7873   STBIR_PROFILE_BUILD_CLEAR( resize->samplers );
   7874 
   7875   return 1;
   7876 }
   7877 
   7878 STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize )
   7879 {
   7880   return stbir_build_samplers_with_splits( resize, 1 );
   7881 }
   7882 
   7883 STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize )
   7884 {
   7885   int result;
   7886 
   7887   if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
   7888   {
   7889     int alloc_state = resize->called_alloc;  // remember allocated state
   7890 
   7891     if ( resize->samplers )
   7892     {
   7893       stbir__free_internal_mem( resize->samplers );
   7894       resize->samplers = 0;
   7895     }
   7896 
   7897     if ( !stbir_build_samplers( resize ) )
   7898       return 0;
   7899 
   7900     resize->called_alloc = alloc_state;
   7901 
   7902     // if build_samplers succeeded (above), but there are no samplers set, then
   7903     //   the area to stretch into was zero pixels, so don't do anything and return
   7904     //   success
   7905     if ( resize->samplers == 0 )
   7906       return 1;
   7907   }
   7908   else
   7909   {
   7910     // didn't build anything - clear it
   7911     STBIR_PROFILE_BUILD_CLEAR( resize->samplers );
   7912   }
   7913 
   7914   // do resize
   7915   result = stbir__perform_resize( resize->samplers, 0, resize->splits );
   7916 
   7917   // if we alloced, then free
   7918   if ( !resize->called_alloc )
   7919   {
   7920     stbir_free_samplers( resize );
   7921     resize->samplers = 0;
   7922   }
   7923 
   7924   return result;
   7925 }
   7926 
   7927 STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count )
   7928 {
   7929   STBIR_ASSERT( resize->samplers );
   7930 
   7931   // if we're just doing the whole thing, call full
   7932   if ( ( split_start == -1 ) || ( ( split_start == 0 ) && ( split_count == resize->splits ) ) )
   7933     return stbir_resize_extended( resize );
   7934 
   7935   // you **must** build samplers first when using split resize
   7936   if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
   7937     return 0;
   7938 
   7939   if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) )
   7940     return 0;
   7941 
   7942   // do resize
   7943   return stbir__perform_resize( resize->samplers, split_start, split_count );
   7944 }
   7945 
   7946 static int stbir__check_output_stuff( void ** ret_ptr, int * ret_pitch, void * output_pixels, int type_size, int output_w, int output_h, int output_stride_in_bytes, stbir_internal_pixel_layout pixel_layout )
   7947 {
   7948   size_t size;
   7949   int pitch;
   7950   void * ptr;
   7951 
   7952   pitch = output_w * type_size * stbir__pixel_channels[ pixel_layout ];
   7953   if ( pitch == 0 )
   7954     return 0;
   7955 
   7956   if ( output_stride_in_bytes == 0 )
   7957     output_stride_in_bytes = pitch;
   7958 
   7959   if ( output_stride_in_bytes < pitch )
   7960     return 0;
   7961 
   7962   size = (size_t)output_stride_in_bytes * (size_t)output_h;
   7963   if ( size == 0 )
   7964     return 0;
   7965 
   7966   *ret_ptr = 0;
   7967   *ret_pitch = output_stride_in_bytes;
   7968 
   7969   if ( output_pixels == 0 )
   7970   {
   7971     ptr = STBIR_MALLOC( size, 0 );
   7972     if ( ptr == 0 )
   7973       return 0;
   7974 
   7975     *ret_ptr = ptr;
   7976     *ret_pitch = pitch;
   7977   }
   7978 
   7979   return 1;
   7980 }
   7981 
   7982 
   7983 STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
   7984                                                           unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
   7985                                                           stbir_pixel_layout pixel_layout )
   7986 {
   7987   STBIR_RESIZE resize;
   7988   unsigned char * optr;
   7989   int opitch;
   7990 
   7991   if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
   7992     return 0;
   7993 
   7994   stbir_resize_init( &resize,
   7995                      input_pixels,  input_w,  input_h,  input_stride_in_bytes,
   7996                      (optr) ? optr : output_pixels, output_w, output_h, opitch,
   7997                      pixel_layout, STBIR_TYPE_UINT8 );
   7998 
   7999   if ( !stbir_resize_extended( &resize ) )
   8000   {
   8001     if ( optr )
   8002       STBIR_FREE( optr, 0 );
   8003     return 0;
   8004   }
   8005 
   8006   return (optr) ? optr : output_pixels;
   8007 }
   8008 
   8009 STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
   8010                                                         unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
   8011                                                         stbir_pixel_layout pixel_layout )
   8012 {
   8013   STBIR_RESIZE resize;
   8014   unsigned char * optr;
   8015   int opitch;
   8016 
   8017   if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
   8018     return 0;
   8019 
   8020   stbir_resize_init( &resize,
   8021                      input_pixels,  input_w,  input_h,  input_stride_in_bytes,
   8022                      (optr) ? optr : output_pixels, output_w, output_h, opitch,
   8023                      pixel_layout, STBIR_TYPE_UINT8_SRGB );
   8024 
   8025   if ( !stbir_resize_extended( &resize ) )
   8026   {
   8027     if ( optr )
   8028       STBIR_FREE( optr, 0 );
   8029     return 0;
   8030   }
   8031 
   8032   return (optr) ? optr : output_pixels;
   8033 }
   8034 
   8035 
   8036 STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
   8037                                                   float *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
   8038                                                   stbir_pixel_layout pixel_layout )
   8039 {
   8040   STBIR_RESIZE resize;
   8041   float * optr;
   8042   int opitch;
   8043 
   8044   if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( float ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
   8045     return 0;
   8046 
   8047   stbir_resize_init( &resize,
   8048                      input_pixels,  input_w,  input_h,  input_stride_in_bytes,
   8049                      (optr) ? optr : output_pixels, output_w, output_h, opitch,
   8050                      pixel_layout, STBIR_TYPE_FLOAT );
   8051 
   8052   if ( !stbir_resize_extended( &resize ) )
   8053   {
   8054     if ( optr )
   8055       STBIR_FREE( optr, 0 );
   8056     return 0;
   8057   }
   8058 
   8059   return (optr) ? optr : output_pixels;
   8060 }
   8061 
   8062 
   8063 STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
   8064                                     void *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
   8065                               stbir_pixel_layout pixel_layout, stbir_datatype data_type,
   8066                               stbir_edge edge, stbir_filter filter )
   8067 {
   8068   STBIR_RESIZE resize;
   8069   float * optr;
   8070   int opitch;
   8071 
   8072   if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, stbir__type_size[data_type], output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
   8073     return 0;
   8074 
   8075   stbir_resize_init( &resize,
   8076                      input_pixels,  input_w,  input_h,  input_stride_in_bytes,
   8077                      (optr) ? optr : output_pixels, output_w, output_h, output_stride_in_bytes,
   8078                      pixel_layout, data_type );
   8079 
   8080   resize.horizontal_edge = edge;
   8081   resize.vertical_edge = edge;
   8082   resize.horizontal_filter = filter;
   8083   resize.vertical_filter = filter;
   8084 
   8085   if ( !stbir_resize_extended( &resize ) )
   8086   {
   8087     if ( optr )
   8088       STBIR_FREE( optr, 0 );
   8089     return 0;
   8090   }
   8091 
   8092   return (optr) ? optr : output_pixels;
   8093 }
   8094 
   8095 #ifdef STBIR_PROFILE
   8096 
   8097 STBIRDEF void stbir_resize_build_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize )
   8098 {
   8099   static char const * bdescriptions[6] = { "Building", "Allocating", "Horizontal sampler", "Vertical sampler", "Coefficient cleanup", "Coefficient piovot" } ;
   8100   stbir__info* samp = resize->samplers;
   8101   int i;
   8102 
   8103   typedef int testa[ (STBIR__ARRAY_SIZE( bdescriptions ) == (STBIR__ARRAY_SIZE( samp->profile.array )-1) )?1:-1];
   8104   typedef int testb[ (sizeof( samp->profile.array ) == (sizeof(samp->profile.named)) )?1:-1];
   8105   typedef int testc[ (sizeof( info->clocks ) >= (sizeof(samp->profile.named)) )?1:-1];
   8106 
   8107   for( i = 0 ; i < STBIR__ARRAY_SIZE( bdescriptions ) ; i++)
   8108     info->clocks[i] = samp->profile.array[i+1];
   8109 
   8110   info->total_clocks = samp->profile.named.total;
   8111   info->descriptions = bdescriptions;
   8112   info->count = STBIR__ARRAY_SIZE( bdescriptions );
   8113 }
   8114 
   8115 STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize, int split_start, int split_count )
   8116 {
   8117   static char const * descriptions[7] = { "Looping", "Vertical sampling", "Horizontal sampling", "Scanline input", "Scanline output", "Alpha weighting", "Alpha unweighting" };
   8118   stbir__per_split_info * split_info;
   8119   int s, i;
   8120 
   8121   typedef int testa[ (STBIR__ARRAY_SIZE( descriptions ) == (STBIR__ARRAY_SIZE( split_info->profile.array )-1) )?1:-1];
   8122   typedef int testb[ (sizeof( split_info->profile.array ) == (sizeof(split_info->profile.named)) )?1:-1];
   8123   typedef int testc[ (sizeof( info->clocks ) >= (sizeof(split_info->profile.named)) )?1:-1];
   8124 
   8125   if ( split_start == -1 )
   8126   {
   8127     split_start = 0;
   8128     split_count = resize->samplers->splits;
   8129   }
   8130 
   8131   if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) )
   8132   {
   8133     info->total_clocks = 0;
   8134     info->descriptions = 0;
   8135     info->count = 0;
   8136     return;
   8137   }
   8138 
   8139   split_info = resize->samplers->split_info + split_start;
   8140 
   8141   // sum up the profile from all the splits
   8142   for( i = 0 ; i < STBIR__ARRAY_SIZE( descriptions ) ; i++ )
   8143   {
   8144     stbir_uint64 sum = 0;
   8145     for( s = 0 ; s < split_count ; s++ )
   8146       sum += split_info[s].profile.array[i+1];
   8147     info->clocks[i] = sum;
   8148   }
   8149 
   8150   info->total_clocks = split_info->profile.named.total;
   8151   info->descriptions = descriptions;
   8152   info->count = STBIR__ARRAY_SIZE( descriptions );
   8153 }
   8154 
   8155 STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize )
   8156 {
   8157   stbir_resize_split_profile_info( info, resize, -1, 0 );
   8158 }
   8159 
   8160 #endif // STBIR_PROFILE
   8161 
   8162 #undef STBIR_BGR
   8163 #undef STBIR_1CHANNEL
   8164 #undef STBIR_2CHANNEL
   8165 #undef STBIR_RGB
   8166 #undef STBIR_RGBA
   8167 #undef STBIR_4CHANNEL
   8168 #undef STBIR_BGRA
   8169 #undef STBIR_ARGB
   8170 #undef STBIR_ABGR
   8171 #undef STBIR_RA
   8172 #undef STBIR_AR
   8173 #undef STBIR_RGBA_PM
   8174 #undef STBIR_BGRA_PM
   8175 #undef STBIR_ARGB_PM
   8176 #undef STBIR_ABGR_PM
   8177 #undef STBIR_RA_PM
   8178 #undef STBIR_AR_PM
   8179 
   8180 #endif // STB_IMAGE_RESIZE_IMPLEMENTATION
   8181 
   8182 #else  // STB_IMAGE_RESIZE_HORIZONTALS&STB_IMAGE_RESIZE_DO_VERTICALS
   8183 
   8184 // we reinclude the header file to define all the horizontal functions
   8185 //   specializing each function for the number of coeffs is 20-40% faster *OVERALL*
   8186 
   8187 // by including the header file again this way, we can still debug the functions
   8188 
   8189 #define STBIR_strs_join2( start, mid, end ) start##mid##end
   8190 #define STBIR_strs_join1( start, mid, end ) STBIR_strs_join2( start, mid, end )
   8191 
   8192 #define STBIR_strs_join24( start, mid1, mid2, end ) start##mid1##mid2##end
   8193 #define STBIR_strs_join14( start, mid1, mid2, end ) STBIR_strs_join24( start, mid1, mid2, end )
   8194 
   8195 #ifdef STB_IMAGE_RESIZE_DO_CODERS
   8196 
   8197 #ifdef stbir__decode_suffix
   8198 #define STBIR__CODER_NAME( name ) STBIR_strs_join1( name, _, stbir__decode_suffix )
   8199 #else
   8200 #define STBIR__CODER_NAME( name ) name
   8201 #endif
   8202 
   8203 #ifdef stbir__decode_swizzle
   8204 #define stbir__decode_simdf8_flip(reg) STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( stbir__simdf8_0123to,stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3),stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3)(reg, reg)
   8205 #define stbir__decode_simdf4_flip(reg) STBIR_strs_join1( STBIR_strs_join1( stbir__simdf_0123to,stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3)(reg, reg)
   8206 #define stbir__encode_simdf8_unflip(reg) STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( stbir__simdf8_0123to,stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3),stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3)(reg, reg)
   8207 #define stbir__encode_simdf4_unflip(reg) STBIR_strs_join1( STBIR_strs_join1( stbir__simdf_0123to,stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3)(reg, reg)
   8208 #else
   8209 #define stbir__decode_order0 0
   8210 #define stbir__decode_order1 1
   8211 #define stbir__decode_order2 2
   8212 #define stbir__decode_order3 3
   8213 #define stbir__encode_order0 0
   8214 #define stbir__encode_order1 1
   8215 #define stbir__encode_order2 2
   8216 #define stbir__encode_order3 3
   8217 #define stbir__decode_simdf8_flip(reg)
   8218 #define stbir__decode_simdf4_flip(reg)
   8219 #define stbir__encode_simdf8_unflip(reg)
   8220 #define stbir__encode_simdf4_unflip(reg)
   8221 #endif
   8222 
   8223 #ifdef STBIR_SIMD8
   8224 #define stbir__encode_simdfX_unflip  stbir__encode_simdf8_unflip
   8225 #else
   8226 #define stbir__encode_simdfX_unflip  stbir__encode_simdf4_unflip
   8227 #endif
   8228 
   8229 static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * decodep, int width_times_channels, void const * inputp )
   8230 {
   8231   float STBIR_STREAMOUT_PTR( * ) decode = decodep;
   8232   float * decode_end = (float*) decode + width_times_channels;
   8233   unsigned char const * input = (unsigned char const*)inputp;
   8234 
   8235   #ifdef STBIR_SIMD
   8236   unsigned char const * end_input_m16 = input + width_times_channels - 16;
   8237   if ( width_times_channels >= 16 )
   8238   {
   8239     decode_end -= 16;
   8240     STBIR_NO_UNROLL_LOOP_START_INF_FOR
   8241     for(;;)
   8242     {
   8243       #ifdef STBIR_SIMD8
   8244       stbir__simdi i; stbir__simdi8 o0,o1;
   8245       stbir__simdf8 of0, of1;
   8246       STBIR_NO_UNROLL(decode);
   8247       stbir__simdi_load( i, input );
   8248       stbir__simdi8_expand_u8_to_u32( o0, o1, i );
   8249       stbir__simdi8_convert_i32_to_float( of0, o0 );
   8250       stbir__simdi8_convert_i32_to_float( of1, o1 );
   8251       stbir__simdf8_mult( of0, of0, STBIR_max_uint8_as_float_inverted8);
   8252       stbir__simdf8_mult( of1, of1, STBIR_max_uint8_as_float_inverted8);
   8253       stbir__decode_simdf8_flip( of0 );
   8254       stbir__decode_simdf8_flip( of1 );
   8255       stbir__simdf8_store( decode + 0, of0 );
   8256       stbir__simdf8_store( decode + 8, of1 );
   8257       #else
   8258       stbir__simdi i, o0, o1, o2, o3;
   8259       stbir__simdf of0, of1, of2, of3;
   8260       STBIR_NO_UNROLL(decode);
   8261       stbir__simdi_load( i, input );
   8262       stbir__simdi_expand_u8_to_u32( o0,o1,o2,o3,i);
   8263       stbir__simdi_convert_i32_to_float( of0, o0 );
   8264       stbir__simdi_convert_i32_to_float( of1, o1 );
   8265       stbir__simdi_convert_i32_to_float( of2, o2 );
   8266       stbir__simdi_convert_i32_to_float( of3, o3 );
   8267       stbir__simdf_mult( of0, of0, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
   8268       stbir__simdf_mult( of1, of1, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
   8269       stbir__simdf_mult( of2, of2, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
   8270       stbir__simdf_mult( of3, of3, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
   8271       stbir__decode_simdf4_flip( of0 );
   8272       stbir__decode_simdf4_flip( of1 );
   8273       stbir__decode_simdf4_flip( of2 );
   8274       stbir__decode_simdf4_flip( of3 );
   8275       stbir__simdf_store( decode + 0,  of0 );
   8276       stbir__simdf_store( decode + 4,  of1 );
   8277       stbir__simdf_store( decode + 8,  of2 );
   8278       stbir__simdf_store( decode + 12, of3 );
   8279       #endif
   8280       decode += 16;
   8281       input += 16;
   8282       if ( decode <= decode_end )
   8283         continue;
   8284       if ( decode == ( decode_end + 16 ) )
   8285         break;
   8286       decode = decode_end; // backup and do last couple
   8287       input = end_input_m16;
   8288     }
   8289     return;
   8290   }
   8291   #endif
   8292 
   8293   // try to do blocks of 4 when you can
   8294   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   8295   decode += 4;
   8296   STBIR_SIMD_NO_UNROLL_LOOP_START
   8297   while( decode <= decode_end )
   8298   {
   8299     STBIR_SIMD_NO_UNROLL(decode);
   8300     decode[0-4] = ((float)(input[stbir__decode_order0])) * stbir__max_uint8_as_float_inverted;
   8301     decode[1-4] = ((float)(input[stbir__decode_order1])) * stbir__max_uint8_as_float_inverted;
   8302     decode[2-4] = ((float)(input[stbir__decode_order2])) * stbir__max_uint8_as_float_inverted;
   8303     decode[3-4] = ((float)(input[stbir__decode_order3])) * stbir__max_uint8_as_float_inverted;
   8304     decode += 4;
   8305     input += 4;
   8306   }
   8307   decode -= 4;
   8308   #endif
   8309 
   8310   // do the remnants
   8311   #if stbir__coder_min_num < 4
   8312   STBIR_NO_UNROLL_LOOP_START
   8313   while( decode < decode_end )
   8314   {
   8315     STBIR_NO_UNROLL(decode);
   8316     decode[0] = ((float)(input[stbir__decode_order0])) * stbir__max_uint8_as_float_inverted;
   8317     #if stbir__coder_min_num >= 2
   8318     decode[1] = ((float)(input[stbir__decode_order1])) * stbir__max_uint8_as_float_inverted;
   8319     #endif
   8320     #if stbir__coder_min_num >= 3
   8321     decode[2] = ((float)(input[stbir__decode_order2])) * stbir__max_uint8_as_float_inverted;
   8322     #endif
   8323     decode += stbir__coder_min_num;
   8324     input += stbir__coder_min_num;
   8325   }
   8326   #endif
   8327 }
   8328 
   8329 static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outputp, int width_times_channels, float const * encode )
   8330 {
   8331   unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char *) outputp;
   8332   unsigned char * end_output = ( (unsigned char *) output ) + width_times_channels;
   8333 
   8334   #ifdef STBIR_SIMD
   8335   if ( width_times_channels >= stbir__simdfX_float_count*2 )
   8336   {
   8337     float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
   8338     end_output -= stbir__simdfX_float_count*2;
   8339     STBIR_NO_UNROLL_LOOP_START_INF_FOR
   8340     for(;;)
   8341     {
   8342       stbir__simdfX e0, e1;
   8343       stbir__simdi i;
   8344       STBIR_SIMD_NO_UNROLL(encode);
   8345       stbir__simdfX_madd_mem( e0, STBIR_simd_point5X, STBIR_max_uint8_as_floatX, encode );
   8346       stbir__simdfX_madd_mem( e1, STBIR_simd_point5X, STBIR_max_uint8_as_floatX, encode+stbir__simdfX_float_count );
   8347       stbir__encode_simdfX_unflip( e0 );
   8348       stbir__encode_simdfX_unflip( e1 );
   8349       #ifdef STBIR_SIMD8
   8350       stbir__simdf8_pack_to_16bytes( i, e0, e1 );
   8351       stbir__simdi_store( output, i );
   8352       #else
   8353       stbir__simdf_pack_to_8bytes( i, e0, e1 );
   8354       stbir__simdi_store2( output, i );
   8355       #endif
   8356       encode += stbir__simdfX_float_count*2;
   8357       output += stbir__simdfX_float_count*2;
   8358       if ( output <= end_output )
   8359         continue;
   8360       if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
   8361         break;
   8362       output = end_output; // backup and do last couple
   8363       encode = end_encode_m8;
   8364     }
   8365     return;
   8366   }
   8367 
   8368   // try to do blocks of 4 when you can
   8369   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   8370   output += 4;
   8371   STBIR_NO_UNROLL_LOOP_START
   8372   while( output <= end_output )
   8373   {
   8374     stbir__simdf e0;
   8375     stbir__simdi i0;
   8376     STBIR_NO_UNROLL(encode);
   8377     stbir__simdf_load( e0, encode );
   8378     stbir__simdf_madd( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), e0 );
   8379     stbir__encode_simdf4_unflip( e0 );
   8380     stbir__simdf_pack_to_8bytes( i0, e0, e0 );  // only use first 4
   8381     *(int*)(output-4) = stbir__simdi_to_int( i0 );
   8382     output += 4;
   8383     encode += 4;
   8384   }
   8385   output -= 4;
   8386   #endif
   8387 
   8388   // do the remnants
   8389   #if stbir__coder_min_num < 4
   8390   STBIR_NO_UNROLL_LOOP_START
   8391   while( output < end_output )
   8392   {
   8393     stbir__simdf e0;
   8394     STBIR_NO_UNROLL(encode);
   8395     stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_uint8( e0 );
   8396     #if stbir__coder_min_num >= 2
   8397     stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order1 ); output[1] = stbir__simdf_convert_float_to_uint8( e0 );
   8398     #endif
   8399     #if stbir__coder_min_num >= 3
   8400     stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order2 ); output[2] = stbir__simdf_convert_float_to_uint8( e0 );
   8401     #endif
   8402     output += stbir__coder_min_num;
   8403     encode += stbir__coder_min_num;
   8404   }
   8405   #endif
   8406 
   8407   #else
   8408 
   8409   // try to do blocks of 4 when you can
   8410   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   8411   output += 4;
   8412   while( output <= end_output )
   8413   {
   8414     float f;
   8415     f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[0-4] = (unsigned char)f;
   8416     f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[1-4] = (unsigned char)f;
   8417     f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[2-4] = (unsigned char)f;
   8418     f = encode[stbir__encode_order3] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[3-4] = (unsigned char)f;
   8419     output += 4;
   8420     encode += 4;
   8421   }
   8422   output -= 4;
   8423   #endif
   8424 
   8425   // do the remnants
   8426   #if stbir__coder_min_num < 4
   8427   STBIR_NO_UNROLL_LOOP_START
   8428   while( output < end_output )
   8429   {
   8430     float f;
   8431     STBIR_NO_UNROLL(encode);
   8432     f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[0] = (unsigned char)f;
   8433     #if stbir__coder_min_num >= 2
   8434     f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[1] = (unsigned char)f;
   8435     #endif
   8436     #if stbir__coder_min_num >= 3
   8437     f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[2] = (unsigned char)f;
   8438     #endif
   8439     output += stbir__coder_min_num;
   8440     encode += stbir__coder_min_num;
   8441   }
   8442   #endif
   8443   #endif
   8444 }
   8445 
   8446 static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int width_times_channels, void const * inputp )
   8447 {
   8448   float STBIR_STREAMOUT_PTR( * ) decode = decodep;
   8449   float * decode_end = (float*) decode + width_times_channels;
   8450   unsigned char const * input = (unsigned char const*)inputp;
   8451 
   8452   #ifdef STBIR_SIMD
   8453   unsigned char const * end_input_m16 = input + width_times_channels - 16;
   8454   if ( width_times_channels >= 16 )
   8455   {
   8456     decode_end -= 16;
   8457     STBIR_NO_UNROLL_LOOP_START_INF_FOR
   8458     for(;;)
   8459     {
   8460       #ifdef STBIR_SIMD8
   8461       stbir__simdi i; stbir__simdi8 o0,o1;
   8462       stbir__simdf8 of0, of1;
   8463       STBIR_NO_UNROLL(decode);
   8464       stbir__simdi_load( i, input );
   8465       stbir__simdi8_expand_u8_to_u32( o0, o1, i );
   8466       stbir__simdi8_convert_i32_to_float( of0, o0 );
   8467       stbir__simdi8_convert_i32_to_float( of1, o1 );
   8468       stbir__decode_simdf8_flip( of0 );
   8469       stbir__decode_simdf8_flip( of1 );
   8470       stbir__simdf8_store( decode + 0, of0 );
   8471       stbir__simdf8_store( decode + 8, of1 );
   8472       #else
   8473       stbir__simdi i, o0, o1, o2, o3;
   8474       stbir__simdf of0, of1, of2, of3;
   8475       STBIR_NO_UNROLL(decode);
   8476       stbir__simdi_load( i, input );
   8477       stbir__simdi_expand_u8_to_u32( o0,o1,o2,o3,i);
   8478       stbir__simdi_convert_i32_to_float( of0, o0 );
   8479       stbir__simdi_convert_i32_to_float( of1, o1 );
   8480       stbir__simdi_convert_i32_to_float( of2, o2 );
   8481       stbir__simdi_convert_i32_to_float( of3, o3 );
   8482       stbir__decode_simdf4_flip( of0 );
   8483       stbir__decode_simdf4_flip( of1 );
   8484       stbir__decode_simdf4_flip( of2 );
   8485       stbir__decode_simdf4_flip( of3 );
   8486       stbir__simdf_store( decode + 0,  of0 );
   8487       stbir__simdf_store( decode + 4,  of1 );
   8488       stbir__simdf_store( decode + 8,  of2 );
   8489       stbir__simdf_store( decode + 12, of3 );
   8490 #endif
   8491       decode += 16;
   8492       input += 16;
   8493       if ( decode <= decode_end )
   8494         continue;
   8495       if ( decode == ( decode_end + 16 ) )
   8496         break;
   8497       decode = decode_end; // backup and do last couple
   8498       input = end_input_m16;
   8499     }
   8500     return;
   8501   }
   8502   #endif
   8503 
   8504   // try to do blocks of 4 when you can
   8505   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   8506   decode += 4;
   8507   STBIR_SIMD_NO_UNROLL_LOOP_START
   8508   while( decode <= decode_end )
   8509   {
   8510     STBIR_SIMD_NO_UNROLL(decode);
   8511     decode[0-4] = ((float)(input[stbir__decode_order0]));
   8512     decode[1-4] = ((float)(input[stbir__decode_order1]));
   8513     decode[2-4] = ((float)(input[stbir__decode_order2]));
   8514     decode[3-4] = ((float)(input[stbir__decode_order3]));
   8515     decode += 4;
   8516     input += 4;
   8517   }
   8518   decode -= 4;
   8519   #endif
   8520 
   8521   // do the remnants
   8522   #if stbir__coder_min_num < 4
   8523   STBIR_NO_UNROLL_LOOP_START
   8524   while( decode < decode_end )
   8525   {
   8526     STBIR_NO_UNROLL(decode);
   8527     decode[0] = ((float)(input[stbir__decode_order0]));
   8528     #if stbir__coder_min_num >= 2
   8529     decode[1] = ((float)(input[stbir__decode_order1]));
   8530     #endif
   8531     #if stbir__coder_min_num >= 3
   8532     decode[2] = ((float)(input[stbir__decode_order2]));
   8533     #endif
   8534     decode += stbir__coder_min_num;
   8535     input += stbir__coder_min_num;
   8536   }
   8537   #endif
   8538 }
   8539 
   8540 static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int width_times_channels, float const * encode )
   8541 {
   8542   unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char *) outputp;
   8543   unsigned char * end_output = ( (unsigned char *) output ) + width_times_channels;
   8544 
   8545   #ifdef STBIR_SIMD
   8546   if ( width_times_channels >= stbir__simdfX_float_count*2 )
   8547   {
   8548     float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
   8549     end_output -= stbir__simdfX_float_count*2;
   8550     STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   8551     for(;;)
   8552     {
   8553       stbir__simdfX e0, e1;
   8554       stbir__simdi i;
   8555       STBIR_SIMD_NO_UNROLL(encode);
   8556       stbir__simdfX_add_mem( e0, STBIR_simd_point5X, encode );
   8557       stbir__simdfX_add_mem( e1, STBIR_simd_point5X, encode+stbir__simdfX_float_count );
   8558       stbir__encode_simdfX_unflip( e0 );
   8559       stbir__encode_simdfX_unflip( e1 );
   8560       #ifdef STBIR_SIMD8
   8561       stbir__simdf8_pack_to_16bytes( i, e0, e1 );
   8562       stbir__simdi_store( output, i );
   8563       #else
   8564       stbir__simdf_pack_to_8bytes( i, e0, e1 );
   8565       stbir__simdi_store2( output, i );
   8566       #endif
   8567       encode += stbir__simdfX_float_count*2;
   8568       output += stbir__simdfX_float_count*2;
   8569       if ( output <= end_output )
   8570         continue;
   8571       if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
   8572         break;
   8573       output = end_output; // backup and do last couple
   8574       encode = end_encode_m8;
   8575     }
   8576     return;
   8577   }
   8578 
   8579   // try to do blocks of 4 when you can
   8580   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   8581   output += 4;
   8582   STBIR_NO_UNROLL_LOOP_START
   8583   while( output <= end_output )
   8584   {
   8585     stbir__simdf e0;
   8586     stbir__simdi i0;
   8587     STBIR_NO_UNROLL(encode);
   8588     stbir__simdf_load( e0, encode );
   8589     stbir__simdf_add( e0, STBIR__CONSTF(STBIR_simd_point5), e0 );
   8590     stbir__encode_simdf4_unflip( e0 );
   8591     stbir__simdf_pack_to_8bytes( i0, e0, e0 );  // only use first 4
   8592     *(int*)(output-4) = stbir__simdi_to_int( i0 );
   8593     output += 4;
   8594     encode += 4;
   8595   }
   8596   output -= 4;
   8597   #endif
   8598 
   8599   #else
   8600 
   8601   // try to do blocks of 4 when you can
   8602   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   8603   output += 4;
   8604   while( output <= end_output )
   8605   {
   8606     float f;
   8607     f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 255); output[0-4] = (unsigned char)f;
   8608     f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 255); output[1-4] = (unsigned char)f;
   8609     f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 255); output[2-4] = (unsigned char)f;
   8610     f = encode[stbir__encode_order3] + 0.5f; STBIR_CLAMP(f, 0, 255); output[3-4] = (unsigned char)f;
   8611     output += 4;
   8612     encode += 4;
   8613   }
   8614   output -= 4;
   8615   #endif
   8616 
   8617   #endif
   8618 
   8619   // do the remnants
   8620   #if stbir__coder_min_num < 4
   8621   STBIR_NO_UNROLL_LOOP_START
   8622   while( output < end_output )
   8623   {
   8624     float f;
   8625     STBIR_NO_UNROLL(encode);
   8626     f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 255); output[0] = (unsigned char)f;
   8627     #if stbir__coder_min_num >= 2
   8628     f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 255); output[1] = (unsigned char)f;
   8629     #endif
   8630     #if stbir__coder_min_num >= 3
   8631     f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 255); output[2] = (unsigned char)f;
   8632     #endif
   8633     output += stbir__coder_min_num;
   8634     encode += stbir__coder_min_num;
   8635   }
   8636   #endif
   8637 }
   8638 
   8639 static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int width_times_channels, void const * inputp )
   8640 {
   8641   float STBIR_STREAMOUT_PTR( * ) decode = decodep;
   8642   float const * decode_end = (float*) decode + width_times_channels;
   8643   unsigned char const * input = (unsigned char const *)inputp;
   8644 
   8645   // try to do blocks of 4 when you can
   8646   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   8647   decode += 4;
   8648   while( decode <= decode_end )
   8649   {
   8650     decode[0-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order0 ] ];
   8651     decode[1-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order1 ] ];
   8652     decode[2-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order2 ] ];
   8653     decode[3-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order3 ] ];
   8654     decode += 4;
   8655     input += 4;
   8656   }
   8657   decode -= 4;
   8658   #endif
   8659 
   8660   // do the remnants
   8661   #if stbir__coder_min_num < 4
   8662   STBIR_NO_UNROLL_LOOP_START
   8663   while( decode < decode_end )
   8664   {
   8665     STBIR_NO_UNROLL(decode);
   8666     decode[0] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order0 ] ];
   8667     #if stbir__coder_min_num >= 2
   8668     decode[1] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order1 ] ];
   8669     #endif
   8670     #if stbir__coder_min_num >= 3
   8671     decode[2] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order2 ] ];
   8672     #endif
   8673     decode += stbir__coder_min_num;
   8674     input += stbir__coder_min_num;
   8675   }
   8676   #endif
   8677 }
   8678 
   8679 #define stbir__min_max_shift20( i, f ) \
   8680     stbir__simdf_max( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_zero )) ); \
   8681     stbir__simdf_min( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_one  )) ); \
   8682     stbir__simdi_32shr( i, stbir_simdi_castf( f ), 20 );
   8683 
   8684 #define stbir__scale_and_convert( i, f ) \
   8685     stbir__simdf_madd( f, STBIR__CONSTF( STBIR_simd_point5 ), STBIR__CONSTF( STBIR_max_uint8_as_float ), f ); \
   8686     stbir__simdf_max( f, f, stbir__simdf_zeroP() ); \
   8687     stbir__simdf_min( f, f, STBIR__CONSTF( STBIR_max_uint8_as_float ) ); \
   8688     stbir__simdf_convert_float_to_i32( i, f );
   8689 
   8690 #define stbir__linear_to_srgb_finish( i, f ) \
   8691 { \
   8692     stbir__simdi temp;  \
   8693     stbir__simdi_32shr( temp, stbir_simdi_castf( f ), 12 ) ; \
   8694     stbir__simdi_and( temp, temp, STBIR__CONSTI(STBIR_mastissa_mask) ); \
   8695     stbir__simdi_or( temp, temp, STBIR__CONSTI(STBIR_topscale) ); \
   8696     stbir__simdi_16madd( i, i, temp ); \
   8697     stbir__simdi_32shr( i, i, 16 ); \
   8698 }
   8699 
   8700 #define stbir__simdi_table_lookup2( v0,v1, table ) \
   8701 { \
   8702   stbir__simdi_u32 temp0,temp1; \
   8703   temp0.m128i_i128 = v0; \
   8704   temp1.m128i_i128 = v1; \
   8705   temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \
   8706   temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
   8707   v0 = temp0.m128i_i128; \
   8708   v1 = temp1.m128i_i128; \
   8709 }
   8710 
   8711 #define stbir__simdi_table_lookup3( v0,v1,v2, table ) \
   8712 { \
   8713   stbir__simdi_u32 temp0,temp1,temp2; \
   8714   temp0.m128i_i128 = v0; \
   8715   temp1.m128i_i128 = v1; \
   8716   temp2.m128i_i128 = v2; \
   8717   temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \
   8718   temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
   8719   temp2.m128i_u32[0] = table[temp2.m128i_i32[0]]; temp2.m128i_u32[1] = table[temp2.m128i_i32[1]]; temp2.m128i_u32[2] = table[temp2.m128i_i32[2]]; temp2.m128i_u32[3] = table[temp2.m128i_i32[3]]; \
   8720   v0 = temp0.m128i_i128; \
   8721   v1 = temp1.m128i_i128; \
   8722   v2 = temp2.m128i_i128; \
   8723 }
   8724 
   8725 #define stbir__simdi_table_lookup4( v0,v1,v2,v3, table ) \
   8726 { \
   8727   stbir__simdi_u32 temp0,temp1,temp2,temp3; \
   8728   temp0.m128i_i128 = v0; \
   8729   temp1.m128i_i128 = v1; \
   8730   temp2.m128i_i128 = v2; \
   8731   temp3.m128i_i128 = v3; \
   8732   temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \
   8733   temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
   8734   temp2.m128i_u32[0] = table[temp2.m128i_i32[0]]; temp2.m128i_u32[1] = table[temp2.m128i_i32[1]]; temp2.m128i_u32[2] = table[temp2.m128i_i32[2]]; temp2.m128i_u32[3] = table[temp2.m128i_i32[3]]; \
   8735   temp3.m128i_u32[0] = table[temp3.m128i_i32[0]]; temp3.m128i_u32[1] = table[temp3.m128i_i32[1]]; temp3.m128i_u32[2] = table[temp3.m128i_i32[2]]; temp3.m128i_u32[3] = table[temp3.m128i_i32[3]]; \
   8736   v0 = temp0.m128i_i128; \
   8737   v1 = temp1.m128i_i128; \
   8738   v2 = temp2.m128i_i128; \
   8739   v3 = temp3.m128i_i128; \
   8740 }
   8741 
   8742 static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int width_times_channels, float const * encode )
   8743 {
   8744   unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp;
   8745   unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
   8746 
   8747   #ifdef STBIR_SIMD
   8748 
   8749   if ( width_times_channels >= 16 )
   8750   {
   8751     float const * end_encode_m16 = encode + width_times_channels - 16;
   8752     end_output -= 16;
   8753     STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   8754     for(;;)
   8755     {
   8756       stbir__simdf f0, f1, f2, f3;
   8757       stbir__simdi i0, i1, i2, i3;
   8758       STBIR_SIMD_NO_UNROLL(encode);
   8759 
   8760       stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
   8761 
   8762       stbir__min_max_shift20( i0, f0 );
   8763       stbir__min_max_shift20( i1, f1 );
   8764       stbir__min_max_shift20( i2, f2 );
   8765       stbir__min_max_shift20( i3, f3 );
   8766 
   8767       stbir__simdi_table_lookup4( i0, i1, i2, i3, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
   8768 
   8769       stbir__linear_to_srgb_finish( i0, f0 );
   8770       stbir__linear_to_srgb_finish( i1, f1 );
   8771       stbir__linear_to_srgb_finish( i2, f2 );
   8772       stbir__linear_to_srgb_finish( i3, f3 );
   8773 
   8774       stbir__interleave_pack_and_store_16_u8( output,  STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) );
   8775 
   8776       encode += 16;
   8777       output += 16;
   8778       if ( output <= end_output )
   8779         continue;
   8780       if ( output == ( end_output + 16 ) )
   8781         break;
   8782       output = end_output; // backup and do last couple
   8783       encode = end_encode_m16;
   8784     }
   8785     return;
   8786   }
   8787   #endif
   8788 
   8789   // try to do blocks of 4 when you can
   8790   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   8791   output += 4;
   8792   STBIR_SIMD_NO_UNROLL_LOOP_START
   8793   while ( output <= end_output )
   8794   {
   8795     STBIR_SIMD_NO_UNROLL(encode);
   8796 
   8797     output[0-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] );
   8798     output[1-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order1] );
   8799     output[2-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order2] );
   8800     output[3-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order3] );
   8801 
   8802     output += 4;
   8803     encode += 4;
   8804   }
   8805   output -= 4;
   8806   #endif
   8807 
   8808   // do the remnants
   8809   #if stbir__coder_min_num < 4
   8810   STBIR_NO_UNROLL_LOOP_START
   8811   while( output < end_output )
   8812   {
   8813     STBIR_NO_UNROLL(encode);
   8814     output[0] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] );
   8815     #if stbir__coder_min_num >= 2
   8816     output[1] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order1] );
   8817     #endif
   8818     #if stbir__coder_min_num >= 3
   8819     output[2] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order2] );
   8820     #endif
   8821     output += stbir__coder_min_num;
   8822     encode += stbir__coder_min_num;
   8823   }
   8824   #endif
   8825 }
   8826 
   8827 #if ( stbir__coder_min_num == 4 ) || ( ( stbir__coder_min_num == 1 ) && ( !defined(stbir__decode_swizzle) ) )
   8828 
   8829 static void STBIR__CODER_NAME(stbir__decode_uint8_srgb4_linearalpha)( float * decodep, int width_times_channels, void const * inputp )
   8830 {
   8831   float STBIR_STREAMOUT_PTR( * ) decode = decodep;
   8832   float const * decode_end = (float*) decode + width_times_channels;
   8833   unsigned char const * input = (unsigned char const *)inputp;
   8834   do {
   8835     decode[0] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0] ];
   8836     decode[1] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order1] ];
   8837     decode[2] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order2] ];
   8838     decode[3] = ( (float) input[stbir__decode_order3] ) * stbir__max_uint8_as_float_inverted;
   8839     input += 4;
   8840     decode += 4;
   8841   } while( decode < decode_end );
   8842 }
   8843 
   8844 
   8845 static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * outputp, int width_times_channels, float const * encode )
   8846 {
   8847   unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp;
   8848   unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
   8849 
   8850   #ifdef STBIR_SIMD
   8851 
   8852   if ( width_times_channels >= 16 )
   8853   {
   8854     float const * end_encode_m16 = encode + width_times_channels - 16;
   8855     end_output -= 16;
   8856     STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   8857     for(;;)
   8858     {
   8859       stbir__simdf f0, f1, f2, f3;
   8860       stbir__simdi i0, i1, i2, i3;
   8861 
   8862       STBIR_SIMD_NO_UNROLL(encode);
   8863       stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
   8864 
   8865       stbir__min_max_shift20( i0, f0 );
   8866       stbir__min_max_shift20( i1, f1 );
   8867       stbir__min_max_shift20( i2, f2 );
   8868       stbir__scale_and_convert( i3, f3 );
   8869 
   8870       stbir__simdi_table_lookup3( i0, i1, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
   8871 
   8872       stbir__linear_to_srgb_finish( i0, f0 );
   8873       stbir__linear_to_srgb_finish( i1, f1 );
   8874       stbir__linear_to_srgb_finish( i2, f2 );
   8875 
   8876       stbir__interleave_pack_and_store_16_u8( output,  STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) );
   8877 
   8878       output += 16;
   8879       encode += 16;
   8880 
   8881       if ( output <= end_output )
   8882         continue;
   8883       if ( output == ( end_output + 16 ) )
   8884         break;
   8885       output = end_output; // backup and do last couple
   8886       encode = end_encode_m16;
   8887     }
   8888     return;
   8889   }
   8890   #endif
   8891 
   8892   STBIR_SIMD_NO_UNROLL_LOOP_START
   8893   do {
   8894     float f;
   8895     STBIR_SIMD_NO_UNROLL(encode);
   8896 
   8897     output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] );
   8898     output[stbir__decode_order1] = stbir__linear_to_srgb_uchar( encode[1] );
   8899     output[stbir__decode_order2] = stbir__linear_to_srgb_uchar( encode[2] );
   8900 
   8901     f = encode[3] * stbir__max_uint8_as_float + 0.5f;
   8902     STBIR_CLAMP(f, 0, 255);
   8903     output[stbir__decode_order3] = (unsigned char) f;
   8904 
   8905     output += 4;
   8906     encode += 4;
   8907   } while( output < end_output );
   8908 }
   8909 
   8910 #endif
   8911 
   8912 #if ( stbir__coder_min_num == 2 ) || ( ( stbir__coder_min_num == 1 ) && ( !defined(stbir__decode_swizzle) ) )
   8913 
   8914 static void STBIR__CODER_NAME(stbir__decode_uint8_srgb2_linearalpha)( float * decodep, int width_times_channels, void const * inputp )
   8915 {
   8916   float STBIR_STREAMOUT_PTR( * ) decode = decodep;
   8917   float const * decode_end = (float*) decode + width_times_channels;
   8918   unsigned char const * input = (unsigned char const *)inputp;
   8919   decode += 4;
   8920   while( decode <= decode_end )
   8921   {
   8922     decode[0-4] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0] ];
   8923     decode[1-4] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted;
   8924     decode[2-4] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0+2] ];
   8925     decode[3-4] = ( (float) input[stbir__decode_order1+2] ) * stbir__max_uint8_as_float_inverted;
   8926     input += 4;
   8927     decode += 4;
   8928   }
   8929   decode -= 4;
   8930   if( decode < decode_end )
   8931   {
   8932     decode[0] = stbir__srgb_uchar_to_linear_float[ stbir__decode_order0 ];
   8933     decode[1] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted;
   8934   }
   8935 }
   8936 
   8937 static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * outputp, int width_times_channels, float const * encode )
   8938 {
   8939   unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp;
   8940   unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
   8941 
   8942   #ifdef STBIR_SIMD
   8943 
   8944   if ( width_times_channels >= 16 )
   8945   {
   8946     float const * end_encode_m16 = encode + width_times_channels - 16;
   8947     end_output -= 16;
   8948     STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   8949     for(;;)
   8950     {
   8951       stbir__simdf f0, f1, f2, f3;
   8952       stbir__simdi i0, i1, i2, i3;
   8953 
   8954       STBIR_SIMD_NO_UNROLL(encode);
   8955       stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
   8956 
   8957       stbir__min_max_shift20( i0, f0 );
   8958       stbir__scale_and_convert( i1, f1 );
   8959       stbir__min_max_shift20( i2, f2 );
   8960       stbir__scale_and_convert( i3, f3 );
   8961 
   8962       stbir__simdi_table_lookup2( i0, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
   8963 
   8964       stbir__linear_to_srgb_finish( i0, f0 );
   8965       stbir__linear_to_srgb_finish( i2, f2 );
   8966 
   8967       stbir__interleave_pack_and_store_16_u8( output,  STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) );
   8968 
   8969       output += 16;
   8970       encode += 16;
   8971       if ( output <= end_output )
   8972         continue;
   8973       if ( output == ( end_output + 16 ) )
   8974         break;
   8975       output = end_output; // backup and do last couple
   8976       encode = end_encode_m16;
   8977     }
   8978     return;
   8979   }
   8980   #endif
   8981 
   8982   STBIR_SIMD_NO_UNROLL_LOOP_START
   8983   do {
   8984     float f;
   8985     STBIR_SIMD_NO_UNROLL(encode);
   8986 
   8987     output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] );
   8988 
   8989     f = encode[1] * stbir__max_uint8_as_float + 0.5f;
   8990     STBIR_CLAMP(f, 0, 255);
   8991     output[stbir__decode_order1] = (unsigned char) f;
   8992 
   8993     output += 2;
   8994     encode += 2;
   8995   } while( output < end_output );
   8996 }
   8997 
   8998 #endif
   8999 
   9000 static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decodep, int width_times_channels, void const * inputp )
   9001 {
   9002   float STBIR_STREAMOUT_PTR( * ) decode = decodep;
   9003   float * decode_end = (float*) decode + width_times_channels;
   9004   unsigned short const * input = (unsigned short const *)inputp;
   9005 
   9006   #ifdef STBIR_SIMD
   9007   unsigned short const * end_input_m8 = input + width_times_channels - 8;
   9008   if ( width_times_channels >= 8 )
   9009   {
   9010     decode_end -= 8;
   9011     STBIR_NO_UNROLL_LOOP_START_INF_FOR
   9012     for(;;)
   9013     {
   9014       #ifdef STBIR_SIMD8
   9015       stbir__simdi i; stbir__simdi8 o;
   9016       stbir__simdf8 of;
   9017       STBIR_NO_UNROLL(decode);
   9018       stbir__simdi_load( i, input );
   9019       stbir__simdi8_expand_u16_to_u32( o, i );
   9020       stbir__simdi8_convert_i32_to_float( of, o );
   9021       stbir__simdf8_mult( of, of, STBIR_max_uint16_as_float_inverted8);
   9022       stbir__decode_simdf8_flip( of );
   9023       stbir__simdf8_store( decode + 0, of );
   9024       #else
   9025       stbir__simdi i, o0, o1;
   9026       stbir__simdf of0, of1;
   9027       STBIR_NO_UNROLL(decode);
   9028       stbir__simdi_load( i, input );
   9029       stbir__simdi_expand_u16_to_u32( o0,o1,i );
   9030       stbir__simdi_convert_i32_to_float( of0, o0 );
   9031       stbir__simdi_convert_i32_to_float( of1, o1 );
   9032       stbir__simdf_mult( of0, of0, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted) );
   9033       stbir__simdf_mult( of1, of1, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted));
   9034       stbir__decode_simdf4_flip( of0 );
   9035       stbir__decode_simdf4_flip( of1 );
   9036       stbir__simdf_store( decode + 0,  of0 );
   9037       stbir__simdf_store( decode + 4,  of1 );
   9038       #endif
   9039       decode += 8;
   9040       input += 8;
   9041       if ( decode <= decode_end )
   9042         continue;
   9043       if ( decode == ( decode_end + 8 ) )
   9044         break;
   9045       decode = decode_end; // backup and do last couple
   9046       input = end_input_m8;
   9047     }
   9048     return;
   9049   }
   9050   #endif
   9051 
   9052   // try to do blocks of 4 when you can
   9053   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9054   decode += 4;
   9055   STBIR_SIMD_NO_UNROLL_LOOP_START
   9056   while( decode <= decode_end )
   9057   {
   9058     STBIR_SIMD_NO_UNROLL(decode);
   9059     decode[0-4] = ((float)(input[stbir__decode_order0])) * stbir__max_uint16_as_float_inverted;
   9060     decode[1-4] = ((float)(input[stbir__decode_order1])) * stbir__max_uint16_as_float_inverted;
   9061     decode[2-4] = ((float)(input[stbir__decode_order2])) * stbir__max_uint16_as_float_inverted;
   9062     decode[3-4] = ((float)(input[stbir__decode_order3])) * stbir__max_uint16_as_float_inverted;
   9063     decode += 4;
   9064     input += 4;
   9065   }
   9066   decode -= 4;
   9067   #endif
   9068 
   9069   // do the remnants
   9070   #if stbir__coder_min_num < 4
   9071   STBIR_NO_UNROLL_LOOP_START
   9072   while( decode < decode_end )
   9073   {
   9074     STBIR_NO_UNROLL(decode);
   9075     decode[0] = ((float)(input[stbir__decode_order0])) * stbir__max_uint16_as_float_inverted;
   9076     #if stbir__coder_min_num >= 2
   9077     decode[1] = ((float)(input[stbir__decode_order1])) * stbir__max_uint16_as_float_inverted;
   9078     #endif
   9079     #if stbir__coder_min_num >= 3
   9080     decode[2] = ((float)(input[stbir__decode_order2])) * stbir__max_uint16_as_float_inverted;
   9081     #endif
   9082     decode += stbir__coder_min_num;
   9083     input += stbir__coder_min_num;
   9084   }
   9085   #endif
   9086 }
   9087 
   9088 
   9089 static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * outputp, int width_times_channels, float const * encode )
   9090 {
   9091   unsigned short STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned short*) outputp;
   9092   unsigned short * end_output = ( (unsigned short*) output ) + width_times_channels;
   9093 
   9094   #ifdef STBIR_SIMD
   9095   {
   9096     if ( width_times_channels >= stbir__simdfX_float_count*2 )
   9097     {
   9098       float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
   9099       end_output -= stbir__simdfX_float_count*2;
   9100       STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   9101       for(;;)
   9102       {
   9103         stbir__simdfX e0, e1;
   9104         stbir__simdiX i;
   9105         STBIR_SIMD_NO_UNROLL(encode);
   9106         stbir__simdfX_madd_mem( e0, STBIR_simd_point5X, STBIR_max_uint16_as_floatX, encode );
   9107         stbir__simdfX_madd_mem( e1, STBIR_simd_point5X, STBIR_max_uint16_as_floatX, encode+stbir__simdfX_float_count );
   9108         stbir__encode_simdfX_unflip( e0 );
   9109         stbir__encode_simdfX_unflip( e1 );
   9110         stbir__simdfX_pack_to_words( i, e0, e1 );
   9111         stbir__simdiX_store( output, i );
   9112         encode += stbir__simdfX_float_count*2;
   9113         output += stbir__simdfX_float_count*2;
   9114         if ( output <= end_output )
   9115           continue;
   9116         if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
   9117           break;
   9118         output = end_output;     // backup and do last couple
   9119         encode = end_encode_m8;
   9120       }
   9121       return;
   9122     }
   9123   }
   9124 
   9125   // try to do blocks of 4 when you can
   9126   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9127   output += 4;
   9128   STBIR_NO_UNROLL_LOOP_START
   9129   while( output <= end_output )
   9130   {
   9131     stbir__simdf e;
   9132     stbir__simdi i;
   9133     STBIR_NO_UNROLL(encode);
   9134     stbir__simdf_load( e, encode );
   9135     stbir__simdf_madd( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), e );
   9136     stbir__encode_simdf4_unflip( e );
   9137     stbir__simdf_pack_to_8words( i, e, e );  // only use first 4
   9138     stbir__simdi_store2( output-4, i );
   9139     output += 4;
   9140     encode += 4;
   9141   }
   9142   output -= 4;
   9143   #endif
   9144 
   9145   // do the remnants
   9146   #if stbir__coder_min_num < 4
   9147   STBIR_NO_UNROLL_LOOP_START
   9148   while( output < end_output )
   9149   {
   9150     stbir__simdf e;
   9151     STBIR_NO_UNROLL(encode);
   9152     stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_short( e );
   9153     #if stbir__coder_min_num >= 2
   9154     stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order1 ); output[1] = stbir__simdf_convert_float_to_short( e );
   9155     #endif
   9156     #if stbir__coder_min_num >= 3
   9157     stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order2 ); output[2] = stbir__simdf_convert_float_to_short( e );
   9158     #endif
   9159     output += stbir__coder_min_num;
   9160     encode += stbir__coder_min_num;
   9161   }
   9162   #endif
   9163 
   9164   #else
   9165 
   9166   // try to do blocks of 4 when you can
   9167   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9168   output += 4;
   9169   STBIR_SIMD_NO_UNROLL_LOOP_START
   9170   while( output <= end_output )
   9171   {
   9172     float f;
   9173     STBIR_SIMD_NO_UNROLL(encode);
   9174     f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0-4] = (unsigned short)f;
   9175     f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1-4] = (unsigned short)f;
   9176     f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2-4] = (unsigned short)f;
   9177     f = encode[stbir__encode_order3] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[3-4] = (unsigned short)f;
   9178     output += 4;
   9179     encode += 4;
   9180   }
   9181   output -= 4;
   9182   #endif
   9183 
   9184   // do the remnants
   9185   #if stbir__coder_min_num < 4
   9186   STBIR_NO_UNROLL_LOOP_START
   9187   while( output < end_output )
   9188   {
   9189     float f;
   9190     STBIR_NO_UNROLL(encode);
   9191     f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0] = (unsigned short)f;
   9192     #if stbir__coder_min_num >= 2
   9193     f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1] = (unsigned short)f;
   9194     #endif
   9195     #if stbir__coder_min_num >= 3
   9196     f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2] = (unsigned short)f;
   9197     #endif
   9198     output += stbir__coder_min_num;
   9199     encode += stbir__coder_min_num;
   9200   }
   9201   #endif
   9202   #endif
   9203 }
   9204 
   9205 static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int width_times_channels, void const * inputp )
   9206 {
   9207   float STBIR_STREAMOUT_PTR( * ) decode = decodep;
   9208   float * decode_end = (float*) decode + width_times_channels;
   9209   unsigned short const * input = (unsigned short const *)inputp;
   9210 
   9211   #ifdef STBIR_SIMD
   9212   unsigned short const * end_input_m8 = input + width_times_channels - 8;
   9213   if ( width_times_channels >= 8 )
   9214   {
   9215     decode_end -= 8;
   9216     STBIR_NO_UNROLL_LOOP_START_INF_FOR
   9217     for(;;)
   9218     {
   9219       #ifdef STBIR_SIMD8
   9220       stbir__simdi i; stbir__simdi8 o;
   9221       stbir__simdf8 of;
   9222       STBIR_NO_UNROLL(decode);
   9223       stbir__simdi_load( i, input );
   9224       stbir__simdi8_expand_u16_to_u32( o, i );
   9225       stbir__simdi8_convert_i32_to_float( of, o );
   9226       stbir__decode_simdf8_flip( of );
   9227       stbir__simdf8_store( decode + 0, of );
   9228       #else
   9229       stbir__simdi i, o0, o1;
   9230       stbir__simdf of0, of1;
   9231       STBIR_NO_UNROLL(decode);
   9232       stbir__simdi_load( i, input );
   9233       stbir__simdi_expand_u16_to_u32( o0, o1, i );
   9234       stbir__simdi_convert_i32_to_float( of0, o0 );
   9235       stbir__simdi_convert_i32_to_float( of1, o1 );
   9236       stbir__decode_simdf4_flip( of0 );
   9237       stbir__decode_simdf4_flip( of1 );
   9238       stbir__simdf_store( decode + 0,  of0 );
   9239       stbir__simdf_store( decode + 4,  of1 );
   9240       #endif
   9241       decode += 8;
   9242       input += 8;
   9243       if ( decode <= decode_end )
   9244         continue;
   9245       if ( decode == ( decode_end + 8 ) )
   9246         break;
   9247       decode = decode_end; // backup and do last couple
   9248       input = end_input_m8;
   9249     }
   9250     return;
   9251   }
   9252   #endif
   9253 
   9254   // try to do blocks of 4 when you can
   9255   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9256   decode += 4;
   9257   STBIR_SIMD_NO_UNROLL_LOOP_START
   9258   while( decode <= decode_end )
   9259   {
   9260     STBIR_SIMD_NO_UNROLL(decode);
   9261     decode[0-4] = ((float)(input[stbir__decode_order0]));
   9262     decode[1-4] = ((float)(input[stbir__decode_order1]));
   9263     decode[2-4] = ((float)(input[stbir__decode_order2]));
   9264     decode[3-4] = ((float)(input[stbir__decode_order3]));
   9265     decode += 4;
   9266     input += 4;
   9267   }
   9268   decode -= 4;
   9269   #endif
   9270 
   9271   // do the remnants
   9272   #if stbir__coder_min_num < 4
   9273   STBIR_NO_UNROLL_LOOP_START
   9274   while( decode < decode_end )
   9275   {
   9276     STBIR_NO_UNROLL(decode);
   9277     decode[0] = ((float)(input[stbir__decode_order0]));
   9278     #if stbir__coder_min_num >= 2
   9279     decode[1] = ((float)(input[stbir__decode_order1]));
   9280     #endif
   9281     #if stbir__coder_min_num >= 3
   9282     decode[2] = ((float)(input[stbir__decode_order2]));
   9283     #endif
   9284     decode += stbir__coder_min_num;
   9285     input += stbir__coder_min_num;
   9286   }
   9287   #endif
   9288 }
   9289 
   9290 static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int width_times_channels, float const * encode )
   9291 {
   9292   unsigned short STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned short*) outputp;
   9293   unsigned short * end_output = ( (unsigned short*) output ) + width_times_channels;
   9294 
   9295   #ifdef STBIR_SIMD
   9296   {
   9297     if ( width_times_channels >= stbir__simdfX_float_count*2 )
   9298     {
   9299       float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
   9300       end_output -= stbir__simdfX_float_count*2;
   9301       STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   9302       for(;;)
   9303       {
   9304         stbir__simdfX e0, e1;
   9305         stbir__simdiX i;
   9306         STBIR_SIMD_NO_UNROLL(encode);
   9307         stbir__simdfX_add_mem( e0, STBIR_simd_point5X, encode );
   9308         stbir__simdfX_add_mem( e1, STBIR_simd_point5X, encode+stbir__simdfX_float_count );
   9309         stbir__encode_simdfX_unflip( e0 );
   9310         stbir__encode_simdfX_unflip( e1 );
   9311         stbir__simdfX_pack_to_words( i, e0, e1 );
   9312         stbir__simdiX_store( output, i );
   9313         encode += stbir__simdfX_float_count*2;
   9314         output += stbir__simdfX_float_count*2;
   9315         if ( output <= end_output )
   9316           continue;
   9317         if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
   9318           break;
   9319         output = end_output; // backup and do last couple
   9320         encode = end_encode_m8;
   9321       }
   9322       return;
   9323     }
   9324   }
   9325 
   9326   // try to do blocks of 4 when you can
   9327   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9328   output += 4;
   9329   STBIR_NO_UNROLL_LOOP_START
   9330   while( output <= end_output )
   9331   {
   9332     stbir__simdf e;
   9333     stbir__simdi i;
   9334     STBIR_NO_UNROLL(encode);
   9335     stbir__simdf_load( e, encode );
   9336     stbir__simdf_add( e, STBIR__CONSTF(STBIR_simd_point5), e );
   9337     stbir__encode_simdf4_unflip( e );
   9338     stbir__simdf_pack_to_8words( i, e, e );  // only use first 4
   9339     stbir__simdi_store2( output-4, i );
   9340     output += 4;
   9341     encode += 4;
   9342   }
   9343   output -= 4;
   9344   #endif
   9345 
   9346   #else
   9347 
   9348   // try to do blocks of 4 when you can
   9349   #if  stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9350   output += 4;
   9351   STBIR_SIMD_NO_UNROLL_LOOP_START
   9352   while( output <= end_output )
   9353   {
   9354     float f;
   9355     STBIR_SIMD_NO_UNROLL(encode);
   9356     f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0-4] = (unsigned short)f;
   9357     f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1-4] = (unsigned short)f;
   9358     f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2-4] = (unsigned short)f;
   9359     f = encode[stbir__encode_order3] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[3-4] = (unsigned short)f;
   9360     output += 4;
   9361     encode += 4;
   9362   }
   9363   output -= 4;
   9364   #endif
   9365 
   9366   #endif
   9367 
   9368   // do the remnants
   9369   #if stbir__coder_min_num < 4
   9370   STBIR_NO_UNROLL_LOOP_START
   9371   while( output < end_output )
   9372   {
   9373     float f;
   9374     STBIR_NO_UNROLL(encode);
   9375     f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0] = (unsigned short)f;
   9376     #if stbir__coder_min_num >= 2
   9377     f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1] = (unsigned short)f;
   9378     #endif
   9379     #if stbir__coder_min_num >= 3
   9380     f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2] = (unsigned short)f;
   9381     #endif
   9382     output += stbir__coder_min_num;
   9383     encode += stbir__coder_min_num;
   9384   }
   9385   #endif
   9386 }
   9387 
   9388 static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep, int width_times_channels, void const * inputp )
   9389 {
   9390   float STBIR_STREAMOUT_PTR( * ) decode = decodep;
   9391   float * decode_end = (float*) decode + width_times_channels;
   9392   stbir__FP16 const * input = (stbir__FP16 const *)inputp;
   9393 
   9394   #ifdef STBIR_SIMD
   9395   if ( width_times_channels >= 8 )
   9396   {
   9397     stbir__FP16 const * end_input_m8 = input + width_times_channels - 8;
   9398     decode_end -= 8;
   9399     STBIR_NO_UNROLL_LOOP_START_INF_FOR
   9400     for(;;)
   9401     {
   9402       STBIR_NO_UNROLL(decode);
   9403 
   9404       stbir__half_to_float_SIMD( decode, input );
   9405       #ifdef stbir__decode_swizzle
   9406       #ifdef STBIR_SIMD8
   9407       {
   9408         stbir__simdf8 of;
   9409         stbir__simdf8_load( of, decode );
   9410         stbir__decode_simdf8_flip( of );
   9411         stbir__simdf8_store( decode, of );
   9412       }
   9413       #else
   9414       {
   9415         stbir__simdf of0,of1;
   9416         stbir__simdf_load( of0, decode );
   9417         stbir__simdf_load( of1, decode+4 );
   9418         stbir__decode_simdf4_flip( of0 );
   9419         stbir__decode_simdf4_flip( of1 );
   9420         stbir__simdf_store( decode, of0 );
   9421         stbir__simdf_store( decode+4, of1 );
   9422       }
   9423       #endif
   9424       #endif
   9425       decode += 8;
   9426       input += 8;
   9427       if ( decode <= decode_end )
   9428         continue;
   9429       if ( decode == ( decode_end + 8 ) )
   9430         break;
   9431       decode = decode_end; // backup and do last couple
   9432       input = end_input_m8;
   9433     }
   9434     return;
   9435   }
   9436   #endif
   9437 
   9438   // try to do blocks of 4 when you can
   9439   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9440   decode += 4;
   9441   STBIR_SIMD_NO_UNROLL_LOOP_START
   9442   while( decode <= decode_end )
   9443   {
   9444     STBIR_SIMD_NO_UNROLL(decode);
   9445     decode[0-4] = stbir__half_to_float(input[stbir__decode_order0]);
   9446     decode[1-4] = stbir__half_to_float(input[stbir__decode_order1]);
   9447     decode[2-4] = stbir__half_to_float(input[stbir__decode_order2]);
   9448     decode[3-4] = stbir__half_to_float(input[stbir__decode_order3]);
   9449     decode += 4;
   9450     input += 4;
   9451   }
   9452   decode -= 4;
   9453   #endif
   9454 
   9455   // do the remnants
   9456   #if stbir__coder_min_num < 4
   9457   STBIR_NO_UNROLL_LOOP_START
   9458   while( decode < decode_end )
   9459   {
   9460     STBIR_NO_UNROLL(decode);
   9461     decode[0] = stbir__half_to_float(input[stbir__decode_order0]);
   9462     #if stbir__coder_min_num >= 2
   9463     decode[1] = stbir__half_to_float(input[stbir__decode_order1]);
   9464     #endif
   9465     #if stbir__coder_min_num >= 3
   9466     decode[2] = stbir__half_to_float(input[stbir__decode_order2]);
   9467     #endif
   9468     decode += stbir__coder_min_num;
   9469     input += stbir__coder_min_num;
   9470   }
   9471   #endif
   9472 }
   9473 
   9474 static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp, int width_times_channels, float const * encode )
   9475 {
   9476   stbir__FP16 STBIR_SIMD_STREAMOUT_PTR( * ) output = (stbir__FP16*) outputp;
   9477   stbir__FP16 * end_output = ( (stbir__FP16*) output ) + width_times_channels;
   9478 
   9479   #ifdef STBIR_SIMD
   9480   if ( width_times_channels >= 8 )
   9481   {
   9482     float const * end_encode_m8 = encode + width_times_channels - 8;
   9483     end_output -= 8;
   9484     STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   9485     for(;;)
   9486     {
   9487       STBIR_SIMD_NO_UNROLL(encode);
   9488       #ifdef stbir__decode_swizzle
   9489       #ifdef STBIR_SIMD8
   9490       {
   9491         stbir__simdf8 of;
   9492         stbir__simdf8_load( of, encode );
   9493         stbir__encode_simdf8_unflip( of );
   9494         stbir__float_to_half_SIMD( output, (float*)&of );
   9495       }
   9496       #else
   9497       {
   9498         stbir__simdf of[2];
   9499         stbir__simdf_load( of[0], encode );
   9500         stbir__simdf_load( of[1], encode+4 );
   9501         stbir__encode_simdf4_unflip( of[0] );
   9502         stbir__encode_simdf4_unflip( of[1] );
   9503         stbir__float_to_half_SIMD( output, (float*)of );
   9504       }
   9505       #endif
   9506       #else
   9507       stbir__float_to_half_SIMD( output, encode );
   9508       #endif
   9509       encode += 8;
   9510       output += 8;
   9511       if ( output <= end_output )
   9512         continue;
   9513       if ( output == ( end_output + 8 ) )
   9514         break;
   9515       output = end_output; // backup and do last couple
   9516       encode = end_encode_m8;
   9517     }
   9518     return;
   9519   }
   9520   #endif
   9521 
   9522   // try to do blocks of 4 when you can
   9523   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9524   output += 4;
   9525   STBIR_SIMD_NO_UNROLL_LOOP_START
   9526   while( output <= end_output )
   9527   {
   9528     STBIR_SIMD_NO_UNROLL(output);
   9529     output[0-4] = stbir__float_to_half(encode[stbir__encode_order0]);
   9530     output[1-4] = stbir__float_to_half(encode[stbir__encode_order1]);
   9531     output[2-4] = stbir__float_to_half(encode[stbir__encode_order2]);
   9532     output[3-4] = stbir__float_to_half(encode[stbir__encode_order3]);
   9533     output += 4;
   9534     encode += 4;
   9535   }
   9536   output -= 4;
   9537   #endif
   9538 
   9539   // do the remnants
   9540   #if stbir__coder_min_num < 4
   9541   STBIR_NO_UNROLL_LOOP_START
   9542   while( output < end_output )
   9543   {
   9544     STBIR_NO_UNROLL(output);
   9545     output[0] = stbir__float_to_half(encode[stbir__encode_order0]);
   9546     #if stbir__coder_min_num >= 2
   9547     output[1] = stbir__float_to_half(encode[stbir__encode_order1]);
   9548     #endif
   9549     #if stbir__coder_min_num >= 3
   9550     output[2] = stbir__float_to_half(encode[stbir__encode_order2]);
   9551     #endif
   9552     output += stbir__coder_min_num;
   9553     encode += stbir__coder_min_num;
   9554   }
   9555   #endif
   9556 }
   9557 
   9558 static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int width_times_channels, void const * inputp )
   9559 {
   9560   #ifdef stbir__decode_swizzle
   9561   float STBIR_STREAMOUT_PTR( * ) decode = decodep;
   9562   float * decode_end = (float*) decode + width_times_channels;
   9563   float const * input = (float const *)inputp;
   9564 
   9565   #ifdef STBIR_SIMD
   9566   if ( width_times_channels >= 16 )
   9567   {
   9568     float const * end_input_m16 = input + width_times_channels - 16;
   9569     decode_end -= 16;
   9570     STBIR_NO_UNROLL_LOOP_START_INF_FOR
   9571     for(;;)
   9572     {
   9573       STBIR_NO_UNROLL(decode);
   9574       #ifdef stbir__decode_swizzle
   9575       #ifdef STBIR_SIMD8
   9576       {
   9577         stbir__simdf8 of0,of1;
   9578         stbir__simdf8_load( of0, input );
   9579         stbir__simdf8_load( of1, input+8 );
   9580         stbir__decode_simdf8_flip( of0 );
   9581         stbir__decode_simdf8_flip( of1 );
   9582         stbir__simdf8_store( decode, of0 );
   9583         stbir__simdf8_store( decode+8, of1 );
   9584       }
   9585       #else
   9586       {
   9587         stbir__simdf of0,of1,of2,of3;
   9588         stbir__simdf_load( of0, input );
   9589         stbir__simdf_load( of1, input+4 );
   9590         stbir__simdf_load( of2, input+8 );
   9591         stbir__simdf_load( of3, input+12 );
   9592         stbir__decode_simdf4_flip( of0 );
   9593         stbir__decode_simdf4_flip( of1 );
   9594         stbir__decode_simdf4_flip( of2 );
   9595         stbir__decode_simdf4_flip( of3 );
   9596         stbir__simdf_store( decode, of0 );
   9597         stbir__simdf_store( decode+4, of1 );
   9598         stbir__simdf_store( decode+8, of2 );
   9599         stbir__simdf_store( decode+12, of3 );
   9600       }
   9601       #endif
   9602       #endif
   9603       decode += 16;
   9604       input += 16;
   9605       if ( decode <= decode_end )
   9606         continue;
   9607       if ( decode == ( decode_end + 16 ) )
   9608         break;
   9609       decode = decode_end; // backup and do last couple
   9610       input = end_input_m16;
   9611     }
   9612     return;
   9613   }
   9614   #endif
   9615 
   9616   // try to do blocks of 4 when you can
   9617   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9618   decode += 4;
   9619   STBIR_SIMD_NO_UNROLL_LOOP_START
   9620   while( decode <= decode_end )
   9621   {
   9622     STBIR_SIMD_NO_UNROLL(decode);
   9623     decode[0-4] = input[stbir__decode_order0];
   9624     decode[1-4] = input[stbir__decode_order1];
   9625     decode[2-4] = input[stbir__decode_order2];
   9626     decode[3-4] = input[stbir__decode_order3];
   9627     decode += 4;
   9628     input += 4;
   9629   }
   9630   decode -= 4;
   9631   #endif
   9632 
   9633   // do the remnants
   9634   #if stbir__coder_min_num < 4
   9635   STBIR_NO_UNROLL_LOOP_START
   9636   while( decode < decode_end )
   9637   {
   9638     STBIR_NO_UNROLL(decode);
   9639     decode[0] = input[stbir__decode_order0];
   9640     #if stbir__coder_min_num >= 2
   9641     decode[1] = input[stbir__decode_order1];
   9642     #endif
   9643     #if stbir__coder_min_num >= 3
   9644     decode[2] = input[stbir__decode_order2];
   9645     #endif
   9646     decode += stbir__coder_min_num;
   9647     input += stbir__coder_min_num;
   9648   }
   9649   #endif
   9650 
   9651   #else
   9652 
   9653   if ( (void*)decodep != inputp )
   9654     STBIR_MEMCPY( decodep, inputp, width_times_channels * sizeof( float ) );
   9655 
   9656   #endif
   9657 }
   9658 
   9659 static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int width_times_channels, float const * encode )
   9660 {
   9661   #if !defined( STBIR_FLOAT_HIGH_CLAMP ) && !defined(STBIR_FLOAT_LO_CLAMP) && !defined(stbir__decode_swizzle)
   9662 
   9663   if ( (void*)outputp != (void*) encode )
   9664     STBIR_MEMCPY( outputp, encode, width_times_channels * sizeof( float ) );
   9665 
   9666   #else
   9667 
   9668   float STBIR_SIMD_STREAMOUT_PTR( * ) output = (float*) outputp;
   9669   float * end_output = ( (float*) output ) + width_times_channels;
   9670 
   9671   #ifdef STBIR_FLOAT_HIGH_CLAMP
   9672   #define stbir_scalar_hi_clamp( v ) if ( v > STBIR_FLOAT_HIGH_CLAMP ) v = STBIR_FLOAT_HIGH_CLAMP;
   9673   #else
   9674   #define stbir_scalar_hi_clamp( v )
   9675   #endif
   9676   #ifdef STBIR_FLOAT_LOW_CLAMP
   9677   #define stbir_scalar_lo_clamp( v ) if ( v < STBIR_FLOAT_LOW_CLAMP ) v = STBIR_FLOAT_LOW_CLAMP;
   9678   #else
   9679   #define stbir_scalar_lo_clamp( v )
   9680   #endif
   9681 
   9682   #ifdef STBIR_SIMD
   9683 
   9684   #ifdef STBIR_FLOAT_HIGH_CLAMP
   9685   const stbir__simdfX high_clamp = stbir__simdf_frepX(STBIR_FLOAT_HIGH_CLAMP);
   9686   #endif
   9687   #ifdef STBIR_FLOAT_LOW_CLAMP
   9688   const stbir__simdfX low_clamp = stbir__simdf_frepX(STBIR_FLOAT_LOW_CLAMP);
   9689   #endif
   9690 
   9691   if ( width_times_channels >= ( stbir__simdfX_float_count * 2 ) )
   9692   {
   9693     float const * end_encode_m8 = encode + width_times_channels - ( stbir__simdfX_float_count * 2 );
   9694     end_output -= ( stbir__simdfX_float_count * 2 );
   9695     STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
   9696     for(;;)
   9697     {
   9698       stbir__simdfX e0, e1;
   9699       STBIR_SIMD_NO_UNROLL(encode);
   9700       stbir__simdfX_load( e0, encode );
   9701       stbir__simdfX_load( e1, encode+stbir__simdfX_float_count );
   9702 #ifdef STBIR_FLOAT_HIGH_CLAMP
   9703       stbir__simdfX_min( e0, e0, high_clamp );
   9704       stbir__simdfX_min( e1, e1, high_clamp );
   9705 #endif
   9706 #ifdef STBIR_FLOAT_LOW_CLAMP
   9707       stbir__simdfX_max( e0, e0, low_clamp );
   9708       stbir__simdfX_max( e1, e1, low_clamp );
   9709 #endif
   9710       stbir__encode_simdfX_unflip( e0 );
   9711       stbir__encode_simdfX_unflip( e1 );
   9712       stbir__simdfX_store( output, e0 );
   9713       stbir__simdfX_store( output+stbir__simdfX_float_count, e1 );
   9714       encode += stbir__simdfX_float_count * 2;
   9715       output += stbir__simdfX_float_count * 2;
   9716       if ( output < end_output )
   9717         continue;
   9718       if ( output == ( end_output + ( stbir__simdfX_float_count * 2 ) ) )
   9719         break;
   9720       output = end_output; // backup and do last couple
   9721       encode = end_encode_m8;
   9722     }
   9723     return;
   9724   }
   9725 
   9726   // try to do blocks of 4 when you can
   9727   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9728   output += 4;
   9729   STBIR_NO_UNROLL_LOOP_START
   9730   while( output <= end_output )
   9731   {
   9732     stbir__simdf e0;
   9733     STBIR_NO_UNROLL(encode);
   9734     stbir__simdf_load( e0, encode );
   9735 #ifdef STBIR_FLOAT_HIGH_CLAMP
   9736     stbir__simdf_min( e0, e0, high_clamp );
   9737 #endif
   9738 #ifdef STBIR_FLOAT_LOW_CLAMP
   9739     stbir__simdf_max( e0, e0, low_clamp );
   9740 #endif
   9741     stbir__encode_simdf4_unflip( e0 );
   9742     stbir__simdf_store( output-4, e0 );
   9743     output += 4;
   9744     encode += 4;
   9745   }
   9746   output -= 4;
   9747   #endif
   9748 
   9749   #else
   9750 
   9751   // try to do blocks of 4 when you can
   9752   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   9753   output += 4;
   9754   STBIR_SIMD_NO_UNROLL_LOOP_START
   9755   while( output <= end_output )
   9756   {
   9757     float e;
   9758     STBIR_SIMD_NO_UNROLL(encode);
   9759     e = encode[ stbir__encode_order0 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[0-4] = e;
   9760     e = encode[ stbir__encode_order1 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[1-4] = e;
   9761     e = encode[ stbir__encode_order2 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[2-4] = e;
   9762     e = encode[ stbir__encode_order3 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[3-4] = e;
   9763     output += 4;
   9764     encode += 4;
   9765   }
   9766   output -= 4;
   9767 
   9768   #endif
   9769 
   9770   #endif
   9771 
   9772   // do the remnants
   9773   #if stbir__coder_min_num < 4
   9774   STBIR_NO_UNROLL_LOOP_START
   9775   while( output < end_output )
   9776   {
   9777     float e;
   9778     STBIR_NO_UNROLL(encode);
   9779     e = encode[ stbir__encode_order0 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[0] = e;
   9780     #if stbir__coder_min_num >= 2
   9781     e = encode[ stbir__encode_order1 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[1] = e;
   9782     #endif
   9783     #if stbir__coder_min_num >= 3
   9784     e = encode[ stbir__encode_order2 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[2] = e;
   9785     #endif
   9786     output += stbir__coder_min_num;
   9787     encode += stbir__coder_min_num;
   9788   }
   9789   #endif
   9790 
   9791   #endif
   9792 }
   9793 
   9794 #undef stbir__decode_suffix
   9795 #undef stbir__decode_simdf8_flip
   9796 #undef stbir__decode_simdf4_flip
   9797 #undef stbir__decode_order0
   9798 #undef stbir__decode_order1
   9799 #undef stbir__decode_order2
   9800 #undef stbir__decode_order3
   9801 #undef stbir__encode_order0
   9802 #undef stbir__encode_order1
   9803 #undef stbir__encode_order2
   9804 #undef stbir__encode_order3
   9805 #undef stbir__encode_simdf8_unflip
   9806 #undef stbir__encode_simdf4_unflip
   9807 #undef stbir__encode_simdfX_unflip
   9808 #undef STBIR__CODER_NAME
   9809 #undef stbir__coder_min_num
   9810 #undef stbir__decode_swizzle
   9811 #undef stbir_scalar_hi_clamp
   9812 #undef stbir_scalar_lo_clamp
   9813 #undef STB_IMAGE_RESIZE_DO_CODERS
   9814 
   9815 #elif defined( STB_IMAGE_RESIZE_DO_VERTICALS)
   9816 
   9817 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   9818 #define STBIR_chans( start, end ) STBIR_strs_join14(start,STBIR__vertical_channels,end,_cont)
   9819 #else
   9820 #define STBIR_chans( start, end ) STBIR_strs_join1(start,STBIR__vertical_channels,end)
   9821 #endif
   9822 
   9823 #if STBIR__vertical_channels >= 1
   9824 #define stbIF0( code ) code
   9825 #else
   9826 #define stbIF0( code )
   9827 #endif
   9828 #if STBIR__vertical_channels >= 2
   9829 #define stbIF1( code ) code
   9830 #else
   9831 #define stbIF1( code )
   9832 #endif
   9833 #if STBIR__vertical_channels >= 3
   9834 #define stbIF2( code ) code
   9835 #else
   9836 #define stbIF2( code )
   9837 #endif
   9838 #if STBIR__vertical_channels >= 4
   9839 #define stbIF3( code ) code
   9840 #else
   9841 #define stbIF3( code )
   9842 #endif
   9843 #if STBIR__vertical_channels >= 5
   9844 #define stbIF4( code ) code
   9845 #else
   9846 #define stbIF4( code )
   9847 #endif
   9848 #if STBIR__vertical_channels >= 6
   9849 #define stbIF5( code ) code
   9850 #else
   9851 #define stbIF5( code )
   9852 #endif
   9853 #if STBIR__vertical_channels >= 7
   9854 #define stbIF6( code ) code
   9855 #else
   9856 #define stbIF6( code )
   9857 #endif
   9858 #if STBIR__vertical_channels >= 8
   9859 #define stbIF7( code ) code
   9860 #else
   9861 #define stbIF7( code )
   9862 #endif
   9863 
   9864 static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** outputs, float const * vertical_coefficients, float const * input, float const * input_end )
   9865 {
   9866   stbIF0( float STBIR_SIMD_STREAMOUT_PTR( * ) output0 = outputs[0]; float c0s = vertical_coefficients[0]; )
   9867   stbIF1( float STBIR_SIMD_STREAMOUT_PTR( * ) output1 = outputs[1]; float c1s = vertical_coefficients[1]; )
   9868   stbIF2( float STBIR_SIMD_STREAMOUT_PTR( * ) output2 = outputs[2]; float c2s = vertical_coefficients[2]; )
   9869   stbIF3( float STBIR_SIMD_STREAMOUT_PTR( * ) output3 = outputs[3]; float c3s = vertical_coefficients[3]; )
   9870   stbIF4( float STBIR_SIMD_STREAMOUT_PTR( * ) output4 = outputs[4]; float c4s = vertical_coefficients[4]; )
   9871   stbIF5( float STBIR_SIMD_STREAMOUT_PTR( * ) output5 = outputs[5]; float c5s = vertical_coefficients[5]; )
   9872   stbIF6( float STBIR_SIMD_STREAMOUT_PTR( * ) output6 = outputs[6]; float c6s = vertical_coefficients[6]; )
   9873   stbIF7( float STBIR_SIMD_STREAMOUT_PTR( * ) output7 = outputs[7]; float c7s = vertical_coefficients[7]; )
   9874 
   9875   #ifdef STBIR_SIMD
   9876   {
   9877     stbIF0(stbir__simdfX c0 = stbir__simdf_frepX( c0s ); )
   9878     stbIF1(stbir__simdfX c1 = stbir__simdf_frepX( c1s ); )
   9879     stbIF2(stbir__simdfX c2 = stbir__simdf_frepX( c2s ); )
   9880     stbIF3(stbir__simdfX c3 = stbir__simdf_frepX( c3s ); )
   9881     stbIF4(stbir__simdfX c4 = stbir__simdf_frepX( c4s ); )
   9882     stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); )
   9883     stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); )
   9884     stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); )
   9885     STBIR_SIMD_NO_UNROLL_LOOP_START
   9886     while ( ( (char*)input_end - (char*) input ) >= (16*stbir__simdfX_float_count) )
   9887     {
   9888       stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
   9889       STBIR_SIMD_NO_UNROLL(output0);
   9890 
   9891       stbir__simdfX_load( r0, input );               stbir__simdfX_load( r1, input+stbir__simdfX_float_count );     stbir__simdfX_load( r2, input+(2*stbir__simdfX_float_count) );      stbir__simdfX_load( r3, input+(3*stbir__simdfX_float_count) );
   9892 
   9893       #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   9894       stbIF0( stbir__simdfX_load( o0, output0 );     stbir__simdfX_load( o1, output0+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output0+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output0+(3*stbir__simdfX_float_count) );
   9895               stbir__simdfX_madd( o0, o0, r0, c0 );  stbir__simdfX_madd( o1, o1, r1, c0 );  stbir__simdfX_madd( o2, o2, r2, c0 );   stbir__simdfX_madd( o3, o3, r3, c0 );
   9896               stbir__simdfX_store( output0, o0 );    stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); )
   9897       stbIF1( stbir__simdfX_load( o0, output1 );     stbir__simdfX_load( o1, output1+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output1+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output1+(3*stbir__simdfX_float_count) );
   9898               stbir__simdfX_madd( o0, o0, r0, c1 );  stbir__simdfX_madd( o1, o1, r1, c1 );  stbir__simdfX_madd( o2, o2, r2, c1 );   stbir__simdfX_madd( o3, o3, r3, c1 );
   9899               stbir__simdfX_store( output1, o0 );    stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); )
   9900       stbIF2( stbir__simdfX_load( o0, output2 );     stbir__simdfX_load( o1, output2+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output2+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output2+(3*stbir__simdfX_float_count) );
   9901               stbir__simdfX_madd( o0, o0, r0, c2 );  stbir__simdfX_madd( o1, o1, r1, c2 );  stbir__simdfX_madd( o2, o2, r2, c2 );   stbir__simdfX_madd( o3, o3, r3, c2 );
   9902               stbir__simdfX_store( output2, o0 );    stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); )
   9903       stbIF3( stbir__simdfX_load( o0, output3 );     stbir__simdfX_load( o1, output3+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output3+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output3+(3*stbir__simdfX_float_count) );
   9904               stbir__simdfX_madd( o0, o0, r0, c3 );  stbir__simdfX_madd( o1, o1, r1, c3 );  stbir__simdfX_madd( o2, o2, r2, c3 );   stbir__simdfX_madd( o3, o3, r3, c3 );
   9905               stbir__simdfX_store( output3, o0 );    stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); )
   9906       stbIF4( stbir__simdfX_load( o0, output4 );     stbir__simdfX_load( o1, output4+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output4+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output4+(3*stbir__simdfX_float_count) );
   9907               stbir__simdfX_madd( o0, o0, r0, c4 );  stbir__simdfX_madd( o1, o1, r1, c4 );  stbir__simdfX_madd( o2, o2, r2, c4 );   stbir__simdfX_madd( o3, o3, r3, c4 );
   9908               stbir__simdfX_store( output4, o0 );    stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); )
   9909       stbIF5( stbir__simdfX_load( o0, output5 );     stbir__simdfX_load( o1, output5+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output5+(2*stbir__simdfX_float_count));    stbir__simdfX_load( o3, output5+(3*stbir__simdfX_float_count) );
   9910               stbir__simdfX_madd( o0, o0, r0, c5 );  stbir__simdfX_madd( o1, o1, r1, c5 );  stbir__simdfX_madd( o2, o2, r2, c5 );   stbir__simdfX_madd( o3, o3, r3, c5 );
   9911               stbir__simdfX_store( output5, o0 );    stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); )
   9912       stbIF6( stbir__simdfX_load( o0, output6 );     stbir__simdfX_load( o1, output6+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output6+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output6+(3*stbir__simdfX_float_count) );
   9913               stbir__simdfX_madd( o0, o0, r0, c6 );  stbir__simdfX_madd( o1, o1, r1, c6 );  stbir__simdfX_madd( o2, o2, r2, c6 );   stbir__simdfX_madd( o3, o3, r3, c6 );
   9914               stbir__simdfX_store( output6, o0 );    stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); )
   9915       stbIF7( stbir__simdfX_load( o0, output7 );     stbir__simdfX_load( o1, output7+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output7+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output7+(3*stbir__simdfX_float_count) );
   9916               stbir__simdfX_madd( o0, o0, r0, c7 );  stbir__simdfX_madd( o1, o1, r1, c7 );  stbir__simdfX_madd( o2, o2, r2, c7 );   stbir__simdfX_madd( o3, o3, r3, c7 );
   9917               stbir__simdfX_store( output7, o0 );    stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); )
   9918       #else
   9919       stbIF0( stbir__simdfX_mult( o0, r0, c0 );      stbir__simdfX_mult( o1, r1, c0 );      stbir__simdfX_mult( o2, r2, c0 );       stbir__simdfX_mult( o3, r3, c0 );
   9920               stbir__simdfX_store( output0, o0 );    stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); )
   9921       stbIF1( stbir__simdfX_mult( o0, r0, c1 );      stbir__simdfX_mult( o1, r1, c1 );      stbir__simdfX_mult( o2, r2, c1 );       stbir__simdfX_mult( o3, r3, c1 );
   9922               stbir__simdfX_store( output1, o0 );    stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); )
   9923       stbIF2( stbir__simdfX_mult( o0, r0, c2 );      stbir__simdfX_mult( o1, r1, c2 );      stbir__simdfX_mult( o2, r2, c2 );       stbir__simdfX_mult( o3, r3, c2 );
   9924               stbir__simdfX_store( output2, o0 );    stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); )
   9925       stbIF3( stbir__simdfX_mult( o0, r0, c3 );      stbir__simdfX_mult( o1, r1, c3 );      stbir__simdfX_mult( o2, r2, c3 );       stbir__simdfX_mult( o3, r3, c3 );
   9926               stbir__simdfX_store( output3, o0 );    stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); )
   9927       stbIF4( stbir__simdfX_mult( o0, r0, c4 );      stbir__simdfX_mult( o1, r1, c4 );      stbir__simdfX_mult( o2, r2, c4 );       stbir__simdfX_mult( o3, r3, c4 );
   9928               stbir__simdfX_store( output4, o0 );    stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); )
   9929       stbIF5( stbir__simdfX_mult( o0, r0, c5 );      stbir__simdfX_mult( o1, r1, c5 );      stbir__simdfX_mult( o2, r2, c5 );       stbir__simdfX_mult( o3, r3, c5 );
   9930               stbir__simdfX_store( output5, o0 );    stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); )
   9931       stbIF6( stbir__simdfX_mult( o0, r0, c6 );      stbir__simdfX_mult( o1, r1, c6 );      stbir__simdfX_mult( o2, r2, c6 );       stbir__simdfX_mult( o3, r3, c6 );
   9932               stbir__simdfX_store( output6, o0 );    stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); )
   9933       stbIF7( stbir__simdfX_mult( o0, r0, c7 );      stbir__simdfX_mult( o1, r1, c7 );      stbir__simdfX_mult( o2, r2, c7 );       stbir__simdfX_mult( o3, r3, c7 );
   9934               stbir__simdfX_store( output7, o0 );    stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); )
   9935       #endif
   9936 
   9937       input += (4*stbir__simdfX_float_count);
   9938       stbIF0( output0 += (4*stbir__simdfX_float_count); ) stbIF1( output1 += (4*stbir__simdfX_float_count); ) stbIF2( output2 += (4*stbir__simdfX_float_count); ) stbIF3( output3 += (4*stbir__simdfX_float_count); ) stbIF4( output4 += (4*stbir__simdfX_float_count); ) stbIF5( output5 += (4*stbir__simdfX_float_count); ) stbIF6( output6 += (4*stbir__simdfX_float_count); ) stbIF7( output7 += (4*stbir__simdfX_float_count); )
   9939     }
   9940     STBIR_SIMD_NO_UNROLL_LOOP_START
   9941     while ( ( (char*)input_end - (char*) input ) >= 16 )
   9942     {
   9943       stbir__simdf o0, r0;
   9944       STBIR_SIMD_NO_UNROLL(output0);
   9945 
   9946       stbir__simdf_load( r0, input );
   9947 
   9948       #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   9949       stbIF0( stbir__simdf_load( o0, output0 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) );  stbir__simdf_store( output0, o0 ); )
   9950       stbIF1( stbir__simdf_load( o0, output1 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) );  stbir__simdf_store( output1, o0 ); )
   9951       stbIF2( stbir__simdf_load( o0, output2 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) );  stbir__simdf_store( output2, o0 ); )
   9952       stbIF3( stbir__simdf_load( o0, output3 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) );  stbir__simdf_store( output3, o0 ); )
   9953       stbIF4( stbir__simdf_load( o0, output4 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) );  stbir__simdf_store( output4, o0 ); )
   9954       stbIF5( stbir__simdf_load( o0, output5 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) );  stbir__simdf_store( output5, o0 ); )
   9955       stbIF6( stbir__simdf_load( o0, output6 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) );  stbir__simdf_store( output6, o0 ); )
   9956       stbIF7( stbir__simdf_load( o0, output7 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) );  stbir__simdf_store( output7, o0 ); )
   9957       #else
   9958       stbIF0( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) );   stbir__simdf_store( output0, o0 ); )
   9959       stbIF1( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) );   stbir__simdf_store( output1, o0 ); )
   9960       stbIF2( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) );   stbir__simdf_store( output2, o0 ); )
   9961       stbIF3( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) );   stbir__simdf_store( output3, o0 ); )
   9962       stbIF4( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) );   stbir__simdf_store( output4, o0 ); )
   9963       stbIF5( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) );   stbir__simdf_store( output5, o0 ); )
   9964       stbIF6( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) );   stbir__simdf_store( output6, o0 ); )
   9965       stbIF7( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) );   stbir__simdf_store( output7, o0 ); )
   9966       #endif
   9967 
   9968       input += 4;
   9969       stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; )
   9970     }
   9971   }
   9972   #else
   9973   STBIR_NO_UNROLL_LOOP_START
   9974   while ( ( (char*)input_end - (char*) input ) >= 16 )
   9975   {
   9976     float r0, r1, r2, r3;
   9977     STBIR_NO_UNROLL(input);
   9978 
   9979     r0 = input[0], r1 = input[1], r2 = input[2], r3 = input[3];
   9980 
   9981     #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
   9982     stbIF0( output0[0] += ( r0 * c0s ); output0[1] += ( r1 * c0s ); output0[2] += ( r2 * c0s ); output0[3] += ( r3 * c0s ); )
   9983     stbIF1( output1[0] += ( r0 * c1s ); output1[1] += ( r1 * c1s ); output1[2] += ( r2 * c1s ); output1[3] += ( r3 * c1s ); )
   9984     stbIF2( output2[0] += ( r0 * c2s ); output2[1] += ( r1 * c2s ); output2[2] += ( r2 * c2s ); output2[3] += ( r3 * c2s ); )
   9985     stbIF3( output3[0] += ( r0 * c3s ); output3[1] += ( r1 * c3s ); output3[2] += ( r2 * c3s ); output3[3] += ( r3 * c3s ); )
   9986     stbIF4( output4[0] += ( r0 * c4s ); output4[1] += ( r1 * c4s ); output4[2] += ( r2 * c4s ); output4[3] += ( r3 * c4s ); )
   9987     stbIF5( output5[0] += ( r0 * c5s ); output5[1] += ( r1 * c5s ); output5[2] += ( r2 * c5s ); output5[3] += ( r3 * c5s ); )
   9988     stbIF6( output6[0] += ( r0 * c6s ); output6[1] += ( r1 * c6s ); output6[2] += ( r2 * c6s ); output6[3] += ( r3 * c6s ); )
   9989     stbIF7( output7[0] += ( r0 * c7s ); output7[1] += ( r1 * c7s ); output7[2] += ( r2 * c7s ); output7[3] += ( r3 * c7s ); )
   9990     #else
   9991     stbIF0( output0[0]  = ( r0 * c0s ); output0[1]  = ( r1 * c0s ); output0[2]  = ( r2 * c0s ); output0[3]  = ( r3 * c0s ); )
   9992     stbIF1( output1[0]  = ( r0 * c1s ); output1[1]  = ( r1 * c1s ); output1[2]  = ( r2 * c1s ); output1[3]  = ( r3 * c1s ); )
   9993     stbIF2( output2[0]  = ( r0 * c2s ); output2[1]  = ( r1 * c2s ); output2[2]  = ( r2 * c2s ); output2[3]  = ( r3 * c2s ); )
   9994     stbIF3( output3[0]  = ( r0 * c3s ); output3[1]  = ( r1 * c3s ); output3[2]  = ( r2 * c3s ); output3[3]  = ( r3 * c3s ); )
   9995     stbIF4( output4[0]  = ( r0 * c4s ); output4[1]  = ( r1 * c4s ); output4[2]  = ( r2 * c4s ); output4[3]  = ( r3 * c4s ); )
   9996     stbIF5( output5[0]  = ( r0 * c5s ); output5[1]  = ( r1 * c5s ); output5[2]  = ( r2 * c5s ); output5[3]  = ( r3 * c5s ); )
   9997     stbIF6( output6[0]  = ( r0 * c6s ); output6[1]  = ( r1 * c6s ); output6[2]  = ( r2 * c6s ); output6[3]  = ( r3 * c6s ); )
   9998     stbIF7( output7[0]  = ( r0 * c7s ); output7[1]  = ( r1 * c7s ); output7[2]  = ( r2 * c7s ); output7[3]  = ( r3 * c7s ); )
   9999     #endif
  10000 
  10001     input += 4;
  10002     stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; )
  10003   }
  10004   #endif
  10005   STBIR_NO_UNROLL_LOOP_START
  10006   while ( input < input_end )
  10007   {
  10008     float r = input[0];
  10009     STBIR_NO_UNROLL(output0);
  10010 
  10011     #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
  10012     stbIF0( output0[0] += ( r * c0s ); )
  10013     stbIF1( output1[0] += ( r * c1s ); )
  10014     stbIF2( output2[0] += ( r * c2s ); )
  10015     stbIF3( output3[0] += ( r * c3s ); )
  10016     stbIF4( output4[0] += ( r * c4s ); )
  10017     stbIF5( output5[0] += ( r * c5s ); )
  10018     stbIF6( output6[0] += ( r * c6s ); )
  10019     stbIF7( output7[0] += ( r * c7s ); )
  10020     #else
  10021     stbIF0( output0[0]  = ( r * c0s ); )
  10022     stbIF1( output1[0]  = ( r * c1s ); )
  10023     stbIF2( output2[0]  = ( r * c2s ); )
  10024     stbIF3( output3[0]  = ( r * c3s ); )
  10025     stbIF4( output4[0]  = ( r * c4s ); )
  10026     stbIF5( output5[0]  = ( r * c5s ); )
  10027     stbIF6( output6[0]  = ( r * c6s ); )
  10028     stbIF7( output7[0]  = ( r * c7s ); )
  10029     #endif
  10030 
  10031     ++input;
  10032     stbIF0( ++output0; ) stbIF1( ++output1; ) stbIF2( ++output2; ) stbIF3( ++output3; ) stbIF4( ++output4; ) stbIF5( ++output5; ) stbIF6( ++output6; ) stbIF7( ++output7; )
  10033   }
  10034 }
  10035 
  10036 static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, float const * vertical_coefficients, float const ** inputs, float const * input0_end )
  10037 {
  10038   float STBIR_SIMD_STREAMOUT_PTR( * ) output = outputp;
  10039 
  10040   stbIF0( float const * input0 = inputs[0]; float c0s = vertical_coefficients[0]; )
  10041   stbIF1( float const * input1 = inputs[1]; float c1s = vertical_coefficients[1]; )
  10042   stbIF2( float const * input2 = inputs[2]; float c2s = vertical_coefficients[2]; )
  10043   stbIF3( float const * input3 = inputs[3]; float c3s = vertical_coefficients[3]; )
  10044   stbIF4( float const * input4 = inputs[4]; float c4s = vertical_coefficients[4]; )
  10045   stbIF5( float const * input5 = inputs[5]; float c5s = vertical_coefficients[5]; )
  10046   stbIF6( float const * input6 = inputs[6]; float c6s = vertical_coefficients[6]; )
  10047   stbIF7( float const * input7 = inputs[7]; float c7s = vertical_coefficients[7]; )
  10048 
  10049 #if ( STBIR__vertical_channels == 1 ) && !defined(STB_IMAGE_RESIZE_VERTICAL_CONTINUE)
  10050   // check single channel one weight
  10051   if ( ( c0s >= (1.0f-0.000001f) ) && ( c0s <= (1.0f+0.000001f) ) )
  10052   {
  10053     STBIR_MEMCPY( output, input0, (char*)input0_end - (char*)input0 );
  10054     return;
  10055   }
  10056 #endif
  10057 
  10058   #ifdef STBIR_SIMD
  10059   {
  10060     stbIF0(stbir__simdfX c0 = stbir__simdf_frepX( c0s ); )
  10061     stbIF1(stbir__simdfX c1 = stbir__simdf_frepX( c1s ); )
  10062     stbIF2(stbir__simdfX c2 = stbir__simdf_frepX( c2s ); )
  10063     stbIF3(stbir__simdfX c3 = stbir__simdf_frepX( c3s ); )
  10064     stbIF4(stbir__simdfX c4 = stbir__simdf_frepX( c4s ); )
  10065     stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); )
  10066     stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); )
  10067     stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); )
  10068 
  10069     STBIR_SIMD_NO_UNROLL_LOOP_START
  10070     while ( ( (char*)input0_end - (char*) input0 ) >= (16*stbir__simdfX_float_count) )
  10071     {
  10072       stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
  10073       STBIR_SIMD_NO_UNROLL(output);
  10074 
  10075       // prefetch four loop iterations ahead (doesn't affect much for small resizes, but helps with big ones)
  10076       stbIF0( stbir__prefetch( input0 + (16*stbir__simdfX_float_count) ); )
  10077       stbIF1( stbir__prefetch( input1 + (16*stbir__simdfX_float_count) ); )
  10078       stbIF2( stbir__prefetch( input2 + (16*stbir__simdfX_float_count) ); )
  10079       stbIF3( stbir__prefetch( input3 + (16*stbir__simdfX_float_count) ); )
  10080       stbIF4( stbir__prefetch( input4 + (16*stbir__simdfX_float_count) ); )
  10081       stbIF5( stbir__prefetch( input5 + (16*stbir__simdfX_float_count) ); )
  10082       stbIF6( stbir__prefetch( input6 + (16*stbir__simdfX_float_count) ); )
  10083       stbIF7( stbir__prefetch( input7 + (16*stbir__simdfX_float_count) ); )
  10084 
  10085       #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
  10086       stbIF0( stbir__simdfX_load( o0, output );      stbir__simdfX_load( o1, output+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( o3, output+(3*stbir__simdfX_float_count) );
  10087               stbir__simdfX_load( r0, input0 );      stbir__simdfX_load( r1, input0+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input0+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input0+(3*stbir__simdfX_float_count) );
  10088               stbir__simdfX_madd( o0, o0, r0, c0 );  stbir__simdfX_madd( o1, o1, r1, c0 );                         stbir__simdfX_madd( o2, o2, r2, c0 );                             stbir__simdfX_madd( o3, o3, r3, c0 ); )
  10089       #else
  10090       stbIF0( stbir__simdfX_load( r0, input0 );      stbir__simdfX_load( r1, input0+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input0+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input0+(3*stbir__simdfX_float_count) );
  10091               stbir__simdfX_mult( o0, r0, c0 );      stbir__simdfX_mult( o1, r1, c0 );                             stbir__simdfX_mult( o2, r2, c0 );                                 stbir__simdfX_mult( o3, r3, c0 );  )
  10092       #endif
  10093 
  10094       stbIF1( stbir__simdfX_load( r0, input1 );      stbir__simdfX_load( r1, input1+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input1+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input1+(3*stbir__simdfX_float_count) );
  10095               stbir__simdfX_madd( o0, o0, r0, c1 );  stbir__simdfX_madd( o1, o1, r1, c1 );                         stbir__simdfX_madd( o2, o2, r2, c1 );                             stbir__simdfX_madd( o3, o3, r3, c1 ); )
  10096       stbIF2( stbir__simdfX_load( r0, input2 );      stbir__simdfX_load( r1, input2+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input2+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input2+(3*stbir__simdfX_float_count) );
  10097               stbir__simdfX_madd( o0, o0, r0, c2 );  stbir__simdfX_madd( o1, o1, r1, c2 );                         stbir__simdfX_madd( o2, o2, r2, c2 );                             stbir__simdfX_madd( o3, o3, r3, c2 ); )
  10098       stbIF3( stbir__simdfX_load( r0, input3 );      stbir__simdfX_load( r1, input3+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input3+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input3+(3*stbir__simdfX_float_count) );
  10099               stbir__simdfX_madd( o0, o0, r0, c3 );  stbir__simdfX_madd( o1, o1, r1, c3 );                         stbir__simdfX_madd( o2, o2, r2, c3 );                             stbir__simdfX_madd( o3, o3, r3, c3 ); )
  10100       stbIF4( stbir__simdfX_load( r0, input4 );      stbir__simdfX_load( r1, input4+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input4+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input4+(3*stbir__simdfX_float_count) );
  10101               stbir__simdfX_madd( o0, o0, r0, c4 );  stbir__simdfX_madd( o1, o1, r1, c4 );                         stbir__simdfX_madd( o2, o2, r2, c4 );                             stbir__simdfX_madd( o3, o3, r3, c4 ); )
  10102       stbIF5( stbir__simdfX_load( r0, input5 );      stbir__simdfX_load( r1, input5+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input5+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input5+(3*stbir__simdfX_float_count) );
  10103               stbir__simdfX_madd( o0, o0, r0, c5 );  stbir__simdfX_madd( o1, o1, r1, c5 );                         stbir__simdfX_madd( o2, o2, r2, c5 );                             stbir__simdfX_madd( o3, o3, r3, c5 ); )
  10104       stbIF6( stbir__simdfX_load( r0, input6 );      stbir__simdfX_load( r1, input6+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input6+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input6+(3*stbir__simdfX_float_count) );
  10105               stbir__simdfX_madd( o0, o0, r0, c6 );  stbir__simdfX_madd( o1, o1, r1, c6 );                         stbir__simdfX_madd( o2, o2, r2, c6 );                             stbir__simdfX_madd( o3, o3, r3, c6 ); )
  10106       stbIF7( stbir__simdfX_load( r0, input7 );      stbir__simdfX_load( r1, input7+stbir__simdfX_float_count );   stbir__simdfX_load( r2, input7+(2*stbir__simdfX_float_count) );   stbir__simdfX_load( r3, input7+(3*stbir__simdfX_float_count) );
  10107               stbir__simdfX_madd( o0, o0, r0, c7 );  stbir__simdfX_madd( o1, o1, r1, c7 );                         stbir__simdfX_madd( o2, o2, r2, c7 );                             stbir__simdfX_madd( o3, o3, r3, c7 ); )
  10108 
  10109       stbir__simdfX_store( output, o0 );             stbir__simdfX_store( output+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output+(2*stbir__simdfX_float_count), o2 );  stbir__simdfX_store( output+(3*stbir__simdfX_float_count), o3 );
  10110       output += (4*stbir__simdfX_float_count);
  10111       stbIF0( input0 += (4*stbir__simdfX_float_count); ) stbIF1( input1 += (4*stbir__simdfX_float_count); ) stbIF2( input2 += (4*stbir__simdfX_float_count); ) stbIF3( input3 += (4*stbir__simdfX_float_count); ) stbIF4( input4 += (4*stbir__simdfX_float_count); ) stbIF5( input5 += (4*stbir__simdfX_float_count); ) stbIF6( input6 += (4*stbir__simdfX_float_count); ) stbIF7( input7 += (4*stbir__simdfX_float_count); )
  10112     }
  10113 
  10114     STBIR_SIMD_NO_UNROLL_LOOP_START
  10115     while ( ( (char*)input0_end - (char*) input0 ) >= 16 )
  10116     {
  10117       stbir__simdf o0, r0;
  10118       STBIR_SIMD_NO_UNROLL(output);
  10119 
  10120       #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
  10121       stbIF0( stbir__simdf_load( o0, output );   stbir__simdf_load( r0, input0 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); )
  10122       #else
  10123       stbIF0( stbir__simdf_load( r0, input0 );  stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); )
  10124       #endif
  10125       stbIF1( stbir__simdf_load( r0, input1 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) ); )
  10126       stbIF2( stbir__simdf_load( r0, input2 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) ); )
  10127       stbIF3( stbir__simdf_load( r0, input3 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) ); )
  10128       stbIF4( stbir__simdf_load( r0, input4 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) ); )
  10129       stbIF5( stbir__simdf_load( r0, input5 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) ); )
  10130       stbIF6( stbir__simdf_load( r0, input6 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) ); )
  10131       stbIF7( stbir__simdf_load( r0, input7 );  stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) ); )
  10132 
  10133       stbir__simdf_store( output, o0 );
  10134       output += 4;
  10135       stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; )
  10136     }
  10137   }
  10138   #else
  10139   STBIR_NO_UNROLL_LOOP_START
  10140   while ( ( (char*)input0_end - (char*) input0 ) >= 16 )
  10141   {
  10142     float o0, o1, o2, o3;
  10143     STBIR_NO_UNROLL(output);
  10144     #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
  10145     stbIF0( o0 = output[0] + input0[0] * c0s; o1 = output[1] + input0[1] * c0s; o2 = output[2] + input0[2] * c0s; o3 = output[3] + input0[3] * c0s; )
  10146     #else
  10147     stbIF0( o0  = input0[0] * c0s; o1  = input0[1] * c0s; o2  = input0[2] * c0s; o3  = input0[3] * c0s; )
  10148     #endif
  10149     stbIF1( o0 += input1[0] * c1s; o1 += input1[1] * c1s; o2 += input1[2] * c1s; o3 += input1[3] * c1s; )
  10150     stbIF2( o0 += input2[0] * c2s; o1 += input2[1] * c2s; o2 += input2[2] * c2s; o3 += input2[3] * c2s; )
  10151     stbIF3( o0 += input3[0] * c3s; o1 += input3[1] * c3s; o2 += input3[2] * c3s; o3 += input3[3] * c3s; )
  10152     stbIF4( o0 += input4[0] * c4s; o1 += input4[1] * c4s; o2 += input4[2] * c4s; o3 += input4[3] * c4s; )
  10153     stbIF5( o0 += input5[0] * c5s; o1 += input5[1] * c5s; o2 += input5[2] * c5s; o3 += input5[3] * c5s; )
  10154     stbIF6( o0 += input6[0] * c6s; o1 += input6[1] * c6s; o2 += input6[2] * c6s; o3 += input6[3] * c6s; )
  10155     stbIF7( o0 += input7[0] * c7s; o1 += input7[1] * c7s; o2 += input7[2] * c7s; o3 += input7[3] * c7s; )
  10156     output[0] = o0; output[1] = o1; output[2] = o2; output[3] = o3;
  10157     output += 4;
  10158     stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; )
  10159   }
  10160   #endif
  10161   STBIR_NO_UNROLL_LOOP_START
  10162   while ( input0 < input0_end )
  10163   {
  10164     float o0;
  10165     STBIR_NO_UNROLL(output);
  10166     #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
  10167     stbIF0( o0 = output[0] + input0[0] * c0s; )
  10168     #else
  10169     stbIF0( o0  = input0[0] * c0s; )
  10170     #endif
  10171     stbIF1( o0 += input1[0] * c1s; )
  10172     stbIF2( o0 += input2[0] * c2s; )
  10173     stbIF3( o0 += input3[0] * c3s; )
  10174     stbIF4( o0 += input4[0] * c4s; )
  10175     stbIF5( o0 += input5[0] * c5s; )
  10176     stbIF6( o0 += input6[0] * c6s; )
  10177     stbIF7( o0 += input7[0] * c7s; )
  10178     output[0] = o0;
  10179     ++output;
  10180     stbIF0( ++input0; ) stbIF1( ++input1; ) stbIF2( ++input2; ) stbIF3( ++input3; ) stbIF4( ++input4; ) stbIF5( ++input5; ) stbIF6( ++input6; ) stbIF7( ++input7; )
  10181   }
  10182 }
  10183 
  10184 #undef stbIF0
  10185 #undef stbIF1
  10186 #undef stbIF2
  10187 #undef stbIF3
  10188 #undef stbIF4
  10189 #undef stbIF5
  10190 #undef stbIF6
  10191 #undef stbIF7
  10192 #undef STB_IMAGE_RESIZE_DO_VERTICALS
  10193 #undef STBIR__vertical_channels
  10194 #undef STB_IMAGE_RESIZE_DO_HORIZONTALS
  10195 #undef STBIR_strs_join24
  10196 #undef STBIR_strs_join14
  10197 #undef STBIR_chans
  10198 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
  10199 #undef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
  10200 #endif
  10201 
  10202 #else // !STB_IMAGE_RESIZE_DO_VERTICALS
  10203 
  10204 #define STBIR_chans( start, end ) STBIR_strs_join1(start,STBIR__horizontal_channels,end)
  10205 
  10206 #ifndef stbir__2_coeff_only
  10207 #define stbir__2_coeff_only()             \
  10208     stbir__1_coeff_only();                \
  10209     stbir__1_coeff_remnant(1);
  10210 #endif
  10211 
  10212 #ifndef stbir__2_coeff_remnant
  10213 #define stbir__2_coeff_remnant( ofs )     \
  10214     stbir__1_coeff_remnant(ofs);          \
  10215     stbir__1_coeff_remnant((ofs)+1);
  10216 #endif
  10217 
  10218 #ifndef stbir__3_coeff_only
  10219 #define stbir__3_coeff_only()             \
  10220     stbir__2_coeff_only();                \
  10221     stbir__1_coeff_remnant(2);
  10222 #endif
  10223 
  10224 #ifndef stbir__3_coeff_remnant
  10225 #define stbir__3_coeff_remnant( ofs )     \
  10226     stbir__2_coeff_remnant(ofs);          \
  10227     stbir__1_coeff_remnant((ofs)+2);
  10228 #endif
  10229 
  10230 #ifndef stbir__3_coeff_setup
  10231 #define stbir__3_coeff_setup()
  10232 #endif
  10233 
  10234 #ifndef stbir__4_coeff_start
  10235 #define stbir__4_coeff_start()            \
  10236     stbir__2_coeff_only();                \
  10237     stbir__2_coeff_remnant(2);
  10238 #endif
  10239 
  10240 #ifndef stbir__4_coeff_continue_from_4
  10241 #define stbir__4_coeff_continue_from_4( ofs )     \
  10242     stbir__2_coeff_remnant(ofs);                  \
  10243     stbir__2_coeff_remnant((ofs)+2);
  10244 #endif
  10245 
  10246 #ifndef stbir__store_output_tiny
  10247 #define stbir__store_output_tiny stbir__store_output
  10248 #endif
  10249 
  10250 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_1_coeff)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10251 {
  10252   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10253   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10254   STBIR_SIMD_NO_UNROLL_LOOP_START
  10255   do {
  10256     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10257     float const * hc = horizontal_coefficients;
  10258     stbir__1_coeff_only();
  10259     stbir__store_output_tiny();
  10260   } while ( output < output_end );
  10261 }
  10262 
  10263 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_2_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10264 {
  10265   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10266   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10267   STBIR_SIMD_NO_UNROLL_LOOP_START
  10268   do {
  10269     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10270     float const * hc = horizontal_coefficients;
  10271     stbir__2_coeff_only();
  10272     stbir__store_output_tiny();
  10273   } while ( output < output_end );
  10274 }
  10275 
  10276 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_3_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10277 {
  10278   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10279   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10280   STBIR_SIMD_NO_UNROLL_LOOP_START
  10281   do {
  10282     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10283     float const * hc = horizontal_coefficients;
  10284     stbir__3_coeff_only();
  10285     stbir__store_output_tiny();
  10286   } while ( output < output_end );
  10287 }
  10288 
  10289 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_4_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10290 {
  10291   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10292   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10293   STBIR_SIMD_NO_UNROLL_LOOP_START
  10294   do {
  10295     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10296     float const * hc = horizontal_coefficients;
  10297     stbir__4_coeff_start();
  10298     stbir__store_output();
  10299   } while ( output < output_end );
  10300 }
  10301 
  10302 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_5_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10303 {
  10304   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10305   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10306   STBIR_SIMD_NO_UNROLL_LOOP_START
  10307   do {
  10308     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10309     float const * hc = horizontal_coefficients;
  10310     stbir__4_coeff_start();
  10311     stbir__1_coeff_remnant(4);
  10312     stbir__store_output();
  10313   } while ( output < output_end );
  10314 }
  10315 
  10316 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_6_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10317 {
  10318   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10319   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10320   STBIR_SIMD_NO_UNROLL_LOOP_START
  10321   do {
  10322     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10323     float const * hc = horizontal_coefficients;
  10324     stbir__4_coeff_start();
  10325     stbir__2_coeff_remnant(4);
  10326     stbir__store_output();
  10327   } while ( output < output_end );
  10328 }
  10329 
  10330 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_7_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10331 {
  10332   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10333   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10334   stbir__3_coeff_setup();
  10335   STBIR_SIMD_NO_UNROLL_LOOP_START
  10336   do {
  10337     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10338     float const * hc = horizontal_coefficients;
  10339 
  10340     stbir__4_coeff_start();
  10341     stbir__3_coeff_remnant(4);
  10342     stbir__store_output();
  10343   } while ( output < output_end );
  10344 }
  10345 
  10346 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_8_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10347 {
  10348   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10349   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10350   STBIR_SIMD_NO_UNROLL_LOOP_START
  10351   do {
  10352     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10353     float const * hc = horizontal_coefficients;
  10354     stbir__4_coeff_start();
  10355     stbir__4_coeff_continue_from_4(4);
  10356     stbir__store_output();
  10357   } while ( output < output_end );
  10358 }
  10359 
  10360 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_9_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10361 {
  10362   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10363   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10364   STBIR_SIMD_NO_UNROLL_LOOP_START
  10365   do {
  10366     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10367     float const * hc = horizontal_coefficients;
  10368     stbir__4_coeff_start();
  10369     stbir__4_coeff_continue_from_4(4);
  10370     stbir__1_coeff_remnant(8);
  10371     stbir__store_output();
  10372   } while ( output < output_end );
  10373 }
  10374 
  10375 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_10_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10376 {
  10377   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10378   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10379   STBIR_SIMD_NO_UNROLL_LOOP_START
  10380   do {
  10381     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10382     float const * hc = horizontal_coefficients;
  10383     stbir__4_coeff_start();
  10384     stbir__4_coeff_continue_from_4(4);
  10385     stbir__2_coeff_remnant(8);
  10386     stbir__store_output();
  10387   } while ( output < output_end );
  10388 }
  10389 
  10390 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_11_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10391 {
  10392   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10393   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10394   stbir__3_coeff_setup();
  10395   STBIR_SIMD_NO_UNROLL_LOOP_START
  10396   do {
  10397     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10398     float const * hc = horizontal_coefficients;
  10399     stbir__4_coeff_start();
  10400     stbir__4_coeff_continue_from_4(4);
  10401     stbir__3_coeff_remnant(8);
  10402     stbir__store_output();
  10403   } while ( output < output_end );
  10404 }
  10405 
  10406 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_12_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10407 {
  10408   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10409   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10410   STBIR_SIMD_NO_UNROLL_LOOP_START
  10411   do {
  10412     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10413     float const * hc = horizontal_coefficients;
  10414     stbir__4_coeff_start();
  10415     stbir__4_coeff_continue_from_4(4);
  10416     stbir__4_coeff_continue_from_4(8);
  10417     stbir__store_output();
  10418   } while ( output < output_end );
  10419 }
  10420 
  10421 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod0 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10422 {
  10423   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10424   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10425   STBIR_SIMD_NO_UNROLL_LOOP_START
  10426   do {
  10427     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10428     int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 4 + 3 ) >> 2;
  10429     float const * hc = horizontal_coefficients;
  10430 
  10431     stbir__4_coeff_start();
  10432     STBIR_SIMD_NO_UNROLL_LOOP_START
  10433     do {
  10434       hc += 4;
  10435       decode += STBIR__horizontal_channels * 4;
  10436       stbir__4_coeff_continue_from_4( 0 );
  10437       --n;
  10438     } while ( n > 0 );
  10439     stbir__store_output();
  10440   } while ( output < output_end );
  10441 }
  10442 
  10443 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod1 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10444 {
  10445   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10446   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10447   STBIR_SIMD_NO_UNROLL_LOOP_START
  10448   do {
  10449     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10450     int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 5 + 3 ) >> 2;
  10451     float const * hc = horizontal_coefficients;
  10452 
  10453     stbir__4_coeff_start();
  10454     STBIR_SIMD_NO_UNROLL_LOOP_START
  10455     do {
  10456       hc += 4;
  10457       decode += STBIR__horizontal_channels * 4;
  10458       stbir__4_coeff_continue_from_4( 0 );
  10459       --n;
  10460     } while ( n > 0 );
  10461     stbir__1_coeff_remnant( 4 );
  10462     stbir__store_output();
  10463   } while ( output < output_end );
  10464 }
  10465 
  10466 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod2 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10467 {
  10468   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10469   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10470   STBIR_SIMD_NO_UNROLL_LOOP_START
  10471   do {
  10472     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10473     int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 6 + 3 ) >> 2;
  10474     float const * hc = horizontal_coefficients;
  10475 
  10476     stbir__4_coeff_start();
  10477     STBIR_SIMD_NO_UNROLL_LOOP_START
  10478     do {
  10479       hc += 4;
  10480       decode += STBIR__horizontal_channels * 4;
  10481       stbir__4_coeff_continue_from_4( 0 );
  10482       --n;
  10483     } while ( n > 0 );
  10484     stbir__2_coeff_remnant( 4 );
  10485 
  10486     stbir__store_output();
  10487   } while ( output < output_end );
  10488 }
  10489 
  10490 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod3 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
  10491 {
  10492   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
  10493   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
  10494   stbir__3_coeff_setup();
  10495   STBIR_SIMD_NO_UNROLL_LOOP_START
  10496   do {
  10497     float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
  10498     int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 7 + 3 ) >> 2;
  10499     float const * hc = horizontal_coefficients;
  10500 
  10501     stbir__4_coeff_start();
  10502     STBIR_SIMD_NO_UNROLL_LOOP_START
  10503     do {
  10504       hc += 4;
  10505       decode += STBIR__horizontal_channels * 4;
  10506       stbir__4_coeff_continue_from_4( 0 );
  10507       --n;
  10508     } while ( n > 0 );
  10509     stbir__3_coeff_remnant( 4 );
  10510 
  10511     stbir__store_output();
  10512   } while ( output < output_end );
  10513 }
  10514 
  10515 static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_funcs)[4]=
  10516 {
  10517   STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod0),
  10518   STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod1),
  10519   STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod2),
  10520   STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod3),
  10521 };
  10522 
  10523 static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_funcs)[12]=
  10524 {
  10525   STBIR_chans(stbir__horizontal_gather_,_channels_with_1_coeff),
  10526   STBIR_chans(stbir__horizontal_gather_,_channels_with_2_coeffs),
  10527   STBIR_chans(stbir__horizontal_gather_,_channels_with_3_coeffs),
  10528   STBIR_chans(stbir__horizontal_gather_,_channels_with_4_coeffs),
  10529   STBIR_chans(stbir__horizontal_gather_,_channels_with_5_coeffs),
  10530   STBIR_chans(stbir__horizontal_gather_,_channels_with_6_coeffs),
  10531   STBIR_chans(stbir__horizontal_gather_,_channels_with_7_coeffs),
  10532   STBIR_chans(stbir__horizontal_gather_,_channels_with_8_coeffs),
  10533   STBIR_chans(stbir__horizontal_gather_,_channels_with_9_coeffs),
  10534   STBIR_chans(stbir__horizontal_gather_,_channels_with_10_coeffs),
  10535   STBIR_chans(stbir__horizontal_gather_,_channels_with_11_coeffs),
  10536   STBIR_chans(stbir__horizontal_gather_,_channels_with_12_coeffs),
  10537 };
  10538 
  10539 #undef STBIR__horizontal_channels
  10540 #undef STB_IMAGE_RESIZE_DO_HORIZONTALS
  10541 #undef stbir__1_coeff_only
  10542 #undef stbir__1_coeff_remnant
  10543 #undef stbir__2_coeff_only
  10544 #undef stbir__2_coeff_remnant
  10545 #undef stbir__3_coeff_only
  10546 #undef stbir__3_coeff_remnant
  10547 #undef stbir__3_coeff_setup
  10548 #undef stbir__4_coeff_start
  10549 #undef stbir__4_coeff_continue_from_4
  10550 #undef stbir__store_output
  10551 #undef stbir__store_output_tiny
  10552 #undef STBIR_chans
  10553 
  10554 #endif  // HORIZONALS
  10555 
  10556 #undef STBIR_strs_join2
  10557 #undef STBIR_strs_join1
  10558 
  10559 #endif // STB_IMAGE_RESIZE_DO_HORIZONTALS/VERTICALS/CODERS
  10560 
  10561 /*
  10562 ------------------------------------------------------------------------------
  10563 This software is available under 2 licenses -- choose whichever you prefer.
  10564 ------------------------------------------------------------------------------
  10565 ALTERNATIVE A - MIT License
  10566 Copyright (c) 2017 Sean Barrett
  10567 Permission is hereby granted, free of charge, to any person obtaining a copy of
  10568 this software and associated documentation files (the "Software"), to deal in
  10569 the Software without restriction, including without limitation the rights to
  10570 use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
  10571 of the Software, and to permit persons to whom the Software is furnished to do
  10572 so, subject to the following conditions:
  10573 The above copyright notice and this permission notice shall be included in all
  10574 copies or substantial portions of the Software.
  10575 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
  10576 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
  10577 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
  10578 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
  10579 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
  10580 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  10581 SOFTWARE.
  10582 ------------------------------------------------------------------------------
  10583 ALTERNATIVE B - Public Domain (www.unlicense.org)
  10584 This is free and unencumbered software released into the public domain.
  10585 Anyone is free to copy, modify, publish, use, compile, sell, or distribute this
  10586 software, either in source code form or as a compiled binary, for any purpose,
  10587 commercial or non-commercial, and by any means.
  10588 In jurisdictions that recognize copyright laws, the author or authors of this
  10589 software dedicate any and all copyright interest in the software to the public
  10590 domain. We make this dedication for the benefit of the public at large and to
  10591 the detriment of our heirs and successors. We intend this dedication to be an
  10592 overt act of relinquishment in perpetuity of all present and future rights to
  10593 this software under copyright law.
  10594 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
  10595 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
  10596 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
  10597 AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
  10598 ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
  10599 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  10600 ------------------------------------------------------------------------------
  10601 */
	minesweeper A minewseeper implementation to play around with Hare and Raylib
	git clone https://git.tronto.net/minesweeper
	Download \| Log \| Files \| Refs \| README \| LICENSE